Tuesday, February 1, 2011

Mahout on Amazon EC2 - part 3 - Debugging

Connecting to management web interface of an Hadoop node


1) Login into AWS management consolute setup
Select the default security group, and add tcp ports 50010-50090 (with ip 0.0.0.0/0).

2) You can view hadoop node status (after starting Hadoop) by opening a web browser and
entering the following address:

http://ec2-50-16-155-136.compute-1.amazonaws.com:50070/
where ec2-XXX-XXXXXXXX is the nodename, and 50070 is the default port for namenode sever,
and 50030 is the default port of the job tracker. 50060 is the default port for the task tracker.



Common errors and their solutions:

* When starting hadoop, the following message is presented:
<32|0>bickson@biggerbro:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2$ ./bin/start-all.sh
localhost: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
localhost: @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
localhost: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
localhost: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
localhost: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
localhost: It is also possible that the RSA host key has just been changed.
localhost: The fingerprint for the RSA key sent by the remote host is
localhost: 06:95:7b:c8:0e:85:e7:ba:aa:b1:31:6e:fc:0e:ae:4d.
localhost: Please contact your system administrator.
localhost: Add correct host key in /mnt/bigbrofs/usr6/bickson/.ssh/known_hosts to get rid of this message.
localhost: Offending key in /mnt/bigbrofs/usr6/bickson/.ssh/known_hosts:1
localhost: RSA host key for localhost has changed and you have requested strict checking.
localhost: Host key verification failed.
Solution:

echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
Note:

If the file ~/.ssh/config does not exist, change the command to:
echo "NoHostAuthenticationForLocalhost yes" >~/.ssh/config 

The following exception is received:
org.apache.hadoop.ipc.RemoteException: java.io.IOException:
File /user/ubuntu/temp/markedPreferences/_temporary/_attempt_local_0001_r_000000_0/part-r-00000 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock
(FSNamesystem.java:1271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) 


Solution:
1) I saw this error when system is out of disk space. Increase number of nodes or instance type.
2) Another cause is that the datanode did not finish to boot. Wait at least 200 seconds after starting Hadoop before actually starting to run jobs.

* Job tracker fails to run with the following error:
2011-02-02 16:02:47,097 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting JobTracker
STARTUP_MSG:   host = ip-10-114-75-91/10.114.75.91
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/

hadoop/common/branches/branch-0.20 -r 911707; compiled by 

'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
2011-02-02 16:02:47,200 INFO org.apache.hadoop.mapred.JobTracker: 

Scheduler  configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, 

limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
2011-02-02 16:02:47,220 FATAL org.apache.hadoop.mapred.JobTracker:

java.lang.RuntimeException: 

Not a host:port pair: local
 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
 at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
 at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
 at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
 at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
 at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)


Solution: edit the file /path/to/hadoop/conf/mapred-site.xml:
 
 
mapred.job.tracker
localhost:9001 





* When connecting to EC2 host you get the following error:
ssh -i ./graphlabkey.pem -o "StrictHostKeyChecking no" 
ubuntu@ec2-50-16-101-232.compute-1.amazonaws.com "/home/ubuntu/ec2-metadata -i"
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for './graphlabkey.pem' are too open.
It is recommended that your private key files are NOT accessible by others.
This private key will be ignored.
Solution:
chmod 400 graphlabkey.pem





* Exception : can not lock storage
************************************************************/
2011-02-03 23:47:48,623 INFO org.apache.hadoop.hdfs.server.common.Storage: 

Cannot lock storage /mnt. The directory is already locked.
2011-02-03 23:47:48,736 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 

Cannot lock storage /mnt. The directory is already locked.
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510)
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:112)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 
Solution: search and remove the file in_use.lock

* Exception:
tasktracker running as process XXX. Stop it first.

Solution:
1) Hadoop is already running - kill it first using stop-all.sh (on a single machine) or stop-mapred.sh and stop-dfs.sh (on a cluster)
2) If you killed Hadoop and you are still getting this error - check under /tmp
if it contains files *.pid - if so remove them.

2010-08-06 12:12:06,900 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /Users/jchen/Data/Hadoop/dfs/data: namenode namespaceID = 773619367; datanode namespaceID = 2049079249
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
Solution: Remove all files named VERSION from all tmp directories (need to search very well Hadoop has at least 3 working directories) and reformat the namenode file system.

Error:
bash-3.2$ ./bin/start-all.sh 
starting namenode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-namenode-biggerbro.ml.cmu.edu.out
localhost: starting datanode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-datanode-biggerbro.ml.cmu.edu.out
localhost: starting secondarynamenode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-secondarynamenode-biggerbro.ml.cmu.edu.out
localhost: Exception in thread "main" java.net.BindException: Address already in use
localhost: 	at sun.nio.ch.Net.bind(Native Method)
localhost: 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
localhost: 	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
localhost: 	at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
localhost: 	at org.apache.hadoop.http.HttpServer.start(HttpServer.java:425)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:165)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:115)
localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:469)
starting jobtracker, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-jobtracker-biggerbro.ml.cmu.edu.out
Solution: kill every process using ./bin/stop-all.sh, wait a few mins and retry. If this does not help you may need to change port numbers in config files.

hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/name is in an inconsistent state: storage directory does not exist or is not accessible.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 04:33:10,572 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/name is in an inconsistent state: storage directory does not exist or is not accessible.
Solution: it seemed you did not format properly hdfs using the command
./bin/hadoop namenode -format

Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
	at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
	at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:247)
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
	at org.apache.hadoop.io.WritableName.getClass(WritableName.java:71)
	at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:1613)
	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1555)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1428)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
	at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

Solution: verify that MAHOUT_HOME is properly defined.

hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 05:31:35,693 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call delete(/tmp/hadoop/mapred/system, true) from 127.0.0.1:51103: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop/mapred/system. Name node is in safe mode.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop/mapred/system. Name node is in safe mode.
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 05:33:39,712 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9000, call addBlock(/user/bickson/small_netflix_mahout_transpose/part-r-00000, DFSClient_-1810781150) from 127.0.0.1:48972: error: java.io.IOException: File /user/bickson/small_netflix_mahout_transpose/part-r-00000 could only be replicated to 0 nodes, instead of 1
hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:java.io.IOException: File /user/bickson/small_netflix_mahout_transpose/part-r-00000 could only be replicated to 0 nodes, instead of 1
Solution: This error may happen if you try to access hdfs file system before Hadoop finished loading up properly. Wait a few mins and try again.

No comments:

Post a Comment