Setup steps on Ubuntu.
Hadoop cluster
Hadoop is an open source implementation to support distributed computing, originally based on Google’s papers of MapReduce , Big Table , and GFS .
.
Intro
Initially Hadoop is built to support Apache Nutch , a crawler sub-project of “Apache Lucene”:http://lucene.apache.org/java . But (with the support of community, especially from Yahoo!) it has graduated to be a top level Apache project and become one of the most popular framework for distributed systems, which can be deployed on large clusters of commodity hardware.
Now Hadoop contains more sub-projects than just the core: HDFS (highly fault-tolerance distributed filesystem), MapReduce (parallel processing vast amounts of data), HBase (NoSQL database), Pig (analyzing large data sets), Hive (querying data), ZooKeeper (distributed coordination) , and some more related (Mahout, Avro, Chukwa, …)
(source: TheRegister)
.
Single-node setup
Some steps for setting up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux. The main goal is to get a “simple” Hadoop installation up and running so that you can play around with the software and develop with it .
(source: HortonWorks)
.
Preparation
+ Hadoop: 0.20.203.x (current stable)
+ Ubuntu: 10.10 (maverick)
+ Sun Java 6 (See Sun_JDK_Ubuntu for more information) .
1. Add the Canonical Partner Repository to your apt repositories:
$ sudo add-apt-repository "deb http://archive.canonical.com/ maverick partner"
2. Update the source list
$ sudo apt-get update
3. Install sun-java6-jdk
$ sudo apt-get install sun-java6-jdk sun-java6-bin
4. Select Sun’s Java as the default on your machine.
$ sudo update-java-alternatives -s java-6-sun
The full JDK which will be placed in /usr/lib/jvm/java-6-sun .
After installation, make a quick check whether Sun’s JDK is correctly set up:
hduser@devserver:~ *$ java -version
+ Add a dedicated Hadoop system user
Actually, this is not required. We’ll skip it since this is for quick development, not dedicated deployment.
In this artical, I’ll use the admin user *hduser* for related Hadoop operations.
+ SSH configuration (See SSH Without Password for more information) .
By default, Ubuntu have SSHd up and running, and already configured to allow SSH public key authentication.
As this is single-node cluster , our “remote” host is actually the same with “localhost” (devserver).
hduser@devserver:~ $ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: f6:61:a8:27:35:cf:4c:6d:13:22:70:cf:4c:c8:a0:23 hduser@devserver hduser@devserver:~ $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@localhost (Alternatively: hduser@devserver:~ $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys )
Now you should be able to login SSH without password.
hduser@devserver:~ $ ssh hduser@localhost Last login: Fri Dec 16 17:22:33 2011 from 192.168.22.84
+ Some more
Install RSync (if not yet)
$ sudo apt-get install rsync
Avoid IPv6 problem by disable IPv6 config, or prefer IPv4: adding the following line to $HADOOP_HOME/conf/hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
.
Installation
Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /opt/hadoop .
(Optionally: If there is dedicated user: hduser of hdgroup , make sure to change the owner and permissions respectively)
Assumed that the downloaded package is _hadoop-0.20.203.0rc1.tar.gz_, which will be extracted to _hadoop-0.20.203.0_ folder
$ mkdir -p /opt/hadoop $ chmod 755 -R /opt/hadoop $ tar -xf hadoop-0.20.203.0rc1.tar.gz $ mv hadoop-0.20.203.0 /opt/hadoop
(Alternatively:
$ tar -xf hadoop-0.20.203.0rc1.tar.gz -C /opt/hadoop/
)
Now we can update $HOME/.bashrc of the Hadoop user for convenient Hadoop commands:
#### ADDED – dq 2010.12.16
export JAVA_HOME=”/usr/lib/jvm/java-6-sun”
export HADOOP_HOME=”/opt/hadoop/hadoop-0.20.203.0″
#alias fs=”hadoop fs”
#alias hfs=”fs -ls”#lzohead() {
# hadoop fs -cat $1 | lzop -dc | head -1000 | less
#}export PATH=”$PATH:$HADOOP_HOME/bin”
#### ADDED END
.
Basic configuration
The options on official page may make the new user feel confused or overwhelmed.
http://hadoop.apache.org/common/docs/current/cluster_setup.html
We’ll just cover the most basic here. First take a quick glance at HDFS to understand the concepts:
http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html
Create a temp directory for Hadoop:
$ mkdir -p /opt/hadoop/tmp
Edit the configuration files, add corresponding following lines:
conf/hadoop-env.sh
Besides the HADOOP_OPTS mentioned above, specify the JAVA_HOME explicitly in hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-sun
.
conf/core-site.xml
<configuration>
<!– ADDED – dq 2011.12.16 –>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
<description>A base for other temporary directories</description>
</property><property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>name of default FileSystem (see fs.SCHEME.impl)</description>
</property>
<!– ADDED END – in conf/core-site.xml –>
</configuration>
.
conf/mapred-site.xml
<configuration>
<!– ADDED – dq 2011.12.16 –>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>host and port for MapReduce job tracker</description>
</property>
<!– ADDED END – in conf/mapred-site.xml –>
</configuration>
conf/hdfs-site.xml
<configuration>
<!– ADDED – dq 2011.12.20 –>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>default block replication, will be used if replication
is not specified when the file is created
</description>
</property>
<!– ADDED END – in conf/hdfs-site.xml –>
</configuration>
Format the HDFS filesystem via the NameNode
(Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster, i.e. in HDFS.)
$ cd $HADOOP_HOME $ ./bin/hadoop namenode -format
Note: if you already add $HADOOP_HOME/bin to the $PATH, just use the “hadoop” is enough. Otherwise you should use “bin/hadoop” from $HADOOP_HOME for every Hadoop commands.
The output will look like this:
11/12/20 16:59:56 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.203 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911807; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2011 ************************************************************/ 11/12/20 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop 11/12/20 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 11/12/20 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true 11/12/20 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 11/12/20 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted. 11/12/20 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/
Check to see if we can start the single-node cluster:
$ bin/start-all.sh
The output will look like this:
starting namenode, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-namenode-devserver.out localhost: starting datanode, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-datanode-devserver.out localhost: starting secondarynamenode, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-secondarynamenode-devserver.out starting jobtracker, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-jobtracker-devserver.out localhost: starting tasktracker, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-tasktracker-devserver.out
To check the (java) Process ID quickly, use Sun JDK’s jps :
$ jps 12353 JobTracker 12610 Jps 12088 NameNode 12465 TaskTracker 12287 SecondaryNameNode 12186 DataNode
We can also check with netstat if Hadoop is listening on the configured ports.
$ netstat -netpl | grep java tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1000 73537 12287/java tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1000 74042 12465/java tcp 0 0 0.0.0.0:55277 0.0.0.0:* LISTEN 1000 72741 12186/java tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1000 73614 12353/java tcp 0 0 0.0.0.0:53486 0.0.0.0:* LISTEN 1000 72550 12088/java tcp 0 0 127.0.0.1:57045 0.0.0.0:* LISTEN 1000 73681 12465/java tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1000 73452 12088/java tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1000 73532 12186/java tcp 0 0 0.0.0.0:40122 0.0.0.0:* LISTEN 1000 73063 12287/java tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1000 73785 12186/java tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1000 73793 12186/java tcp 0 0 0.0.0.0:43332 0.0.0.0:* LISTEN 1000 73366 12353/java tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1000 73131 12088/java tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1000 73442 12353/java
(you can use option -a instead of -e to show ESTABLISHED connection besides the LISTENing)
If there are any errors, examine the log files in the logs directory.
.
Running MapReduce
Stop the cluster:
$ bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode
We’ll run the wordcount job (http://wiki.apache.org/hadoop/WordCount) .
$ mkdir -p /opt/hadoop/gutenberg
First, download some books from project Gutenberg to a temp folder:
http://www.gutenberg.org/ebooks/20417
http://www.gutenberg.org/etext/5000
http://www.gutenberg.org/etext/4300
$ ls /opt/hadoop/gutenberg drwxr-xr-x 2 hduser hdgroup 4096 2011-12-20 11:26 . drwxr-xr-x 6 hduser hdgroup 4096 2011-12-20 12:02 .. -rw-r--r-- 1 hduser hdgroup 661807 2011-12-20 11:21 pg20417.txt -rw-r--r-- 1 hduser hdgroup 1540092 2011-12-20 11:21 pg4300.txt -rw-r--r-- 1 hduser hdgroup 1391685 2011-12-20 11:21 pg5000.txt
Restart the Hadoop cluster if it’s not running already.
$ bin/start-all.sh
Now we copy the files from our local file system to Hadoop’s HDFS. (note that the normal filesystem will not recognize the files in HDFS, by default)
$ hadoop dfs -copyFromLocal /opt/hadoop/tmp/gutenberg $HOME/gutenberg $ hadoop dfs -ls $HOME drwxr-xr-x - hduser supergroup 0 2011-12-27 11:40 /home/hduser/gutenberg $ hadoop dfs -ls $HOME/gutenberg Found 3 items -rw-r--r-- 1 hduser supergroup 661807 2011-12-20 11:40 /home/hduser/gutenberg/pg20417.txt -rw-r--r-- 1 hduser supergroup 1540092 2011-12-20 11:40 /home/hduser/gutenberg/pg4300.txt -rw-r--r-- 1 hduser supergroup 1391685 2011-12-20 11:40 /home/hduser/gutenberg/pg5000.txt
Now, we actually run the WordCount example job.
$ bin/hadoop jar hadoop*examples*.jar wordcount $HOME/gutenberg $HOME/gutenberg-output
If you meet some exception like below:
Exception in thread “main” java.io.IOException: Error opening job jar: hadoop*examples*.jar
at org.apache.hadoop.util.RunJar.main (RunJar.java: 90)
Caused by: java.util.zip.ZipException: error in opening zip file
Then check your JRE version to make sure it support wildcard on classpath; and if the user is at $HADOOP_HOME directory.
To make sure the command run correctly, specify full path to the example JAR (absolute path & no wildcard):
*$ bin/hadoop jar $HADOOP_HOME/hadoop-examples-0.20.203.0.jar wordcount $HOME/gutenberg $HOME/gutenberg-output*
The result will look like this:
11/12/20 18:00:06 INFO input.FileInputFormat: Total input paths to process : 3 11/12/20 18:00:07 INFO mapred.JobClient: Running job: job_201112201314_0003 11/12/20 18:00:08 INFO mapred.JobClient: map 0% reduce 0% 11/12/20 18:00:58 INFO mapred.JobClient: map 66% reduce 0% 11/12/20 18:01:18 INFO mapred.JobClient: map 100% reduce 0% 11/12/20 18:01:33 INFO mapred.JobClient: map 100% reduce 100% 11/12/20 18:01:41 INFO mapred.JobClient: Job complete: job_201112201314_0003 11/12/20 18:01:41 INFO mapred.JobClient: Counters: 25 11/12/20 18:01:41 INFO mapred.JobClient: Job Counters 11/12/20 18:01:41 INFO mapred.JobClient: Launched reduce tasks=1 11/12/20 18:01:41 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=94460 11/12/20 18:01:41 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/12/20 18:01:41 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/12/20 18:01:41 INFO mapred.JobClient: Launched map tasks=3 11/12/20 18:01:41 INFO mapred.JobClient: Data-local map tasks=3 11/12/20 18:01:41 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=32203 11/12/20 18:01:41 INFO mapred.JobClient: File Output Format Counters 11/12/20 18:01:41 INFO mapred.JobClient: Bytes Written=880829 11/12/20 18:01:41 INFO mapred.JobClient: FileSystemCounters 11/12/20 18:01:41 INFO mapred.JobClient: FILE_BYTES_READ=2214823 11/12/20 18:01:41 INFO mapred.JobClient: HDFS_BYTES_READ=3593945 11/12/20 18:01:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3773724 11/12/20 18:01:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880829 11/12/20 18:01:41 INFO mapred.JobClient: File Input Format Counters 11/12/20 18:01:41 INFO mapred.JobClient: Bytes Read=3593584 11/12/20 18:01:41 INFO mapred.JobClient: Map-Reduce Framework 11/12/20 18:01:41 INFO mapred.JobClient: Reduce input groups=82334 11/12/20 18:01:41 INFO mapred.JobClient: Map output materialized bytes=1474328 11/12/20 18:01:41 INFO mapred.JobClient: Combine output records=102321 11/12/20 18:01:41 INFO mapred.JobClient: Map input records=77932 11/12/20 18:01:41 INFO mapred.JobClient: Reduce shuffle bytes=1474328 11/12/20 18:01:41 INFO mapred.JobClient: Reduce output records=82334 11/12/20 18:01:41 INFO mapred.JobClient: Spilled Records=255959 11/12/20 18:01:41 INFO mapred.JobClient: Map output bytes=6076092 11/12/20 18:01:41 INFO mapred.JobClient: Combine input records=629172 11/12/20 18:01:41 INFO mapred.JobClient: Map output records=629172 11/12/20 18:01:41 INFO mapred.JobClient: SPLIT_RAW_BYTES=361 11/12/20 18:01:41 INFO mapred.JobClient: Reduce input records=102321
Check if the result is successfully stored in HDFS directory
*$ bin/hadoop dfs -ls $HOME*
*$ bin/hadoop dfs -ls $HOME/gutenberg-output*
<pre>
Found 3 items
-rw-r–r– 1 hduser supergroup 0 2011-12-20 18:01 /home/hduser/gutenberg-output3/_SUCCESS
drwxr-xr-x – hduser supergroup 0 2011-12-20 18:00 /home/hduser/gutenberg-output3/_logs
-rw-r–r– 1 hduser supergroup 880829 2011-12-20 18:01 /home/hduser/gutenberg-output3/part-r-00000
</pre>
We can override some Hadoop settings on the fly by using the “-D” option (Java system property):
*$ bin/hadoop jar $HADOOP_HOME/hadoop-examples-0.20.203.0.jar wordcount -Dmapred.reduce.tasks=16 $HOME/gutenberg $HOME/gutenberg-output2*
Check the result
*$ bin/hadoop dfs -cat $HOME/gutenberg-output/part-r-00000*
.
h3. Web interfaces
Hadoop Distributed Filesystem
http://192.168.22.77:50070/dfshealth.jsp
Hadoop JobTracker
http://192.168.22.77:50030/jobtracker.jsp
Hadoop TaskTracker
http://192.168.22.77:50060/tasktracker.jsp
Hadoop logs: either any above web interface + “/logs/” : “host:port/logs/”
http://192.168.22.77:50030/logs/
.
. . .
Pingback: Reports Push & Pull | DucQuoc's Blog