Hadoop cluster setup

Setup steps on Ubuntu.

Hadoop cluster

Hadoop is an open source implementation to support distributed computing, originally based on Google’s papers of MapReduce , Big Table , and GFS .

.

Intro

Initially Hadoop is built to support Apache Nutch , a crawler sub-project of “Apache Lucene”:http://lucene.apache.org/java . But (with the support of community, especially from Yahoo!) it has graduated to be a top level Apache project and become one of the most popular framework for distributed systems, which can be deployed on large clusters of commodity hardware.

Now Hadoop contains more sub-projects than just the core: HDFS (highly fault-tolerance distributed filesystem), MapReduce (parallel processing vast amounts of data), HBase (NoSQL database), Pig (analyzing large data sets), Hive (querying data), ZooKeeper (distributed coordination) , and some more related (Mahout, Avro, Chukwa, …)


(source: TheRegister)

.

Single-node setup

Some steps for setting up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux. The main goal is to get a “simple” Hadoop installation up and running so that you can play around with the software and develop with it .


(source: HortonWorks)

.

Preparation

+ Hadoop: 0.20.203.x (current stable)

+ Ubuntu: 10.10 (maverick)

+ Sun Java 6 (See Sun_JDK_Ubuntu for more information) .

1. Add the Canonical Partner Repository to your apt repositories:

$ sudo add-apt-repository "deb http://archive.canonical.com/ maverick partner"

2. Update the source list

$ sudo apt-get update

3. Install sun-java6-jdk

$ sudo apt-get install sun-java6-jdk sun-java6-bin

4. Select Sun’s Java as the default on your machine.

$ sudo update-java-alternatives -s java-6-sun

The full JDK which will be placed in /usr/lib/jvm/java-6-sun .
After installation, make a quick check whether Sun’s JDK is correctly set up:

hduser@devserver:~ *$ java -version

+ Add a dedicated Hadoop system user

Actually, this is not required. We’ll skip it since this is for quick development, not dedicated deployment.
In this artical, I’ll use the admin user *hduser* for related Hadoop operations.

+ SSH configuration (See SSH Without Password for more information) .

By default, Ubuntu have SSHd up and running, and already configured to allow SSH public key authentication.

As this is single-node cluster , our “remote” host is actually the same with “localhost” (devserver).

hduser@devserver:~ $ ssh-keygen -t rsa -P ""
 Generating public/private rsa key pair.
 Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
 Enter passphrase (empty for no passphrase):
 Enter same passphrase again:
 Your identification has been saved in /home/hduser/.ssh/id_rsa.
 Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
 The key fingerprint is:
 f6:61:a8:27:35:cf:4c:6d:13:22:70:cf:4c:c8:a0:23 hduser@devserver
hduser@devserver:~ $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@localhost
(Alternatively:
hduser@devserver:~ $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
 )

 

Now you should be able to login SSH without password.

hduser@devserver:~ $ ssh hduser@localhost
Last login: Fri Dec 16 17:22:33 2011 from 192.168.22.84

+ Some more

Install RSync (if not yet)

$ sudo apt-get install rsync

Avoid IPv6 problem by disable IPv6 config, or prefer IPv4: adding the following line to $HADOOP_HOME/conf/hadoop-env.sh:

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

.

Installation

Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /opt/hadoop .
(Optionally: If there is dedicated user: hduser of hdgroup , make sure to change the owner and permissions respectively)

Assumed that the downloaded package is _hadoop-0.20.203.0rc1.tar.gz_, which will be extracted to _hadoop-0.20.203.0_ folder

$ mkdir -p /opt/hadoop
$ chmod 755 -R /opt/hadoop
$ tar -xf hadoop-0.20.203.0rc1.tar.gz
$ mv hadoop-0.20.203.0 /opt/hadoop

(Alternatively:

$ tar -xf hadoop-0.20.203.0rc1.tar.gz -C /opt/hadoop/

)

Now we can update $HOME/.bashrc of the Hadoop user for convenient Hadoop commands:

#### ADDED – dq 2010.12.16
export JAVA_HOME=”/usr/lib/jvm/java-6-sun”
export HADOOP_HOME=”/opt/hadoop/hadoop-0.20.203.0″
#alias fs=”hadoop fs”
#alias hfs=”fs -ls”

#lzohead() {
# hadoop fs -cat $1 | lzop -dc | head -1000 | less
#}

export PATH=”$PATH:$HADOOP_HOME/bin”
#### ADDED END

.

Basic configuration

The options on official page may make the new user feel confused or overwhelmed.
http://hadoop.apache.org/common/docs/current/cluster_setup.html

We’ll just cover the most basic here. First take a quick glance at HDFS to understand the concepts:
http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html

Create a temp directory for Hadoop:

$ mkdir -p /opt/hadoop/tmp

Edit the configuration files, add corresponding following lines:

conf/hadoop-env.sh
Besides the HADOOP_OPTS mentioned above, specify the JAVA_HOME explicitly in hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-sun

.

conf/core-site.xml

<configuration>
<!– ADDED – dq 2011.12.16 –>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
<description>A base for other temporary directories</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>name of default FileSystem (see fs.SCHEME.impl)</description>
</property>
<!– ADDED END – in conf/core-site.xml –>
</configuration>

.

conf/mapred-site.xml

<configuration>
<!– ADDED – dq 2011.12.16 –>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>host and port for MapReduce job tracker</description>
</property>
<!– ADDED END – in conf/mapred-site.xml –>
</configuration>

 

conf/hdfs-site.xml

<configuration>
<!– ADDED – dq 2011.12.20 –>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>default block replication, will be used if replication
is not specified when the file is created
</description>
</property>
<!– ADDED END – in conf/hdfs-site.xml –>
</configuration>

 

Format the HDFS filesystem via the NameNode
(Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster, i.e. in HDFS.)

$ cd $HADOOP_HOME
$ ./bin/hadoop namenode -format

Note: if you already add $HADOOP_HOME/bin to the $PATH, just use the “hadoop” is enough. Otherwise you should use “bin/hadoop” from $HADOOP_HOME for every Hadoop commands.

 

The output will look like this:

 11/12/20 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
 /************************************************************
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG: host = ubuntu/127.0.1.1
 STARTUP_MSG: args = [-format]
 STARTUP_MSG: version = 0.20.203
 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911807; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2011
 ************************************************************/
 11/12/20 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
 11/12/20 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
 11/12/20 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
 11/12/20 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
 11/12/20 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
 11/12/20 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
 /************************************************************
 SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
 ************************************************************/

 

Check to see if we can start the single-node cluster:

$ bin/start-all.sh

The output will look like this:

 starting namenode, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-namenode-devserver.out
 localhost: starting datanode, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-datanode-devserver.out
 localhost: starting secondarynamenode, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-secondarynamenode-devserver.out
 starting jobtracker, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-jobtracker-devserver.out
 localhost: starting tasktracker, logging to /opt/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-hduser-tasktracker-devserver.out

 

To check the (java) Process ID quickly, use Sun JDK’s jps :

$ jps
 12353 JobTracker
 12610 Jps
 12088 NameNode
 12465 TaskTracker
 12287 SecondaryNameNode
 12186 DataNode

We can also check with netstat if Hadoop is listening on the configured ports.

$ netstat -netpl | grep java
 tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1000 73537 12287/java
 tcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1000 74042 12465/java
 tcp 0 0 0.0.0.0:55277 0.0.0.0:* LISTEN 1000 72741 12186/java
 tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1000 73614 12353/java
 tcp 0 0 0.0.0.0:53486 0.0.0.0:* LISTEN 1000 72550 12088/java
 tcp 0 0 127.0.0.1:57045 0.0.0.0:* LISTEN 1000 73681 12465/java
 tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1000 73452 12088/java
 tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1000 73532 12186/java
 tcp 0 0 0.0.0.0:40122 0.0.0.0:* LISTEN 1000 73063 12287/java
 tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1000 73785 12186/java
 tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1000 73793 12186/java
 tcp 0 0 0.0.0.0:43332 0.0.0.0:* LISTEN 1000 73366 12353/java
 tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1000 73131 12088/java
 tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1000 73442 12353/java

(you can use option -a instead of -e to show ESTABLISHED connection besides the LISTENing)

If there are any errors, examine the log files in the logs directory.

.

Running MapReduce

Stop the cluster:

$ bin/stop-all.sh
 stopping jobtracker
 localhost: stopping tasktracker
 stopping namenode
 localhost: stopping datanode
 localhost: stopping secondarynamenode

We’ll run the wordcount job (http://wiki.apache.org/hadoop/WordCount) .

$ mkdir -p /opt/hadoop/gutenberg

First, download some books from project Gutenberg to a temp folder:
http://www.gutenberg.org/ebooks/20417
http://www.gutenberg.org/etext/5000
http://www.gutenberg.org/etext/4300

$ ls /opt/hadoop/gutenberg
 drwxr-xr-x 2 hduser hdgroup 4096 2011-12-20 11:26 .
 drwxr-xr-x 6 hduser hdgroup 4096 2011-12-20 12:02 ..
 -rw-r--r-- 1 hduser hdgroup 661807 2011-12-20 11:21 pg20417.txt
 -rw-r--r-- 1 hduser hdgroup 1540092 2011-12-20 11:21 pg4300.txt
 -rw-r--r-- 1 hduser hdgroup 1391685 2011-12-20 11:21 pg5000.txt

Restart the Hadoop cluster if it’s not running already.

$ bin/start-all.sh

Now we copy the files from our local file system to Hadoop’s HDFS. (note that the normal filesystem will not recognize the files in HDFS, by default)

$ hadoop dfs -copyFromLocal /opt/hadoop/tmp/gutenberg $HOME/gutenberg
$ hadoop dfs -ls $HOME
 drwxr-xr-x - hduser supergroup 0 2011-12-27 11:40 /home/hduser/gutenberg
$ hadoop dfs -ls $HOME/gutenberg
 Found 3 items
 -rw-r--r-- 1 hduser supergroup 661807 2011-12-20 11:40 /home/hduser/gutenberg/pg20417.txt
 -rw-r--r-- 1 hduser supergroup 1540092 2011-12-20 11:40 /home/hduser/gutenberg/pg4300.txt
 -rw-r--r-- 1 hduser supergroup 1391685 2011-12-20 11:40 /home/hduser/gutenberg/pg5000.txt

Now, we actually run the WordCount example job.

$ bin/hadoop jar hadoop*examples*.jar wordcount $HOME/gutenberg $HOME/gutenberg-output

 

If you meet some exception like below:

Exception in thread “main” java.io.IOException: Error opening job jar: hadoop*examples*.jar
at org.apache.hadoop.util.RunJar.main (RunJar.java: 90)
Caused by: java.util.zip.ZipException: error in opening zip file

Then check your JRE version to make sure it support wildcard on classpath; and if the user is at $HADOOP_HOME directory.

To make sure the command run correctly, specify full path to the example JAR (absolute path & no wildcard):

*$ bin/hadoop jar $HADOOP_HOME/hadoop-examples-0.20.203.0.jar wordcount $HOME/gutenberg $HOME/gutenberg-output*

The result will look like this:

 11/12/20 18:00:06 INFO input.FileInputFormat: Total input paths to process : 3
 11/12/20 18:00:07 INFO mapred.JobClient: Running job: job_201112201314_0003
 11/12/20 18:00:08 INFO mapred.JobClient: map 0% reduce 0%
 11/12/20 18:00:58 INFO mapred.JobClient: map 66% reduce 0%
 11/12/20 18:01:18 INFO mapred.JobClient: map 100% reduce 0%
 11/12/20 18:01:33 INFO mapred.JobClient: map 100% reduce 100%
 11/12/20 18:01:41 INFO mapred.JobClient: Job complete: job_201112201314_0003
 11/12/20 18:01:41 INFO mapred.JobClient: Counters: 25
 11/12/20 18:01:41 INFO mapred.JobClient: Job Counters
 11/12/20 18:01:41 INFO mapred.JobClient: Launched reduce tasks=1
 11/12/20 18:01:41 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=94460
 11/12/20 18:01:41 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
 11/12/20 18:01:41 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
 11/12/20 18:01:41 INFO mapred.JobClient: Launched map tasks=3
 11/12/20 18:01:41 INFO mapred.JobClient: Data-local map tasks=3
 11/12/20 18:01:41 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=32203
 11/12/20 18:01:41 INFO mapred.JobClient: File Output Format Counters
 11/12/20 18:01:41 INFO mapred.JobClient: Bytes Written=880829
 11/12/20 18:01:41 INFO mapred.JobClient: FileSystemCounters
 11/12/20 18:01:41 INFO mapred.JobClient: FILE_BYTES_READ=2214823
 11/12/20 18:01:41 INFO mapred.JobClient: HDFS_BYTES_READ=3593945
 11/12/20 18:01:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3773724
 11/12/20 18:01:41 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880829
 11/12/20 18:01:41 INFO mapred.JobClient: File Input Format Counters
 11/12/20 18:01:41 INFO mapred.JobClient: Bytes Read=3593584
 11/12/20 18:01:41 INFO mapred.JobClient: Map-Reduce Framework
 11/12/20 18:01:41 INFO mapred.JobClient: Reduce input groups=82334
 11/12/20 18:01:41 INFO mapred.JobClient: Map output materialized bytes=1474328
 11/12/20 18:01:41 INFO mapred.JobClient: Combine output records=102321
 11/12/20 18:01:41 INFO mapred.JobClient: Map input records=77932
 11/12/20 18:01:41 INFO mapred.JobClient: Reduce shuffle bytes=1474328
 11/12/20 18:01:41 INFO mapred.JobClient: Reduce output records=82334
 11/12/20 18:01:41 INFO mapred.JobClient: Spilled Records=255959
 11/12/20 18:01:41 INFO mapred.JobClient: Map output bytes=6076092
 11/12/20 18:01:41 INFO mapred.JobClient: Combine input records=629172
 11/12/20 18:01:41 INFO mapred.JobClient: Map output records=629172
 11/12/20 18:01:41 INFO mapred.JobClient: SPLIT_RAW_BYTES=361
 11/12/20 18:01:41 INFO mapred.JobClient: Reduce input records=102321

Check if the result is successfully stored in HDFS directory

*$ bin/hadoop dfs -ls $HOME*

*$ bin/hadoop dfs -ls $HOME/gutenberg-output*
<pre>
Found 3 items
-rw-r–r– 1 hduser supergroup 0 2011-12-20 18:01 /home/hduser/gutenberg-output3/_SUCCESS
drwxr-xr-x – hduser supergroup 0 2011-12-20 18:00 /home/hduser/gutenberg-output3/_logs
-rw-r–r– 1 hduser supergroup 880829 2011-12-20 18:01 /home/hduser/gutenberg-output3/part-r-00000
</pre>

We can override some Hadoop settings on the fly by using the “-D” option (Java system property):

*$ bin/hadoop jar $HADOOP_HOME/hadoop-examples-0.20.203.0.jar wordcount -Dmapred.reduce.tasks=16 $HOME/gutenberg $HOME/gutenberg-output2*

Check the result

*$ bin/hadoop dfs -cat $HOME/gutenberg-output/part-r-00000*

.

h3. Web interfaces

Hadoop Distributed Filesystem
http://192.168.22.77:50070/dfshealth.jsp

Hadoop JobTracker
http://192.168.22.77:50030/jobtracker.jsp

Hadoop TaskTracker
http://192.168.22.77:50060/tasktracker.jsp

Hadoop logs: either any above web interface + “/logs/” : “host:port/logs/”
http://192.168.22.77:50030/logs/

.

. . .

About DucQuoc.wordpress.com

A brother, husband and father...
This entry was posted in Coding, Linux. Bookmark the permalink.

1 Response to Hadoop cluster setup

  1. Pingback: Reports Push & Pull | DucQuoc's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s