Adding a new worker node to my Hadoop-Pi Cluster

As I started exploring some more datasets outside of my home automation logs I was finding myself in need of a little more space as the existing datanodes where only setup on 16gb SD cards. So I decided to add a new node with a 64 gbcd card and then at a later date upgrade the other nodes

  1. Prepare Raspberry PI as per post Preparing a new Raspberry Pi server
  2. Configure networking
  3. Install prerequisites
  4. Create Group and User accounts for hadoop system
  5. Setup passphraseless ssh
  6. Download and unpack hadoop binaries
  7. Update Configuration files
  8. Create the Hadoop filesystem
  9. Clear up incompatible jline jars
  10. Reboot and startup services
  11. Job Done… but….

2. Configure Networking

With the new Debian Jessie operating system networking is done slightly differently before and if you try and old /etc/network options you could end up with a confused system. I somehow manged to get both a static IP address and dhcp at the same time that caused much confusion. Its now done via /etc/dhcpcd.conf

sudo vi /etc/dhcpcd.conf

At the bottom of the file create an entry for eth0 with your chosen settings

# Custom static IP address for eth0.

interface eth0
static ip_address=192.168.1.213/24
static routers=192.168.1.254
static domain_name_servers=192.168.1.254

Save

Set the hostname

sudo vi /etc/hostname

set to worker02

Update the hosts file

sudo vi /etc/hosts

192.168.1.210 master01
192.168.1.211 master02
192.168.1.212 worker01
192.168.1.213 worker02

NOTE: There could be an entry for 127.0.1.1 for the host. If there is delete it as this can cause issues when running map reduce jobs on the cluster.

Reboot for the settings to take effect

sudo reboot

Whilst that is rebooting log onto the master01 server and add worker02 to its hosts file using the same procedure

Reconnect using the new ip address and logon again

3. Install prerequisites

Firstly update the repository information

sudo apt-get update

now install the prerequisites

sudo apt-get install oracle-java8-jdk libsnappy-java libssl-dev r-base

4. Create Group and User accounts for hadoop system
Now we need to create the group and user accounts that will run the hadoop system. These need to be the same set up for the rest of the cluster. In my case username hduser and group hadoop.
Create the group

sudo groupadd hadoop

Create the user in the hadoop group

sudo adduser hduser -ingroup hadoop

Do not add a password just press enter until asked to try again and select ‘n’

Screen clipping taken: 04/06/2017 15:38

This has the effect of locking the account – we won every want to log in directly using this account.
Leave all the other fields blank

5. Setup passphraseless ssh
To enable the system to operate the hduser must have passphraseless access it run jobs. To help managing the cluster I will also enable passphraseless acesss from hduser@master01.

Generate ssh keys for hduser

sudo -H -u hduser ssh-keygen -t rsa -P “”

Save the key in the default /home/hduser/.ssh/id_rsa

Enable local passphraseless access

sudo -u hduser bash -c ‘cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys’

Test running a simple ssh command as hduser

Enable passphraseless access for hduser@master01

Get the master01 public key and copy to the worker02 node. Carry out this set on the master01 server.

master01> sudo scp /home/hduser/.ssh/id_rsa.pub pi@worker02:~/master01.pub

Now install it for hduser on worker02

sudo -u hduser bash -c ‘cat /home/pi/master01.pub >> /home/hduser/.ssh/authorized_keys’

And again test running simple ssh command as hduser

6. Download and unpack hadoop binaries
For this step I will assumed that you have a copy of the hadoop binaries compiled for the raspberry pi. If you want more information on how to do this check out this post. If you want to save a few hours feel free to download my version but obviously I disclaim any liability in connection with its use etc etc.

wget https://owainlloyd.info/hadoop-2.6.4.armf.tar.gz

Unpack these to /opt

sudo tar -zxvf hadoop-2.6.4.armf.tar.gz -C /opt

For ease create a symbolic link to the newly created directory

sudo ln -s /opt/hadoop-2.6.4 /opt/hadoop

Change the ownership of these to the hduser and hadoop group

sudo chown -R hduser:hadoop /opt/hadoop-2.6.4

sudo chown -R hduser:hadoop /opt/hadoop

7. Update Configuration files

Now we need to copy the configuration files from the master01 server to the new worker node. Do this on the master01 server

Master01> sudo -H -u hduser scp /opt/hadoop/etc/hadoop/* hduser@worker02:/opt/hadoop/etc/hadoop

Note not all configuration files are needed – it’s just easier to copy the whole lot across.

Having done that we need to add some environment variables into the bash shell

sudo vi /etc/bash.bashrc

And add the following lines at the bottom

export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:jre/bin/java::”)
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL

8. Create the Hadoop filesystem

Create and set the file permissions for the datanode

sudo mkdir /hdfs
sudo chmod 0777 /hdfs
sudo mkdir /hdfs/tmp
sudo chmod 0750 /hdfs/tmp

Change ownership to the hduser

sudo chown -R hduser:hadoop /hdfs/tmp

9. Clear up incompatible jline jars
Jline2 is now used but for some reason the previous version still exists in the Hadoop lib. Easy solution is to go through and delete all the old version

find /opt/hadoop/ -name jline-0.9.94.jar
sudo find /opt/hadoop/ -name jline-0.9.94.jar -exec rm -rf {} \;

Just to check its actually worked

find /opt/hadoop/ -name jline-0.9.94.jar

Note: when first trying this I only found this out after a few days when my oozie jobs started to randomly fail. When certain jobs where being run on the new node. More information can be found here:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

10. Reboot and startup services

Now the only thing needed is to reboot and startup the services.

sudo reboot

Once rebooted log back in and

Start the datanode

sudo -H -u hduser /opt/hadoop-2.6.4/sbin/hadoop-daemon.sh start datanode

and the node manager

sudo -H -u hduser /opt/hadoop/sbin/yarn-daemon.sh start nodemanager


And there is my new worker node 🙂

Note: In my initial attempts I accidently ran these scripts as root not hduser. This makes some of the files in the directories have the wrong permissions and prevents the node from working. If this happens you need to delete the directories and redo.

11. Job done.. But..
Couple of additional steps are needed on master01 to finish the inclusion of the new node.

Update slaves files

We need to add worker02 into the slaves files on master01 to ensure that the new node is included when performing cluster level activities.

sudo vi /opt/hadoop/etc/hadoop/slaves

and add worker02

Rebalance the cluster

Automatic actions taken by the cluster only make sure that under replicated blocks are copied. The cluster does not automatically rebalance when new nodes are added. To do this we need to run

sudo -H -u hduser ssh hduser@master01 /opt/hadoop/bin/hadoop balancer

Note: The reason I am using ssh to run the command even though I am logged into that server is that I wish for the program to continue to run after I disconnect the current session as it can take a long time depending on how much data needs to be moved.

Before rebalancing:

After rebalancing:

Leave a Reply

Your email address will not be published. Required fields are marked *