Install Standalone Hadoop 2.6 on AWS Ubuntu

This is an extremely quick step by step guide to installing Hadoop 2.6 standalone on an AWS instance. It covers preparing the machine, downloading and configuring Hadoop, starting Hadoop and checking that it's running.

Download Hadoop Demo

Launch AWS Instance i2.xlarge
apt Update and install Java
Mount the large SSD drive
Create and Setup SSH Certificates
Add a hadoop user
Install Hadoop
Start Hadoop
Check Hadoop is Running

Hadoop AWS Install Tutorial

Launch AWS Instance i2.xlarge

i2 are High Storage Instances that provide very fast SSD-backed instance storage optimized for very high random I/O performance, and provide high IOPS at a low cost. Launch an i2.xlarge (4 vCPU, 30 GiB RAM, 800GB SSD) instance and set the security group to allow TCP connections to any port from your home IP address. This is used to connect with putty and to view web interfaces that hadoop provides.

apt Update and install Java

First get the latest updates then install oracle java 7.

sudo apt-get update
sudo apt-get install htop
# Install Oracle Java 7
sudo apt-get install -y python-software-properties debconf-utils
sudo add-apt-repository ppa:webupd8team/java -y
sudo apt-get update
echo "oracle-java7-installer shared/accepted-oracle-license-v1-1 select true" | sudo debconf-set-selections
sudo apt-get install -y oracle-java7-installer

Mount the large SSD drive

Mount the SSD drive as /mnt/bigd/.

sudo mkfs -t ext4 /dev/xvdb
sudo mkdir -p /mnt/bigd
sudo mount /dev/xvdb /mnt/bigd/
sudo chown ubuntu /mnt/bigd/
sudo chgrp ubuntu /mnt/bigd/

Add a dedicated Hadoop User.

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo
cd /mnt/bigd/
sudo chown hduser .

Create and Setup SSH Certificates

sudo apt-get install -y ssh
su hduser
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
# We can check if ssh works
# ssh localhost

Install Hadoop

Download Hadoop tar

su hduser
cd cd /mnt/bigd/
wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
tar xvzf hadoop-2.6.0.tar.gz
ln -s hadoop-2.6.0 hadoop

Set Environment Paths

echo 'export JAVA_HOME=/usr/lib/jvm/java-7-oracle/' >> ~/.bashrc 
echo 'export HDIR=/mnt/bigd' >> ~/.bashrc
echo 'export HADOOP_INSTALL=/mnt/bigd/hadoop' >> ~/.bashrc 
echo 'export PATH=$PATH:$HADOOP_INSTALL/bin' >> ~/.bashrc 
echo 'export PATH=$PATH:$HADOOP_INSTALL/sbin' >> ~/.bashrc 
echo 'export HADOOP_MAPRED_HOME=$HADOOP_INSTALL' >> ~/.bashrc 
echo 'export HADOOP_COMMON_HOME=$HADOOP_INSTALL' >> ~/.bashrc 
echo 'export HADOOP_HDFS_HOME=$HADOOP_INSTALL' >> ~/.bashrc 
echo 'export YARN_HOME=$HADOOP_INSTALL' >> ~/.bashrc 
echo 'export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native' >> ~/.bashrc 
echo 'export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"' >> ~/.bashrc 
echo 'export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar' >> ~/.bashrc 
source ~/.bashrc

set XML configuration files

The config files can be downloaded as part of hadoop-demo.zip. These are then copied over the existing empty configs to set the directories to those newly created.

echo 'export JAVA_HOME=/usr/lib/jvm/java-7-oracle/' >> $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh
wget http://www.timestored.com/learn/hadoop/f/hadoop-demo.zip
unzip hadoop-demo.zip
rm hadoop-demo.zip
mkdir -p $HDIR/tmp
cp hadoop-demo/core-site.xml $HADOOP_INSTALL/etc/hadoop/
cp hadoop-demo/mapred-site.xml $HADOOP_INSTALL/etc/hadoop/
mkdir -p $HDIR/store/hdfs/namenode
mkdir -p $HDIR/store/hdfs/datanode
cp hadoop-demo/hdfs-site.xml $HADOOP_INSTALL/etc/hadoop/
rm -r hadoop-demo

This should leave you with the following configs:

core-site.xml

<configuration>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/bigd/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>

 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>
</configuration>

hdfs-site.xml

<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/mnt/bigd/store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/mnt/bigd/store/hdfs/datanode</value>
 </property>
</configuration>

mapred-site.xml

<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
 </property>
</configuration>

Format the New Hadoop Filesystem

hadoop namenode -format

Start Hadoop

cd $HADOOP_INSTALL/sbin
./start-all.sh
# $HADOOP_INSTALL/sbin/start-all.sh # start everything
# $HADOOP_INSTALL/sbin/stop-all.sh # stop everything

Check Hadoop is Running

Run JPS to check the hadoop processes are running.

$ jps
5736 NodeManager
6101 Jps
5241 DataNode
5075 NameNode
5441 SecondaryNameNode
5592 ResourceManager

Or connect to the namenode web interfaces on ports: 50070 50090, if ran locally use this http://localhost:50070 or replace localhost with your domain name.