Big Data and it's Ecosystem Projects


  • Ubuntu 18.04 installed on a virtual machine.

Install OpenJDK on Ubuntu


The Hadoop framework is written in Java, and its services require a compatible Java Runtime Environment (JRE) and Java Development Kit (JDK). Use the following command to update your system before initiating a new installation:

sudo apt update

At the moment, Apache Hadoop 3.x fully supports Java 8. The OpenJDK 8 package in Ubuntu contains both the runtime environment and development kit.

Type the following command in your terminal to install OpenJDK 8:

sudo apt install openjdk-8-jdk -y

The OpenJDK or Oracle Java version can affect how elements of a Hadoop ecosystem interact. 

Once the installation process is complete, verify the current version:

java -version; javac -version

The output informs you which Java edition is in use.

System displays Java version based on command.

Set Up a Non-Root User for Hadoop Environment


It is advisable to create a non-root user, specifically for the Hadoop environment. A distinct user improves security and helps you manage your cluster more efficiently. To ensure the smooth functioning of Hadoop services, the user should have the ability to establish a passwordless SSH connection with the localhost.

Install OpenSSH on Ubuntu


Install the OpenSSH server and client using the following command:

sudo apt install openssh-server openssh-client -y

In the example below, the output confirms that the latest version is already installed.

Checking if the latest OpenSSH version is already installed which is required for installation of Hadoop on Ubuntu.

If you have installed OpenSSH for the first time, use this opportunity to implement these vital SSH security recommendations.

Create Hadoop User

Utilize the adduser command to create a new Hadoop user:

sudo adduser hdoop

The username, in this example, is hdoop. You are free the use any username and password you see fit. Switch to the newly created user and enter the corresponding password:

su - hdoop

The user now needs to be able to SSH to the localhost without being prompted for a password.

Enable Passwordless SSH for Hadoop User

Generate an SSH key pair and define the location is is to be stored in:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

The system proceeds to generate and save the SSH key pair.

Private and public authentication key pair.

Use the cat command to store the public key as authorized_keys in the ssh directory:

cat ~/.ssh/ >> ~/.ssh/authorized_keys

Set the permissions for your user with the chmod command:

chmod 0600 ~/.ssh/authorized_keys

The new user is now able to SSH without needing to enter a password every time. Verify everything is set up correctly by using the hdoop user to SSH to localhost:

ssh localhost

After an initial prompt, the Hadoop user is now able to establish an SSH connection to the localhost seamlessly.

Download and Install Hadoop on Ubuntu

Visit the official Apache Hadoop project page, and select the version of Hadoop you want to implement.

A list of Hadoopversions available for download.

The steps outlined in this tutorial use the Binary download for Hadoop Version 3.2.1.

Select your preferred option, and you are presented with a mirror link that allows you to download the Hadoop tar package.

The download page provides the direct download miror link for Hadoop.


Note: It is sound practice to verify Hadoop downloads originating from mirror sites. The instructions for using GPG or SHA-512 for verification are provided on the official download page.

Use the provided mirror link and download the Hadoop package with the wget command:


Downloading the official Hadoop version specified in the link.

Once the download is complete, extract the files to initiate the Hadoop installation:

tar xzf hadoop-3.2.1.tar.gz

The Hadoop binary files are now located within the hadoop-3.2.1 directory.

Single Node Hadoop Deployment (Pseudo-Distributed Mode)


Hadoop excels when deployed in a fully distributed mode on a large cluster of networked servers. However, if you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node.

This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single Java process. A Hadoop environment is configured by editing a set of configuration files:

      • bashrc
      • core-site.xml
      • hdfs-site.xml
      • mapred-site-xml
      • yarn-site.xml

Configure Hadoop Environment Variables (bashrc)


Edit the .bashrc shell configuration file using a text editor of your choice (we will be using nano):

sudo nano .bashrc

Define the Hadoop environment variables by adding the following content to the end of the file:

#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native”

Once you add the variables, save and exit the .bashrc file.

The Hadoop environment variables are added to the bashrc file on Ubuntu using the nano text editor.

It is vital to apply the changes to the current running environment by using the following command:

source ~/.bashrc

Edit File


The file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related project settings.

When setting up a single node Hadoop cluster, you need to define which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable to access the file:

sudo nano $HADOOP_HOME/etc/hadoop/

Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK installation on your system. If you have installed the same version as presented in the first part of this tutorial, add the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

The path needs to match the location of the Java installation on your system.

The location of the Java Home variable in the configuration file.

If you need help to locate the correct Java path, run the following command in your terminal window:

which javac

The resulting output provides the path to the Java binary directory.

Java binary directory location with which command.

Use the provided path to find the OpenJDK directory with the following command:

readlink -f /usr/bin/javac

The section of the path just before the /bin/javac directory needs to be assigned to the $JAVA_HOME variable.

The location of the openjdk binary directory.

Edit core-site.xml File

The core-site.xml file defines HDFS and Hadoop core properties.

To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for your NameNode, and the temporary directory Hadoop uses for the map and reduce process.

Open the core-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration to override the default values for the temporary directory and add your HDFS URL to replace the default local file system setting:


This example uses values specific to the local system. You should use values that match your systems requirements. The data needs to be consistent throughout the configuration process.

The content that needs to be added to the core-site.xml file for your Hadoop setup.

Do not forget to create a Linux directory in the location you specified for your temporary data.

Edit hdfs-site.xml File

The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories.

Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the single node setup.

Use the following command to open the hdfs-site.xml file for editing:

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories to your custom locations:


If necessary, create the specific directories you defined for the value.

The elements of the hdfs-site.xml file in Hadoop.

Edit mapred-site.xml File

Use the following command to access the mapred-site.xml file and define MapReduce values:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following configuration to change the default MapReduce framework name value to yarn: 

The mapred configuration file content for a single node Hadoop cluster.

Edit yarn-site.xml File

The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the Node Manager, Resource Manager, Containers, and Application Master.

Open the yarn-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Append the following configuration to the file:






The single node Hadoop Yarn configuration file.

Format HDFS NameNode

It is important to format the NameNode before starting Hadoop services for the first time:

hdfs namenode -format

The shutdown notification signifies the end of the NameNode format process.

The sytem proceeds to format the NameNode based on the format command.

Start Hadoop Cluster

Navigate to the hadoop-3.2.1/sbin directory and execute the following commands to start the NameNode and DataNode:


The system takes a few moments to initiate the necessary nodes.

The namenode, datanode and secondary namenode are being started on a Hadoop cluster.
Once the namenode, datanodes, and secondary namenode are up and running, start the YARN resource and nodemanagers by typing:


As with the previous command, the output informs you that the processes are starting.

The system initiates the resource and node manager daemons.
Type this simple command to check if all the daemons are active and running as Java processes:


If everything is working as intended, the resulting list of running Java processes contains all the HDFS and YARN daemons.

A list of running Java processes in your Hadoop cluster.

Access Hadoop UI from Browser

Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode UI:


The NameNode user interface provides a comprehensive overview of the entire cluster.

The main page of the NameNode interface accessed form a browser.

The default port 9864 is used to access individual DataNodes directly from your browser:


Access the DataNode UI from your prefered browser.

The YARN Resource Manager is accessible on port 8088:


The Resource Manager is an invaluable tool that allows you to monitor all running processes in your Hadoop cluster.

The resourcemanager daemon allows you to track all cluster applications and processes from your browser.


You have successfully installed Hadoop on Ubuntu and deployed it in a pseudo-distributed mode. A single node Hadoop deployment is an excellent starting point to explore basic HDFS commands and acquire the experience you need to design a fully distributed Hadoop cluster.


  • Ubuntu 18.04 installed on a virtual machine.

Create A New User


Image for post

To add a new user ‘hduser’ only provide the new password rest all skip just by pressing enter

Image for post

Add the highlighted line in the sudoers file to give root permissions to hduser

Java Environment ..


Image for post

switch user to hduser and download jdk in /usr/local

Image for post


Image for post

Setting Oracle JDK as the default JVM…


Image for post

Setting Up Hduser with SSH Keys & Enable Localhost


Image for post


Image for post


Image for post


Image for post

Hadoop 3.x Installation Step By Step


Image for post

Setting Up Hadoop 3.x in Pseudo Distributed Mode


Image for post

Disable IPV6


Image for post

Setting Hadoop Environment


Image for post

Setting up Hadoop Configurations …


Image for post


Image for post


Image for post


Image for post


Image for post


Image for post


Image for post


Image for post

Output to format the namenode with Cluster ID



  • Ubuntu 18.04 installed on a virtual machine.

What are we going to install in order to create the Hadoop Multi-Node Cluster?

  • Java 8;
  • SSH;
  • PDSH;

1st Step: Configuring our Network

Go to the Network Settings of your Virtual Machine and Enable Adapter 2. Then, instead of NAT, chose Virtual Host-Only Adapter and where it says “Promiscuous Mode” select the option “Allow All”.

Image for post

2nd Step:

Install SSH using the following command:

It will ask you for the password. When it asks for confirmation, just give it.

Image for post

3rd Step:

Install PDSH using the following command:

Just as before, give confirmation when needed.

Image for post

4th Step:

Open the .bashrc file with the following command:

At the end of the file just write the following line:

Image for post

5th Step:

Now let’s configure SSH. Let’s create a new key using the following command:

Just press Enter everytime that is needed.

Image for post

6th Step:

Now we need to copy the public key to the authorized_keys file with the following command:

Image for post

7th Step:

Now we can verify the SSH configuration by connecting to the localhost:

Just type “yes” and press Enter when needed.

Image for post

8th Step:

This is the step where we install Java 8. We use this command:

Just as previously, give confirmation when needed.

Image for post

9th Step:

This step isn’t really a step, it’s just to check if Java is now correctly installed:

Image for post

10th Step:

Download Hadoop using the following command:

Image for post

11th Step:

We need to unzip the hadoop-3.2.1.tar.gz file with the following command:

Image for post

12th Step:

Change the hadoop-3.2.1 folder name to hadoop (this maked it easier to use). Use this command:

Image for post

13th Step:

Open the file in the nano editor to edit JAVA_HOME:

Paste this line to JAVA_HOME:

(I forgot to take a screenshot for this step, but it’s really easy to find. Once you find it just remove the # commentary tag and do what I said, copy it).

14th Step:

Change the hadoop folder directory to /usr/local/hadoop. This is the command:

Provide the password when needed.

Image for post

15th Step:

Open the environment file on nano with this command:

Then, add the following configurations:

Image for post

16th Step:

Now we will add a user called hadoopuser, and we will set up it’s configurations:

Provide the password and you can leave the rest blank, just press Enter.

Image for post

Now type these commands:

Image for post

17th Step:

Now we need to verify the machine ip address:

Image for post

Now, as you can see, my IP is, just remember this will be different for you, you need to act accordingly when the IP addresses are used later.

My network will be as follows:




In your case, just keep adding 1 to the last number of the IP you get on your machine, just as I did for mine.

18th Step:

Open the hosts file and insert your Network configurations:

Image for post

19th Step:

Now is the time to create the Slaves.

Shut Down your Master Virtual Machine and clone it twice, naming one Slave1 and the Other Slave2.

Make sure the “Generate new MAC addresses for all network adapters” option is chosen.

Also, make a Full Clone.

Image for post
Clone for Slave1, do the same for Slave2.

20th Step:

On the master VM, open the hostname file on nano:

Insert the name of your master virtual machine. (note, it’s the same name you entered previously on the hosts file)

Image for post

Now do the same on the slaves:

Image for post
Image for post

Also, you should reboot all of them so this configuration taked effect:

21st Step:

Configure the SSH on hadoop-master, with the hadoopuser. This is the command:

Image for post

22nd Step

Create an SSH key:

Image for post

23rd Step:

Now we need to copy the SSH key to all the users. Use this command:

Image for post
Image for post
Image for post

24th Step:

On hadoop-master, open core-site.xml file on nano:

Image for post

Then add the following configurations:

Image for post

25th Step:

Still on hadoop-master, open the hdfs-site.xml file.

Image for post

Add the following configurations:

Image for post

26th Step:

We’re still on hadoop-master, let’s open the workers file:

Image for post

Add these two lines: (the slave names, remember the hosts file?)

Image for post

27th Step:

We need to copy the Hadoop Master configurations to the slaves, to do that we use these commands:

Image for post
Copying information to Slave1
Image for post
Copying Information to Slave2

28th Step:

Now we need to format the HDFS file system. Run these commands:

Image for post

29th Step:

Start HDFS with this command:

Image for post

To check if this worked, run the follwing command. This will tell you what resources have been initialized:

Image for post

Now we need to do the same in the slaves:

Image for post
Image for post

30th Step:

Let’s see if this worked:

Open your browser and type hadoop-master:9870.

This is what mine shows, hopefully yours is showing the same thing!

Image for post

As you can see, both nodes are operational!

31st Step:

Let’s configure yarn, just execute the following commands:

Image for post

32nd Step:

In both slaves, open yarn-site.xml on nano:

You have to add the following configurations on both slaves:

Image for post
Image for post

33rd Step:

On the master, let’s start yarn. Use this command:

Image for post

34th Step:

Open your browser. Now you will type http://hadoop-master:8088/cluster

Image for post

As you can see, the cluster shows 2 active nodes!

35th Step:

Just kidding, there are no more steps. Hopefully you managed to do it all correctly, and if so, congratulations on building a Hadoop Multi-Node Cluster!


  • You have a Google Cloud account. If not, click here to create a free-tier Google Cloud account. This will give you USD 300 of free credit
  • Manual installation of Cloudera Manager without Google’s Dataproc functionality

Once you create a Google Cloud account. Navigate to the console and hit the drop-down for “Select a project”

Image for post

Now on the top-right, hit “NEW PROJECT”. Add a “project name” and click save. Leave “organization” as-is

Image for post

From the navigation menu on the left, select Compute Engine -> VM Instances as shown

Image for post

“Create” a new VM Instance

Image for post

Add a generic name for the instance. I generally do instance-1 or instance-001 and continue the numbers consecutively

Select “us-central1 (Iowa)” region with the “us-central1-a” zone. This seems to be the cheapest option available

The “n1” series of general-purpose machine type is the cheapest option

Under machine type, select “Custom” with 2 cores of vCPU and 12 GB of RAM. Please note there is a limit to the number of cores and total RAMs provided under the free-tier usage policy

Image for post

Under “Boot disk”, select Centos OS 7 as the OS and 100 GB as storage

Image for post

Under Identity and API access, leave the access scopes as-is

Under Firewall, select both boxes to enable HTTP and HTTPS traffic

Image for post

Repeat the steps above to create 4 nodes each with the same configuration

Image for post

In SSH drop-down, select “open in browser window”. Repeat for all nodes. Enter the commands:

Image for post
sudo su -
vi /etc/selinux/config

Inside the config file, change SELINUX=disabled

Image for post
vi /etc/ssh/sshd_config
Image for post

Under Authentication, change

PermitRootLogin yes

Now we can login into instance-2/3/4 from instance-1 without password

Image for post

Ensure that you’ve done the above steps on all nodes. Following which you should reboot all the 4 nodes

Image for post

Re-login into instance-1 as root user and enter:

hit enter three times

and your keys will be generated under /root/.ssh/

Image for post

In instance-1, as root user:

cd /root/.ssh

And copy the public key

Image for post

In cloud console menu, metadata -> sshkeys -> edit -> add item -> enter key and save

Image for post
Now, in the terminal, on all nodes:
service sshd restart
Image for post

From instance-1:

ssh instance-2
“yes” to establish connection
Image for post

Repeat for instance-3 and 4

Cluster setup is completed for 4 nodes on Google Cloud Platform

In order to install Java, please visit this link. The above link will allow you to download and install Java on instance-1

Lets install it on the other nodes now:

Image for post
Copying the jdk…rpm to the other nodes
scp jdk….rpm instance-2:/tmp
scp jdk….rpm instance-3:/tmp
scp jdk….rpm instance-4:/tmp
Image for post
Lets navigate to instance-2 and run the following commands:
ssh instance-2
cd /tmp
rpm -ivh jdk….rpm

Repeat the same steps on instance-3/4

Java is installed on all 4 nodes on Google Cloud Platform

Image for post

Head over to Cloudera Manager Downloads page, enter your details and hit download. Copy this link

Image for post
To install CM, change permissions and run the installer.bin
wget <and paste it here>
chmod u+x cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin
Image for post

This window will now open, hit Next and accept all licenses

Image for post

This launches the Cloudera Manager Login Page. Use admin/admin as credentials

Image for post

Here’s the Cloudera Manager Homepage

Image for post

Accept the licenses

Image for post

There are other options, but this 60 day Enterprise-trial period seems to be the best option. 

Image for post
  • Under VPC Network, hit Firewall rules
  • Click new to create a firewall rule and add a generic name
  • Ensure the logs are turned off
Image for post
  • Selecting Ingress traffic, ensure to pick “All instances in the network”
  • Pick IP Ranges under source filter
  • Under Source IP ranges — OR,  enter your public IP address 
  • Ensure tcp and udp is checked and the appropriate ports are selected: 22,7180,7187,8888,8889
  • Hit create
Image for post

Lets do the cluster installation now

Image for post

Adding a generic name

Image for post

Add the internal IP addresses of the 4 nodes here and search

Image for post

We’re selecting a public repository

Image for post

Parcels method chosen for installation

Image for post

Using the latest CDH version, rest all configs on this page to be left as-is

Image for post

Accepting the JDK license

Image for post

We’re using the root login through password-less communication. To authenticate we will now use the private key from the ssh-keygen we did earlier

Image for post
Using root login instance-1
cd /root/.ssh
cat id_rsa

Saving the private key to a file and uploading it in the cluster installation’s current step i.e. “Private Key File”

Image for post

Install Parcels, this takes a while

This concludes the Cloudera Manager — Cluster Installation

Image for post

Essential Services chosen

Image for post

PostgreSQL embedded database chosen as default. Please note this embedded PostgreSQL is not supported for use in production environments. 

Image for post

No changes at this step — Review Changes

Image for post

First Run Command on the services selected

Image for post

Services are up and running

Image for post

The cluster is up and running and the configuration is completed. Some minor configuration warnings are present, however, they can be safely ignored for the purpose of this assessment.

This concludes the Cloudera Manager — Cluster Configuration

Congratulations! This concludes Cloudera Manager installation on Google Cloud Platform

Download VirtualBox

 Download Hortonworks Sandbox

  1. Choose Installation type – Virtual Box
  2. Press Let’s Go
  3. Fill up your Details (for signup with Cloudera)


  1. Press Continue
  2. Click the Check box for accept T&C


  1. Press Submit
  2. Click and Download Sandbox 2.5.0


   8. Open Virtual Box and Import the .ova file. 

   9. After the import has finished just select the “Hortonworks Docker Sandbox HDP” environment and click the start button

   10. It took some time before you saw the following screen.

   11. Open your browser and go to

         Explore the Hortonworks Data Platform (make sure you disabled your popup blocker)

   12. Use maria_dev as username and password

   13. You will see the Ambari Sandbox


Data Analysis with Hive

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

  • A relational database
  • A design for OnLine Transaction Processing (OLTP)
  • A language for real-time queries and row-level updates

Features of Hive

  • It stores schema in a database and processed data into HDFS.
  • It is designed for OLAP.
  • It provides SQL type language for querying called HiveQL or HQL.
  • It is familiar, fast, scalable, and extensible.

Architecture of Hive

The following component diagram depicts the architecture of Hive:

Hive Architecture

This component diagram contains different units. The following table describes each unit:

Unit NameOperation
User InterfaceHive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).
Meta StoreHive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process EngineHiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it.
Execution EngineThe conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASEHadoop distributed file system or HBASE are the data storage techniques to store data into file system.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

How Hive Works

Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of Hive data types is given below.

Integer Types

TINYINT1-byte signed integer-128 to 127
SMALLINT2-byte signed integer32,768 to 32,767
INT4-byte signed integer2,147,483,648 to 2,147,483,647
BIGINT8-byte signed integer-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

Decimal Type

FLOAT4-byteSingle precision floating point number
DOUBLE8-byteDouble precision floating point number

Date/Time Types


  • It supports traditional UNIX timestamp with optional nanosecond precision.
  • As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
  • As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal precision.
  • As string, it follows java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” (9 decimal place precision)


The Date value is used to specify a particular year, month and day, in the form YYYY–MM–DD. However, it didn’t provide the time of the day. The range of Date type lies between 0000–01–01 to 9999–12–31.

String Types


The string is a sequence of characters. It values can be enclosed within single quotes (‘) or double quotes (“).


The varchar is a variable length type whose range lies between 1 and 65535, which specifies that the maximum number of characters allowed in the character string.


The char is a fixed-length type whose maximum length is fixed at 255.

Complex Type

StructIt is similar to C struct or an object where fields are accessed using the “dot” notation.struct(‘James’,’Roy’)
MapIt contains the key-value tuples where the fields are accessed using array‘first’,’James’,’last’,’Roy’)
ArrayIt is a collection of similar type of values that indexable using zero-based integers.array(‘James’,’Roy’)