Spark Architecture

Image for post


1st Step:

Image for post
Image for post

2nd Step:

Image for post

3rd Step:

sudo nano /hostname
Image for post
Image for post
Image for post
Image for post

4th Step:

ip addr
Image for post

5th Step:

sudo nano /etc/hosts
Image for post

6th Step:

sudo reboot

7th Step:

$ sudo apt-get install software_properties_common
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install openjdk-11-jdk
$ java -version
Image for post

8th Step:

$ sudo apt-get install scala
Image for post
$ scala -version
Image for post
Image for post

9th Step:

$ sudo apt-get install openssh-server openssh-client
Image for post
$ ssh-keygen -t rsa -P ""
Image for post
cat ~/.ssh/ >> ~/.ssh/authorized_keys
Image for post
ssh-copy-id user@pd-master
ssh-copy-id user@pd-slave1
ssh-copy-id user@pd-slave2
Image for post
Image for post
Image for post
$ ssh slave01
$ ssh slave02
Image for post
Image for post
Image for post

10th Step:

$ wget
Image for post
$ tar xvf spark-2.4.4-bin-hadoop2.7.tgz
Image for post
$ sudo mv spark-2.4.4-bin-hadoop2.7 /usr/local/spark
Image for post
$ sudo gedit~/.bashrc
Image for post
export PATH = $PATH:/usr/local/spark/bin
Image for post
Note: this screenshot has a mistake, when you’re doing this don’t leave a space like I did. Just write “PATH=$PATH”.
$ source ~/.bashrc
Image for post

11th Step:

$ cd /usr/local/spark/conf
$ cp
Image for post
$ sudo vim
export SPARK_MASTER_HOST='<MASTER-IP>'export JAVA_HOME=<Path_of_JAVA_installation>
$ sudo nano slaves
Image for post
Image for post

12th Step:

$ cd /usr/local/spark
$ ./sbin/
Image for post
Image for post
$ ./sbin/

13th Step:

$ jps
Image for post

14th Step:

Image for post

Sign Up with Databricks Community Edition

You can Sign Up with Databricks free community edition from

With the Databricks free community edition, you need to declare some personal information along with the reason of using it. Once you submit it, you should see the following notification and wait for the Email.


Image for post

After filling in your information, Databricks will send to your provided email address and ask for verification


Launch the Databricks, Upload the Data, and Write Your First Script

Databrick Community Edition Sign-In: CLICK HERE

1. Provide your credentials with your specified username and password registered with Databricks. Success sign in should appear as below:

2. Initiate the new cluster. On the left-hand side, click ‘Clusters’, then specify the cluster name and Apache Spark and Python version. For simplicity, I will choose 4.3 (includes Apache Spark 2.4.5, Scala 2.11) by default. To check if the cluster is running, your specified cluster should be active and running under ‘interactive cluster’ section.


3. Go shopping for some sample dataset: Cars93 .

Home → Import & Explore Data → Drag File to Upload → Create Table in Notebook


Image for post

4. Run the following commands in the workspace environment. It’s a simple way to interact with data in the FileStore. To change languages within a cell:

  • %python – Allows you to execute Python code in the cell.
  • %r – Allows you to execute R code in the cell.
  • %scala – Allows you to execute Scala code in the cell.
  • %sql – Allows you to execute SQL statements in the cell.
  • sh – Allows you to execute Bash Shell commmands and code in the cell.
  • fs – Allows you to execute Databricks Filesystem commands in the cell.
  • md – Allows you to render Markdown syntax as formatted content in the cell.
  • run – Allows you to run another notebook from a cell in the current notebook.

To read more about magics see here.



Image for post


Basically, we can deploy Spark in a Hadoop cluster in three ways, such as standalone, YARN, and SIMR.

Some Intuition

In Spark, there are two modes to submit a job:

i) Client mode

(ii) Cluster mode.

Client mode: 

In the client mode, we have Spark installed in our local client machine, so the Driver program (which is the entry point to a Spark program) resides in the client machine i.e. we will have the SparkSession or SparkContext in the client machine.

Whenever we place any request like “spark-submit” to submit any job, the request goes to Resource Manager then the Resource Manager opens up the Application Master in any of the Worker nodes.

Note: I am skipping the detailed intermediate steps explained above here.

The Application Master launches the Executors (i.e. Containers in terms of Hadoop) and the jobs will be executed.

After the Executors are launched they start communicating directly with the Driver program i.e. SparkSession or SparkContext and the output will be directly returned to the client.

The drawback of Spark Client mode w.r.t YARN is that: The client machine needs to be available at all times whenever any job is running. You cannot submit your job and then turn off your laptop and leave from office until your job is finished. 😛

In this case, it won’t be able to give the output as the connection between Driver and Executors will be broken.

Cluster Mode: 

The only difference in this mode is that Spark is installed in the cluster, not in the local machine. Whenever we place any request like “spark-submit” to submit any job, the request goes to Resource Manager then the Resource Manager opens up the Application Master in any of the Worker nodes.

Now, the Application Master will launch the Driver Program (which will be having the SparkSession/SparkContext) in the Worker node.

That means, in cluster mode the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. Whereas in client mode, the driver runs in the client machine, and the application master is only used for requesting resources from YARN.


Here, I have explained how Spark Driver and Executor works


Integrate Spark with YARN (General Procedure)

To communicate with the YARN Resource Manager, Spark needs to be aware of your Hadoop configuration. This is done via the HADOOP_CONF_DIR environment variable. The SPARK_HOME variable is not mandatory but is useful when submitting Spark jobs from the command line.

  • Edit the “bashrc” file and add the following lines:

export HADOOP_CONF_DIR=/<path of hadoop dir>/etc/hadoop
export YARN_CONF_DIR=/<path of hadoop dir>/etc/hadoop
export SPARK_HOME=/<path of spark dir>
export LD_LIBRARY_PATH=/<path of hadoop dir>/lib/native:$LD_LIBRARY_PATH

  • Restart your session by logging out and logging in again.
  • Rename the spark default template config file:
    mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
  • Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:
    spark.master yarn

Copy all jars of Spark from $SPARK_HOME/jars to hdfs so that it can be shared among all the worker nodes:

hdfs dfs -put *.jar /user/spark/share/lib

Add/modify the following parameters in spark-default.conf:

spark.master yarn
spark.yarn.jars hdfs://hmaster:9000/user/spark/share/lib/*.jar
spark.executor.memory 1g
spark.driver.memory 512m 512m


Option 1 

If you have Hadoop already installed on your cluster and want to run spark on YARN it’s very easy:

Step 1: Find the YARN Master node (i.e. which runs the Resource Manager). The following steps are to be performed on the master node only.

Step 2: Download the Spark tgz package and extract it somewhere.

Step 3: Define these environment variables, in .bashrc for example:

# Spark variables
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=<extracted_spark_package>

Step 4: Run your spark job using the --master option to yarn-client or yarn-master:

spark-submit \
--master yarn-client \
--class org.apache.spark.examples.JavaSparkPi \
$SPARK_HOME/lib/spark-examples-1.5.1-hadoop2.6.0.jar \

This particular example uses a pre-compiled example job which comes with the Spark installation.


Option 2