Spark 3.0
Spark
Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing.
The fast part means that it’s faster than previous approaches to work with Big Data like classical MapReduce. The secret for being faster is that Spark runs on memory (RAM), and that makes the processing much faster than on disk drives.
The general part means that it can be used for multiple things like running distributed SQL, creating data pipelines, ingesting data into a database, running Machine Learning algorithms, working with graphs or data streams, and much more.
Components of Spark
Spark as a whole consists of various libraries, APIs, databases, etc. The main components of Apache Spark are as follows:
Spark Core
Spare Core is the basic building block of Spark, which includes all components for job scheduling, performing various memory operations, fault tolerance, and more. Spark Core is also home to the API that consists of RDD. Moreover, Spark Core provides APIs for building and manipulating data in RDD.
Spark SQL
Apache Spark works with the unstructured data using its ‘go to’ tool, Spark SQL. Spark SQL allows querying data via SQL, as well as via Apache Hive’s form of SQL called Hive Query Language (HQL). It also supports data from various sources like parse tables, log files, JSON, etc. Spark SQL allows programmers to combine SQL queries with programmable changes or manipulations supported by RDD in Python, Java, Scala, and R.
Spark Streaming
Spark Streaming processes live streams of data. Data generated by various sources is processed at the very instant by Spark Streaming. Examples of this data include log files, messages containing status updates posted by users, etc.
MLlib
Apache Spark comes up with a library containing common Machine Learning (ML) services called MLlib. It provides various types of ML algorithms including regression, clustering, and classification, which can perform various operations on data to get meaningful insights out of it.
GraphX
GraphX is Apache Spark’s library for enhancing graphs and enabling graph-parallel computation. Apache Spark includes a number of graph algorithms which help users in simplifying graph analytics.
Prerequisite
- Hardware / Software Requirements
- For Local Installation of Spark :
- Good to have
- Better if you have good knowledge on Python and some of the Data Analytics Libraries like Pandas
Course Outline
Module 1: Introduction to Hadoop and it’s Ecosystem
The objective of this Apache Hadoop ecosystem components tutorial is to have an overview of what are the different components of Hadoop ecosystem that make Hadoop so powerful . We will also learn about Hadoop ecosystem components like HDFS and HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache HBase and HBase components.
Module 2: What is Spark and it’s Architecture
This module introduces the on Spark, open-source projects from Apache Software Foundation which mainly used for Big Data Analytics. Here we will try to see the key difference between MapReduce and Spark is their approach toward data processing.
Hands-on : Local Installation of Spark with Python
Hand-on : Databricks Setup
Module 3: Python – A Quick Review
In this module, you will get a quick review on Python Language. We will not going in depth but we will try to discuss some important components of Python Language.
Hands-on : Python Code Along
Hands-on : Python Review Exercise
Module 4: Spark DataFrame Basics
Data is a fundamental element in any machine learning workload, so in this module, we will learn how to create and manage datasets using Spark DataFrame. We will also try to get some knowledge on RDD,which is the fundamental Data Structure of Spark.
Hands-on : Working with RDD
Hands-on : Working with DataFrame
Module 5: Introduction to Machine Learning with MLlib
The goal of this series is to help you get started with Apache Spark’s ML library. Together we will explore how to solve various interesting machine learning use-cases in a well structured way. By the end, you will be able to use Spark ML with high confidence and learn to implement an organized and easy to maintain workflow for your future projects.
Hands-on : Consultancy Project on Linear Regression
Hands-on : Consultancy Project on Logistic Regression
Hands-on : Consultancy Project on K-Means
Module 6: Spark Streaming
In a world where we generate data at an extremely fast rate, the correct analysis of the data and providing useful and meaningful results at the right time can provide helpful solutions for many domains dealing with data products. One of the amazing frameworks that can handle big data in real-time and perform different analysis, is Apache Spark. In this section, we are going to use spark streaming to process high-velocity data at scale.
Hands-on : Spark Streaming Code Along with RDD
Lab Files
You can find all Lab Files and Instructions here.
Lab 1 : Installation of Spark
Lab 2 : Python_Crash_Course_Exercises (.zip file. Please unzip and Upload to Databricks)
Lab 3 : DataFrame Exercise (Spark DataFrames Project Exercise) (.zip file. Please unzip and Upload to Databricks)
Lab 4 : Consultancy Project I ( Linear_Regression_Consulting_Project)(.zip file. Please unzip and Upload to Databricks)
Lab 5 : Consultancy Project II ( Logistic_Regression_Consulting_Project ) (.zip file. Please unzip and Upload to Databricks)
Lab 6 : Consultancy Project III (Clustering_Consulting_Project)(.zip file. Please unzip and Upload to Databricks)
Lab 7 : Spark Streaming
Extra Resources
Databricks Notebook
Book
Cheatsheet
Mindmap
- Click here to see the Mind Map for Module 1 and Module 2
Dataset
Preparing for CCA175 [Optional]
Big data and analytics are fast becoming must-have skills by companies all across the world. The technology is useful for the efficiency brought by harnessing data and using it for business decision making on operations, cost-saving initiatives, customer service, and profitability. Such Big data technologies include Hadoop, Apache Spark, Machine Learning and Data Mining.
Recruitment of professionally trained big data analysts has been a big challenge for human resource experts across the world. Professionals with the right skills and certification are rare and hard to come by.
Among Apache Spark Certifications that you should acquire if you are looking to increase your skills in big data and analytics, is the CCA175 Certification. This certification will also give you an advantage in the employment market or as a big data consultant
Essentially, the following steps are necessary to prepare and pass the CCA157 Spark and Hadoop Developer exam:
- Read everything you can about Spark & Hadoop
- Have a good understanding of how to execute HDFS Commands
- Learn how to move data between relational databases & HDFS using Sqoop
- Choose a programming language between Python and Scala
- Polish up on your SQL (HiveQL) skills
- Develop Spark-based applications using core APIs
- Integrate Spark SQL & data frames to Spark-based applications
- Learn how to stream data pipelines – Flume, Kafka and Spark Structured Streaming
- Take Practice Tests
1. The CCA175 Spark and Hadoop Developer certification exam
The CCA175 Certification is conducted by Cloudera and involves an exam on a variety of topics including Impala, Avro, Flume, HDFS, Spark with Scala and Python. The CCA175 exam is scenario-based, where you will have 2 hours to answer between 8 and 12 scenario questions, using tools like Impala or Hive, usually with some coding required.
In evaluating your score, Cloudera will look at your results and not the code itself with a minimum score of 70% required to earn the certification. You should expect your results within 3 days of the exam, and your certificate about a week after.
The CCA175 exam is available in either Scala or Python programming languages. It is a practical, hands-on exam that is administered remotely to all registered candidates and can be done anywhere in the world at any time of the available time slots. You will be able to get the available time slots when registering for the exam.
2. Preparing for the exam
This article is meant to be a preparatory guide before taking the CCA175 Spark and Hadoop Developer Certification exam. As this skill tests your proficiency in different programming languages and how good your code is, it will be important that you prepare well for the exam. I shall walk you through the objectives of the test, the skills outline and the resources that you need to stock up on. Taking any certification exam costs time and money. Once you have decided to take the exam, you must pass in the first sitting.
You should strive to learn effectively, and avoid time-wasting learning that will neither improve your skills nor your chances of excelling in this exam. This guide is expected to steer you on the right path, to ensure you get the best chance of passing the exam.

Getting Started
You need to enhance your proficiency in coding using some key languages, relevant to this exam. You should, therefore, take time to revise and upgrade your skills where necessary.
To take this exam you should have the following key programming skills;
- Sqoop – It is one of the Apache foundation projects, used for efficiently transferring bulk data. It is usually used to transfer data in the Hadoop to Relational database direction but can be used in the reverse direction too. You should, therefore, ensure you have a good understanding of how to use Sqoop.
- Spark – Obviously, this is the main celebrity here. You should have a good understanding and skills on how to code using Python or Scala.
- HiveQL or SQL – You should be confident about your ability to use Hive or SQL, and can write such scripts with relative ease.
Learning objectives and Skills Outline
Cloudera has listed three required skills for you to get the CCA175 certification on its website, you can look at those skills here:
Skills | Knowledge required |
Transform, Stage, and Store | Convert a set of data values in a given format stored in HDFS into new data values or a new data format and write them into HDFS. 1. Load data from HDFS for use in Spark applications. |
Data Analysis | Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by using queries against loaded data. 1. Use metastore tables as an input source or an output sink for Spark applications. |
Configuration | This is a practical exam and the candidate should be familiar with all aspects of generating a result, not just writing code. 1. Supply command-line options to change your application configuration, such as increasing available memory |
Transform, Stage, and Store
The first learning objective for the CCA175 certification is to transform, stage and store data. The Hadoop ecosystem uses a distributed file system known as Hadoop’s Distributed File System or HDFS.
By the end of your learning you should be able to do the following:
- Convert a set of HDFS data values into a new data set or format.
- Load data from HDFS, and use it for Spark applications.
- Use Spark to write back the data into HDFS.
- Read and write files data in different formats.
- Use a spark API to extract, transform and load data.
Data Analysis
The second learning skill that will be tested is data analysis. You will be expected to understand how to use Spark applications to do data analysis, filter data, run calculation routines, and queries, join datasets and produce data in required formats.
You should, therefore, be able to use Spark SQL to interact with data and generate reports using queries against loaded data. You should also be able to use input metastore tables, spark applications, and query databases in Spark.
Configuration
Upon earning the CCA175 certification, you should be able to comfortably configure and organize sets of data into different specs. As the exam is practical with scenario questions, you will be expected to be familiar with how to solve the given problems, as the scores will be on results and not the code.
You should also be able to change configurations using supply command-line options.
The three skills above will be mandatory for you to pass the exam. In this article, we shall further break down the skills with a few more details, for you to prepare for.
3. Preparing for the CCA175 Certification Exam
Some people are saying that you can fully prepare for this exam in only one month. But in reality, it’s not possible. Not even close. You will waste your money if sat for the exam unprepared. Make sure you go through as many resources as possible before booking your exam.
Read everything you can about Spark & Hadoop
Apache Spark is a distributed processing engine, providing APIs to facilitate distributed computing. It is not a programming language but is a cluster-computing framework. It is open-source, making it easily available and free for use.
Choose a programming language between Python and Scala
The CCA175 certification exam is available in either Scala or Python programming languages. You should, therefore, be comfortable with the use of at least one of these.
Check out the following link if you wish to use the Scala programming language:
CCA 175 – Spark and Hadoop Developer Certification – Scala
If you are more comfortable with using the Python programming language, then you can take the following course on Udemy:
CCA 175 – Spark and Hadoop Developer – Python (pyspark)
These courses are available in Udemy at an affordable cost and will teach you the full CCA175 Spark and Hadoop Developer curriculum. It will teach you about Apache Sqoop and how to execute HDFS Commands. Also included in the course content is programming with Scala or Python, with all the fundamentals provided.
YouTube also offers some great resources that you can use for free. Check out the video below to learn about Spark.
https://www.youtube.com/watch?time_continue=145&v=9mELEARcxJo&feature=emb_logo
Some of the other places you could look online to learn about Hadoop and Spark are from the following links:
- https://blog.matthewrathbone.com/2016/09/01/a-beginners-guide-to-hadoop-storage-formats.html (An Introduction to Hadoop and Spark Storage Formats (or File Formats))
- https://www.oreilly.com/library/view/hadoop-application-architectures/9781491910313/ch01.html (Chapter 1. Data Modeling in Hadoop)
- https://www.youtube.com/watch?v=ziqx2hJY8Hg (Hadoop Tutorial: Intro to HDFS)
Have a good understanding of how to execute HDFS Commands
A basic requirement for you to understand to enable you to ace the CCA175 exam is HDFS commands. By the time you are doing the exam, you should be able to comfortably execute basic and frequently used Hadoop HDFS commands to perform file operations.
This site here provides some of the best tutorials from where you can learn how to do these commands.
Learn how to move data between relational databases & HDFS using Sqoop
Sqoop is used for data transfer between Hadoop and relational databases. You should be able to use the Sqoop tool to import and export data
Check out the tutorial below for a detailed view of what Sqoop is and how to use it.
Polish up on your SQL (HiveQL) skills
This certification exam requires that you have a good understanding of SQL programming language. You should, therefore, polish up on your SQL skills, and learn how to structure databases, author and manage SQL databases and how to do data analysis with SQL. We recommend that you learn how to use Hive Query language.
Check out this link (Hive tutorials) that can help you to learn Hive SQL in 3 days.
Develop Spark-based applications using core APIs
In practicing for the CCA exam, you should be able to develop Spark-based applications using core APIs.
Spark will require a data structure for it to hold data. You can use either of two options; Dataset and Dataframe.
You should be able to perform transformation and action tasks for the dataset. Learning these tasks will be critical for you to be able to pass your exam.
Check out the following sites for some free tutorials that you can use to learn how to develop Spark-based applications using core APIs
Integrate Spark SQL & data frames to Spark-based applications
A great resource that you can use to learn how to integrate Spark SQL & data frames to Spark-based applications, is the following free tutorial from towardsdatascience.com.
Here you will get lessons on how to leverage the power of relational databases, using spark SQL and DataFrames. In greater detail, you will get to understand the challenges with scaling relational databases, understand Spark SQL and DataFrames, and get insights from an actual case study.
Learn how to stream data pipelines – Flume, Kafka and Spark Structured Streaming
Watch the video below to learn how you can:
- Develop end to end applications that read data from web server logs.
- Stream and connect into Kafka.
- Process data using Spark Streaming.
- Write data to HBase.
https://www.youtube.com/watch?time_continue=1&v=czBLDvL1KrI&feature=emb_logo
Take Practice Tests
As with all other skills, mastery requires practice.
Once you have gone through the notes and learned all the skills that you need for this course, you should embark on an intense practice regime. Remember that the CCA-175 examination tests scenario questions, and you shall be scored and rated based on the ultimate results, rather than the code that you use.
You need to be prepared to try as many practice questions as possible.
4. Registering for the CCA175 Certification Exam
We highly recommend that you do not register for the exam until you are thoroughly prepared for it. You should take these suggested courses, and use the resources to learn as much as possible about everything you need to know to get the certification.
Once you are confident that you have mastered the content of the course, then you can proceed and register for the examination.
How to register for the CCA175 Certification Exam:
To register for this exam,
- Log on to Cloudera using this link and click on the purchase link for CCA Spark and Hadoop Developer (CCA175). If you have never registered before at Cloudera, you will need to create a Cloudera Single Sign On (SSO) account before proceeding to register for the exam.
- Review all the details carefully and click on purchase, the certification costs $295, so you will need to have funded your account sufficiently to be able to process the payment.
- When registration is complete, you will receive an email with instructions about how to create an account with examslocal.com, to enable scheduling your exam.
- Follow the instructions, and you will have been booked for your examination slot. If you can’t see your preferred time slot, check whether other alternative time slots will be suitable for you.
Due to limited time slots, it will be best for you to register as early as possible, as time slots are given on a first-come, first-served basis.
You can also reschedule your exam, should you wish to. Again you will log onto examslocal.com and click on “my exams” where you will be guided appropriately on how to reschedule.