Spark 4.0
What is Apache Spark?
Core Architecture
- Spark SQL: Structured data queries with DataFrame APIs
- MLlib: Machine learning algorithms and utilities
- GraphX: Graph processing capabilities
- Structured Streaming: Real-time data processing
- Spark Connect: Client-server architecture for remote connectivity (introduced in Spark 3.4, enhanced in 4.0+)
Apache Spark 4.0 (Released February 2025)
Major Platform Changes
- Dropped Java 8/11 support; JDK 17 is now the default (JDK 21 also supported)
- Dropped Scala 2.12; Scala 2.13 is now the default
- Python 3.9+ required (Python 3.8 dropped)
- Spark 4 is pre-built with Scala 2.13, and support for Scala 2.12 has been officially dropped
- Hadoop upgraded from 3.3.4 to 3.4.1
- Kubernetes operator support enhanced
Core SQL Features
- Session variables for procedural SQL programming
- SQL user-defined functions for reusable query logic
- Pipe syntax for streamlined query chaining
Streaming Enhancements
PySpark Improvements
Spark Connect Enhancements
- Full API compatibility for Java client
- ML on Spark Connect support
- Swift client implementation
- spark.api.mode configuration for easy toggling
Data Source and Format Updates
- ORC now uses ZSTD compression by default
- Enhanced columnar processing
Performance and Observability
- Enhanced Catalyst optimizer
- Better AQE (Adaptive Query Execution) handling
- Improved predicate pushdown
- Structured Logging Framework for better debugging
- spark.eventLog.rolling.enabled now default
- Enhanced Spark UI with flame graphs and thread dumps
- Prometheus metrics enabled by default
- RocksDB is now the default state store backend
- Improved shuffle service configuration
- CRC32C support for shuffle checksums
Apache Spark 4.1 (Released December 2025)
Spark Declarative Pipelines (SDP)
- Define datasets and queries; Spark handles execution graph, dependency ordering, parallelism, checkpoints, and retries
- Author pipelines in Python and SQL
- Compile and run via CLI (spark-pipelines)
- Multi-language support through Spark Connect
- Streaming Tables: Tables managed by streaming queries
- Materialized Views: Tables defined as query results
Real-Time Mode (RTM) in Structured Streaming
- Stateless/single-stage streaming queries (Scala only)
- Kafka source support
- Kafka and Foreach sinks
- Update output mode
- Simple configuration change—no code rewrite needed
- Sub-second latency for stateful workloads
- Single-digit millisecond latency for stateless workloads
- Orders of magnitude improvement over micro-batch mode
Enhanced PySpark Capabilities
- @arrow_udf: Scalar functions accepting pyarrow.Array objects
- @arrow_udtf: Vectorized table functions processing pyarrow.RecordBatches
Spark Connect Maturity
- Generally available for Python clients
- Intelligent model caching and memory management
- Models cached in memory or spilled to disk based on size
- Enhanced stability for ML workloads
- Protobuf execution plans are now compressed using zstd, improving stability when handling large and complex logical plans
- Chunked Arrow result streaming for large result sets
- Removed 2GB limit for local relations
SQL Feature Maturity
- CONTINUE HANDLER for error recovery
- Multi-variable DECLARE syntax
- 8x faster reads vs. standard VARIANT
- 30x faster reads vs. JSON strings
- Trade-off: 20-50% slower writes (optimized for read-heavy analytics)
Course Outline
Module 1: Introduction to Hadoop and it’s Ecosystem
The objective of this Apache Hadoop ecosystem components tutorial is to have an overview of what are the different components of Hadoop ecosystem that make Hadoop so powerful . We will also learn about Hadoop ecosystem components like HDFS and HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache HBase and HBase components.
Module 2: What is Spark and it’s Architecture
This module introduces the on Spark, open-source projects from Apache Software Foundation which mainly used for Big Data Analytics. Here we will try to see the key difference between MapReduce and Spark is their approach toward data processing.
Hands-on : Local Installation of Spark with Python
Hand-on : Databricks Setup
Module 3: Python – A Quick Review (Optional)
In this module, you will get a quick review on Python Language. We will not going in depth but we will try to discuss some important components of Python Language.
Hands-on : Python Code Along
Hands-on : Python Review Exercise
Module 4: Spark DataFrame Basics
Data is a fundamental element in any machine learning workload, so in this module, we will learn how to create and manage datasets using Spark DataFrame. We will also try to get some knowledge on RDD,which is the fundamental Data Structure of Spark.
Hands-on : Working with DataFrame
Module 5: Introduction to Machine Learning with MLlib
The goal of this series is to help you get started with Apache Spark’s ML library. Together we will explore how to solve various interesting machine learning use-cases in a well structured way. By the end, you will be able to use Spark ML with high confidence and learn to implement an organized and easy to maintain workflow for your future projects.
Hands-on : Consultancy Project on Linear Regression
Hands-on : Consultancy Project on Logistic Regression
Module 6: Spark Structured Streaming
In a world where we generate data at an extremely fast rate, the correct analysis of the data and providing useful and meaningful results at the right time can provide helpful solutions for many domains dealing with data products. One of the amazing frameworks that can handle big data in real-time and perform different analysis, is Apache Spark. In this section, we are going to use spark streaming to process high-velocity data at scale.
Hands-on : Spark Structured Streaming Code Along
Code Along - Databricks Notebook
- Python Refresh Python Quick Review
- Dataframe Operations: Click here
Dataset
