What is Apache Spark?

Apache Spark is a powerful open-source distributed computing framework designed for big data processing and analytics. It processes large datasets across clusters of computers in parallel, known for its speed—often 100x faster than traditional Hadoop MapReduce because it processes data in-memory rather than writing to disk between operations.

 

Core Architecture

Spark supports multiple programming languages including Scala, Python (PySpark), Java, R, and SQL. The framework includes several specialized components:
  • Spark SQL: Structured data queries with DataFrame APIs
  • MLlib: Machine learning algorithms and utilities
  • GraphX: Graph processing capabilities
  • Structured Streaming: Real-time data processing
  • Spark Connect: Client-server architecture for remote connectivity (introduced in Spark 3.4, enhanced in 4.0+)
Given your work with Databricks and conversational data analytics, you’re likely familiar with how Spark SQL integrates with Unity Catalog for federated queries across your Azure and GCP billing data, and how Structured Streaming supports real-time monitoring applications.

 

Apache Spark 4.0 (Released February 2025)

Apache Spark 4.0 marks a significant milestone as the inaugural release in the 4.x series, resolving over 5,100 tickets with contributions from more than 390 individuals.

 

Major Platform Changes

Java and Scala Upgrades:
  • Dropped Java 8/11 support; JDK 17 is now the default (JDK 21 also supported)
  • Dropped Scala 2.12; Scala 2.13 is now the default
  • Python 3.9+ required (Python 3.8 dropped)
Build and Runtime:
  • Spark 4 is pre-built with Scala 2.13, and support for Scala 2.12 has been officially dropped
  • Hadoop upgraded from 3.3.4 to 3.4.1
  • Kubernetes operator support enhanced

Core SQL Features

ANSI SQL Mode by Default: One of the most significant shifts in Spark 4.0 is enabling ANSI SQL mode by default, aligning Spark more closely with standard SQL semantics. This ensures stricter data handling by providing explicit error messages for operations that previously resulted in silent truncations or nulls, such as numeric overflows or division by zero.
VARIANT Data Type: Apache Spark 4.0 introduces the new VARIANT data type designed specifically for semi-structured data, enabling the storage of complex JSON or map-like structures within a single column while maintaining the ability to efficiently query nested fields.
Session Variables and SQL UDFs:
  • Session variables for procedural SQL programming
  • SQL user-defined functions for reusable query logic
  • Pipe syntax for streamlined query chaining
String Collation Support: Advanced string comparison capabilities for multilingual data handling, important for international applications.

 

Streaming Enhancements

Arbitrary Stateful Processing v2 (transformWithState): Spark 4.0 introduces a new Arbitrary Stateful Processing operator called transformWithState, providing greater control and ease of debugging with multiple state variables/column families support. This is particularly relevant for your InteliBot work with complex state management.
State Data Source: New capability to query and debug streaming state for easier troubleshooting of stateful operations.

 

PySpark Improvements

Python Data Source API: Allows developers to create custom data sources entirely in Python, enabling integration of new data formats and systems without Java/Scala development.
Python UDTFs (User-Defined Table Functions): You can create a Python class as a UDTF using a decorator that yields an iterator of output rows. A powerful aspect is dynamic schema UDTFs – your UDTF can define an analyze() method to produce a schema on the fly based on parameters.
Unified Profiling: Comprehensive profiling support for PySpark UDFs to identify performance bottlenecks.
Native Plotting API: Built-in visualization capabilities directly in PySpark DataFrames.

 

Spark Connect Enhancements

Lightweight Python Client: A new lightweight Python client (pyspark-client) at just 1.5 MB, compared to 355MB for full PySpark.
New Features:
  • Full API compatibility for Java client
  • ML on Spark Connect support
  • Swift client implementation
  • spark.api.mode configuration for easy toggling

Data Source and Format Updates

Built-in XML Support: Native XML data source without external libraries.
Parquet & ORC Improvements:
  • ORC now uses ZSTD compression by default
  • Enhanced columnar processing
Avro & Protobuf: Expanded schema evolution capabilities.

Performance and Observability

Query Optimization:
  • Enhanced Catalyst optimizer
  • Better AQE (Adaptive Query Execution) handling
  • Improved predicate pushdown
Monitoring:
  • Structured Logging Framework for better debugging
  • spark.eventLog.rolling.enabled now default
  • Enhanced Spark UI with flame graphs and thread dumps
  • Prometheus metrics enabled by default
Infrastructure:
  • RocksDB is now the default state store backend
  • Improved shuffle service configuration
  • CRC32C support for shuffle checksums

Apache Spark 4.1 (Released December 2025)

Apache Spark 4.1.1 was released on January 9, 2026, following 4.1.0’s release on December 16, 2025. This release focuses on higher-level abstractions, ultra-low latency streaming, and enhanced Python capabilities.

 

Spark Declarative Pipelines (SDP)

Spark Declarative Pipelines (SDP) is a new component in Apache Spark 4.1, designed to allow developers to focus on data transformations rather than managing explicit dependencies and pipeline execution. By using a declarative approach, developers can now define the desired table state and how data flows between them.
Key Features:
  • Define datasets and queries; Spark handles execution graph, dependency ordering, parallelism, checkpoints, and retries
  • Author pipelines in Python and SQL
  • Compile and run via CLI (spark-pipelines)
  • Multi-language support through Spark Connect
Abstractions:
  • Streaming Tables: Tables managed by streaming queries
  • Materialized Views: Tables defined as query results
This shifts focus from “how-to” (imperative steps) to “what-to” (desired outcome), similar to how Spark moved from RDDs to DataFrames.
 

Real-Time Mode (RTM) in Structured Streaming

Apache Spark 4.1 marks a major milestone for low-latency streaming with the first official Spark support for Real-Time Mode in Structured Streaming, offering continuous, low-latency processing, achieving p99 latencies in the single-digit milliseconds range.
Current Support (Spark 4.1):
  • Stateless/single-stage streaming queries (Scala only)
  • Kafka source support
  • Kafka and Foreach sinks
  • Update output mode
  • Simple configuration change—no code rewrite needed
Performance Benefits:
  • Sub-second latency for stateful workloads
  • Single-digit millisecond latency for stateless workloads
  • Orders of magnitude improvement over micro-batch mode
This is highly relevant for your Genie observability solutions where monitoring conversations across users benefits from minimal latency.

 

Enhanced PySpark Capabilities

Arrow-Native UDFs and UDTFs: Spark 4.1 introduces two new decorators that allow developers to bypass Pandas conversion overhead and work directly with PyArrow arrays and batches.
  • @arrow_udf: Scalar functions accepting pyarrow.Array objects
  • @arrow_udtf: Vectorized table functions processing pyarrow.RecordBatches
These eliminate serialization overhead, crucial for your conversational analytics applications with high-volume data processing.
Python Worker Logging: By enabling spark.sql.pyspark.worker.logging.enabled, you can use the standard Python logging module inside your UDFs. Spark captures these logs and exposes them per session via a new Table-Valued Function: python_worker_logs().
This addresses a major pain point in debugging PySpark UDFs, which will be valuable for your multi-agent implementations.
Python Data Source Filter Pushdown: Implement pushFilters method in DataSourceReader to handle filter conditions at the source level, reducing data transfer.

 

Spark Connect Maturity

Spark ML on Connect (GA):
  • Generally available for Python clients
  • Intelligent model caching and memory management
  • Models cached in memory or spilled to disk based on size
  • Enhanced stability for ML workloads
Scalability Improvements:
  • Protobuf execution plans are now compressed using zstd, improving stability when handling large and complex logical plans
  • Chunked Arrow result streaming for large result sets
  • Removed 2GB limit for local relations

SQL Feature Maturity

SQL Scripting (GA): After its preview in 4.0, SQL Scripting is now Generally Available (GA) and enabled by default, transforming Spark SQL into a robust programmable environment with loops, conditionals, and complex control flow.
New in 4.1:
  • CONTINUE HANDLER for error recovery
  • Multi-variable DECLARE syntax
VARIANT Shredding: A major performance enhancement in Spark 4.1 is shredding. This feature automatically extracts commonly occurring fields within a variant column and stores them as separate, typed Parquet fields.
Performance Impact:
  • 8x faster reads vs. standard VARIANT
  • 30x faster reads vs. JSON strings
  • Trade-off: 20-50% slower writes (optimized for read-heavy analytics)
Recursive CTEs: Spark 4.1 adds standard SQL syntax for Recursive Common Table Expressions, allowing you to traverse hierarchical data structures—such as org charts or graph topologies—purely within SQL.
Approximate Data Sketches: Native support for KLL (quantiles) and Theta sketches for efficient approximate set operations on massive datasets.

Module 1: Introduction to Hadoop and it’s Ecosystem

The objective of this Apache Hadoop ecosystem components tutorial is to have an overview of what are the different components of Hadoop ecosystem that make Hadoop so powerful . We will also learn about Hadoop ecosystem components like HDFS and HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache HBase and HBase components.

 

Module 2: What is Spark and it’s Architecture

This module introduces the on Spark, open-source projects from Apache Software Foundation which mainly used for Big Data Analytics. Here we will try to  see the key difference between MapReduce and Spark is their approach toward data processing. 

Hands-on : Local Installation of Spark with Python 
Hand-on : Databricks Setup

 

Module 3: Python – A Quick Review (Optional)

In this module, you will get a quick review on Python Language. We will not going in depth but we will try to discuss some important components of Python Language.  

Hands-on : Python Code Along
Hands-on : Python Review Exercise

 

Module 4: Spark DataFrame Basics

Data is a fundamental element in any machine learning workload, so in this module, we will learn how to create and manage datasets using Spark DataFrame. We will also try to get some knowledge on RDD,which is the fundamental Data Structure of Spark. 

Hands-on : Working with DataFrame

 

Module 5: Introduction to Machine Learning with MLlib

The goal of this series is to help you get started with Apache Spark’s ML library. Together we will explore how to solve various interesting machine learning use-cases in a well structured way. By the end, you will be able to use Spark ML with high confidence and learn to implement an organized and easy to maintain workflow for your future projects. 

Hands-on : Consultancy Project on Linear Regression
Hands-on : Consultancy Project on Logistic Regression

 

Module 6: Spark Structured Streaming 

In a world where we generate data at an extremely fast rate, the correct analysis of the data and providing useful and meaningful results at the right time can provide helpful solutions for many domains dealing with data products. One of the amazing frameworks that can handle big data in real-time and perform different analysis, is Apache Spark. In this section, we are going to use spark streaming to process high-velocity data at scale.

Hands-on : Spark Structured Streaming Code Along

To do the labs we’ll use Databricks free Edition

  • Click here to sign-up.

To create a temporary email you can use this site!

All Exercise Files are here

  • Please click here 
  • The class ID is 41587