šŸš€ Spark + JupyterHub Installation Guide

Multi-User PySpark Training Environment
Version: 1.0 | Last Updated: January 2026 | Tested On: Ubuntu 24.04 LTS
Target Users: 5 concurrent users | Data Size: Up to 2GB per user

āš ļø CRITICAL REQUIREMENTS

Before you start, please note these important requirements:

  1. Java 17 is REQUIRED - Spark 4.1.1 does NOT work with Java 11
  2. Driver Memory: 512MB minimum - Lower values will cause failures
  3. 16GB RAM recommended for 5 concurrent users
  4. Ubuntu 24.04 hostname must be fixed (maps to 127.0.1.1 by default)
  5. All configurations use explicit IPs to avoid connection issues

This guide includes all tested fixes - follow every step exactly as written.

šŸ“ IMPORTANT: Replace 129.212.232.134 with your server's actual IP address throughout ALL commands in this guide.

Prerequisites


Server Specifications

Recommended Server Configuration:

Minimum Requirements:

Compatible Platforms:


Initial Setup

Step 1 Connect to Your Server

# Replace 129.212.232.134 with your server's IP address
ssh root@129.212.232.134
Note: Throughout this entire guide, replace 129.212.232.134 with your actual server IP address.

Step 2 Update System

apt update && apt upgrade -y

Step 3 Install Essential Packages

apt install -y \
  python3 \
  python3-pip \
  python3-venv \
  nodejs \
  npm \
  curl \
  wget \
  htop \
  ufw \
  net-tools \
  nano \
  git

Step 4 Reboot Server

reboot

Wait 2 minutes, then reconnect:

ssh root@129.212.232.134

Install Java 17

Important: Spark 4.1.1 requires Java 17 (NOT Java 11)
# Install Java 17
apt install -y openjdk-17-jdk

# Verify installation
java -version

Expected output:

openjdk version "17.0.x"

Note: If you accidentally installed Java 11, the services will fail to start with "class file version 61.0" error. Always use Java 17 for Spark 4.1.1.


Install Apache Spark 4.1.1

Step 1 Download Spark

cd /opt

wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz

Step 2 Extract and Setup

tar -xzf spark-4.1.1-bin-hadoop3.tgz

mv spark-4.1.1-bin-hadoop3 spark

rm spark-4.1.1-bin-hadoop3.tgz

chown -R root:root /opt/spark

Step 3 Verify Installation

ls -la /opt/spark/

/opt/spark/bin/spark-shell --version

āœ“ You should see Spark version 4.1.1


Configure Spark

Step 1 Fix Hostname Resolution

Critical: This prevents connection issues between master and worker
# Get hostname
HOSTNAME=$(hostname)
echo "Your hostname is: $HOSTNAME"

# Backup original hosts file
cp /etc/hosts /etc/hosts.backup

# Create proper hosts file
cat > /etc/hosts << EOF
127.0.0.1       localhost
129.212.232.134 $HOSTNAME

::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters
EOF

# Verify
cat /etc/hosts

Step 2 Create spark-env.sh

cat > /opt/spark/conf/spark-env.sh << 'EOF'
#!/usr/bin/env bash

# Critical: Replace 129.212.232.134 with actual IP
export SPARK_LOCAL_IP=129.212.232.134
export SPARK_MASTER_HOST=129.212.232.134
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080

# Worker settings
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=12g
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=8081
export SPARK_WORKER_TIMEOUT=120

# Python settings
export PYSPARK_PYTHON=/opt/jupyterhub/bin/python3
export PYSPARK_DRIVER_PYTHON=/opt/jupyterhub/bin/python3

# Java settings
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
EOF

chmod +x /opt/spark/conf/spark-env.sh

Step 3 Create spark-defaults.conf

cat > /opt/spark/conf/spark-defaults.conf << 'EOF'
spark.master                                spark://129.212.232.134:7077
spark.driver.host                           129.212.232.134
spark.driver.bindAddress                    129.212.232.134
spark.driver.port                           7001
spark.driver.memory                         512m
spark.driver.memoryOverhead                 256m
spark.executor.memory                       768m
spark.executor.memoryOverhead               384m
spark.executor.cores                        1
spark.network.timeout                       300s
spark.rpc.askTimeout                        300s
spark.port.maxRetries                       100
spark.serializer                            org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions                4
spark.default.parallelism                   8
spark.sql.adaptive.enabled                  true
spark.cores.max                             8
spark.hadoop.fs.permissions.umask-mode      000
EOF
Note: The spark.hadoop.fs.permissions.umask-mode setting ensures users can delete files they create.

Step 4 Create Start Scripts

Master Start Script:

cat > /opt/spark/sbin/start-master-fixed.sh << 'EOF'
#!/usr/bin/env bash
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_LOCAL_IP=129.212.232.134
export SPARK_MASTER_HOST=129.212.232.134

$SPARK_HOME/sbin/start-master.sh \
    --host 129.212.232.134 \
    --port 7077 \
    --webui-port 8080
EOF

chmod +x /opt/spark/sbin/start-master-fixed.sh

Worker Start Script:

cat > /opt/spark/sbin/start-worker-fixed.sh << 'EOF'
#!/usr/bin/env bash
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_LOCAL_IP=129.212.232.134

$SPARK_HOME/sbin/start-worker.sh \
    --host 129.212.232.134 \
    --port 7078 \
    --webui-port 8081 \
    spark://129.212.232.134:7077
EOF

chmod +x /opt/spark/sbin/start-worker-fixed.sh

Install JupyterHub

Step 1 Create Virtual Environment

python3 -m venv /opt/jupyterhub

Step 2 Install JupyterHub Components

# Upgrade pip
/opt/jupyterhub/bin/pip install --upgrade pip wheel

# Install JupyterHub and JupyterLab
/opt/jupyterhub/bin/pip install jupyterhub jupyterlab notebook

# Install PySpark
/opt/jupyterhub/bin/pip install pyspark==4.1.1 py4j ipykernel

Step 3 Install Node.js Proxy

npm install -g configurable-http-proxy

Step 4 Verify Installations

/opt/jupyterhub/bin/jupyterhub --version
/opt/jupyterhub/bin/jupyter --version
which configurable-http-proxy

Configure JupyterHub

Step 1 Create Configuration Directory

mkdir -p /opt/jupyterhub/etc/jupyterhub
cd /opt/jupyterhub/etc/jupyterhub

Step 2 Generate Config File

/opt/jupyterhub/bin/jupyterhub --generate-config

Step 3 Edit Configuration

cat > /opt/jupyterhub/etc/jupyterhub/jupyterhub_config.py << 'EOF'
# Basic settings
c.JupyterHub.bind_url = 'http://0.0.0.0:8000'
c.Spawner.default_url = '/lab'
c.Spawner.notebook_dir = '~'

# Environment variables for all users
c.Spawner.environment = {
    'SPARK_HOME': '/opt/spark',
    'PYTHONPATH': '/opt/spark/python:/opt/spark/python/lib/py4j-0.10.9.7-src.zip',
    'JAVA_HOME': '/usr/lib/jvm/java-17-openjdk-amd64',
    'SPARK_LOCAL_IP': '129.212.232.134',
    'PYSPARK_PYTHON': '/opt/jupyterhub/bin/python3',
    'PYSPARK_DRIVER_PYTHON': '/opt/jupyterhub/bin/python3'
}

# Resource limits per user
c.Spawner.mem_limit = '2G'
c.Spawner.cpu_limit = 2

# Allow all authenticated users to login
c.Authenticator.allow_all = True

# Preserve user environment variables (fixes file permission issues)
c.Spawner.env_keep = ['USER', 'PATH', 'HOME']
EOF

Create PySpark Kernel

mkdir -p /usr/local/share/jupyter/kernels/pyspark

cat > /usr/local/share/jupyter/kernels/pyspark/kernel.json << 'EOF'
{
  "display_name": "PySpark (Spark 4.1.1)",
  "language": "python",
  "argv": [
    "/opt/jupyterhub/bin/python3",
    "-m",
    "ipykernel_launcher",
    "-f",
    "{connection_file}"
  ],
  "env": {
    "SPARK_HOME": "/opt/spark",
    "PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.10.9.7-src.zip:/opt/jupyterhub/lib/python3.12/site-packages",
    "JAVA_HOME": "/usr/lib/jvm/java-17-openjdk-amd64",
    "SPARK_LOCAL_IP": "129.212.232.134",
    "PATH": "/opt/jupyterhub/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
    "PYSPARK_PYTHON": "/opt/jupyterhub/bin/python3",
    "PYSPARK_DRIVER_PYTHON": "/opt/jupyterhub/bin/python3",
    "PYSPARK_SUBMIT_ARGS": "--master spark://129.212.232.134:7077 --conf spark.driver.host=129.212.232.134 --conf spark.driver.bindAddress=129.212.232.134 --executor-memory 768m --executor-cores 1 --driver-memory 512m --conf spark.executor.memoryOverhead=384m --conf spark.driver.memoryOverhead=256m pyspark-shell"
  }
}
EOF

Setup System Services

Step 1 Spark Master Service

cat > /etc/systemd/system/spark-master.service << 'EOF'
[Unit]
Description=Apache Spark Master
After=network-online.target
Wants=network-online.target

[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=129.212.232.134"
Environment="SPARK_MASTER_HOST=129.212.232.134"

ExecStartPre=/bin/sleep 5
ExecStart=/opt/spark/sbin/start-master-fixed.sh
ExecStop=/opt/spark/sbin/stop-master.sh

Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

Step 2 Spark Worker Service

cat > /etc/systemd/system/spark-worker.service << 'EOF'
[Unit]
Description=Apache Spark Worker
After=spark-master.service network-online.target
Requires=spark-master.service
Wants=network-online.target

[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=129.212.232.134"

ExecStartPre=/bin/bash -c 'for i in {1..30}; do curl -s http://129.212.232.134:8080 && break || sleep 2; done'
ExecStart=/opt/spark/sbin/start-worker-fixed.sh
ExecStop=/opt/spark/sbin/stop-worker.sh

Restart=on-failure
RestartSec=15

[Install]
WantedBy=multi-user.target
EOF

Step 3 JupyterHub Service

cat > /etc/systemd/system/jupyterhub.service << 'EOF'
[Unit]
Description=JupyterHub
After=spark-worker.service network.target
Requires=spark-worker.service

[Service]
Type=simple
User=root
Environment="PATH=/opt/jupyterhub/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/opt/jupyterhub/bin/jupyterhub -f /opt/jupyterhub/etc/jupyterhub/jupyterhub_config.py
WorkingDirectory=/opt/jupyterhub/etc/jupyterhub

Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

Step 4 Reload Systemd

systemctl daemon-reload

Configure Firewall

# Enable firewall
ufw --force enable

# Allow SSH (IMPORTANT!)
ufw allow 22/tcp

# Spark ports
ufw allow 7077/tcp comment 'Spark Master'
ufw allow 7078/tcp comment 'Spark Worker'
ufw allow 8080/tcp comment 'Spark Master UI'
ufw allow 8081/tcp comment 'Spark Worker UI'
ufw allow 7001:7010/tcp comment 'Spark Driver'

# JupyterHub
ufw allow 8000/tcp comment 'JupyterHub'

# Spark Application UI
ufw allow 4040:4050/tcp comment 'Spark App UI'

# Reload firewall
ufw reload

# Verify firewall rules
ufw status numbered

Create User Accounts

# Create 5 training users
for i in {1..5}; do
  useradd -m -s /bin/bash user$i
  echo "user$i:Training@123" | chpasswd
  echo "Created user$i with password: Training@123"
done

# Verify users
ls -la /home/

User Credentials:

Set Proper Permissions

# Fix home directory permissions for all users
# This prevents "Permission denied" errors when deleting Spark files
for i in {1..5}; do
  chown -R user$i:user$i /home/user$i
  chmod -R 755 /home/user$i
done

echo "āœ“ User permissions set correctly"

Install Common Python Packages

# Install data science and ML packages
/opt/jupyterhub/bin/pip install \
  pandas \
  numpy \
  matplotlib \
  seaborn \
  plotly \
  scikit-learn \
  scipy \
  statsmodels \
  requests \
  beautifulsoup4 \
  sqlalchemy \
  pymysql \
  psycopg2-binary

# Verify installations
/opt/jupyterhub/bin/pip list

Start Services

Step 1 Enable Services (Auto-start on boot)

systemctl enable spark-master
systemctl enable spark-worker
systemctl enable jupyterhub

Step 2 Start Spark Master

systemctl start spark-master
echo "Waiting for master to start..."
sleep 15

# Check status
systemctl status spark-master --no-pager

Expected: Active: active (running)

Step 3 Start Spark Worker

systemctl start spark-worker
echo "Waiting for worker to start..."
sleep 15

# Check status
systemctl status spark-worker --no-pager

Expected: Active: active (running)

Step 4 Start JupyterHub

systemctl start jupyterhub
echo "Waiting for JupyterHub to start..."
sleep 10

# Check status
systemctl status jupyterhub --no-pager

Expected: Active: active (running)


Verification

Check All Services

systemctl status spark-master spark-worker jupyterhub --no-pager

All should show: Active: active (running)

Check Processes

jps

Expected output:

12345 Master
12346 Worker
12347 Jps

Check Network Bindings

netstat -tuln | grep "7077\|7078\|8080\|8081\|8000"

Expected: All ports listening on 129.212.232.134

Check Worker Registration

curl -s http://129.212.232.134:8080 | grep -i "Workers (1)"

Expected: Shows "Workers (1)"

Check Logs

# Master logs
ls /opt/spark/logs/*Master*.out
tail -50 /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-*.out

# Worker logs
ls /opt/spark/logs/*Worker*.out
tail -50 /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-*.out

Look for: "Successfully registered with master"


Testing

Test 1: Access Web Interfaces

Open in browser:

  1. Spark Master UI: http://129.212.232.134:8080
    Should show: 1 Worker, 8 cores, 12GB memory
  2. Spark Worker UI: http://129.212.232.134:8081
    Should show: Worker details
  3. JupyterHub: http://129.212.232.134:8000
    Should show: Login page

Test 2: Login to JupyterHub

  1. Go to: http://129.212.232.134:8000
  2. Login with: user1 / Training@123
  3. You should see JupyterLab interface

Test 3: Create and Run PySpark Notebook

  1. Click New → Notebook
  2. Select kernel: PySpark (Spark 4.1.1)
  3. In first cell, paste this code:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("InstallationTest") \
    .getOrCreate()

# Verify configuration
print(f"āœ“ Spark Version: {spark.version}")
print(f"āœ“ Master: {spark.sparkContext.master}")
print(f"āœ“ Application Name: {spark.sparkContext.appName}")

# Test data processing
print("\n=== Testing Data Processing ===")
df = spark.range(1, 1000000).toDF("id")
df = df.withColumn("value", df.id * 2)

print(f"āœ“ Total Rows: {df.count():,}")
print("\nāœ“ Sample Data:")
df.show(10)

print("\nāœ“ Statistics:")
df.describe().show()

# Stop Spark
spark.stop()
print("\nāœ“āœ“āœ“ INSTALLATION SUCCESSFUL! āœ“āœ“āœ“")
  1. Press Shift + Enter to run

Test 4: Check Spark Application UI

While notebook is running:

Expected Results

All tests should pass with:


Monitoring

Check Cluster Health Script

cat > /root/check-spark.sh << 'EOF'
#!/bin/bash
echo "=== SPARK CLUSTER STATUS ==="
echo ""
echo "Master: $(systemctl is-active spark-master)"
echo "Worker: $(systemctl is-active spark-worker)"
echo "JupyterHub: $(systemctl is-active jupyterhub)"
echo ""
echo "=== REGISTERED WORKERS ==="
curl -s http://129.212.232.134:8080 | grep -c "worker-" | xargs echo "Workers:"
echo ""
echo "=== ACTIVE APPLICATIONS ==="
curl -s http://129.212.232.134:8080 | grep -c "app-" | xargs echo "Apps:"
echo ""
echo "=== RESOURCE USAGE ==="
free -h | grep Mem
echo ""
df -h | grep -E "Filesystem|/$"
EOF

chmod +x /root/check-spark.sh

Run health check:

/root/check-spark.sh

Monitor Logs

# Real-time monitoring
journalctl -u spark-master -f
journalctl -u spark-worker -f
journalctl -u jupyterhub -f

# System resources
htop

Maintenance

Adding More Users

# Add new user
useradd -m -s /bin/bash user6
echo "user6:Training@123" | chpasswd

Installing Additional Packages

# Install for all users
/opt/jupyterhub/bin/pip install package-name

# Restart JupyterHub
systemctl restart jupyterhub

Updating Spark Configuration

# Edit configuration
nano /opt/spark/conf/spark-defaults.conf

# Restart Spark services
systemctl restart spark-master spark-worker

Backup Configuration

# Create backup
tar -czf spark-jupyterhub-backup-$(date +%Y%m%d).tar.gz \
  /opt/spark/conf \
  /opt/jupyterhub/etc/jupyterhub \
  /etc/systemd/system/spark-*.service \
  /etc/systemd/system/jupyterhub.service \
  /usr/local/share/jupyter/kernels/pyspark

# Verify backup
ls -lh spark-jupyterhub-backup-*.tar.gz

Checking Resource Usage

# Memory usage
free -h

# Disk usage
df -h

# CPU usage
htop

# Active Spark applications
curl -s http://129.212.232.134:8080

Restarting All Services

systemctl restart spark-master && \
  sleep 10 && \
  systemctl restart spark-worker && \
  sleep 10 && \
  systemctl restart jupyterhub

Scaling: Adding More Worker Nodes

When to Add Workers

Add more worker nodes when:

Setup New Worker Node

On Each New Worker Server:

# 1. Install Java 17
apt update
apt install -y openjdk-17-jdk

# 2. Download and extract Spark
cd /opt
wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
tar -xzf spark-4.1.1-bin-hadoop3.tgz
mv spark-4.1.1-bin-hadoop3 spark
rm spark-4.1.1-bin-hadoop3.tgz

# 3. Fix hostname resolution
HOSTNAME=$(hostname)
WORKER_IP=$(hostname -I | awk '{print $1}')
cat > /etc/hosts << EOF
127.0.0.1       localhost
$WORKER_IP      $HOSTNAME
EOF

# 4. Create spark-env.sh
cat > /opt/spark/conf/spark-env.sh << EOF
export SPARK_LOCAL_IP=$WORKER_IP
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=12g
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
EOF

chmod +x /opt/spark/conf/spark-env.sh

# 5. Create worker systemd service
# Replace 129.212.232.134 with your MASTER IP
cat > /etc/systemd/system/spark-worker.service << EOF
[Unit]
Description=Apache Spark Worker
After=network-online.target

[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=$WORKER_IP"

ExecStart=/opt/spark/sbin/start-worker.sh spark://129.212.232.134:7077
ExecStop=/opt/spark/sbin/stop-worker.sh

Restart=on-failure
RestartSec=15

[Install]
WantedBy=multi-user.target
EOF

# 6. Configure firewall
ufw allow 22/tcp
ufw allow 7078/tcp
ufw allow 8081/tcp
ufw --force enable

# 7. Start worker
systemctl daemon-reload
systemctl enable spark-worker
systemctl start spark-worker

Verify Connection

On Master Server:

# Check Master UI
curl http://129.212.232.134:8080 | grep -i "worker"

# Or open in browser
# http://129.212.232.134:8080

Expected: Shows all workers listed with their IPs

On Worker Server:

# Check worker status
systemctl status spark-worker

# Check worker logs
tail -50 /opt/spark/logs/spark-*-Worker-*.out | grep "Successfully registered"

Expected: "Successfully registered with master"

Quick Health Check

# On master, count workers
curl -s http://129.212.232.134:8080 | grep -c "worker-"

Quick Reference

Service Commands

# Check status
systemctl status spark-master spark-worker jupyterhub

# Start services
systemctl start spark-master
systemctl start spark-worker
systemctl start jupyterhub

# Stop services
systemctl stop spark-master spark-worker jupyterhub

# Restart services
systemctl restart spark-master spark-worker jupyterhub

# View logs
journalctl -u spark-master -f
journalctl -u spark-worker -f
journalctl -u jupyterhub -f

Access URLs

Default Credentials

Important Paths

Reset User Password

# Reset password for a user
echo "user5:NewPassword123" | chpasswd

# Or interactive way
passwd user5
↑