Spark JupyterHub Installation Guide

⚠️ CRITICAL REQUIREMENTS

Before you start, please note these important requirements:

Java 17 is REQUIRED - Spark 4.1.1 does NOT work with Java 11
Driver Memory: 512MB minimum - Lower values will cause failures
16GB RAM recommended for 5 concurrent users
Ubuntu 24.04 hostname must be fixed (maps to 127.0.1.1 by default)
All configurations use explicit IPs to avoid connection issues

This guide includes all tested fixes - follow every step exactly as written.

📝 IMPORTANT: Replace 129.212.232.134 with your server's actual IP address throughout ALL commands in this guide.

📑 TABLE OF CONTENTS

Prerequisites
Server Specifications
Initial Setup
Install Java 17
Install Apache Spark 4.1.1
Configure Spark
Install JupyterHub
Configure JupyterHub
Create PySpark Kernel
Setup System Services
Configure Firewall
Create User Accounts
Install Common Python Packages
Start Services
Verification
Testing
Monitoring
Maintenance
Scaling: Adding More Worker Nodes
Quick Reference

Prerequisites

Ubuntu 24.04 LTS server (VPS, Cloud VM, or On-Premises)
Root access or sudo privileges
SSH access to the server
Basic Linux command line knowledge
Your server's public IP address

Server Specifications

Recommended Server Configuration:

RAM: 16GB
CPU: 8 vCPUs (or 8 cores)
Storage: 320GB SSD
OS: Ubuntu 24.04 LTS x64
Network: Public IP address assigned

Minimum Requirements:

RAM: 8GB (supports 2-3 concurrent users)
CPU: 4 vCPUs/cores
Storage: 160GB SSD

Compatible Platforms:

DigitalOcean Droplets
AWS EC2 Instances
Azure Virtual Machines
Google Cloud Compute Engine
On-Premises Physical Servers
On-Premises Virtual Machines (VMware, Hyper-V, KVM)

Initial Setup

Step 1 Connect to Your Server

# Replace 129.212.232.134 with your server's IP address
ssh root@129.212.232.134

Note: Throughout this entire guide, replace 129.212.232.134 with your actual server IP address.

Step 2 Update System

apt update && apt upgrade -y

Step 3 Install Essential Packages

apt install -y \
  python3 \
  python3-pip \
  python3-venv \
  nodejs \
  npm \
  curl \
  wget \
  htop \
  ufw \
  net-tools \
  nano \
  git

Step 4 Reboot Server

reboot

Wait 2 minutes, then reconnect:

ssh root@129.212.232.134

Install Java 17

Important: Spark 4.1.1 requires Java 17 (NOT Java 11)

# Install Java 17
apt install -y openjdk-17-jdk

# Verify installation
java -version

Expected output:

openjdk version "17.0.x"

Note: If you accidentally installed Java 11, the services will fail to start with "class file version 61.0" error. Always use Java 17 for Spark 4.1.1.

Install Apache Spark 4.1.1

Step 1 Download Spark

cd /opt

wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz

Step 2 Extract and Setup

tar -xzf spark-4.1.1-bin-hadoop3.tgz

mv spark-4.1.1-bin-hadoop3 spark

rm spark-4.1.1-bin-hadoop3.tgz

chown -R root:root /opt/spark

Step 3 Verify Installation

ls -la /opt/spark/

/opt/spark/bin/spark-shell --version

✓ You should see Spark version 4.1.1

Configure Spark

Step 1 Fix Hostname Resolution

Critical: This prevents connection issues between master and worker

# Get hostname
HOSTNAME=$(hostname)
echo "Your hostname is: $HOSTNAME"

# Backup original hosts file
cp /etc/hosts /etc/hosts.backup

# Create proper hosts file
cat > /etc/hosts << EOF
127.0.0.1       localhost
129.212.232.134 $HOSTNAME

::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters
EOF

# Verify
cat /etc/hosts

Step 2 Create spark-env.sh

cat > /opt/spark/conf/spark-env.sh << 'EOF'
#!/usr/bin/env bash

# Critical: Replace 129.212.232.134 with actual IP
export SPARK_LOCAL_IP=129.212.232.134
export SPARK_MASTER_HOST=129.212.232.134
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080

# Worker settings
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=12g
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=8081
export SPARK_WORKER_TIMEOUT=120

# Python settings
export PYSPARK_PYTHON=/opt/jupyterhub/bin/python3
export PYSPARK_DRIVER_PYTHON=/opt/jupyterhub/bin/python3

# Java settings
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
EOF

chmod +x /opt/spark/conf/spark-env.sh

Step 3 Create spark-defaults.conf

cat > /opt/spark/conf/spark-defaults.conf << 'EOF'
spark.master                                spark://129.212.232.134:7077
spark.driver.host                           129.212.232.134
spark.driver.bindAddress                    129.212.232.134
spark.driver.port                           7001
spark.driver.memory                         512m
spark.driver.memoryOverhead                 256m
spark.executor.memory                       768m
spark.executor.memoryOverhead               384m
spark.executor.cores                        1
spark.network.timeout                       300s
spark.rpc.askTimeout                        300s
spark.port.maxRetries                       100
spark.serializer                            org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions                4
spark.default.parallelism                   8
spark.sql.adaptive.enabled                  true
spark.cores.max                             8
spark.hadoop.fs.permissions.umask-mode      000
EOF

Note: The spark.hadoop.fs.permissions.umask-mode setting ensures users can delete files they create.

Step 4 Create Start Scripts

Master Start Script:

cat > /opt/spark/sbin/start-master-fixed.sh << 'EOF'
#!/usr/bin/env bash
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_LOCAL_IP=129.212.232.134
export SPARK_MASTER_HOST=129.212.232.134

$SPARK_HOME/sbin/start-master.sh \
    --host 129.212.232.134 \
    --port 7077 \
    --webui-port 8080
EOF

chmod +x /opt/spark/sbin/start-master-fixed.sh

Worker Start Script:

cat > /opt/spark/sbin/start-worker-fixed.sh << 'EOF'
#!/usr/bin/env bash
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_LOCAL_IP=129.212.232.134

$SPARK_HOME/sbin/start-worker.sh \
    --host 129.212.232.134 \
    --port 7078 \
    --webui-port 8081 \
    spark://129.212.232.134:7077
EOF

chmod +x /opt/spark/sbin/start-worker-fixed.sh

Install JupyterHub

Step 1 Create Virtual Environment

python3 -m venv /opt/jupyterhub

Step 2 Install JupyterHub Components

# Upgrade pip
/opt/jupyterhub/bin/pip install --upgrade pip wheel

# Install JupyterHub and JupyterLab
/opt/jupyterhub/bin/pip install jupyterhub jupyterlab notebook

# Install PySpark
/opt/jupyterhub/bin/pip install pyspark==4.1.1 py4j ipykernel

Step 3 Install Node.js Proxy

npm install -g configurable-http-proxy

Step 4 Verify Installations

/opt/jupyterhub/bin/jupyterhub --version
/opt/jupyterhub/bin/jupyter --version
which configurable-http-proxy

Configure JupyterHub

Step 1 Create Configuration Directory

mkdir -p /opt/jupyterhub/etc/jupyterhub
cd /opt/jupyterhub/etc/jupyterhub

Step 2 Generate Config File

/opt/jupyterhub/bin/jupyterhub --generate-config

Step 3 Edit Configuration

cat > /opt/jupyterhub/etc/jupyterhub/jupyterhub_config.py << 'EOF'
# Basic settings
c.JupyterHub.bind_url = 'http://0.0.0.0:8000'
c.Spawner.default_url = '/lab'
c.Spawner.notebook_dir = '~'

# Environment variables for all users
c.Spawner.environment = {
    'SPARK_HOME': '/opt/spark',
    'PYTHONPATH': '/opt/spark/python:/opt/spark/python/lib/py4j-0.10.9.7-src.zip',
    'JAVA_HOME': '/usr/lib/jvm/java-17-openjdk-amd64',
    'SPARK_LOCAL_IP': '129.212.232.134',
    'PYSPARK_PYTHON': '/opt/jupyterhub/bin/python3',
    'PYSPARK_DRIVER_PYTHON': '/opt/jupyterhub/bin/python3'
}

# Resource limits per user
c.Spawner.mem_limit = '2G'
c.Spawner.cpu_limit = 2

# Allow all authenticated users to login
c.Authenticator.allow_all = True

# Preserve user environment variables (fixes file permission issues)
c.Spawner.env_keep = ['USER', 'PATH', 'HOME']
EOF

Create PySpark Kernel

mkdir -p /usr/local/share/jupyter/kernels/pyspark

cat > /usr/local/share/jupyter/kernels/pyspark/kernel.json << 'EOF'
{
  "display_name": "PySpark (Spark 4.1.1)",
  "language": "python",
  "argv": [
    "/opt/jupyterhub/bin/python3",
    "-m",
    "ipykernel_launcher",
    "-f",
    "{connection_file}"
  ],
  "env": {
    "SPARK_HOME": "/opt/spark",
    "PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.10.9.7-src.zip:/opt/jupyterhub/lib/python3.12/site-packages",
    "JAVA_HOME": "/usr/lib/jvm/java-17-openjdk-amd64",
    "SPARK_LOCAL_IP": "129.212.232.134",
    "PATH": "/opt/jupyterhub/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
    "PYSPARK_PYTHON": "/opt/jupyterhub/bin/python3",
    "PYSPARK_DRIVER_PYTHON": "/opt/jupyterhub/bin/python3",
    "PYSPARK_SUBMIT_ARGS": "--master spark://129.212.232.134:7077 --conf spark.driver.host=129.212.232.134 --conf spark.driver.bindAddress=129.212.232.134 --executor-memory 768m --executor-cores 1 --driver-memory 512m --conf spark.executor.memoryOverhead=384m --conf spark.driver.memoryOverhead=256m pyspark-shell"
  }
}
EOF

Setup System Services

Step 1 Spark Master Service

cat > /etc/systemd/system/spark-master.service << 'EOF'
[Unit]
Description=Apache Spark Master
After=network-online.target
Wants=network-online.target

[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=129.212.232.134"
Environment="SPARK_MASTER_HOST=129.212.232.134"

ExecStartPre=/bin/sleep 5
ExecStart=/opt/spark/sbin/start-master-fixed.sh
ExecStop=/opt/spark/sbin/stop-master.sh

Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

Step 2 Spark Worker Service

cat > /etc/systemd/system/spark-worker.service << 'EOF'
[Unit]
Description=Apache Spark Worker
After=spark-master.service network-online.target
Requires=spark-master.service
Wants=network-online.target

[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=129.212.232.134"

ExecStartPre=/bin/bash -c 'for i in {1..30}; do curl -s http://129.212.232.134:8080 && break || sleep 2; done'
ExecStart=/opt/spark/sbin/start-worker-fixed.sh
ExecStop=/opt/spark/sbin/stop-worker.sh

Restart=on-failure
RestartSec=15

[Install]
WantedBy=multi-user.target
EOF

Step 3 JupyterHub Service

cat > /etc/systemd/system/jupyterhub.service << 'EOF'
[Unit]
Description=JupyterHub
After=spark-worker.service network.target
Requires=spark-worker.service

[Service]
Type=simple
User=root
Environment="PATH=/opt/jupyterhub/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/opt/jupyterhub/bin/jupyterhub -f /opt/jupyterhub/etc/jupyterhub/jupyterhub_config.py
WorkingDirectory=/opt/jupyterhub/etc/jupyterhub

Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

Step 4 Reload Systemd

systemctl daemon-reload

Configure Firewall

# Enable firewall
ufw --force enable

# Allow SSH (IMPORTANT!)
ufw allow 22/tcp

# Spark ports
ufw allow 7077/tcp comment 'Spark Master'
ufw allow 7078/tcp comment 'Spark Worker'
ufw allow 8080/tcp comment 'Spark Master UI'
ufw allow 8081/tcp comment 'Spark Worker UI'
ufw allow 7001:7010/tcp comment 'Spark Driver'

# JupyterHub
ufw allow 8000/tcp comment 'JupyterHub'

# Spark Application UI
ufw allow 4040:4050/tcp comment 'Spark App UI'

# Reload firewall
ufw reload

# Verify firewall rules
ufw status numbered

Create User Accounts

# Create 5 training users
for i in {1..5}; do
  useradd -m -s /bin/bash user$i
  echo "user$i:Training@123" | chpasswd
  echo "Created user$i with password: Training@123"
done

# Verify users
ls -la /home/

User Credentials:

user1 / Training@123
user2 / Training@123
user3 / Training@123
user4 / Training@123
user5 / Training@123

Set Proper Permissions

# Fix home directory permissions for all users
# This prevents "Permission denied" errors when deleting Spark files
for i in {1..5}; do
  chown -R user$i:user$i /home/user$i
  chmod -R 755 /home/user$i
done

echo "✓ User permissions set correctly"

Install Common Python Packages

# Install data science and ML packages
/opt/jupyterhub/bin/pip install \
  pandas \
  numpy \
  matplotlib \
  seaborn \
  plotly \
  scikit-learn \
  scipy \
  statsmodels \
  requests \
  beautifulsoup4 \
  sqlalchemy \
  pymysql \
  psycopg2-binary

# Verify installations
/opt/jupyterhub/bin/pip list

Start Services

Step 1 Enable Services (Auto-start on boot)

systemctl enable spark-master
systemctl enable spark-worker
systemctl enable jupyterhub

Step 2 Start Spark Master

systemctl start spark-master
echo "Waiting for master to start..."
sleep 15

# Check status
systemctl status spark-master --no-pager

Expected: Active: active (running)

Step 3 Start Spark Worker

systemctl start spark-worker
echo "Waiting for worker to start..."
sleep 15

# Check status
systemctl status spark-worker --no-pager

Expected: Active: active (running)

Step 4 Start JupyterHub

systemctl start jupyterhub
echo "Waiting for JupyterHub to start..."
sleep 10

# Check status
systemctl status jupyterhub --no-pager

Expected: Active: active (running)

Verification

Check All Services

systemctl status spark-master spark-worker jupyterhub --no-pager

All should show: Active: active (running)

Check Processes

jps

Expected output:

12345 Master
12346 Worker
12347 Jps

Check Network Bindings

netstat -tuln | grep "7077\|7078\|8080\|8081\|8000"

Expected: All ports listening on 129.212.232.134

Check Worker Registration

curl -s http://129.212.232.134:8080 | grep -i "Workers (1)"

Expected: Shows "Workers (1)"

Check Logs

# Master logs
ls /opt/spark/logs/*Master*.out
tail -50 /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-*.out

# Worker logs
ls /opt/spark/logs/*Worker*.out
tail -50 /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-*.out

Look for: "Successfully registered with master"

Testing

Test 1: Access Web Interfaces

Open in browser:

Spark Master UI: http://129.212.232.134:8080
Should show: 1 Worker, 8 cores, 12GB memory
Spark Worker UI: http://129.212.232.134:8081
Should show: Worker details
JupyterHub: http://129.212.232.134:8000
Should show: Login page

Test 2: Login to JupyterHub

Go to: http://129.212.232.134:8000
Login with: user1 / Training@123
You should see JupyterLab interface

Test 3: Create and Run PySpark Notebook

Click New → Notebook
Select kernel: PySpark (Spark 4.1.1)
In first cell, paste this code:

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("InstallationTest") \
    .getOrCreate()

# Verify configuration
print(f"✓ Spark Version: {spark.version}")
print(f"✓ Master: {spark.sparkContext.master}")
print(f"✓ Application Name: {spark.sparkContext.appName}")

# Test data processing
print("\n=== Testing Data Processing ===")
df = spark.range(1, 1000000).toDF("id")
df = df.withColumn("value", df.id * 2)

print(f"✓ Total Rows: {df.count():,}")
print("\n✓ Sample Data:")
df.show(10)

print("\n✓ Statistics:")
df.describe().show()

# Stop Spark
spark.stop()
print("\n✓✓✓ INSTALLATION SUCCESSFUL! ✓✓✓")

Press Shift + Enter to run

Test 4: Check Spark Application UI

While notebook is running:

Open: http://129.212.232.134:4040
You should see Spark Application UI with job details

Expected Results

All tests should pass with:

✓ No errors in notebook execution
✓ Spark version shows 4.1.1
✓ Master URL shows spark://129.212.232.134:7077
✓ Data operations complete successfully
⚠️ Warnings about native Hadoop library are NORMAL (ignore them)

Monitoring

Check Cluster Health Script

cat > /root/check-spark.sh << 'EOF'
#!/bin/bash
echo "=== SPARK CLUSTER STATUS ==="
echo ""
echo "Master: $(systemctl is-active spark-master)"
echo "Worker: $(systemctl is-active spark-worker)"
echo "JupyterHub: $(systemctl is-active jupyterhub)"
echo ""
echo "=== REGISTERED WORKERS ==="
curl -s http://129.212.232.134:8080 | grep -c "worker-" | xargs echo "Workers:"
echo ""
echo "=== ACTIVE APPLICATIONS ==="
curl -s http://129.212.232.134:8080 | grep -c "app-" | xargs echo "Apps:"
echo ""
echo "=== RESOURCE USAGE ==="
free -h | grep Mem
echo ""
df -h | grep -E "Filesystem|/$"
EOF

chmod +x /root/check-spark.sh

Run health check:

/root/check-spark.sh

Monitor Logs

# Real-time monitoring
journalctl -u spark-master -f
journalctl -u spark-worker -f
journalctl -u jupyterhub -f

# System resources
htop

Maintenance

Adding More Users

# Add new user
useradd -m -s /bin/bash user6
echo "user6:Training@123" | chpasswd

Installing Additional Packages

# Install for all users
/opt/jupyterhub/bin/pip install package-name

# Restart JupyterHub
systemctl restart jupyterhub

Updating Spark Configuration

# Edit configuration
nano /opt/spark/conf/spark-defaults.conf

# Restart Spark services
systemctl restart spark-master spark-worker

Backup Configuration

# Create backup
tar -czf spark-jupyterhub-backup-$(date +%Y%m%d).tar.gz \
  /opt/spark/conf \
  /opt/jupyterhub/etc/jupyterhub \
  /etc/systemd/system/spark-*.service \
  /etc/systemd/system/jupyterhub.service \
  /usr/local/share/jupyter/kernels/pyspark

# Verify backup
ls -lh spark-jupyterhub-backup-*.tar.gz

Checking Resource Usage

# Memory usage
free -h

# Disk usage
df -h

# CPU usage
htop

# Active Spark applications
curl -s http://129.212.232.134:8080

Restarting All Services

systemctl restart spark-master && \
  sleep 10 && \
  systemctl restart spark-worker && \
  sleep 10 && \
  systemctl restart jupyterhub

Scaling: Adding More Worker Nodes

When to Add Workers

Add more worker nodes when:

You need to support more concurrent users (>5)
Processing larger datasets (>5GB per user)
Need faster job execution times

Setup New Worker Node

On Each New Worker Server:

# 1. Install Java 17
apt update
apt install -y openjdk-17-jdk

# 2. Download and extract Spark
cd /opt
wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
tar -xzf spark-4.1.1-bin-hadoop3.tgz
mv spark-4.1.1-bin-hadoop3 spark
rm spark-4.1.1-bin-hadoop3.tgz

# 3. Fix hostname resolution
HOSTNAME=$(hostname)
WORKER_IP=$(hostname -I | awk '{print $1}')
cat > /etc/hosts << EOF
127.0.0.1       localhost
$WORKER_IP      $HOSTNAME
EOF

# 4. Create spark-env.sh
cat > /opt/spark/conf/spark-env.sh << EOF
export SPARK_LOCAL_IP=$WORKER_IP
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=12g
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
EOF

chmod +x /opt/spark/conf/spark-env.sh

# 5. Create worker systemd service
# Replace 129.212.232.134 with your MASTER IP
cat > /etc/systemd/system/spark-worker.service << EOF
[Unit]
Description=Apache Spark Worker
After=network-online.target

[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=$WORKER_IP"

ExecStart=/opt/spark/sbin/start-worker.sh spark://129.212.232.134:7077
ExecStop=/opt/spark/sbin/stop-worker.sh

Restart=on-failure
RestartSec=15

[Install]
WantedBy=multi-user.target
EOF

# 6. Configure firewall
ufw allow 22/tcp
ufw allow 7078/tcp
ufw allow 8081/tcp
ufw --force enable

# 7. Start worker
systemctl daemon-reload
systemctl enable spark-worker
systemctl start spark-worker

Verify Connection

On Master Server:

# Check Master UI
curl http://129.212.232.134:8080 | grep -i "worker"

# Or open in browser
# http://129.212.232.134:8080

Expected: Shows all workers listed with their IPs

On Worker Server:

# Check worker status
systemctl status spark-worker

# Check worker logs
tail -50 /opt/spark/logs/spark-*-Worker-*.out | grep "Successfully registered"

Expected: "Successfully registered with master"

Quick Health Check

# On master, count workers
curl -s http://129.212.232.134:8080 | grep -c "worker-"

Quick Reference

Service Commands

# Check status
systemctl status spark-master spark-worker jupyterhub

# Start services
systemctl start spark-master
systemctl start spark-worker
systemctl start jupyterhub

# Stop services
systemctl stop spark-master spark-worker jupyterhub

# Restart services
systemctl restart spark-master spark-worker jupyterhub

# View logs
journalctl -u spark-master -f
journalctl -u spark-worker -f
journalctl -u jupyterhub -f

Access URLs

JupyterHub: http://129.212.232.134:8000
Spark Master UI: http://129.212.232.134:8080
Spark Worker UI: http://129.212.232.134:8081
Spark Application UI: http://129.212.232.134:4040

Default Credentials

Users: user1, user2, user3, user4, user5
Password: Training@123

Important Paths

Spark Home: /opt/spark
Spark Config: /opt/spark/conf
Spark Logs: /opt/spark/logs
JupyterHub Config: /opt/jupyterhub/etc/jupyterhub
PySpark Kernel: /usr/local/share/jupyter/kernels/pyspark

Reset User Password

# Reset password for a user
echo "user5:NewPassword123" | chpasswd

# Or interactive way
passwd user5

🚀 Spark + JupyterHub Installation Guide

⚠️ CRITICAL REQUIREMENTS

📑 TABLE OF CONTENTS

Prerequisites

Server Specifications

Recommended Server Configuration:

Minimum Requirements:

Compatible Platforms:

Initial Setup

Step 1 Connect to Your Server

Step 2 Update System

Step 3 Install Essential Packages

Step 4 Reboot Server

Install Java 17

Install Apache Spark 4.1.1

Step 1 Download Spark

Step 2 Extract and Setup

Step 3 Verify Installation

Configure Spark

Step 1 Fix Hostname Resolution

Step 2 Create spark-env.sh

Step 3 Create spark-defaults.conf

Step 4 Create Start Scripts

Install JupyterHub

Step 1 Create Virtual Environment

Step 2 Install JupyterHub Components

Step 3 Install Node.js Proxy

Step 4 Verify Installations

Configure JupyterHub

Step 1 Create Configuration Directory

Step 2 Generate Config File

Step 3 Edit Configuration

Create PySpark Kernel

Setup System Services

Step 1 Spark Master Service

Step 2 Spark Worker Service

Step 3 JupyterHub Service

Step 4 Reload Systemd

Configure Firewall

Create User Accounts

Set Proper Permissions

Install Common Python Packages

Start Services

Step 1 Enable Services (Auto-start on boot)

Step 2 Start Spark Master

Step 3 Start Spark Worker

Step 4 Start JupyterHub

Verification

Check All Services

Check Processes

Check Network Bindings

Check Worker Registration

Check Logs

Testing

Test 1: Access Web Interfaces

Test 2: Login to JupyterHub

Test 3: Create and Run PySpark Notebook

Test 4: Check Spark Application UI

Expected Results

Monitoring

Check Cluster Health Script

Monitor Logs

Maintenance

Adding More Users

Installing Additional Packages

Updating Spark Configuration

Backup Configuration

Checking Resource Usage

Restarting All Services

Scaling: Adding More Worker Nodes

When to Add Workers

Setup New Worker Node

Verify Connection

Quick Health Check

Quick Reference

Service Commands

Access URLs

Default Credentials

Important Paths

Reset User Password