ā ļø CRITICAL REQUIREMENTS
Before you start, please note these important requirements:
- Java 17 is REQUIRED - Spark 4.1.1 does NOT work with Java 11
- Driver Memory: 512MB minimum - Lower values will cause failures
- 16GB RAM recommended for 5 concurrent users
- Ubuntu 24.04 hostname must be fixed (maps to 127.0.1.1 by default)
- All configurations use explicit IPs to avoid connection issues
This guide includes all tested fixes - follow every step exactly as written.
129.212.232.134 with your server's actual IP address throughout ALL commands in this guide.
š TABLE OF CONTENTS
- Prerequisites
- Server Specifications
- Initial Setup
- Install Java 17
- Install Apache Spark 4.1.1
- Configure Spark
- Install JupyterHub
- Configure JupyterHub
- Create PySpark Kernel
- Setup System Services
- Configure Firewall
- Create User Accounts
- Install Common Python Packages
- Start Services
- Verification
- Testing
- Monitoring
- Maintenance
- Scaling: Adding More Worker Nodes
- Quick Reference
Prerequisites
- Ubuntu 24.04 LTS server (VPS, Cloud VM, or On-Premises)
- Root access or sudo privileges
- SSH access to the server
- Basic Linux command line knowledge
- Your server's public IP address
Server Specifications
Recommended Server Configuration:
- RAM: 16GB
- CPU: 8 vCPUs (or 8 cores)
- Storage: 320GB SSD
- OS: Ubuntu 24.04 LTS x64
- Network: Public IP address assigned
Minimum Requirements:
- RAM: 8GB (supports 2-3 concurrent users)
- CPU: 4 vCPUs/cores
- Storage: 160GB SSD
Compatible Platforms:
- DigitalOcean Droplets
- AWS EC2 Instances
- Azure Virtual Machines
- Google Cloud Compute Engine
- On-Premises Physical Servers
- On-Premises Virtual Machines (VMware, Hyper-V, KVM)
Initial Setup
Step 1 Connect to Your Server
# Replace 129.212.232.134 with your server's IP address
ssh root@129.212.232.134
129.212.232.134 with your actual server IP address.
Step 2 Update System
apt update && apt upgrade -y
Step 3 Install Essential Packages
apt install -y \
python3 \
python3-pip \
python3-venv \
nodejs \
npm \
curl \
wget \
htop \
ufw \
net-tools \
nano \
git
Step 4 Reboot Server
reboot
Wait 2 minutes, then reconnect:
ssh root@129.212.232.134
Install Java 17
# Install Java 17
apt install -y openjdk-17-jdk
# Verify installation
java -version
Expected output:
openjdk version "17.0.x"
Note: If you accidentally installed Java 11, the services will fail to start with "class file version 61.0" error. Always use Java 17 for Spark 4.1.1.
Install Apache Spark 4.1.1
Step 1 Download Spark
cd /opt
wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
Step 2 Extract and Setup
tar -xzf spark-4.1.1-bin-hadoop3.tgz
mv spark-4.1.1-bin-hadoop3 spark
rm spark-4.1.1-bin-hadoop3.tgz
chown -R root:root /opt/spark
Step 3 Verify Installation
ls -la /opt/spark/
/opt/spark/bin/spark-shell --version
ā You should see Spark version 4.1.1
Configure Spark
Step 1 Fix Hostname Resolution
# Get hostname
HOSTNAME=$(hostname)
echo "Your hostname is: $HOSTNAME"
# Backup original hosts file
cp /etc/hosts /etc/hosts.backup
# Create proper hosts file
cat > /etc/hosts << EOF
127.0.0.1 localhost
129.212.232.134 $HOSTNAME
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
EOF
# Verify
cat /etc/hosts
Step 2 Create spark-env.sh
cat > /opt/spark/conf/spark-env.sh << 'EOF'
#!/usr/bin/env bash
# Critical: Replace 129.212.232.134 with actual IP
export SPARK_LOCAL_IP=129.212.232.134
export SPARK_MASTER_HOST=129.212.232.134
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
# Worker settings
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=12g
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=8081
export SPARK_WORKER_TIMEOUT=120
# Python settings
export PYSPARK_PYTHON=/opt/jupyterhub/bin/python3
export PYSPARK_DRIVER_PYTHON=/opt/jupyterhub/bin/python3
# Java settings
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
EOF
chmod +x /opt/spark/conf/spark-env.sh
Step 3 Create spark-defaults.conf
cat > /opt/spark/conf/spark-defaults.conf << 'EOF'
spark.master spark://129.212.232.134:7077
spark.driver.host 129.212.232.134
spark.driver.bindAddress 129.212.232.134
spark.driver.port 7001
spark.driver.memory 512m
spark.driver.memoryOverhead 256m
spark.executor.memory 768m
spark.executor.memoryOverhead 384m
spark.executor.cores 1
spark.network.timeout 300s
spark.rpc.askTimeout 300s
spark.port.maxRetries 100
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.shuffle.partitions 4
spark.default.parallelism 8
spark.sql.adaptive.enabled true
spark.cores.max 8
spark.hadoop.fs.permissions.umask-mode 000
EOF
spark.hadoop.fs.permissions.umask-mode setting ensures users can delete files they create.
Step 4 Create Start Scripts
Master Start Script:
cat > /opt/spark/sbin/start-master-fixed.sh << 'EOF'
#!/usr/bin/env bash
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_LOCAL_IP=129.212.232.134
export SPARK_MASTER_HOST=129.212.232.134
$SPARK_HOME/sbin/start-master.sh \
--host 129.212.232.134 \
--port 7077 \
--webui-port 8080
EOF
chmod +x /opt/spark/sbin/start-master-fixed.sh
Worker Start Script:
cat > /opt/spark/sbin/start-worker-fixed.sh << 'EOF'
#!/usr/bin/env bash
export SPARK_HOME=/opt/spark
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export SPARK_LOCAL_IP=129.212.232.134
$SPARK_HOME/sbin/start-worker.sh \
--host 129.212.232.134 \
--port 7078 \
--webui-port 8081 \
spark://129.212.232.134:7077
EOF
chmod +x /opt/spark/sbin/start-worker-fixed.sh
Install JupyterHub
Step 1 Create Virtual Environment
python3 -m venv /opt/jupyterhub
Step 2 Install JupyterHub Components
# Upgrade pip
/opt/jupyterhub/bin/pip install --upgrade pip wheel
# Install JupyterHub and JupyterLab
/opt/jupyterhub/bin/pip install jupyterhub jupyterlab notebook
# Install PySpark
/opt/jupyterhub/bin/pip install pyspark==4.1.1 py4j ipykernel
Step 3 Install Node.js Proxy
npm install -g configurable-http-proxy
Step 4 Verify Installations
/opt/jupyterhub/bin/jupyterhub --version
/opt/jupyterhub/bin/jupyter --version
which configurable-http-proxy
Configure JupyterHub
Step 1 Create Configuration Directory
mkdir -p /opt/jupyterhub/etc/jupyterhub
cd /opt/jupyterhub/etc/jupyterhub
Step 2 Generate Config File
/opt/jupyterhub/bin/jupyterhub --generate-config
Step 3 Edit Configuration
cat > /opt/jupyterhub/etc/jupyterhub/jupyterhub_config.py << 'EOF'
# Basic settings
c.JupyterHub.bind_url = 'http://0.0.0.0:8000'
c.Spawner.default_url = '/lab'
c.Spawner.notebook_dir = '~'
# Environment variables for all users
c.Spawner.environment = {
'SPARK_HOME': '/opt/spark',
'PYTHONPATH': '/opt/spark/python:/opt/spark/python/lib/py4j-0.10.9.7-src.zip',
'JAVA_HOME': '/usr/lib/jvm/java-17-openjdk-amd64',
'SPARK_LOCAL_IP': '129.212.232.134',
'PYSPARK_PYTHON': '/opt/jupyterhub/bin/python3',
'PYSPARK_DRIVER_PYTHON': '/opt/jupyterhub/bin/python3'
}
# Resource limits per user
c.Spawner.mem_limit = '2G'
c.Spawner.cpu_limit = 2
# Allow all authenticated users to login
c.Authenticator.allow_all = True
# Preserve user environment variables (fixes file permission issues)
c.Spawner.env_keep = ['USER', 'PATH', 'HOME']
EOF
Create PySpark Kernel
mkdir -p /usr/local/share/jupyter/kernels/pyspark
cat > /usr/local/share/jupyter/kernels/pyspark/kernel.json << 'EOF'
{
"display_name": "PySpark (Spark 4.1.1)",
"language": "python",
"argv": [
"/opt/jupyterhub/bin/python3",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/opt/spark",
"PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.10.9.7-src.zip:/opt/jupyterhub/lib/python3.12/site-packages",
"JAVA_HOME": "/usr/lib/jvm/java-17-openjdk-amd64",
"SPARK_LOCAL_IP": "129.212.232.134",
"PATH": "/opt/jupyterhub/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"PYSPARK_PYTHON": "/opt/jupyterhub/bin/python3",
"PYSPARK_DRIVER_PYTHON": "/opt/jupyterhub/bin/python3",
"PYSPARK_SUBMIT_ARGS": "--master spark://129.212.232.134:7077 --conf spark.driver.host=129.212.232.134 --conf spark.driver.bindAddress=129.212.232.134 --executor-memory 768m --executor-cores 1 --driver-memory 512m --conf spark.executor.memoryOverhead=384m --conf spark.driver.memoryOverhead=256m pyspark-shell"
}
}
EOF
Setup System Services
Step 1 Spark Master Service
cat > /etc/systemd/system/spark-master.service << 'EOF'
[Unit]
Description=Apache Spark Master
After=network-online.target
Wants=network-online.target
[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=129.212.232.134"
Environment="SPARK_MASTER_HOST=129.212.232.134"
ExecStartPre=/bin/sleep 5
ExecStart=/opt/spark/sbin/start-master-fixed.sh
ExecStop=/opt/spark/sbin/stop-master.sh
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
Step 2 Spark Worker Service
cat > /etc/systemd/system/spark-worker.service << 'EOF'
[Unit]
Description=Apache Spark Worker
After=spark-master.service network-online.target
Requires=spark-master.service
Wants=network-online.target
[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=129.212.232.134"
ExecStartPre=/bin/bash -c 'for i in {1..30}; do curl -s http://129.212.232.134:8080 && break || sleep 2; done'
ExecStart=/opt/spark/sbin/start-worker-fixed.sh
ExecStop=/opt/spark/sbin/stop-worker.sh
Restart=on-failure
RestartSec=15
[Install]
WantedBy=multi-user.target
EOF
Step 3 JupyterHub Service
cat > /etc/systemd/system/jupyterhub.service << 'EOF'
[Unit]
Description=JupyterHub
After=spark-worker.service network.target
Requires=spark-worker.service
[Service]
Type=simple
User=root
Environment="PATH=/opt/jupyterhub/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/opt/jupyterhub/bin/jupyterhub -f /opt/jupyterhub/etc/jupyterhub/jupyterhub_config.py
WorkingDirectory=/opt/jupyterhub/etc/jupyterhub
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
Step 4 Reload Systemd
systemctl daemon-reload
Configure Firewall
# Enable firewall
ufw --force enable
# Allow SSH (IMPORTANT!)
ufw allow 22/tcp
# Spark ports
ufw allow 7077/tcp comment 'Spark Master'
ufw allow 7078/tcp comment 'Spark Worker'
ufw allow 8080/tcp comment 'Spark Master UI'
ufw allow 8081/tcp comment 'Spark Worker UI'
ufw allow 7001:7010/tcp comment 'Spark Driver'
# JupyterHub
ufw allow 8000/tcp comment 'JupyterHub'
# Spark Application UI
ufw allow 4040:4050/tcp comment 'Spark App UI'
# Reload firewall
ufw reload
# Verify firewall rules
ufw status numbered
Create User Accounts
# Create 5 training users
for i in {1..5}; do
useradd -m -s /bin/bash user$i
echo "user$i:Training@123" | chpasswd
echo "Created user$i with password: Training@123"
done
# Verify users
ls -la /home/
User Credentials:
- user1 / Training@123
- user2 / Training@123
- user3 / Training@123
- user4 / Training@123
- user5 / Training@123
Set Proper Permissions
# Fix home directory permissions for all users
# This prevents "Permission denied" errors when deleting Spark files
for i in {1..5}; do
chown -R user$i:user$i /home/user$i
chmod -R 755 /home/user$i
done
echo "ā User permissions set correctly"
Install Common Python Packages
# Install data science and ML packages
/opt/jupyterhub/bin/pip install \
pandas \
numpy \
matplotlib \
seaborn \
plotly \
scikit-learn \
scipy \
statsmodels \
requests \
beautifulsoup4 \
sqlalchemy \
pymysql \
psycopg2-binary
# Verify installations
/opt/jupyterhub/bin/pip list
Start Services
Step 1 Enable Services (Auto-start on boot)
systemctl enable spark-master
systemctl enable spark-worker
systemctl enable jupyterhub
Step 2 Start Spark Master
systemctl start spark-master
echo "Waiting for master to start..."
sleep 15
# Check status
systemctl status spark-master --no-pager
Expected: Active: active (running)
Step 3 Start Spark Worker
systemctl start spark-worker
echo "Waiting for worker to start..."
sleep 15
# Check status
systemctl status spark-worker --no-pager
Expected: Active: active (running)
Step 4 Start JupyterHub
systemctl start jupyterhub
echo "Waiting for JupyterHub to start..."
sleep 10
# Check status
systemctl status jupyterhub --no-pager
Expected: Active: active (running)
Verification
Check All Services
systemctl status spark-master spark-worker jupyterhub --no-pager
All should show: Active: active (running)
Check Processes
jps
Expected output:
12345 Master
12346 Worker
12347 Jps
Check Network Bindings
netstat -tuln | grep "7077\|7078\|8080\|8081\|8000"
Expected: All ports listening on 129.212.232.134
Check Worker Registration
curl -s http://129.212.232.134:8080 | grep -i "Workers (1)"
Expected: Shows "Workers (1)"
Check Logs
# Master logs
ls /opt/spark/logs/*Master*.out
tail -50 /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-*.out
# Worker logs
ls /opt/spark/logs/*Worker*.out
tail -50 /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-*.out
Look for: "Successfully registered with master"
Testing
Test 1: Access Web Interfaces
Open in browser:
- Spark Master UI: http://129.212.232.134:8080
Should show: 1 Worker, 8 cores, 12GB memory - Spark Worker UI: http://129.212.232.134:8081
Should show: Worker details - JupyterHub: http://129.212.232.134:8000
Should show: Login page
Test 2: Login to JupyterHub
- Go to: http://129.212.232.134:8000
- Login with: user1 / Training@123
- You should see JupyterLab interface
Test 3: Create and Run PySpark Notebook
- Click New ā Notebook
- Select kernel: PySpark (Spark 4.1.1)
- In first cell, paste this code:
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
.appName("InstallationTest") \
.getOrCreate()
# Verify configuration
print(f"ā Spark Version: {spark.version}")
print(f"ā Master: {spark.sparkContext.master}")
print(f"ā Application Name: {spark.sparkContext.appName}")
# Test data processing
print("\n=== Testing Data Processing ===")
df = spark.range(1, 1000000).toDF("id")
df = df.withColumn("value", df.id * 2)
print(f"ā Total Rows: {df.count():,}")
print("\nā Sample Data:")
df.show(10)
print("\nā Statistics:")
df.describe().show()
# Stop Spark
spark.stop()
print("\nāāā INSTALLATION SUCCESSFUL! āāā")
- Press Shift + Enter to run
Test 4: Check Spark Application UI
While notebook is running:
- Open: http://129.212.232.134:4040
- You should see Spark Application UI with job details
Expected Results
All tests should pass with:
- ā No errors in notebook execution
- ā Spark version shows 4.1.1
- ā Master URL shows spark://129.212.232.134:7077
- ā Data operations complete successfully
- ā ļø Warnings about native Hadoop library are NORMAL (ignore them)
Monitoring
Check Cluster Health Script
cat > /root/check-spark.sh << 'EOF'
#!/bin/bash
echo "=== SPARK CLUSTER STATUS ==="
echo ""
echo "Master: $(systemctl is-active spark-master)"
echo "Worker: $(systemctl is-active spark-worker)"
echo "JupyterHub: $(systemctl is-active jupyterhub)"
echo ""
echo "=== REGISTERED WORKERS ==="
curl -s http://129.212.232.134:8080 | grep -c "worker-" | xargs echo "Workers:"
echo ""
echo "=== ACTIVE APPLICATIONS ==="
curl -s http://129.212.232.134:8080 | grep -c "app-" | xargs echo "Apps:"
echo ""
echo "=== RESOURCE USAGE ==="
free -h | grep Mem
echo ""
df -h | grep -E "Filesystem|/$"
EOF
chmod +x /root/check-spark.sh
Run health check:
/root/check-spark.sh
Monitor Logs
# Real-time monitoring
journalctl -u spark-master -f
journalctl -u spark-worker -f
journalctl -u jupyterhub -f
# System resources
htop
Maintenance
Adding More Users
# Add new user
useradd -m -s /bin/bash user6
echo "user6:Training@123" | chpasswd
Installing Additional Packages
# Install for all users
/opt/jupyterhub/bin/pip install package-name
# Restart JupyterHub
systemctl restart jupyterhub
Updating Spark Configuration
# Edit configuration
nano /opt/spark/conf/spark-defaults.conf
# Restart Spark services
systemctl restart spark-master spark-worker
Backup Configuration
# Create backup
tar -czf spark-jupyterhub-backup-$(date +%Y%m%d).tar.gz \
/opt/spark/conf \
/opt/jupyterhub/etc/jupyterhub \
/etc/systemd/system/spark-*.service \
/etc/systemd/system/jupyterhub.service \
/usr/local/share/jupyter/kernels/pyspark
# Verify backup
ls -lh spark-jupyterhub-backup-*.tar.gz
Checking Resource Usage
# Memory usage
free -h
# Disk usage
df -h
# CPU usage
htop
# Active Spark applications
curl -s http://129.212.232.134:8080
Restarting All Services
systemctl restart spark-master && \
sleep 10 && \
systemctl restart spark-worker && \
sleep 10 && \
systemctl restart jupyterhub
Scaling: Adding More Worker Nodes
When to Add Workers
Add more worker nodes when:
- You need to support more concurrent users (>5)
- Processing larger datasets (>5GB per user)
- Need faster job execution times
Setup New Worker Node
On Each New Worker Server:
# 1. Install Java 17
apt update
apt install -y openjdk-17-jdk
# 2. Download and extract Spark
cd /opt
wget https://archive.apache.org/dist/spark/spark-4.1.1/spark-4.1.1-bin-hadoop3.tgz
tar -xzf spark-4.1.1-bin-hadoop3.tgz
mv spark-4.1.1-bin-hadoop3 spark
rm spark-4.1.1-bin-hadoop3.tgz
# 3. Fix hostname resolution
HOSTNAME=$(hostname)
WORKER_IP=$(hostname -I | awk '{print $1}')
cat > /etc/hosts << EOF
127.0.0.1 localhost
$WORKER_IP $HOSTNAME
EOF
# 4. Create spark-env.sh
cat > /opt/spark/conf/spark-env.sh << EOF
export SPARK_LOCAL_IP=$WORKER_IP
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=12g
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
EOF
chmod +x /opt/spark/conf/spark-env.sh
# 5. Create worker systemd service
# Replace 129.212.232.134 with your MASTER IP
cat > /etc/systemd/system/spark-worker.service << EOF
[Unit]
Description=Apache Spark Worker
After=network-online.target
[Service]
Type=forking
User=root
Environment="SPARK_HOME=/opt/spark"
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="SPARK_LOCAL_IP=$WORKER_IP"
ExecStart=/opt/spark/sbin/start-worker.sh spark://129.212.232.134:7077
ExecStop=/opt/spark/sbin/stop-worker.sh
Restart=on-failure
RestartSec=15
[Install]
WantedBy=multi-user.target
EOF
# 6. Configure firewall
ufw allow 22/tcp
ufw allow 7078/tcp
ufw allow 8081/tcp
ufw --force enable
# 7. Start worker
systemctl daemon-reload
systemctl enable spark-worker
systemctl start spark-worker
Verify Connection
On Master Server:
# Check Master UI
curl http://129.212.232.134:8080 | grep -i "worker"
# Or open in browser
# http://129.212.232.134:8080
Expected: Shows all workers listed with their IPs
On Worker Server:
# Check worker status
systemctl status spark-worker
# Check worker logs
tail -50 /opt/spark/logs/spark-*-Worker-*.out | grep "Successfully registered"
Expected: "Successfully registered with master"
Quick Health Check
# On master, count workers
curl -s http://129.212.232.134:8080 | grep -c "worker-"
Quick Reference
Service Commands
# Check status
systemctl status spark-master spark-worker jupyterhub
# Start services
systemctl start spark-master
systemctl start spark-worker
systemctl start jupyterhub
# Stop services
systemctl stop spark-master spark-worker jupyterhub
# Restart services
systemctl restart spark-master spark-worker jupyterhub
# View logs
journalctl -u spark-master -f
journalctl -u spark-worker -f
journalctl -u jupyterhub -f
Access URLs
- JupyterHub: http://129.212.232.134:8000
- Spark Master UI: http://129.212.232.134:8080
- Spark Worker UI: http://129.212.232.134:8081
- Spark Application UI: http://129.212.232.134:4040
Default Credentials
- Users: user1, user2, user3, user4, user5
- Password: Training@123
Important Paths
- Spark Home:
/opt/spark - Spark Config:
/opt/spark/conf - Spark Logs:
/opt/spark/logs - JupyterHub Config:
/opt/jupyterhub/etc/jupyterhub - PySpark Kernel:
/usr/local/share/jupyter/kernels/pyspark
Reset User Password
# Reset password for a user
echo "user5:NewPassword123" | chpasswd
# Or interactive way
passwd user5