- Published on
Apache SPARK Up and Running FAST with Docker
Table of Contents
- Introduction
- Quick Start: Single-Node Spark Cluster
- Running Your First PySpark Application
- Multi-Worker Setup for Distributed Processing
- Working with Data: File Handling
- Database Integration: PostgreSQL Example
- Spark History Server for Job Monitoring
- Production-Ready Configuration
- Troubleshooting Common Issues
- Advanced Spark Submit Options
- Connecting from External PySpark Applications
- Real-World Example: ETL Pipeline
- Best Practices Summary
- Cleanup and Maintenance
- Conclusion
Introduction
Running Apache Spark in Docker containers provides an isolated, reproducible environment perfect for development, testing, and even production workloads. This comprehensive guide uses the official Apache Spark Docker images and walks you through everything from a simple single-node setup to a full production-ready cluster with monitoring, security, and database integration.
What You'll Learn
- Setting up a Spark cluster with Docker Compose using official Apache images
- Configuring multi-worker distributed processing
- Mounting volumes for data persistence
- Connecting to databases (PostgreSQL, MySQL)
- Submitting and running PySpark applications
- Production deployment with monitoring and security
- Troubleshooting common Docker-Spark issues
Prerequisites
- Docker Engine 20.10+ installed
- Docker Compose 2.0+ installed
- Basic understanding of Apache Spark concepts
- 8GB+ RAM available for Docker
- Familiarity with command line operations
Quick Start: Single-Node Spark Cluster
Let's start with the simplest possible Spark setup using official Apache images.
Step 1: Create Project Directory
mkdir spark-docker-tutorial
cd spark-docker-tutorial
Step 2: Create Docker Compose File
Create a file named docker-compose.yml:
version: '3.8'
services:
spark-master:
image: apache/spark:3.5.0
container_name: spark-master
hostname: spark-master
command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_WEBUI_PORT=8080
ports:
- '8080:8080' # Master Web UI
- '7077:7077' # Master Port
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
spark-worker:
image: apache/spark:3.5.0
container_name: spark-worker
hostname: spark-worker
command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
environment:
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=2g
- SPARK_WORKER_WEBUI_PORT=8081
ports:
- '8081:8081' # Worker Web UI
depends_on:
- spark-master
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
Step 3: Start the Cluster
docker-compose up -d
Expected output:
Creating network "spark-docker-tutorial_default" with the default driver
Creating spark-master ... done
Creating spark-worker ... done
Step 4: Verify the Setup
Check if containers are running:
docker-compose ps
You should see:
Name Command State Ports
--------------------------------------------------------------------------------------------------
spark-master /opt/spark/bin/spark-clas... Up 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp
spark-worker /opt/spark/bin/spark-clas... Up 0.0.0.0:8081->8081/tcp
Step 5: Access the Spark Web UI
Open your browser and navigate to: http://localhost:8080
You should see the Spark Master UI showing:
- 1 worker connected
- 2 cores available
- 2GB memory available
Congratulations! You now have a working Spark cluster using official Apache images.
Running Your First PySpark Application
Let's run a simple PySpark application to verify everything works.
Create Application Directory
mkdir -p spark-apps
Create a Test Application
Create spark-apps/word_count.py:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, split
# Create Spark session
spark = SparkSession.builder \
.appName("WordCount") \
.master("spark://spark-master:7077") \
.getOrCreate()
# Sample data
data = [
("Apache Spark is a unified analytics engine",),
("Spark provides high-level APIs in Java, Scala, Python and R",),
("Spark also supports a rich set of higher-level tools",)
]
# Create DataFrame
df = spark.createDataFrame(data, ["text"])
# Perform word count
word_count = df.select(
explode(split(col("text"), " ")).alias("word")
).groupBy("word") \
.count() \
.orderBy(col("count").desc())
# Display results
print("\\n=== WORD COUNT RESULTS ===")
word_count.show(20, truncate=False)
# Stop Spark session
spark.stop()
Submit the Application
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
/opt/spark-apps/word_count.py
You should see output like:
=== WORD COUNT RESULTS ===
+-------------+-----+
|word |count|
+-------------+-----+
|Spark |3 |
|a |2 |
|in |1 |
|Apache |1 |
|unified |1 |
|analytics |1 |
|engine |1 |
+-------------+-----+
Multi-Worker Setup for Distributed Processing
For real-world data processing, you need multiple workers to distribute the workload.
Scaling Workers Dynamically
The easiest way to add workers:
docker-compose up -d --scale spark-worker=3
This creates 3 worker instances. Check the Spark UI at http://localhost:8080 to see all 3 workers.
Static Multi-Worker Configuration
For better control and unique naming, define workers explicitly:
version: '3.8'
services:
spark-master:
image: apache/spark:3.5.0
container_name: spark-master
hostname: spark-master
command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
environment:
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_WEBUI_PORT=8080
ports:
- '8080:8080'
- '7077:7077'
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
networks:
- spark-network
spark-worker-1:
image: apache/spark:3.5.0
container_name: spark-worker-1
hostname: spark-worker-1
command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
environment:
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=4g
- SPARK_WORKER_WEBUI_PORT=8081
ports:
- '8081:8081'
depends_on:
- spark-master
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
networks:
- spark-network
spark-worker-2:
image: apache/spark:3.5.0
container_name: spark-worker-2
hostname: spark-worker-2
command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
environment:
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=4g
- SPARK_WORKER_WEBUI_PORT=8082
ports:
- '8082:8082'
depends_on:
- spark-master
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
networks:
- spark-network
spark-worker-3:
image: apache/spark:3.5.0
container_name: spark-worker-3
hostname: spark-worker-3
command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
environment:
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=4g
- SPARK_WORKER_WEBUI_PORT=8083
ports:
- '8083:8083'
depends_on:
- spark-master
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
networks:
- spark-network
networks:
spark-network:
driver: bridge
Testing Distributed Processing
Create spark-apps/parallel_processing.py:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand
import time
spark = SparkSession.builder \
.appName("ParallelProcessing") \
.master("spark://spark-master:7077") \
.config("spark.executor.instances", "3") \
.config("spark.executor.cores", "2") \
.config("spark.executor.memory", "2g") \
.getOrCreate()
# Create large dataset
print("Creating dataset with 10 million rows...")
df = spark.range(0, 10000000).withColumn("random_value", rand())
# Repartition to ensure work is distributed
df = df.repartition(12) # 3 workers * 2 cores * 2 tasks per core
# Perform computation
print("Starting distributed computation...")
start_time = time.time()
result = df.groupBy((col("id") % 100).alias("partition")) \
.agg({"random_value": "avg", "id": "count"}) \
.orderBy("partition")
# Trigger action
count = result.count()
end_time = time.time()
print(f"\\nProcessed {count} partitions in {end_time - start_time:.2f} seconds")
print("\\nSample results:")
result.show(10)
# Check partition distribution
print("\\nNumber of partitions:", df.rdd.getNumPartitions())
spark.stop()
Run it:
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--executor-memory 2G \
--total-executor-cores 6 \
/opt/spark-apps/parallel_processing.py
Watch the Spark UI to see tasks distributed across workers!
Working with Data: File Handling
Real Spark applications need to read and write data files.
Create Data Directory Structure
mkdir -p spark-data/input spark-data/output
Sample CSV Data
Create spark-data/input/sales_data.csv:
date,product,category,quantity,price
2024-01-01,Laptop,Electronics,5,999.99
2024-01-01,Mouse,Electronics,20,29.99
2024-01-01,Desk,Furniture,3,299.99
2024-01-02,Chair,Furniture,8,149.99
2024-01-02,Keyboard,Electronics,15,79.99
2024-01-02,Monitor,Electronics,10,399.99
2024-01-03,Laptop,Electronics,3,999.99
2024-01-03,Desk,Furniture,5,299.99
Data Processing Application
Create spark-apps/sales_analysis.py:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, avg, count, round as _round
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
spark = SparkSession.builder \
.appName("SalesAnalysis") \
.master("spark://spark-master:7077") \
.getOrCreate()
# Define schema for better performance
schema = StructType([
StructField("date", DateType(), True),
StructField("product", StringType(), True),
StructField("category", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("price", DoubleType(), True)
])
# Read CSV data
print("Reading sales data...")
sales_df = spark.read \
.option("header", "true") \
.option("dateFormat", "yyyy-MM-dd") \
.schema(schema) \
.csv("/opt/spark-data/input/sales_data.csv")
# Calculate total revenue per product
sales_df = sales_df.withColumn("revenue", col("quantity") * col("price"))
print("\\n=== TOTAL SALES BY PRODUCT ===")
product_sales = sales_df.groupBy("product") \
.agg(
_sum("quantity").alias("total_quantity"),
_round(_sum("revenue"), 2).alias("total_revenue")
) \
.orderBy(col("total_revenue").desc())
product_sales.show()
# Category analysis
print("\\n=== SALES BY CATEGORY ===")
category_sales = sales_df.groupBy("category") \
.agg(
count("*").alias("transactions"),
_sum("quantity").alias("total_items"),
_round(_sum("revenue"), 2).alias("total_revenue"),
_round(avg("revenue"), 2).alias("avg_transaction_value")
) \
.orderBy(col("total_revenue").desc())
category_sales.show()
# Save results
print("\\nSaving results to output directory...")
product_sales.coalesce(1).write \
.mode("overwrite") \
.option("header", "true") \
.csv("/opt/spark-data/output/product_sales")
category_sales.coalesce(1).write \
.mode("overwrite") \
.option("header", "true") \
.csv("/opt/spark-data/output/category_sales")
print("\\nAnalysis complete! Check spark-data/output/ directory for results.")
spark.stop()
Run the Analysis
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
/opt/spark-apps/sales_analysis.py
Check results:
ls -la spark-data/output/
cat spark-data/output/product_sales/part-*.csv
Database Integration: PostgreSQL Example
Let's connect Spark to a PostgreSQL database - a common real-world scenario.
Complete Database Setup
Create docker-compose-with-db.yml:
version: '3.8'
services:
postgres:
image: postgres:15
container_name: postgres-db
hostname: postgres-db
environment:
POSTGRES_DB: analytics
POSTGRES_USER: spark_user
POSTGRES_PASSWORD: spark_password
ports:
- '5432:5432'
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init-db.sql:/docker-entrypoint-initdb.d/init-db.sql
networks:
- spark-network
spark-master:
image: apache/spark:3.5.0
container_name: spark-master
hostname: spark-master
command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
environment:
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_WEBUI_PORT=8080
ports:
- '8080:8080'
- '7077:7077'
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
- ./jars:/opt/spark/jars
networks:
- spark-network
spark-worker:
image: apache/spark:3.5.0
command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
environment:
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=4g
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
- ./jars:/opt/spark/jars
depends_on:
- spark-master
- postgres
networks:
- spark-network
deploy:
replicas: 2
networks:
spark-network:
driver: bridge
volumes:
postgres_data:
Initialize Database
Create init-db.sql:
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
city VARCHAR(50),
country VARCHAR(50),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(customer_id),
order_date DATE,
total_amount DECIMAL(10, 2),
status VARCHAR(20)
);
-- Insert sample data
INSERT INTO customers (first_name, last_name, email, city, country) VALUES
('John', 'Doe', '[email protected]', 'New York', 'USA'),
('Jane', 'Smith', '[email protected]', 'London', 'UK'),
('Carlos', 'Rodriguez', '[email protected]', 'Madrid', 'Spain'),
('Yuki', 'Tanaka', '[email protected]', 'Tokyo', 'Japan'),
('Emma', 'Brown', '[email protected]', 'Sydney', 'Australia');
INSERT INTO orders (customer_id, order_date, total_amount, status) VALUES
(1, '2024-01-15', 1500.00, 'completed'),
(1, '2024-02-20', 750.50, 'completed'),
(2, '2024-01-10', 2200.00, 'completed'),
(3, '2024-02-05', 1100.75, 'pending'),
(4, '2024-01-25', 890.00, 'completed'),
(5, '2024-02-15', 1650.25, 'completed'),
(2, '2024-03-01', 3200.00, 'shipped');
Download JDBC Driver
mkdir -p jars
cd jars
wget https://jdbc.postgresql.org/download/postgresql-42.7.1.jar
cd ..
PySpark Database Application
Create spark-apps/db_analysis.py:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, count, avg, round as _round
# Database connection properties
db_properties = {
"user": "spark_user",
"password": "spark_password",
"driver": "org.postgresql.Driver"
}
jdbc_url = "jdbc:postgresql://postgres-db:5432/analytics"
# Create Spark session
spark = SparkSession.builder \
.appName("DatabaseAnalysis") \
.master("spark://spark-master:7077") \
.config("spark.jars", "/opt/spark/jars/postgresql-42.7.1.jar") \
.getOrCreate()
print("\\n=== READING DATA FROM POSTGRESQL ===")
# Read customers table
customers_df = spark.read.jdbc(
url=jdbc_url,
table="customers",
properties=db_properties
)
print(f"Total customers: {customers_df.count()}")
customers_df.show()
# Read orders table
orders_df = spark.read.jdbc(
url=jdbc_url,
table="orders",
properties=db_properties
)
print(f"\\nTotal orders: {orders_df.count()}")
orders_df.show()
# Perform analysis: Customer order summary
print("\\n=== CUSTOMER ORDER SUMMARY ===")
customer_summary = customers_df.join(
orders_df,
customers_df.customer_id == orders_df.customer_id,
"left"
).groupBy(
customers_df.customer_id,
"first_name",
"last_name",
"country"
).agg(
count("order_id").alias("total_orders"),
_round(_sum("total_amount"), 2).alias("total_spent"),
_round(avg("total_amount"), 2).alias("avg_order_value")
).orderBy(col("total_spent").desc())
customer_summary.show()
# Analyze by country
print("\\n=== SALES BY COUNTRY ===")
country_sales = customers_df.join(
orders_df,
customers_df.customer_id == orders_df.customer_id
).groupBy("country").agg(
count("order_id").alias("total_orders"),
_round(_sum("total_amount"), 2).alias("total_revenue")
).orderBy(col("total_revenue").desc())
country_sales.show()
# Write results back to database
print("\\n=== WRITING RESULTS TO DATABASE ===")
customer_summary.write.jdbc(
url=jdbc_url,
table="customer_analytics",
mode="overwrite",
properties=db_properties
)
print("Analysis complete! Results saved to 'customer_analytics' table.")
spark.stop()
Start Everything and Run
# Start services
docker-compose -f docker-compose-with-db.yml up -d
# Wait for PostgreSQL to be ready (about 10 seconds)
sleep 10
# Run the analysis
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--jars /opt/spark/jars/postgresql-42.7.1.jar \
/opt/spark-apps/db_analysis.py
Verify Results in PostgreSQL
docker exec -it postgres-db psql -U spark_user -d analytics -c "SELECT * FROM customer_analytics;"
For more database integration examples, see Connect PostgreSQL Database using PySpark.
Spark History Server for Job Monitoring
The History Server lets you review completed applications.
Enhanced Docker Compose with History Server
version: '3.8'
services:
spark-master:
image: apache/spark:3.5.0
container_name: spark-master
hostname: spark-master
command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
environment:
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_WEBUI_PORT=8080
ports:
- '8080:8080'
- '7077:7077'
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-events:/opt/spark/spark-events
networks:
- spark-network
spark-worker:
image: apache/spark:3.5.0
command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
environment:
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=4g
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-events:/opt/spark/spark-events
depends_on:
- spark-master
networks:
- spark-network
deploy:
replicas: 2
spark-history:
image: apache/spark:3.5.0
container_name: spark-history
hostname: spark-history
command: /opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer
environment:
- SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/opt/spark/spark-events
ports:
- '18080:18080'
volumes:
- ./spark-events:/opt/spark/spark-events
depends_on:
- spark-master
networks:
- spark-network
networks:
spark-network:
driver: bridge
Configure Event Logging
Create spark-defaults.conf:
spark.eventLog.enabled=true
spark.eventLog.dir=/opt/spark/spark-events
spark.history.fs.logDirectory=/opt/spark/spark-events
Create Events Directory
mkdir spark-events
Submit with Event Logging
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/opt/spark/spark-events \
/opt/spark-apps/word_count.py
Access History Server
After running applications, access the History Server at http://localhost:18080 to see:
- Completed jobs
- Execution timelines
- Stage details
- Executor metrics
Production-Ready Configuration
Here's a complete production configuration with all best practices using official Apache Spark images.
docker-compose-production.yml
version: '3.8'
services:
spark-master:
image: apache/spark:3.5.0
container_name: spark-master
hostname: spark-master
command: >
bash -c "/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master"
environment:
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_WEBUI_PORT=8080
- SPARK_NO_DAEMONIZE=true
ports:
- '8080:8080'
- '7077:7077'
- '4040-4050:4040-4050'
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
- ./spark-events:/opt/spark/spark-events
- ./jars:/opt/spark/jars
- ./conf:/opt/spark/conf
networks:
- spark-network
deploy:
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '2'
memory: 4G
healthcheck:
test: ['CMD', 'curl', '-f', 'http://localhost:8080']
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
restart: unless-stopped
spark-worker-1:
image: apache/spark:3.5.0
container_name: spark-worker-1
hostname: spark-worker-1
command: >
bash -c "/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker
spark://spark-master:7077"
environment:
- SPARK_WORKER_CORES=4
- SPARK_WORKER_MEMORY=6g
- SPARK_WORKER_WEBUI_PORT=8081
- SPARK_NO_DAEMONIZE=true
ports:
- '8081:8081'
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
- ./spark-events:/opt/spark/spark-events
- ./jars:/opt/spark/jars
- ./conf:/opt/spark/conf
networks:
- spark-network
deploy:
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '2'
memory: 4G
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
depends_on:
spark-master:
condition: service_healthy
restart: unless-stopped
spark-worker-2:
image: apache/spark:3.5.0
container_name: spark-worker-2
hostname: spark-worker-2
command: >
bash -c "/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker
spark://spark-master:7077"
environment:
- SPARK_WORKER_CORES=4
- SPARK_WORKER_MEMORY=6g
- SPARK_WORKER_WEBUI_PORT=8082
- SPARK_NO_DAEMONIZE=true
ports:
- '8082:8082'
volumes:
- ./spark-apps:/opt/spark-apps
- ./spark-data:/opt/spark-data
- ./spark-events:/opt/spark/spark-events
- ./jars:/opt/spark/jars
- ./conf:/opt/spark/conf
networks:
- spark-network
deploy:
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '2'
memory: 4G
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
depends_on:
spark-master:
condition: service_healthy
restart: unless-stopped
spark-history:
image: apache/spark:3.5.0
container_name: spark-history
hostname: spark-history
command: /opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer
environment:
- SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/opt/spark/spark-events -Dspark.history.ui.port=18080
- SPARK_NO_DAEMONIZE=true
ports:
- '18080:18080'
volumes:
- ./spark-events:/opt/spark/spark-events
networks:
- spark-network
depends_on:
- spark-master
restart: unless-stopped
networks:
spark-network:
driver: bridge
name: spark-network
Performance Tuning Configuration
Create conf/spark-defaults.conf:
# Event Logging
spark.eventLog.enabled=true
spark.eventLog.dir=/opt/spark/spark-events
# Serialization
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=512m
# Memory management
spark.memory.fraction=0.8
spark.memory.storageFraction=0.3
# Shuffle optimization
spark.shuffle.compress=true
spark.shuffle.spill.compress=true
spark.shuffle.file.buffer=64k
# Dynamic allocation
spark.dynamicAllocation.enabled=false
spark.shuffle.service.enabled=false
# UI
spark.ui.retainedJobs=100
spark.ui.retainedStages=100
spark.sql.ui.retainedExecutions=100
# Compression
spark.eventLog.compress=true
Troubleshooting Common Issues
Issue 1: Workers Not Connecting
Symptoms: Workers don't appear in Master UI
Check logs:
docker logs spark-worker
Verify network connectivity:
docker exec spark-worker ping spark-master
Solution: Ensure workers can reach the master. Check SPARK_MASTER_HOST is set to spark-master (the hostname, not localhost).
Issue 2: Out of Memory Errors
Symptoms: Jobs fail with java.lang.OutOfMemoryError
Solution: Increase worker and executor memory:
environment:
- SPARK_WORKER_MEMORY=8g
And when submitting:
--executor-memory 4G --driver-memory 2G
Issue 3: Port Already in Use
Symptoms: Error starting userland proxy: bind: address already in use
Check what's using the port:
# Windows
netstat -ano | findstr :8080
# Linux/Mac
lsof -i :8080
Solution: Change port mapping:
ports:
- '8090:8080' # Use 8090 on host instead
Issue 4: Permission Denied on Mounted Volumes
Symptoms: Permission denied errors when reading/writing files
Solution: The Apache Spark Docker image runs as user spark (UID 185). On Linux:
sudo chown -R 185:185 ./spark-apps ./spark-data ./spark-events
Or set appropriate permissions:
chmod -R 777 ./spark-apps ./spark-data ./spark-events
Issue 5: Cannot Connect to Database
Symptoms: No suitable driver found or connection timeout
Solutions:
- Ensure JDBC driver is in
/opt/spark/jarsdirectory - Verify containers are on the same network
- Use container hostname:
jdbc:postgresql://postgres-db:5432/ - Check database is ready:
docker logs postgres-db - Add
--jarsflag when submitting:--jars /opt/spark/jars/postgresql-42.7.1.jar
Issue 6: Container Exits Immediately
Symptoms: Spark containers stop right after starting
Solution: Apache Spark Docker images need SPARK_NO_DAEMONIZE=true to run in foreground:
environment:
- SPARK_NO_DAEMONIZE=true
Debugging Commands
View real-time logs:
docker logs -f spark-worker
Access container shell:
docker exec -it spark-master bash
Inspect container:
docker inspect spark-master
Test connectivity:
docker exec spark-worker curl spark-master:8080
Check resource usage:
docker stats
Advanced Spark Submit Options
Configuring Resources
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--deploy-mode client \
--executor-memory 4G \
--executor-cores 2 \
--total-executor-cores 8 \
--driver-memory 2G \
--conf spark.sql.shuffle.partitions=100 \
--conf spark.default.parallelism=100 \
/opt/spark-apps/my_app.py
With Python Dependencies
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--py-files /opt/spark-apps/dependencies.zip \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
/opt/spark-apps/streaming_app.py
With External JARs
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--jars /opt/spark/jars/postgresql-42.7.1.jar,/opt/spark/jars/mysql-connector-java-8.0.30.jar \
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC" \
/opt/spark-apps/etl_job.py
Connecting from External PySpark Applications
You can connect to your Docker Spark cluster from Python scripts running on your host machine.
Install PySpark Locally
pip install pyspark==3.5.0
Example External Connection
from pyspark.sql import SparkSession
# Connect to Docker Spark cluster
spark = SparkSession.builder \
.appName("ExternalApp") \
.master("spark://localhost:7077") \
.config("spark.executor.memory", "2g") \
.config("spark.executor.cores", "2") \
.config("spark.cores.max", "4") \
.getOrCreate()
# Your Spark code here
df = spark.createDataFrame([
(1, "Alice"),
(2, "Bob"),
(3, "Charlie")
], ["id", "name"])
df.show()
spark.stop()
Note: Ensure port 7077 is exposed in your docker-compose file.
Real-World Example: ETL Pipeline
Let's build a complete ETL pipeline that:
- Reads data from CSV
- Transforms and enriches it
- Writes to PostgreSQL
- Generates a summary report
Sample Input Data
Create spark-data/input/transactions.csv:
transaction_id,customer_id,product_id,quantity,price,transaction_date
1001,101,501,2,29.99,2024-01-15
1002,102,502,1,149.99,2024-01-15
1003,101,503,3,19.99,2024-01-16
1004,103,501,1,29.99,2024-01-16
1005,104,504,2,79.99,2024-01-17
1006,102,505,1,299.99,2024-01-17
1007,105,502,2,149.99,2024-01-18
ETL Application
Create spark-apps/etl_pipeline.py:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
col, sum as _sum, count, avg, max as _max,
min as _min, round as _round, current_timestamp, lit
)
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, DateType
from datetime import datetime
# Database configuration
jdbc_url = "jdbc:postgresql://postgres-db:5432/analytics"
db_properties = {
"user": "spark_user",
"password": "spark_password",
"driver": "org.postgresql.Driver"
}
# Initialize Spark
spark = SparkSession.builder \
.appName("ETL_Pipeline") \
.master("spark://spark-master:7077") \
.config("spark.jars", "/opt/spark/jars/postgresql-42.7.1.jar") \
.getOrCreate()
print("=" * 60)
print("ETL PIPELINE STARTED")
print(f"Timestamp: {datetime.now()}")
print("=" * 60)
# EXTRACT: Read CSV data
print("\\n[EXTRACT] Reading transaction data from CSV...")
schema = StructType([
StructField("transaction_id", IntegerType(), True),
StructField("customer_id", IntegerType(), True),
StructField("product_id", IntegerType(), True),
StructField("quantity", IntegerType(), True),
StructField("price", DoubleType(), True),
StructField("transaction_date", DateType(), True)
])
transactions = spark.read \
.option("header", "true") \
.option("dateFormat", "yyyy-MM-dd") \
.schema(schema) \
.csv("/opt/spark-data/input/transactions.csv")
print(f"Loaded {transactions.count()} transactions")
transactions.show(5)
# TRANSFORM: Calculate revenue and add metadata
print("\\n[TRANSFORM] Calculating revenue and enriching data...")
enriched_transactions = transactions.withColumn(
"revenue", col("quantity") * col("price")
).withColumn(
"processed_at", current_timestamp()
).withColumn(
"processing_batch", lit("batch_2024_01")
)
enriched_transactions.show(5)
# TRANSFORM: Generate customer summary
print("\\n[TRANSFORM] Creating customer summary...")
customer_summary = enriched_transactions.groupBy("customer_id").agg(
count("transaction_id").alias("total_transactions"),
_sum("quantity").alias("total_items_purchased"),
_round(_sum("revenue"), 2).alias("total_revenue"),
_round(avg("revenue"), 2).alias("avg_transaction_value"),
_max("transaction_date").alias("last_purchase_date"),
_min("transaction_date").alias("first_purchase_date")
).orderBy(col("total_revenue").desc())
customer_summary.show()
# TRANSFORM: Product performance
print("\\n[TRANSFORM] Analyzing product performance...")
product_summary = enriched_transactions.groupBy("product_id").agg(
count("transaction_id").alias("times_purchased"),
_sum("quantity").alias("total_quantity_sold"),
_round(_sum("revenue"), 2).alias("total_revenue"),
_round(avg("price"), 2).alias("avg_price")
).orderBy(col("total_revenue").desc())
product_summary.show()
# LOAD: Write to PostgreSQL
print("\\n[LOAD] Writing data to PostgreSQL...")
# Write enriched transactions
enriched_transactions.write.jdbc(
url=jdbc_url,
table="enriched_transactions",
mode="overwrite",
properties=db_properties
)
print("✓ Enriched transactions written")
# Write customer summary
customer_summary.write.jdbc(
url=jdbc_url,
table="customer_summary",
mode="overwrite",
properties=db_properties
)
print("✓ Customer summary written")
# Write product summary
product_summary.write.jdbc(
url=jdbc_url,
table="product_summary",
mode="overwrite",
properties=db_properties
)
print("✓ Product summary written")
# Generate report
print("\\n" + "=" * 60)
print("ETL PIPELINE SUMMARY REPORT")
print("=" * 60)
print(f"Total Transactions Processed: {transactions.count()}")
print(f"Unique Customers: {transactions.select('customer_id').distinct().count()}")
print(f"Unique Products: {transactions.select('product_id').distinct().count()}")
print(f"Total Revenue: ${enriched_transactions.agg(_sum('revenue')).collect()[0][0]:.2f}")
print("=" * 60)
spark.stop()
print("\\nETL Pipeline completed successfully!")
Run the Complete Pipeline
docker exec -it spark-master /opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--jars /opt/spark/jars/postgresql-42.7.1.jar \
--executor-memory 2G \
--total-executor-cores 4 \
/opt/spark-apps/etl_pipeline.py
Verify Results
docker exec -it postgres-db psql -U spark_user -d analytics
# Inside PostgreSQL
\dt -- List tables
SELECT * FROM customer_summary;
SELECT * FROM product_summary;
Best Practices Summary
Development
- ✅ Use official Apache Spark images with specific version tags
- ✅ Start small (1-2 workers) for development
- ✅ Mount code as volumes for easy iteration
- ✅ Use
--scalefor quick worker adjustments - ✅ Set
SPARK_NO_DAEMONIZE=truefor proper container execution
Production
- ✅ Enable event logging and History Server
- ✅ Configure resource limits (CPU/memory)
- ✅ Use health checks and restart policies
- ✅ Use specific Spark versions (never use
latest) - ✅ Monitor resource usage with
docker stats - ✅ Configure proper networking with custom networks
Performance
- ✅ Tune executor memory and cores based on workload
- ✅ Configure appropriate shuffle partitions
- ✅ Use Kryo serialization for better performance
- ✅ Enable compression for shuffle and event logs
- ✅ Mount configuration files via volumes
Debugging
- ✅ Always check logs first:
docker logs -f spark-worker - ✅ Use Spark UI to monitor job progress
- ✅ Verify network connectivity between containers
- ✅ Test database connections before running jobs
- ✅ Check file permissions (Apache images use UID 185)
Cleanup and Maintenance
Stop Services
docker-compose down
Stop and Remove Volumes
docker-compose down -v
Remove Event Logs
rm -rf spark-events/*
Clean Up Docker Resources
# Remove stopped containers
docker container prune
# Remove unused images
docker image prune -a
# Remove unused volumes
docker volume prune
Conclusion
You now have a complete understanding of running Apache Spark in Docker using official Apache images, from basic setups to production-ready clusters with monitoring, database integration, and real-world ETL pipelines.
Key Takeaways
- Official Images: Using Apache's official Spark images ensures compatibility and support
- Quick Start: Get Spark running in minutes with Docker Compose
- Scalability: Easily scale workers for distributed processing
- Data Integration: Connect to databases and process files seamlessly
- Monitoring: Use History Server and Spark UI for job insights
- Production Ready: Apply best practices for reliable deployments
Key Differences from Bitnami Images
- Apache images use
/opt/spark/bin/spark-classfor launching services - Require
SPARK_NO_DAEMONIZE=trueenvironment variable - Run as user
spark(UID 185) instead of root - Less opinionated configuration - more manual setup required
- Closer to official Apache Spark distribution
Next Steps
- Explore PySpark Tutorial for more PySpark concepts
- Learn about Spark performance tuning
- Set up streaming pipelines with Kafka and Spark
- Deploy to production with Kubernetes
Additional Resources
- Official Apache Spark Docker Documentation
- Apache Spark GitHub Repository
- Apache Spark Documentation
- Docker Compose Reference
Happy Spark processing! 🚀
Related Articles
Connect to a PostgreSQL database using PySpark
Connect to a PostgreSQL database using PySpark. Learn how to use the PySpark DataFrameReader to load data from a PostgreSQL database.
PySpark Tutorial For Beginners (Spark with Python)
PySpark Tutorial For Beginners (Spark with Python). PySpark is the Python API for Apache Spark, which is a cluster computing system. It allows you to write Spark applications using Python APIs and provides the PySpark shell for interactively analyzing your data in a distributed environment.
Top 10 Python Libraries for Data Engineering
Top 10 Python Libraries for Data Engineering. Data science is rapidly growing and providing immense opportunities for organizations to leverage data insights for strategic decision-making. Python is gaining popularity as the programming language of choice for data science projects. One of the primary reasons for this trend is the availability of various Python libraries that offer efficient solutions for data science tasks. In this article, we will discuss the top 10 Python libraries for data science.