Published on

Apache SPARK Up and Running FAST with Docker

Table of Contents

Introduction

Running Apache Spark in Docker containers provides an isolated, reproducible environment perfect for development, testing, and even production workloads. This comprehensive guide uses the official Apache Spark Docker images and walks you through everything from a simple single-node setup to a full production-ready cluster with monitoring, security, and database integration.

What You'll Learn

  • Setting up a Spark cluster with Docker Compose using official Apache images
  • Configuring multi-worker distributed processing
  • Mounting volumes for data persistence
  • Connecting to databases (PostgreSQL, MySQL)
  • Submitting and running PySpark applications
  • Production deployment with monitoring and security
  • Troubleshooting common Docker-Spark issues

Prerequisites

  • Docker Engine 20.10+ installed
  • Docker Compose 2.0+ installed
  • Basic understanding of Apache Spark concepts
  • 8GB+ RAM available for Docker
  • Familiarity with command line operations

Quick Start: Single-Node Spark Cluster

Let's start with the simplest possible Spark setup using official Apache images.

Step 1: Create Project Directory

mkdir spark-docker-tutorial
cd spark-docker-tutorial

Step 2: Create Docker Compose File

Create a file named docker-compose.yml:

version: '3.8'

services:
  spark-master:
    image: apache/spark:3.5.0
    container_name: spark-master
    hostname: spark-master
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
    environment:
      - SPARK_MODE=master
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER_PORT=7077
      - SPARK_MASTER_WEBUI_PORT=8080
    ports:
      - '8080:8080' # Master Web UI
      - '7077:7077' # Master Port
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data

  spark-worker:
    image: apache/spark:3.5.0
    container_name: spark-worker
    hostname: spark-worker
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    environment:
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=2g
      - SPARK_WORKER_WEBUI_PORT=8081
    ports:
      - '8081:8081' # Worker Web UI
    depends_on:
      - spark-master
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data

Step 3: Start the Cluster

docker-compose up -d

Expected output:

Creating network "spark-docker-tutorial_default" with the default driver
Creating spark-master ... done
Creating spark-worker ... done

Step 4: Verify the Setup

Check if containers are running:

docker-compose ps

You should see:

     Name                   Command               State                    Ports
--------------------------------------------------------------------------------------------------
spark-master    /opt/spark/bin/spark-clas...   Up      0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp
spark-worker    /opt/spark/bin/spark-clas...   Up      0.0.0.0:8081->8081/tcp

Step 5: Access the Spark Web UI

Open your browser and navigate to: http://localhost:8080

You should see the Spark Master UI showing:

  • 1 worker connected
  • 2 cores available
  • 2GB memory available

Congratulations! You now have a working Spark cluster using official Apache images.

Running Your First PySpark Application

Let's run a simple PySpark application to verify everything works.

Create Application Directory

mkdir -p spark-apps

Create a Test Application

Create spark-apps/word_count.py:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, split

# Create Spark session
spark = SparkSession.builder \
    .appName("WordCount") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

# Sample data
data = [
    ("Apache Spark is a unified analytics engine",),
    ("Spark provides high-level APIs in Java, Scala, Python and R",),
    ("Spark also supports a rich set of higher-level tools",)
]

# Create DataFrame
df = spark.createDataFrame(data, ["text"])

# Perform word count
word_count = df.select(
    explode(split(col("text"), " ")).alias("word")
).groupBy("word") \
 .count() \
 .orderBy(col("count").desc())

# Display results
print("\\n=== WORD COUNT RESULTS ===")
word_count.show(20, truncate=False)

# Stop Spark session
spark.stop()

Submit the Application

docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  /opt/spark-apps/word_count.py

You should see output like:

=== WORD COUNT RESULTS ===
+-------------+-----+
|word         |count|
+-------------+-----+
|Spark        |3    |
|a            |2    |
|in           |1    |
|Apache       |1    |
|unified      |1    |
|analytics    |1    |
|engine       |1    |
+-------------+-----+

Multi-Worker Setup for Distributed Processing

For real-world data processing, you need multiple workers to distribute the workload.

Scaling Workers Dynamically

The easiest way to add workers:

docker-compose up -d --scale spark-worker=3

This creates 3 worker instances. Check the Spark UI at http://localhost:8080 to see all 3 workers.

Static Multi-Worker Configuration

For better control and unique naming, define workers explicitly:

version: '3.8'

services:
  spark-master:
    image: apache/spark:3.5.0
    container_name: spark-master
    hostname: spark-master
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
    environment:
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER_PORT=7077
      - SPARK_MASTER_WEBUI_PORT=8080
    ports:
      - '8080:8080'
      - '7077:7077'
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
    networks:
      - spark-network

  spark-worker-1:
    image: apache/spark:3.5.0
    container_name: spark-worker-1
    hostname: spark-worker-1
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    environment:
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=4g
      - SPARK_WORKER_WEBUI_PORT=8081
    ports:
      - '8081:8081'
    depends_on:
      - spark-master
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
    networks:
      - spark-network

  spark-worker-2:
    image: apache/spark:3.5.0
    container_name: spark-worker-2
    hostname: spark-worker-2
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    environment:
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=4g
      - SPARK_WORKER_WEBUI_PORT=8082
    ports:
      - '8082:8082'
    depends_on:
      - spark-master
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
    networks:
      - spark-network

  spark-worker-3:
    image: apache/spark:3.5.0
    container_name: spark-worker-3
    hostname: spark-worker-3
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    environment:
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=4g
      - SPARK_WORKER_WEBUI_PORT=8083
    ports:
      - '8083:8083'
    depends_on:
      - spark-master
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
    networks:
      - spark-network

networks:
  spark-network:
    driver: bridge

Testing Distributed Processing

Create spark-apps/parallel_processing.py:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand
import time

spark = SparkSession.builder \
    .appName("ParallelProcessing") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.instances", "3") \
    .config("spark.executor.cores", "2") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

# Create large dataset
print("Creating dataset with 10 million rows...")
df = spark.range(0, 10000000).withColumn("random_value", rand())

# Repartition to ensure work is distributed
df = df.repartition(12)  # 3 workers * 2 cores * 2 tasks per core

# Perform computation
print("Starting distributed computation...")
start_time = time.time()

result = df.groupBy((col("id") % 100).alias("partition")) \
    .agg({"random_value": "avg", "id": "count"}) \
    .orderBy("partition")

# Trigger action
count = result.count()
end_time = time.time()

print(f"\\nProcessed {count} partitions in {end_time - start_time:.2f} seconds")
print("\\nSample results:")
result.show(10)

# Check partition distribution
print("\\nNumber of partitions:", df.rdd.getNumPartitions())

spark.stop()

Run it:

docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  --executor-memory 2G \
  --total-executor-cores 6 \
  /opt/spark-apps/parallel_processing.py

Watch the Spark UI to see tasks distributed across workers!

Working with Data: File Handling

Real Spark applications need to read and write data files.

Create Data Directory Structure

mkdir -p spark-data/input spark-data/output

Sample CSV Data

Create spark-data/input/sales_data.csv:

date,product,category,quantity,price
2024-01-01,Laptop,Electronics,5,999.99
2024-01-01,Mouse,Electronics,20,29.99
2024-01-01,Desk,Furniture,3,299.99
2024-01-02,Chair,Furniture,8,149.99
2024-01-02,Keyboard,Electronics,15,79.99
2024-01-02,Monitor,Electronics,10,399.99
2024-01-03,Laptop,Electronics,3,999.99
2024-01-03,Desk,Furniture,5,299.99

Data Processing Application

Create spark-apps/sales_analysis.py:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, avg, count, round as _round
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType

spark = SparkSession.builder \
    .appName("SalesAnalysis") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

# Define schema for better performance
schema = StructType([
    StructField("date", DateType(), True),
    StructField("product", StringType(), True),
    StructField("category", StringType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("price", DoubleType(), True)
])

# Read CSV data
print("Reading sales data...")
sales_df = spark.read \
    .option("header", "true") \
    .option("dateFormat", "yyyy-MM-dd") \
    .schema(schema) \
    .csv("/opt/spark-data/input/sales_data.csv")

# Calculate total revenue per product
sales_df = sales_df.withColumn("revenue", col("quantity") * col("price"))

print("\\n=== TOTAL SALES BY PRODUCT ===")
product_sales = sales_df.groupBy("product") \
    .agg(
        _sum("quantity").alias("total_quantity"),
        _round(_sum("revenue"), 2).alias("total_revenue")
    ) \
    .orderBy(col("total_revenue").desc())

product_sales.show()

# Category analysis
print("\\n=== SALES BY CATEGORY ===")
category_sales = sales_df.groupBy("category") \
    .agg(
        count("*").alias("transactions"),
        _sum("quantity").alias("total_items"),
        _round(_sum("revenue"), 2).alias("total_revenue"),
        _round(avg("revenue"), 2).alias("avg_transaction_value")
    ) \
    .orderBy(col("total_revenue").desc())

category_sales.show()

# Save results
print("\\nSaving results to output directory...")
product_sales.coalesce(1).write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv("/opt/spark-data/output/product_sales")

category_sales.coalesce(1).write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv("/opt/spark-data/output/category_sales")

print("\\nAnalysis complete! Check spark-data/output/ directory for results.")

spark.stop()

Run the Analysis

docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  /opt/spark-apps/sales_analysis.py

Check results:

ls -la spark-data/output/
cat spark-data/output/product_sales/part-*.csv

Database Integration: PostgreSQL Example

Let's connect Spark to a PostgreSQL database - a common real-world scenario.

Complete Database Setup

Create docker-compose-with-db.yml:

version: '3.8'

services:
  postgres:
    image: postgres:15
    container_name: postgres-db
    hostname: postgres-db
    environment:
      POSTGRES_DB: analytics
      POSTGRES_USER: spark_user
      POSTGRES_PASSWORD: spark_password
    ports:
      - '5432:5432'
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-db.sql:/docker-entrypoint-initdb.d/init-db.sql
    networks:
      - spark-network

  spark-master:
    image: apache/spark:3.5.0
    container_name: spark-master
    hostname: spark-master
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
    environment:
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER_PORT=7077
      - SPARK_MASTER_WEBUI_PORT=8080
    ports:
      - '8080:8080'
      - '7077:7077'
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
      - ./jars:/opt/spark/jars
    networks:
      - spark-network

  spark-worker:
    image: apache/spark:3.5.0
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    environment:
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=4g
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
      - ./jars:/opt/spark/jars
    depends_on:
      - spark-master
      - postgres
    networks:
      - spark-network
    deploy:
      replicas: 2

networks:
  spark-network:
    driver: bridge

volumes:
  postgres_data:

Initialize Database

Create init-db.sql:

CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(100),
    city VARCHAR(50),
    country VARCHAR(50),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT REFERENCES customers(customer_id),
    order_date DATE,
    total_amount DECIMAL(10, 2),
    status VARCHAR(20)
);

-- Insert sample data
INSERT INTO customers (first_name, last_name, email, city, country) VALUES
('John', 'Doe', '[email protected]', 'New York', 'USA'),
('Jane', 'Smith', '[email protected]', 'London', 'UK'),
('Carlos', 'Rodriguez', '[email protected]', 'Madrid', 'Spain'),
('Yuki', 'Tanaka', '[email protected]', 'Tokyo', 'Japan'),
('Emma', 'Brown', '[email protected]', 'Sydney', 'Australia');

INSERT INTO orders (customer_id, order_date, total_amount, status) VALUES
(1, '2024-01-15', 1500.00, 'completed'),
(1, '2024-02-20', 750.50, 'completed'),
(2, '2024-01-10', 2200.00, 'completed'),
(3, '2024-02-05', 1100.75, 'pending'),
(4, '2024-01-25', 890.00, 'completed'),
(5, '2024-02-15', 1650.25, 'completed'),
(2, '2024-03-01', 3200.00, 'shipped');

Download JDBC Driver

mkdir -p jars
cd jars
wget https://jdbc.postgresql.org/download/postgresql-42.7.1.jar
cd ..

PySpark Database Application

Create spark-apps/db_analysis.py:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, count, avg, round as _round

# Database connection properties
db_properties = {
    "user": "spark_user",
    "password": "spark_password",
    "driver": "org.postgresql.Driver"
}

jdbc_url = "jdbc:postgresql://postgres-db:5432/analytics"

# Create Spark session
spark = SparkSession.builder \
    .appName("DatabaseAnalysis") \
    .master("spark://spark-master:7077") \
    .config("spark.jars", "/opt/spark/jars/postgresql-42.7.1.jar") \
    .getOrCreate()

print("\\n=== READING DATA FROM POSTGRESQL ===")

# Read customers table
customers_df = spark.read.jdbc(
    url=jdbc_url,
    table="customers",
    properties=db_properties
)

print(f"Total customers: {customers_df.count()}")
customers_df.show()

# Read orders table
orders_df = spark.read.jdbc(
    url=jdbc_url,
    table="orders",
    properties=db_properties
)

print(f"\\nTotal orders: {orders_df.count()}")
orders_df.show()

# Perform analysis: Customer order summary
print("\\n=== CUSTOMER ORDER SUMMARY ===")
customer_summary = customers_df.join(
    orders_df,
    customers_df.customer_id == orders_df.customer_id,
    "left"
).groupBy(
    customers_df.customer_id,
    "first_name",
    "last_name",
    "country"
).agg(
    count("order_id").alias("total_orders"),
    _round(_sum("total_amount"), 2).alias("total_spent"),
    _round(avg("total_amount"), 2).alias("avg_order_value")
).orderBy(col("total_spent").desc())

customer_summary.show()

# Analyze by country
print("\\n=== SALES BY COUNTRY ===")
country_sales = customers_df.join(
    orders_df,
    customers_df.customer_id == orders_df.customer_id
).groupBy("country").agg(
    count("order_id").alias("total_orders"),
    _round(_sum("total_amount"), 2).alias("total_revenue")
).orderBy(col("total_revenue").desc())

country_sales.show()

# Write results back to database
print("\\n=== WRITING RESULTS TO DATABASE ===")
customer_summary.write.jdbc(
    url=jdbc_url,
    table="customer_analytics",
    mode="overwrite",
    properties=db_properties
)

print("Analysis complete! Results saved to 'customer_analytics' table.")

spark.stop()

Start Everything and Run

# Start services
docker-compose -f docker-compose-with-db.yml up -d

# Wait for PostgreSQL to be ready (about 10 seconds)
sleep 10

# Run the analysis
docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  --jars /opt/spark/jars/postgresql-42.7.1.jar \
  /opt/spark-apps/db_analysis.py

Verify Results in PostgreSQL

docker exec -it postgres-db psql -U spark_user -d analytics -c "SELECT * FROM customer_analytics;"

For more database integration examples, see Connect PostgreSQL Database using PySpark.

Spark History Server for Job Monitoring

The History Server lets you review completed applications.

Enhanced Docker Compose with History Server

version: '3.8'

services:
  spark-master:
    image: apache/spark:3.5.0
    container_name: spark-master
    hostname: spark-master
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
    environment:
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER_PORT=7077
      - SPARK_MASTER_WEBUI_PORT=8080
    ports:
      - '8080:8080'
      - '7077:7077'
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-events:/opt/spark/spark-events
    networks:
      - spark-network

  spark-worker:
    image: apache/spark:3.5.0
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
    environment:
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=4g
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-events:/opt/spark/spark-events
    depends_on:
      - spark-master
    networks:
      - spark-network
    deploy:
      replicas: 2

  spark-history:
    image: apache/spark:3.5.0
    container_name: spark-history
    hostname: spark-history
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer
    environment:
      - SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/opt/spark/spark-events
    ports:
      - '18080:18080'
    volumes:
      - ./spark-events:/opt/spark/spark-events
    depends_on:
      - spark-master
    networks:
      - spark-network

networks:
  spark-network:
    driver: bridge

Configure Event Logging

Create spark-defaults.conf:

spark.eventLog.enabled=true
spark.eventLog.dir=/opt/spark/spark-events
spark.history.fs.logDirectory=/opt/spark/spark-events

Create Events Directory

mkdir spark-events

Submit with Event Logging

docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=/opt/spark/spark-events \
  /opt/spark-apps/word_count.py

Access History Server

After running applications, access the History Server at http://localhost:18080 to see:

  • Completed jobs
  • Execution timelines
  • Stage details
  • Executor metrics

Production-Ready Configuration

Here's a complete production configuration with all best practices using official Apache Spark images.

docker-compose-production.yml

version: '3.8'

services:
  spark-master:
    image: apache/spark:3.5.0
    container_name: spark-master
    hostname: spark-master
    command: >
      bash -c "/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master"
    environment:
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER_PORT=7077
      - SPARK_MASTER_WEBUI_PORT=8080
      - SPARK_NO_DAEMONIZE=true
    ports:
      - '8080:8080'
      - '7077:7077'
      - '4040-4050:4040-4050'
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
      - ./spark-events:/opt/spark/spark-events
      - ./jars:/opt/spark/jars
      - ./conf:/opt/spark/conf
    networks:
      - spark-network
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          cpus: '2'
          memory: 4G
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8080']
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    restart: unless-stopped

  spark-worker-1:
    image: apache/spark:3.5.0
    container_name: spark-worker-1
    hostname: spark-worker-1
    command: >
      bash -c "/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker
      spark://spark-master:7077"
    environment:
      - SPARK_WORKER_CORES=4
      - SPARK_WORKER_MEMORY=6g
      - SPARK_WORKER_WEBUI_PORT=8081
      - SPARK_NO_DAEMONIZE=true
    ports:
      - '8081:8081'
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
      - ./spark-events:/opt/spark/spark-events
      - ./jars:/opt/spark/jars
      - ./conf:/opt/spark/conf
    networks:
      - spark-network
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          cpus: '2'
          memory: 4G
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    depends_on:
      spark-master:
        condition: service_healthy
    restart: unless-stopped

  spark-worker-2:
    image: apache/spark:3.5.0
    container_name: spark-worker-2
    hostname: spark-worker-2
    command: >
      bash -c "/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker
      spark://spark-master:7077"
    environment:
      - SPARK_WORKER_CORES=4
      - SPARK_WORKER_MEMORY=6g
      - SPARK_WORKER_WEBUI_PORT=8082
      - SPARK_NO_DAEMONIZE=true
    ports:
      - '8082:8082'
    volumes:
      - ./spark-apps:/opt/spark-apps
      - ./spark-data:/opt/spark-data
      - ./spark-events:/opt/spark/spark-events
      - ./jars:/opt/spark/jars
      - ./conf:/opt/spark/conf
    networks:
      - spark-network
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          cpus: '2'
          memory: 4G
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    depends_on:
      spark-master:
        condition: service_healthy
    restart: unless-stopped

  spark-history:
    image: apache/spark:3.5.0
    container_name: spark-history
    hostname: spark-history
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer
    environment:
      - SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/opt/spark/spark-events -Dspark.history.ui.port=18080
      - SPARK_NO_DAEMONIZE=true
    ports:
      - '18080:18080'
    volumes:
      - ./spark-events:/opt/spark/spark-events
    networks:
      - spark-network
    depends_on:
      - spark-master
    restart: unless-stopped

networks:
  spark-network:
    driver: bridge
    name: spark-network

Performance Tuning Configuration

Create conf/spark-defaults.conf:

# Event Logging
spark.eventLog.enabled=true
spark.eventLog.dir=/opt/spark/spark-events

# Serialization
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=512m

# Memory management
spark.memory.fraction=0.8
spark.memory.storageFraction=0.3

# Shuffle optimization
spark.shuffle.compress=true
spark.shuffle.spill.compress=true
spark.shuffle.file.buffer=64k

# Dynamic allocation
spark.dynamicAllocation.enabled=false
spark.shuffle.service.enabled=false

# UI
spark.ui.retainedJobs=100
spark.ui.retainedStages=100
spark.sql.ui.retainedExecutions=100

# Compression
spark.eventLog.compress=true

Troubleshooting Common Issues

Issue 1: Workers Not Connecting

Symptoms: Workers don't appear in Master UI

Check logs:

docker logs spark-worker

Verify network connectivity:

docker exec spark-worker ping spark-master

Solution: Ensure workers can reach the master. Check SPARK_MASTER_HOST is set to spark-master (the hostname, not localhost).

Issue 2: Out of Memory Errors

Symptoms: Jobs fail with java.lang.OutOfMemoryError

Solution: Increase worker and executor memory:

environment:
  - SPARK_WORKER_MEMORY=8g

And when submitting:

--executor-memory 4G --driver-memory 2G

Issue 3: Port Already in Use

Symptoms: Error starting userland proxy: bind: address already in use

Check what's using the port:

# Windows
netstat -ano | findstr :8080

# Linux/Mac
lsof -i :8080

Solution: Change port mapping:

ports:
  - '8090:8080' # Use 8090 on host instead

Issue 4: Permission Denied on Mounted Volumes

Symptoms: Permission denied errors when reading/writing files

Solution: The Apache Spark Docker image runs as user spark (UID 185). On Linux:

sudo chown -R 185:185 ./spark-apps ./spark-data ./spark-events

Or set appropriate permissions:

chmod -R 777 ./spark-apps ./spark-data ./spark-events

Issue 5: Cannot Connect to Database

Symptoms: No suitable driver found or connection timeout

Solutions:

  1. Ensure JDBC driver is in /opt/spark/jars directory
  2. Verify containers are on the same network
  3. Use container hostname: jdbc:postgresql://postgres-db:5432/
  4. Check database is ready: docker logs postgres-db
  5. Add --jars flag when submitting:
    --jars /opt/spark/jars/postgresql-42.7.1.jar
    

Issue 6: Container Exits Immediately

Symptoms: Spark containers stop right after starting

Solution: Apache Spark Docker images need SPARK_NO_DAEMONIZE=true to run in foreground:

environment:
  - SPARK_NO_DAEMONIZE=true

Debugging Commands

View real-time logs:

docker logs -f spark-worker

Access container shell:

docker exec -it spark-master bash

Inspect container:

docker inspect spark-master

Test connectivity:

docker exec spark-worker curl spark-master:8080

Check resource usage:

docker stats

Advanced Spark Submit Options

Configuring Resources

docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  --deploy-mode client \
  --executor-memory 4G \
  --executor-cores 2 \
  --total-executor-cores 8 \
  --driver-memory 2G \
  --conf spark.sql.shuffle.partitions=100 \
  --conf spark.default.parallelism=100 \
  /opt/spark-apps/my_app.py

With Python Dependencies

docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  --py-files /opt/spark-apps/dependencies.zip \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
  /opt/spark-apps/streaming_app.py

With External JARs

docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  --jars /opt/spark/jars/postgresql-42.7.1.jar,/opt/spark/jars/mysql-connector-java-8.0.30.jar \
  --conf spark.executor.extraJavaOptions="-XX:+UseG1GC" \
  /opt/spark-apps/etl_job.py

Connecting from External PySpark Applications

You can connect to your Docker Spark cluster from Python scripts running on your host machine.

Install PySpark Locally

pip install pyspark==3.5.0

Example External Connection

from pyspark.sql import SparkSession

# Connect to Docker Spark cluster
spark = SparkSession.builder \
    .appName("ExternalApp") \
    .master("spark://localhost:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .config("spark.cores.max", "4") \
    .getOrCreate()

# Your Spark code here
df = spark.createDataFrame([
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie")
], ["id", "name"])

df.show()

spark.stop()

Note: Ensure port 7077 is exposed in your docker-compose file.

Real-World Example: ETL Pipeline

Let's build a complete ETL pipeline that:

  1. Reads data from CSV
  2. Transforms and enriches it
  3. Writes to PostgreSQL
  4. Generates a summary report

Sample Input Data

Create spark-data/input/transactions.csv:

transaction_id,customer_id,product_id,quantity,price,transaction_date
1001,101,501,2,29.99,2024-01-15
1002,102,502,1,149.99,2024-01-15
1003,101,503,3,19.99,2024-01-16
1004,103,501,1,29.99,2024-01-16
1005,104,504,2,79.99,2024-01-17
1006,102,505,1,299.99,2024-01-17
1007,105,502,2,149.99,2024-01-18

ETL Application

Create spark-apps/etl_pipeline.py:

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, sum as _sum, count, avg, max as _max,
    min as _min, round as _round, current_timestamp, lit
)
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, DateType
from datetime import datetime

# Database configuration
jdbc_url = "jdbc:postgresql://postgres-db:5432/analytics"
db_properties = {
    "user": "spark_user",
    "password": "spark_password",
    "driver": "org.postgresql.Driver"
}

# Initialize Spark
spark = SparkSession.builder \
    .appName("ETL_Pipeline") \
    .master("spark://spark-master:7077") \
    .config("spark.jars", "/opt/spark/jars/postgresql-42.7.1.jar") \
    .getOrCreate()

print("=" * 60)
print("ETL PIPELINE STARTED")
print(f"Timestamp: {datetime.now()}")
print("=" * 60)

# EXTRACT: Read CSV data
print("\\n[EXTRACT] Reading transaction data from CSV...")
schema = StructType([
    StructField("transaction_id", IntegerType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("price", DoubleType(), True),
    StructField("transaction_date", DateType(), True)
])

transactions = spark.read \
    .option("header", "true") \
    .option("dateFormat", "yyyy-MM-dd") \
    .schema(schema) \
    .csv("/opt/spark-data/input/transactions.csv")

print(f"Loaded {transactions.count()} transactions")
transactions.show(5)

# TRANSFORM: Calculate revenue and add metadata
print("\\n[TRANSFORM] Calculating revenue and enriching data...")
enriched_transactions = transactions.withColumn(
    "revenue", col("quantity") * col("price")
).withColumn(
    "processed_at", current_timestamp()
).withColumn(
    "processing_batch", lit("batch_2024_01")
)

enriched_transactions.show(5)

# TRANSFORM: Generate customer summary
print("\\n[TRANSFORM] Creating customer summary...")
customer_summary = enriched_transactions.groupBy("customer_id").agg(
    count("transaction_id").alias("total_transactions"),
    _sum("quantity").alias("total_items_purchased"),
    _round(_sum("revenue"), 2).alias("total_revenue"),
    _round(avg("revenue"), 2).alias("avg_transaction_value"),
    _max("transaction_date").alias("last_purchase_date"),
    _min("transaction_date").alias("first_purchase_date")
).orderBy(col("total_revenue").desc())

customer_summary.show()

# TRANSFORM: Product performance
print("\\n[TRANSFORM] Analyzing product performance...")
product_summary = enriched_transactions.groupBy("product_id").agg(
    count("transaction_id").alias("times_purchased"),
    _sum("quantity").alias("total_quantity_sold"),
    _round(_sum("revenue"), 2).alias("total_revenue"),
    _round(avg("price"), 2).alias("avg_price")
).orderBy(col("total_revenue").desc())

product_summary.show()

# LOAD: Write to PostgreSQL
print("\\n[LOAD] Writing data to PostgreSQL...")

# Write enriched transactions
enriched_transactions.write.jdbc(
    url=jdbc_url,
    table="enriched_transactions",
    mode="overwrite",
    properties=db_properties
)
print("✓ Enriched transactions written")

# Write customer summary
customer_summary.write.jdbc(
    url=jdbc_url,
    table="customer_summary",
    mode="overwrite",
    properties=db_properties
)
print("✓ Customer summary written")

# Write product summary
product_summary.write.jdbc(
    url=jdbc_url,
    table="product_summary",
    mode="overwrite",
    properties=db_properties
)
print("✓ Product summary written")

# Generate report
print("\\n" + "=" * 60)
print("ETL PIPELINE SUMMARY REPORT")
print("=" * 60)
print(f"Total Transactions Processed: {transactions.count()}")
print(f"Unique Customers: {transactions.select('customer_id').distinct().count()}")
print(f"Unique Products: {transactions.select('product_id').distinct().count()}")
print(f"Total Revenue: ${enriched_transactions.agg(_sum('revenue')).collect()[0][0]:.2f}")
print("=" * 60)

spark.stop()
print("\\nETL Pipeline completed successfully!")

Run the Complete Pipeline

docker exec -it spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  --jars /opt/spark/jars/postgresql-42.7.1.jar \
  --executor-memory 2G \
  --total-executor-cores 4 \
  /opt/spark-apps/etl_pipeline.py

Verify Results

docker exec -it postgres-db psql -U spark_user -d analytics

# Inside PostgreSQL
\dt  -- List tables
SELECT * FROM customer_summary;
SELECT * FROM product_summary;

Best Practices Summary

Development

  • ✅ Use official Apache Spark images with specific version tags
  • ✅ Start small (1-2 workers) for development
  • ✅ Mount code as volumes for easy iteration
  • ✅ Use --scale for quick worker adjustments
  • ✅ Set SPARK_NO_DAEMONIZE=true for proper container execution

Production

  • ✅ Enable event logging and History Server
  • ✅ Configure resource limits (CPU/memory)
  • ✅ Use health checks and restart policies
  • ✅ Use specific Spark versions (never use latest)
  • ✅ Monitor resource usage with docker stats
  • ✅ Configure proper networking with custom networks

Performance

  • ✅ Tune executor memory and cores based on workload
  • ✅ Configure appropriate shuffle partitions
  • ✅ Use Kryo serialization for better performance
  • ✅ Enable compression for shuffle and event logs
  • ✅ Mount configuration files via volumes

Debugging

  • ✅ Always check logs first: docker logs -f spark-worker
  • ✅ Use Spark UI to monitor job progress
  • ✅ Verify network connectivity between containers
  • ✅ Test database connections before running jobs
  • ✅ Check file permissions (Apache images use UID 185)

Cleanup and Maintenance

Stop Services

docker-compose down

Stop and Remove Volumes

docker-compose down -v

Remove Event Logs

rm -rf spark-events/*

Clean Up Docker Resources

# Remove stopped containers
docker container prune

# Remove unused images
docker image prune -a

# Remove unused volumes
docker volume prune

Conclusion

You now have a complete understanding of running Apache Spark in Docker using official Apache images, from basic setups to production-ready clusters with monitoring, database integration, and real-world ETL pipelines.

Key Takeaways

  1. Official Images: Using Apache's official Spark images ensures compatibility and support
  2. Quick Start: Get Spark running in minutes with Docker Compose
  3. Scalability: Easily scale workers for distributed processing
  4. Data Integration: Connect to databases and process files seamlessly
  5. Monitoring: Use History Server and Spark UI for job insights
  6. Production Ready: Apply best practices for reliable deployments

Key Differences from Bitnami Images

  • Apache images use /opt/spark/bin/spark-class for launching services
  • Require SPARK_NO_DAEMONIZE=true environment variable
  • Run as user spark (UID 185) instead of root
  • Less opinionated configuration - more manual setup required
  • Closer to official Apache Spark distribution

Next Steps

Additional Resources

Happy Spark processing! 🚀

Related Articles

Top 10 Python Libraries for Data Engineering

Top 10 Python Libraries for Data Engineering. Data science is rapidly growing and providing immense opportunities for organizations to leverage data insights for strategic decision-making. Python is gaining popularity as the programming language of choice for data science projects. One of the primary reasons for this trend is the availability of various Python libraries that offer efficient solutions for data science tasks. In this article, we will discuss the top 10 Python libraries for data science.