Published on

How to Install Apache Spark on a Local Machine using Windows

Table of Contents

Introduction

Installing Apache Spark on Windows can be challenging due to compatibility issues and environment configuration requirements. This comprehensive guide walks you through every step of the installation process, from prerequisites to verification, with detailed troubleshooting tips for common errors.

By the end of this guide, you'll have a fully functional Apache Spark installation on your Windows machine, ready for local development and testing.

Understanding the Components

Before diving into installation, it's important to understand what you're installing:

Apache Spark: A unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R.

Hadoop Winutils: Windows utilities for Hadoop that Spark requires on Windows systems. Even though you're not installing Hadoop, Spark needs these utilities to function properly on Windows.

Java Development Kit (JDK): Spark is built on the JVM (Java Virtual Machine) and requires Java to run.

Python (Optional but Recommended): Required if you want to use PySpark, Spark's Python API.

Prerequisites

System Requirements

  • Operating System: Windows 10 or Windows 11 (64-bit)
  • RAM: Minimum 4GB (8GB or more recommended)
  • Disk Space: At least 5GB free space
  • Administrator Access: Required for setting environment variables

Required Software

1. Java Development Kit (JDK)

Apache Spark requires JDK 8, 11, or 17. JDK 11 is recommended for most use cases.

Download and Install JDK:

  1. Visit Oracle JDK Downloads or Adoptium (OpenJDK)
  2. Download JDK 11 (Windows x64 Installer)
  3. Run the installer with default settings
  4. Note the installation path (e.g., C:\Program Files\Java\jdk-11.0.18)

Verify Java Installation:

Open Command Prompt and run:

java -version

You should see output like:

java version "11.0.18" 2023-01-17 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.18+9-LTS-195)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.18+9-LTS-195, mixed mode)

Important Notes:

  • If you have multiple Java versions installed, Spark will use the version specified in JAVA_HOME
  • Do not install Java in paths with spaces (e.g., avoid C:\Program Files (x86)\)
  • If you must use a path with spaces, use the short name format (progra~1 or progra~2)

If you plan to use PySpark, install Python 3.8 or higher.

Download and Install Python:

  1. Visit Python Downloads
  2. Download Python 3.11 (Windows installer - 64-bit)
  3. IMPORTANT: Check "Add Python to PATH" during installation
  4. Choose "Install Now" or customize to select installation directory

Verify Python Installation:

python --version
pip --version

Expected output:

Python 3.11.5
pip 23.2.1 from C:\Users\YourName\AppData\Local\Programs\Python\Python311\lib\site-packages\pip (python 3.11)

Install PySpark Package (Optional):

pip install pyspark

This installs PySpark Python package, but you'll still need the full Spark distribution for local development.

3. Hadoop Winutils

Spark on Windows requires Hadoop's native Windows utilities (winutils.exe and hadoop.dll).

Download Winutils:

  1. Visit winutils GitHub repository
  2. Navigate to the Hadoop version matching your Spark distribution:
    • For Spark 3.5.x: Use hadoop-3.3.6
    • For Spark 3.4.x: Use hadoop-3.3.5
    • For Spark 3.3.x: Use hadoop-3.3.1
  3. Download both winutils.exe and hadoop.dll from the bin folder

Create Hadoop Directory Structure:

# Create directories
mkdir C:\hadoop\bin

# Move downloaded files to:
# C:\hadoop\bin\winutils.exe
# C:\hadoop\bin\hadoop.dll

Set Permissions (Important):

  1. Right-click on C:\hadoop\bin\winutils.exe
  2. Select "Properties" → "Security" tab
  3. Click "Edit" → "Add" → Enter "Everyone" → "OK"
  4. Check "Full control" and click "OK"

Step 1: Download Apache Spark

Choose Your Spark Version

Visit the Apache Spark Downloads page.

Version Selection:

  • Latest Stable Release: Spark 3.5.0+ (recommended for new projects)
  • Package Type: "Pre-built for Apache Hadoop 3.3 and later"

Download Options:

  1. Direct Download (Recommended):

    • Click the suggested mirror link
    • Download spark-3.5.0-bin-hadoop3.tgz
  2. Command Line Download:

    # Using curl (if available)
    curl -O https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
    

Verify Download: Check the file size matches the website (approximately 400MB).

Step 2: Extract Apache Spark

Extract the Archive

Using 7-Zip (Recommended):

  1. Download and install 7-Zip
  2. Right-click on spark-3.5.0-bin-hadoop3.tgz
  3. Select "7-Zip" → "Extract Here"
  4. You'll get a .tar file - extract it again
  5. Final extracted folder: spark-3.5.0-bin-hadoop3

Using Windows Built-in Extraction:

  1. Rename .tgz to .tar.gz
  2. Right-click → "Extract All"
  3. May require extracting twice (tar.gz, then tar)

Move to Installation Directory

Create a clean installation path without spaces:

# Option 1: Simple path (Recommended)
mkdir C:\spark
move spark-3.5.0-bin-hadoop3 C:\spark

# Option 2: Versioned path
mkdir C:\spark\3.5.0
move spark-3.5.0-bin-hadoop3 C:\spark\3.5.0

Final Structure:

C:\spark\
├── bin\
│   ├── spark-shell.cmd
│   ├── spark-submit.cmd
│   ├── pyspark.cmd
│   └── ...
├── conf\
├── jars\
├── python\
└── ...

Step 3: Configure Environment Variables

Environment variables tell Windows where to find Spark, Java, and Hadoop.

Open Environment Variables Dialog

Method 1: GUI

  1. Right-click "This PC" or "Computer"
  2. Select "Properties"
  3. Click "Advanced system settings"
  4. Click "Environment Variables" button

Method 2: Command

  1. Press Win + R
  2. Type sysdm.cpl and press Enter
  3. Go to "Advanced" tab → "Environment Variables"

Method 3: Search

  1. Search for "Environment Variables" in Windows Search
  2. Select "Edit the system environment variables"

Create System Variables

Click "New" under "System variables" (not User variables) to create each:

JAVA_HOME

Variable name: JAVA_HOME
Variable value: C:\Program Files\Java\jdk-11.0.18

Important:

  • Use your actual JDK installation path
  • If path contains spaces, use short name format:
    C:\progra~1\Java\jdk-11.0.18  (for Program Files)
    C:\progra~2\Java\jdk-11.0.18  (for Program Files (x86))
    

Find Short Name:

dir /x "C:\Program Files"

HADOOP_HOME

Variable name: HADOOP_HOME
Variable value: C:\hadoop

SPARK_HOME

Variable name: SPARK_HOME
Variable value: C:\spark

Or if using versioned path:

Variable value: C:\spark\3.5.0

PYSPARK_PYTHON (Optional)

If you have multiple Python versions:

Variable name: PYSPARK_PYTHON
Variable value: C:\Users\YourName\AppData\Local\Programs\Python\Python311\python.exe

Update PATH Variable

  1. Find "Path" in "System variables"
  2. Click "Edit"
  3. Click "New" and add each path (one per line):
%JAVA_HOME%\bin
%HADOOP_HOME%\bin
%SPARK_HOME%\bin
%SPARK_HOME%\python

Order matters: Ensure these paths appear before any other Java or Python installations.

Verify Environment Variables

Open a new Command Prompt (important - existing windows won't see new variables):

# Verify variables are set
echo %JAVA_HOME%
echo %HADOOP_HOME%
echo %SPARK_HOME%

# Verify binaries are accessible
where java
where spark-shell
where winutils

Expected output:

C:\Program Files\Java\jdk-11.0.18
C:\hadoop
C:\spark

C:\Program Files\Java\jdk-11.0.18\bin\java.exe
C:\spark\bin\spark-shell
C:\spark\bin\spark-shell.cmd
C:\hadoop\bin\winutils.exe

Step 4: Configure Spark (Optional)

Create Spark Configuration

Navigate to Spark's conf directory:

cd C:\spark\conf

Copy template configuration:

copy spark-defaults.conf.template spark-defaults.conf
copy log4j2.properties.template log4j2.properties

Edit Spark Defaults

Open spark-defaults.conf and add basic configurations:

# Application Properties
spark.app.name              MySparkApp
spark.master                local[*]

# Memory Configuration
spark.driver.memory         2g
spark.executor.memory       2g

# Python Configuration (if using PySpark)
spark.pyspark.python        python
spark.pyspark.driver.python python

# UI Configuration
spark.ui.port               4040

# Logging Configuration
spark.eventLog.enabled      false

Configure Logging

Edit log4j2.properties to reduce verbosity:

# Change rootLogger level from INFO to WARN
rootLogger.level = warn

# Or for specific loggers
logger.spark.name = org.apache.spark
logger.spark.level = warn

Step 5: Test Spark Installation

Test Spark Shell (Scala)

Open Command Prompt and run:

spark-shell

Expected Output:

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Scala version 2.12.18 (Java HotSpot(TM) 64-Bit Server VM, Java 11.0.18)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Run Test Commands:

// Create a simple RDD
val data = 1 to 10
val rdd = sc.parallelize(data)

// Perform a transformation and action
val result = rdd.map(_ * 2).collect()
println(result.mkString(", "))

// Output: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20

Exit Spark Shell:

:quit

Test PySpark

Open Command Prompt and run:

pyspark

Expected Output:

Python 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Python version 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34)
Spark context Web UI available at http://localhost:4040
SparkSession available as 'spark'.

>>>

Run Test Commands:

# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["name", "age"])

# Show the DataFrame
df.show()

# Output:
# +-------+---+
# |   name|age|
# +-------+---+
# |  Alice| 25|
# |    Bob| 30|
# |Charlie| 35|
# +-------+---+

# Perform a transformation
df.filter(df.age > 28).show()

Exit PySpark:

exit()

Access Spark Web UI

While Spark is running, open a browser and navigate to:

http://localhost:4040

You should see the Spark Web UI showing:

  • Active jobs
  • Stages
  • Storage
  • Environment configuration
  • Executors

Step 6: Run a Spark Application

Create a Simple PySpark Script

Create a file test_spark.py:

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("WindowsSparkTest") \
    .master("local[*]") \
    .getOrCreate()

# Create sample data
data = [
    ("Data Engineering", 95),
    ("Machine Learning", 88),
    ("DevOps", 92),
    ("Cloud Computing", 90)
]

# Create DataFrame
df = spark.createDataFrame(data, ["Topic", "Score"])

# Show DataFrame
print("Original Data:")
df.show()

# Perform aggregations
print(f"Average Score: {df.agg({'Score': 'avg'}).collect()[0][0]:.2f}")
print(f"Max Score: {df.agg({'Score': 'max'}).collect()[0][0]}")
print(f"Min Score: {df.agg({'Score': 'min'}).collect()[0][0]}")

# Filter high scores
print("\nHigh Scores (>90):")
df.filter(df.Score > 90).show()

# Stop Spark session
spark.stop()
print("\nSpark application completed successfully!")

Run the Script

python test_spark.py

Or using spark-submit:

spark-submit test_spark.py

Expected Output:

Original Data:
+----------------+-----+
|           Topic|Score|
+----------------+-----+
|Data Engineering|   95|
|Machine Learning|   88|
|          DevOps|   92|
| Cloud Computing|   90|
+----------------+-----+

Average Score: 91.25
Max Score: 95
Min Score: 88

High Scores (>90):
+----------------+-----+
|           Topic|Score|
+----------------+-----+
|Data Engineering|   95|
|          DevOps|   92|
+----------------+-----+

Spark application completed successfully!

Troubleshooting Common Issues

Issue 1: "java is not recognized as an internal or external command"

Cause: JAVA_HOME not set or PATH not updated

Solution:

  1. Verify JAVA_HOME is set correctly
  2. Ensure %JAVA_HOME%\bin is in PATH
  3. Open a new Command Prompt window
  4. Run: echo %JAVA_HOME% and where java

Issue 2: "JAVA_HOME is set to an invalid directory"

Cause: JAVA_HOME points to wrong location

Solution:

# Find Java installation
where java

# Set JAVA_HOME to the directory ABOVE bin
# If java is at: C:\Program Files\Java\jdk-11.0.18\bin\java.exe
# Then JAVA_HOME should be: C:\Program Files\Java\jdk-11.0.18

Issue 3: "Could not find or load main class org.apache.spark.deploy.SparkSubmit"

Cause: SPARK_HOME not set or incorrect

Solution:

  1. Verify SPARK_HOME points to Spark installation directory
  2. Check that %SPARK_HOME%\bin\spark-shell.cmd exists
  3. Ensure no trailing slashes in SPARK_HOME path

Issue 4: "java.io.IOException: Could not locate Hadoop executable"

Cause: HADOOP_HOME not set or winutils.exe missing

Solution:

  1. Download winutils.exe and hadoop.dll
  2. Place in C:\hadoop\bin\
  3. Set HADOOP_HOME to C:\hadoop
  4. Verify: where winutils should find it

Issue 5: "java.io.FileNotFoundException: java.io.IOException: (null) entry in command string: null chmod 0644"

Cause: Missing hadoop.dll or incorrect permissions

Solution:

  1. Download hadoop.dll (same version as winutils.exe)
  2. Place in C:\hadoop\bin\hadoop.dll
  3. Set permissions on winutils.exe (Everyone → Full Control)
  4. Restart Command Prompt

Issue 6: "py4j.protocol.Py4JNetworkError: Answer from Java side is empty"

Cause: Python/Java version mismatch or port conflict

Solution:

  1. Verify Java version is 8, 11, or 17
  2. Check if port 4040 is already in use:
    netstat -ano | findstr 4040
    
  3. Kill conflicting process or change Spark UI port:
    set SPARK_UI_PORT=4041
    pyspark
    

Issue 7: Spark Shell Starts but No Output

Cause: Logging level too verbose

Solution:

  1. Edit conf/log4j2.properties
  2. Change rootLogger.level = info to rootLogger.level = warn
  3. Or set at runtime:
    spark.sparkContext.setLogLevel("WARN")
    

Issue 8: "Exception in thread 'main' java.lang.UnsupportedClassVersionError"

Cause: Java version mismatch (code compiled with newer Java than runtime)

Solution:

  1. Check Java version: java -version
  2. Spark 3.5.x requires Java 8, 11, or 17
  3. Update Java or use compatible Spark version

Issue 9: PATH Too Long Error

Cause: Windows PATH variable exceeds 2048 character limit

Solution:

  1. Use short name format for paths (progra~1)
  2. Remove unnecessary entries from PATH
  3. Use user variables instead of system variables where possible

Issue 10: PySpark Cannot Find Python

Cause: PYSPARK_PYTHON not set or wrong Python version

Solution:

# Set Python path explicitly
set PYSPARK_PYTHON=C:\Users\YourName\AppData\Local\Programs\Python\Python311\python.exe

# Or add to Environment Variables permanently
# Variable: PYSPARK_PYTHON
# Value: [path to python.exe]

Performance Tuning for Windows

Memory Configuration

Edit spark-defaults.conf:

# Adjust based on your system RAM
# For 8GB system:
spark.driver.memory         2g
spark.executor.memory       2g

# For 16GB system:
spark.driver.memory         4g
spark.executor.memory       4g

Optimize for Local Development

# Use all available cores
spark.master                local[*]

# Disable unnecessary features for local dev
spark.eventLog.enabled      false
spark.ui.showConsoleProgress false

# Increase parallelism for better performance
spark.default.parallelism   4
spark.sql.shuffle.partitions 4

Windows-Specific Optimizations

# Disable POSIX file permissions (Windows doesn't support them)
spark.hadoop.fs.permissions.umask-mode 000

# Use local disk for temp files
spark.local.dir             C:/temp/spark

Verification Checklist

Before considering your installation complete, verify:

  • Java version is 8, 11, or 17: java -version
  • Python version is 3.8+: python --version
  • JAVA_HOME is set: echo %JAVA_HOME%
  • HADOOP_HOME is set: echo %HADOOP_HOME%
  • SPARK_HOME is set: echo %SPARK_HOME%
  • All binaries are in PATH: where java, where winutils, where spark-shell
  • Spark shell starts: spark-shell
  • PySpark shell starts: pyspark
  • Web UI is accessible: http://localhost:4040 (while Spark is running)
  • Sample application runs successfully
  • No Java exceptions in output

Alternative: Using Windows Subsystem for Linux (WSL)

If you encounter persistent issues with Windows, consider using WSL2:

Advantages:

  • Native Linux environment
  • No winutils.exe required
  • Better performance
  • Easier troubleshooting

Quick Setup:

# Install WSL2 with Ubuntu
wsl --install -d Ubuntu

# Inside WSL, install dependencies
sudo apt update
sudo apt install openjdk-11-jdk python3 python3-pip

# Download and extract Spark
wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark

# Set environment variables in ~/.bashrc
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

# Test installation
spark-shell

Next Steps

Congratulations! You now have Apache Spark installed on your Windows machine. Here are recommended next steps:

Learn Spark Basics

Explore Development Options

  • Apache Spark with Docker - Containerized Spark development
  • Set up Jupyter Notebook with PySpark for interactive development
  • Install Visual Studio Code with Python extension for script development

Advanced Topics

Production Deployment

  • Databricks - Managed Spark platform for production workloads
  • Learn about Spark on cloud platforms (AWS EMR, Azure HDInsight, GCP Dataproc)
  • Understand cluster management and resource optimization

Conclusion

Installing Apache Spark on Windows requires careful attention to environment configuration and dependencies. While the process can be challenging due to Windows-specific requirements like winutils.exe, following this comprehensive guide ensures a successful installation.

Key takeaways:

  • Always use paths without spaces for installation directories
  • Environment variables must be set as System variables, not User variables
  • A new Command Prompt window is required after setting environment variables
  • Winutils.exe and hadoop.dll are mandatory for Spark on Windows
  • Java version compatibility is critical (use JDK 11 for best compatibility)

With your local Spark installation complete, you're ready to start learning and developing big data applications on your Windows machine!

Related Articles

Top 10 Python Libraries for Data Engineering

Top 10 Python Libraries for Data Engineering. Data science is rapidly growing and providing immense opportunities for organizations to leverage data insights for strategic decision-making. Python is gaining popularity as the programming language of choice for data science projects. One of the primary reasons for this trend is the availability of various Python libraries that offer efficient solutions for data science tasks. In this article, we will discuss the top 10 Python libraries for data science.