- Published on
How to Install Apache Spark on a Local Machine using Windows
Table of Contents
- Introduction
- Understanding the Components
- Prerequisites
- Step 1: Download Apache Spark
- Step 2: Extract Apache Spark
- Step 3: Configure Environment Variables
- Step 4: Configure Spark (Optional)
- Step 5: Test Spark Installation
- Step 6: Run a Spark Application
- Troubleshooting Common Issues
- Issue 1: "java is not recognized as an internal or external command"
- Issue 2: "JAVA_HOME is set to an invalid directory"
- Issue 3: "Could not find or load main class org.apache.spark.deploy.SparkSubmit"
- Issue 4: "java.io.IOException: Could not locate Hadoop executable"
- Issue 5: "java.io.FileNotFoundException: java.io.IOException: (null) entry in command string: null chmod 0644"
- Issue 6: "py4j.protocol.Py4JNetworkError: Answer from Java side is empty"
- Issue 7: Spark Shell Starts but No Output
- Issue 8: "Exception in thread 'main' java.lang.UnsupportedClassVersionError"
- Issue 9: PATH Too Long Error
- Issue 10: PySpark Cannot Find Python
- Performance Tuning for Windows
- Verification Checklist
- Alternative: Using Windows Subsystem for Linux (WSL)
- Next Steps
- Conclusion
Introduction
Installing Apache Spark on Windows can be challenging due to compatibility issues and environment configuration requirements. This comprehensive guide walks you through every step of the installation process, from prerequisites to verification, with detailed troubleshooting tips for common errors.
By the end of this guide, you'll have a fully functional Apache Spark installation on your Windows machine, ready for local development and testing.
Understanding the Components
Before diving into installation, it's important to understand what you're installing:
Apache Spark: A unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R.
Hadoop Winutils: Windows utilities for Hadoop that Spark requires on Windows systems. Even though you're not installing Hadoop, Spark needs these utilities to function properly on Windows.
Java Development Kit (JDK): Spark is built on the JVM (Java Virtual Machine) and requires Java to run.
Python (Optional but Recommended): Required if you want to use PySpark, Spark's Python API.
Prerequisites
System Requirements
- Operating System: Windows 10 or Windows 11 (64-bit)
- RAM: Minimum 4GB (8GB or more recommended)
- Disk Space: At least 5GB free space
- Administrator Access: Required for setting environment variables
Required Software
1. Java Development Kit (JDK)
Apache Spark requires JDK 8, 11, or 17. JDK 11 is recommended for most use cases.
Download and Install JDK:
- Visit Oracle JDK Downloads or Adoptium (OpenJDK)
- Download JDK 11 (Windows x64 Installer)
- Run the installer with default settings
- Note the installation path (e.g.,
C:\Program Files\Java\jdk-11.0.18)
Verify Java Installation:
Open Command Prompt and run:
java -version
You should see output like:
java version "11.0.18" 2023-01-17 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.18+9-LTS-195)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.18+9-LTS-195, mixed mode)
Important Notes:
- If you have multiple Java versions installed, Spark will use the version specified in
JAVA_HOME - Do not install Java in paths with spaces (e.g., avoid
C:\Program Files (x86)\) - If you must use a path with spaces, use the short name format (
progra~1orprogra~2)
2. Python (Optional but Recommended)
If you plan to use PySpark, install Python 3.8 or higher.
Download and Install Python:
- Visit Python Downloads
- Download Python 3.11 (Windows installer - 64-bit)
- IMPORTANT: Check "Add Python to PATH" during installation
- Choose "Install Now" or customize to select installation directory
Verify Python Installation:
python --version
pip --version
Expected output:
Python 3.11.5
pip 23.2.1 from C:\Users\YourName\AppData\Local\Programs\Python\Python311\lib\site-packages\pip (python 3.11)
Install PySpark Package (Optional):
pip install pyspark
This installs PySpark Python package, but you'll still need the full Spark distribution for local development.
3. Hadoop Winutils
Spark on Windows requires Hadoop's native Windows utilities (winutils.exe and hadoop.dll).
Download Winutils:
- Visit winutils GitHub repository
- Navigate to the Hadoop version matching your Spark distribution:
- For Spark 3.5.x: Use
hadoop-3.3.6 - For Spark 3.4.x: Use
hadoop-3.3.5 - For Spark 3.3.x: Use
hadoop-3.3.1
- For Spark 3.5.x: Use
- Download both
winutils.exeandhadoop.dllfrom thebinfolder
Create Hadoop Directory Structure:
# Create directories
mkdir C:\hadoop\bin
# Move downloaded files to:
# C:\hadoop\bin\winutils.exe
# C:\hadoop\bin\hadoop.dll
Set Permissions (Important):
- Right-click on
C:\hadoop\bin\winutils.exe - Select "Properties" → "Security" tab
- Click "Edit" → "Add" → Enter "Everyone" → "OK"
- Check "Full control" and click "OK"
Step 1: Download Apache Spark
Choose Your Spark Version
Visit the Apache Spark Downloads page.
Version Selection:
- Latest Stable Release: Spark 3.5.0+ (recommended for new projects)
- Package Type: "Pre-built for Apache Hadoop 3.3 and later"
Download Options:
Direct Download (Recommended):
- Click the suggested mirror link
- Download
spark-3.5.0-bin-hadoop3.tgz
Command Line Download:
# Using curl (if available) curl -O https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Verify Download: Check the file size matches the website (approximately 400MB).
Step 2: Extract Apache Spark
Extract the Archive
Using 7-Zip (Recommended):
- Download and install 7-Zip
- Right-click on
spark-3.5.0-bin-hadoop3.tgz - Select "7-Zip" → "Extract Here"
- You'll get a
.tarfile - extract it again - Final extracted folder:
spark-3.5.0-bin-hadoop3
Using Windows Built-in Extraction:
- Rename
.tgzto.tar.gz - Right-click → "Extract All"
- May require extracting twice (tar.gz, then tar)
Move to Installation Directory
Create a clean installation path without spaces:
# Option 1: Simple path (Recommended)
mkdir C:\spark
move spark-3.5.0-bin-hadoop3 C:\spark
# Option 2: Versioned path
mkdir C:\spark\3.5.0
move spark-3.5.0-bin-hadoop3 C:\spark\3.5.0
Final Structure:
C:\spark\
├── bin\
│ ├── spark-shell.cmd
│ ├── spark-submit.cmd
│ ├── pyspark.cmd
│ └── ...
├── conf\
├── jars\
├── python\
└── ...
Step 3: Configure Environment Variables
Environment variables tell Windows where to find Spark, Java, and Hadoop.
Open Environment Variables Dialog
Method 1: GUI
- Right-click "This PC" or "Computer"
- Select "Properties"
- Click "Advanced system settings"
- Click "Environment Variables" button
Method 2: Command
- Press
Win + R - Type
sysdm.cpland press Enter - Go to "Advanced" tab → "Environment Variables"
Method 3: Search
- Search for "Environment Variables" in Windows Search
- Select "Edit the system environment variables"
Create System Variables
Click "New" under "System variables" (not User variables) to create each:
JAVA_HOME
Variable name: JAVA_HOME
Variable value: C:\Program Files\Java\jdk-11.0.18
Important:
- Use your actual JDK installation path
- If path contains spaces, use short name format:
C:\progra~1\Java\jdk-11.0.18 (for Program Files) C:\progra~2\Java\jdk-11.0.18 (for Program Files (x86))
Find Short Name:
dir /x "C:\Program Files"
HADOOP_HOME
Variable name: HADOOP_HOME
Variable value: C:\hadoop
SPARK_HOME
Variable name: SPARK_HOME
Variable value: C:\spark
Or if using versioned path:
Variable value: C:\spark\3.5.0
PYSPARK_PYTHON (Optional)
If you have multiple Python versions:
Variable name: PYSPARK_PYTHON
Variable value: C:\Users\YourName\AppData\Local\Programs\Python\Python311\python.exe
Update PATH Variable
- Find "Path" in "System variables"
- Click "Edit"
- Click "New" and add each path (one per line):
%JAVA_HOME%\bin
%HADOOP_HOME%\bin
%SPARK_HOME%\bin
%SPARK_HOME%\python
Order matters: Ensure these paths appear before any other Java or Python installations.
Verify Environment Variables
Open a new Command Prompt (important - existing windows won't see new variables):
# Verify variables are set
echo %JAVA_HOME%
echo %HADOOP_HOME%
echo %SPARK_HOME%
# Verify binaries are accessible
where java
where spark-shell
where winutils
Expected output:
C:\Program Files\Java\jdk-11.0.18
C:\hadoop
C:\spark
C:\Program Files\Java\jdk-11.0.18\bin\java.exe
C:\spark\bin\spark-shell
C:\spark\bin\spark-shell.cmd
C:\hadoop\bin\winutils.exe
Step 4: Configure Spark (Optional)
Create Spark Configuration
Navigate to Spark's conf directory:
cd C:\spark\conf
Copy template configuration:
copy spark-defaults.conf.template spark-defaults.conf
copy log4j2.properties.template log4j2.properties
Edit Spark Defaults
Open spark-defaults.conf and add basic configurations:
# Application Properties
spark.app.name MySparkApp
spark.master local[*]
# Memory Configuration
spark.driver.memory 2g
spark.executor.memory 2g
# Python Configuration (if using PySpark)
spark.pyspark.python python
spark.pyspark.driver.python python
# UI Configuration
spark.ui.port 4040
# Logging Configuration
spark.eventLog.enabled false
Configure Logging
Edit log4j2.properties to reduce verbosity:
# Change rootLogger level from INFO to WARN
rootLogger.level = warn
# Or for specific loggers
logger.spark.name = org.apache.spark
logger.spark.level = warn
Step 5: Test Spark Installation
Test Spark Shell (Scala)
Open Command Prompt and run:
spark-shell
Expected Output:
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Scala version 2.12.18 (Java HotSpot(TM) 64-Bit Server VM, Java 11.0.18)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Run Test Commands:
// Create a simple RDD
val data = 1 to 10
val rdd = sc.parallelize(data)
// Perform a transformation and action
val result = rdd.map(_ * 2).collect()
println(result.mkString(", "))
// Output: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20
Exit Spark Shell:
:quit
Test PySpark
Open Command Prompt and run:
pyspark
Expected Output:
Python 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Python version 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34)
Spark context Web UI available at http://localhost:4040
SparkSession available as 'spark'.
>>>
Run Test Commands:
# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["name", "age"])
# Show the DataFrame
df.show()
# Output:
# +-------+---+
# | name|age|
# +-------+---+
# | Alice| 25|
# | Bob| 30|
# |Charlie| 35|
# +-------+---+
# Perform a transformation
df.filter(df.age > 28).show()
Exit PySpark:
exit()
Access Spark Web UI
While Spark is running, open a browser and navigate to:
http://localhost:4040
You should see the Spark Web UI showing:
- Active jobs
- Stages
- Storage
- Environment configuration
- Executors
Step 6: Run a Spark Application
Create a Simple PySpark Script
Create a file test_spark.py:
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
.appName("WindowsSparkTest") \
.master("local[*]") \
.getOrCreate()
# Create sample data
data = [
("Data Engineering", 95),
("Machine Learning", 88),
("DevOps", 92),
("Cloud Computing", 90)
]
# Create DataFrame
df = spark.createDataFrame(data, ["Topic", "Score"])
# Show DataFrame
print("Original Data:")
df.show()
# Perform aggregations
print(f"Average Score: {df.agg({'Score': 'avg'}).collect()[0][0]:.2f}")
print(f"Max Score: {df.agg({'Score': 'max'}).collect()[0][0]}")
print(f"Min Score: {df.agg({'Score': 'min'}).collect()[0][0]}")
# Filter high scores
print("\nHigh Scores (>90):")
df.filter(df.Score > 90).show()
# Stop Spark session
spark.stop()
print("\nSpark application completed successfully!")
Run the Script
python test_spark.py
Or using spark-submit:
spark-submit test_spark.py
Expected Output:
Original Data:
+----------------+-----+
| Topic|Score|
+----------------+-----+
|Data Engineering| 95|
|Machine Learning| 88|
| DevOps| 92|
| Cloud Computing| 90|
+----------------+-----+
Average Score: 91.25
Max Score: 95
Min Score: 88
High Scores (>90):
+----------------+-----+
| Topic|Score|
+----------------+-----+
|Data Engineering| 95|
| DevOps| 92|
+----------------+-----+
Spark application completed successfully!
Troubleshooting Common Issues
Issue 1: "java is not recognized as an internal or external command"
Cause: JAVA_HOME not set or PATH not updated
Solution:
- Verify JAVA_HOME is set correctly
- Ensure
%JAVA_HOME%\binis in PATH - Open a new Command Prompt window
- Run:
echo %JAVA_HOME%andwhere java
Issue 2: "JAVA_HOME is set to an invalid directory"
Cause: JAVA_HOME points to wrong location
Solution:
# Find Java installation
where java
# Set JAVA_HOME to the directory ABOVE bin
# If java is at: C:\Program Files\Java\jdk-11.0.18\bin\java.exe
# Then JAVA_HOME should be: C:\Program Files\Java\jdk-11.0.18
Issue 3: "Could not find or load main class org.apache.spark.deploy.SparkSubmit"
Cause: SPARK_HOME not set or incorrect
Solution:
- Verify SPARK_HOME points to Spark installation directory
- Check that
%SPARK_HOME%\bin\spark-shell.cmdexists - Ensure no trailing slashes in SPARK_HOME path
Issue 4: "java.io.IOException: Could not locate Hadoop executable"
Cause: HADOOP_HOME not set or winutils.exe missing
Solution:
- Download winutils.exe and hadoop.dll
- Place in
C:\hadoop\bin\ - Set HADOOP_HOME to
C:\hadoop - Verify:
where winutilsshould find it
Issue 5: "java.io.FileNotFoundException: java.io.IOException: (null) entry in command string: null chmod 0644"
Cause: Missing hadoop.dll or incorrect permissions
Solution:
- Download hadoop.dll (same version as winutils.exe)
- Place in
C:\hadoop\bin\hadoop.dll - Set permissions on winutils.exe (Everyone → Full Control)
- Restart Command Prompt
Issue 6: "py4j.protocol.Py4JNetworkError: Answer from Java side is empty"
Cause: Python/Java version mismatch or port conflict
Solution:
- Verify Java version is 8, 11, or 17
- Check if port 4040 is already in use:
netstat -ano | findstr 4040 - Kill conflicting process or change Spark UI port:
set SPARK_UI_PORT=4041 pyspark
Issue 7: Spark Shell Starts but No Output
Cause: Logging level too verbose
Solution:
- Edit
conf/log4j2.properties - Change
rootLogger.level = infotorootLogger.level = warn - Or set at runtime:
spark.sparkContext.setLogLevel("WARN")
Issue 8: "Exception in thread 'main' java.lang.UnsupportedClassVersionError"
Cause: Java version mismatch (code compiled with newer Java than runtime)
Solution:
- Check Java version:
java -version - Spark 3.5.x requires Java 8, 11, or 17
- Update Java or use compatible Spark version
Issue 9: PATH Too Long Error
Cause: Windows PATH variable exceeds 2048 character limit
Solution:
- Use short name format for paths (progra~1)
- Remove unnecessary entries from PATH
- Use user variables instead of system variables where possible
Issue 10: PySpark Cannot Find Python
Cause: PYSPARK_PYTHON not set or wrong Python version
Solution:
# Set Python path explicitly
set PYSPARK_PYTHON=C:\Users\YourName\AppData\Local\Programs\Python\Python311\python.exe
# Or add to Environment Variables permanently
# Variable: PYSPARK_PYTHON
# Value: [path to python.exe]
Performance Tuning for Windows
Memory Configuration
Edit spark-defaults.conf:
# Adjust based on your system RAM
# For 8GB system:
spark.driver.memory 2g
spark.executor.memory 2g
# For 16GB system:
spark.driver.memory 4g
spark.executor.memory 4g
Optimize for Local Development
# Use all available cores
spark.master local[*]
# Disable unnecessary features for local dev
spark.eventLog.enabled false
spark.ui.showConsoleProgress false
# Increase parallelism for better performance
spark.default.parallelism 4
spark.sql.shuffle.partitions 4
Windows-Specific Optimizations
# Disable POSIX file permissions (Windows doesn't support them)
spark.hadoop.fs.permissions.umask-mode 000
# Use local disk for temp files
spark.local.dir C:/temp/spark
Verification Checklist
Before considering your installation complete, verify:
- Java version is 8, 11, or 17:
java -version - Python version is 3.8+:
python --version - JAVA_HOME is set:
echo %JAVA_HOME% - HADOOP_HOME is set:
echo %HADOOP_HOME% - SPARK_HOME is set:
echo %SPARK_HOME% - All binaries are in PATH:
where java,where winutils,where spark-shell - Spark shell starts:
spark-shell - PySpark shell starts:
pyspark - Web UI is accessible: http://localhost:4040 (while Spark is running)
- Sample application runs successfully
- No Java exceptions in output
Alternative: Using Windows Subsystem for Linux (WSL)
If you encounter persistent issues with Windows, consider using WSL2:
Advantages:
- Native Linux environment
- No winutils.exe required
- Better performance
- Easier troubleshooting
Quick Setup:
# Install WSL2 with Ubuntu
wsl --install -d Ubuntu
# Inside WSL, install dependencies
sudo apt update
sudo apt install openjdk-11-jdk python3 python3-pip
# Download and extract Spark
wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark
# Set environment variables in ~/.bashrc
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
# Test installation
spark-shell
Next Steps
Congratulations! You now have Apache Spark installed on your Windows machine. Here are recommended next steps:
Learn Spark Basics
- Apache Spark and PySpark Overview - Understand Spark's architecture and core concepts
- PySpark Tutorial - Get hands-on with PySpark's DataFrame API
Explore Development Options
- Apache Spark with Docker - Containerized Spark development
- Set up Jupyter Notebook with PySpark for interactive development
- Install Visual Studio Code with Python extension for script development
Advanced Topics
- Connecting to PostgreSQL with PySpark - Database integration
- Learn about Spark's MLlib for machine learning
- Explore Spark Streaming for real-time data processing
Production Deployment
- Databricks - Managed Spark platform for production workloads
- Learn about Spark on cloud platforms (AWS EMR, Azure HDInsight, GCP Dataproc)
- Understand cluster management and resource optimization
Conclusion
Installing Apache Spark on Windows requires careful attention to environment configuration and dependencies. While the process can be challenging due to Windows-specific requirements like winutils.exe, following this comprehensive guide ensures a successful installation.
Key takeaways:
- Always use paths without spaces for installation directories
- Environment variables must be set as System variables, not User variables
- A new Command Prompt window is required after setting environment variables
- Winutils.exe and hadoop.dll are mandatory for Spark on Windows
- Java version compatibility is critical (use JDK 11 for best compatibility)
With your local Spark installation complete, you're ready to start learning and developing big data applications on your Windows machine!
Related Articles
Top 10 Python Libraries for Data Engineering
Top 10 Python Libraries for Data Engineering. Data science is rapidly growing and providing immense opportunities for organizations to leverage data insights for strategic decision-making. Python is gaining popularity as the programming language of choice for data science projects. One of the primary reasons for this trend is the availability of various Python libraries that offer efficient solutions for data science tasks. In this article, we will discuss the top 10 Python libraries for data science.
Apache SPARK Up and Running FAST with Docker
Complete step-by-step guide to running Apache Spark in Docker using official Apache images. From basic setup to production-ready clusters with monitoring, database integration, and real-world examples.