- Published on
Connect to a PostgreSQL database using PySpark
Table of Contents
Connect to a PostgreSQL database using PySpark
To connect to a PostgreSQL database using PySpark, you need to install the PostgreSQL JDBC driver and use the PySpark DataFrameReader
to load data from the database.
First, you need to download the PostgreSQL JDBC driver (postgresql-<version>.jar
) from the official website: https://jdbc.postgresql.org/download/
Next, include the JDBC driver when starting your PySpark application by setting the --jars
option:
pyspark --jars /path/to/postgresql-<version>.jar
Now you can use PySpark to connect to the PostgreSQL database. Here's an example of how to do it:
from pyspark.sql import SparkSession
# Initialize the Spark session
spark = SparkSession.builder \
.appName("PostgreSQL Connection Example") \
.getOrCreate()
# Database connection properties
url = "jdbc:postgresql://localhost:5432/your_database_name"
properties = {
"user": "your_user",
"password": "your_password",
"driver": "org.postgresql.Driver"
}
# Read data from the PostgreSQL table
table_name = "your_table_name"
df = spark.read \
.jdbc(url, table_name, properties=properties)
# Show the DataFrame
df.show()
Make sure to replace your_database_name
, your_user
, your_password
, and your_table_name
with the appropriate values for your PostgreSQL database.
This example initializes a Spark session, sets the database connection properties, reads data from the specified PostgreSQL table, and shows the resulting DataFrame.