Published on

Connect to a PostgreSQL database using PySpark

Table of Contents

Connect to a PostgreSQL database using PySpark

To connect to a PostgreSQL database using PySpark, you need to install the PostgreSQL JDBC driver and use the PySpark DataFrameReader to load data from the database.

First, you need to download the PostgreSQL JDBC driver (postgresql-<version>.jar) from the official website: https://jdbc.postgresql.org/download/

Next, include the JDBC driver when starting your PySpark application by setting the --jars option:

pyspark --jars /path/to/postgresql-<version>.jar

Now you can use PySpark to connect to the PostgreSQL database. Here's an example of how to do it:

from pyspark.sql import SparkSession

# Initialize the Spark session
spark = SparkSession.builder \
    .appName("PostgreSQL Connection Example") \
    .getOrCreate()

# Database connection properties
url = "jdbc:postgresql://localhost:5432/your_database_name"
properties = {
    "user": "your_user",
    "password": "your_password",
    "driver": "org.postgresql.Driver"
}

# Read data from the PostgreSQL table
table_name = "your_table_name"
df = spark.read \
    .jdbc(url, table_name, properties=properties)

# Show the DataFrame
df.show()

Make sure to replace your_database_name, your_user, your_password, and your_table_name with the appropriate values for your PostgreSQL database.

This example initializes a Spark session, sets the database connection properties, reads data from the specified PostgreSQL table, and shows the resulting DataFrame.