Published on

Apache Spark and PySpark

Table of Contents

Apache Spark

info: https://spark.apache.org/

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Features

  • Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
  • General Computation: It can run on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
  • Rich APIs: It provides APIs in Java, Scala, Python, and R. It also provides an interactive shell for Scala and Python.
  • Unified Engine: It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
  • Rich Set of Tools: It supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Components

  • Spark Core: It is the base engine for large-scale parallel and distributed data processing. It provides distributed task dispatching, scheduling, and basic I/O functionalities.
  • Spark SQL: It is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
  • MLlib: It is Apache Spark's scalable machine learning library. It provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.
  • GraphX: It is Apache Spark's API for graphs and graph-parallel computation. It unifies ETL, exploratory analysis, and iterative graph computation within a single system.
  • Spark Streaming: It is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

Use Cases

  • Data Processing: Apache Spark is used for processing large datasets in a distributed manner.
  • Machine Learning: It is used for building and training machine learning models at scale.
  • Graph Processing: It is used for processing and analyzing large-scale graphs.
  • Real-time Stream Processing: It is used for processing real-time data streams.

Advantages

  • Speed: It is faster than Hadoop MapReduce.
  • Ease of Use: It provides high-level APIs in Java, Scala, Python, and R.
  • Unified Engine: It provides a unified engine for diverse workloads.
  • Rich Set of Tools: It supports a rich set of higher-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.

Disadvantages

  • Complexity: It can be complex to set up and configure.
  • Memory Management: It requires careful memory management to avoid out-of-memory errors.
  • Learning Curve: It has a steep learning curve for beginners.

PySpark

info: https://spark.apache.org/docs/latest/api/python/index.html

PySpark is the Python API for Apache Spark. It allows you to write Spark applications using Python. PySpark is built on top of Spark's Java API and is exposed to Python through Py4J.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df.show()

Spark SQL

info: https://spark.apache.org/docs/latest/sql-programming-guide.html

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.

MLlib

info: https://spark.apache.org/docs/latest/ml-guide.html

MLlib is Apache Spark's scalable machine learning library. It provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.

GraphX

info: https://spark.apache.org/docs/latest/graphx-programming-guide.html

GraphX is Apache Spark's API for graphs and graph-parallel computation. It unifies ETL, exploratory analysis, and iterative graph computation within a single system.

Spark Streaming

info: https://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

References

Further Reading