- Published on
Apache Spark and PySpark
Table of Contents
Apache Spark
info: https://spark.apache.org/
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Features
- Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
- General Computation: It can run on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
- Rich APIs: It provides APIs in Java, Scala, Python, and R. It also provides an interactive shell for Scala and Python.
- Unified Engine: It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
- Rich Set of Tools: It supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Components
- Spark Core: It is the base engine for large-scale parallel and distributed data processing. It provides distributed task dispatching, scheduling, and basic I/O functionalities.
- Spark SQL: It is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
- MLlib: It is Apache Spark's scalable machine learning library. It provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.
- GraphX: It is Apache Spark's API for graphs and graph-parallel computation. It unifies ETL, exploratory analysis, and iterative graph computation within a single system.
- Spark Streaming: It is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Use Cases
- Data Processing: Apache Spark is used for processing large datasets in a distributed manner.
- Machine Learning: It is used for building and training machine learning models at scale.
- Graph Processing: It is used for processing and analyzing large-scale graphs.
- Real-time Stream Processing: It is used for processing real-time data streams.
Advantages
- Speed: It is faster than Hadoop MapReduce.
- Ease of Use: It provides high-level APIs in Java, Scala, Python, and R.
- Unified Engine: It provides a unified engine for diverse workloads.
- Rich Set of Tools: It supports a rich set of higher-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.
Disadvantages
- Complexity: It can be complex to set up and configure.
- Memory Management: It requires careful memory management to avoid out-of-memory errors.
- Learning Curve: It has a steep learning curve for beginners.
PySpark
info: https://spark.apache.org/docs/latest/api/python/index.html
PySpark is the Python API for Apache Spark. It allows you to write Spark applications using Python. PySpark is built on top of Spark's Java API and is exposed to Python through Py4J.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df.show()
Spark SQL
info: https://spark.apache.org/docs/latest/sql-programming-guide.html
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.
MLlib
info: https://spark.apache.org/docs/latest/ml-guide.html
MLlib is Apache Spark's scalable machine learning library. It provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.
GraphX
info: https://spark.apache.org/docs/latest/graphx-programming-guide.html
GraphX is Apache Spark's API for graphs and graph-parallel computation. It unifies ETL, exploratory analysis, and iterative graph computation within a single system.
Spark Streaming
info: https://spark.apache.org/docs/latest/streaming-programming-guide.html
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.
References
Further Reading
- Learning Spark: Lightning-Fast Big Data Analysis
- Spark: The Definitive Guide
- Mastering Apache Spark 2.x - Second Edition
- Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis
- Advanced Analytics with Spark: Patterns for Learning from Data at Scale
- High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
- Apache Spark in 24 Hours, Sams Teach Yourself
- Apache Spark 2.x for Java Developers
- Apache Spark 2.x Cookbook
- Apache Spark 2.x Machine Learning Cookbook
- Apache Spark 2.x for Python Developers
- Apache Spark 2.x for Scala Developers
- Apache Spark 2.x Graph Processing
- Apache Spark 2.x Machine Learning
- Apache Spark 2.x Cookbook
- Apache Spark 2.x for Java Developers
- Apache Spark 2.x for Scala Developers
- Apache Spark 2.x for Python Developers
- Apache Spark 2.x Graph Processing