Apache Spark Architecture: End-to-End

A visual breakdown of how your Spark application runs, from code submission to distributed execution on a cluster.

1. Client Machine

Where you write your PySpark script and submit it to the cluster.

e.g., Your laptop, an Airflow worker, or a CI/CD runner.

spark-submit job.py --date ...

Action: SUBMIT JOB

The Cluster Environment

2. Cluster Manager

The overall resource commander. It receives job requests and allocates resources on the physical machines.

e.g., Receives request: "I need 1 Driver (8GB RAM) and 10 Executors (26GB RAM each)."

e.g., Kubernetes, YARN
Kubernetes YARN Mesos Standalone

Action: ALLOCATE RESOURCES

Allocates Driver

Allocates Executors

3. Spark Driver (The Brain)

Runs your main() function, creates the execution plan (DAG), and directs the Executors.

e.g., Plan: "Read table1, join with table2, then write the result to S3."

driver.memory: 8g driver.maxResultSize: 2g

Action: SEND TASKS & CODE TO EXECUTORS

4. Worker Nodes (The Workforce)

Worker Node 1

e.g., An AWS EC2 instance (m5.4xlarge) with 16 vCPU & 64 GiB RAM.

Executor 1

A process that executes tasks on data partitions.

Task: "Read partition #5 of table1 and filter for filter_column = filter_value."

executor.memory: 26g executor.cores: 4
Task
A Task is a unit of work on a single data partition.
Task
Task
Task

Executor (another, on same node)

Worker Node 2 ... N

Additional worker nodes with similar configuration.

Executor 2 ... 10

Multiple executors processing different data partitions.

memory: 26g cores: 4
Task
Task
Task
Task

Additional Executors

5. Data Source / Sink

Where the data lives permanently. Executors read from and write to this storage.

e.g., Amazon S3, Google Cloud Storage, HDFS, Hive Tables, databases.

S3 Buckets Hive Tables Databases