A visual breakdown of how your Spark application runs, from code submission to distributed execution on a cluster.
Where you write your PySpark script and submit it to the cluster.
e.g., Your laptop, an Airflow worker, or a CI/CD runner.
spark-submit job.py --date ...
Action: SUBMIT JOB
The overall resource commander. It receives job requests and allocates resources on the physical machines.
e.g., Receives request: "I need 1 Driver (8GB RAM) and 10 Executors (26GB RAM each)."
Action: ALLOCATE RESOURCES
▼
Allocates Driver
▼
Allocates Executors
Runs your main()
function, creates the execution plan (DAG), and directs the Executors.
e.g., Plan: "Read table1
, join with table2
, then write the result to S3."
Action: SEND TASKS & CODE TO EXECUTORS
4. Worker Nodes (The Workforce)
Worker Node 1
e.g., An AWS EC2 instance (m5.4xlarge) with 16 vCPU & 64 GiB RAM.
A process that executes tasks on data partitions.
Task: "Read partition #5 of table1
and filter for filter_column = filter_value
."
Executor (another, on same node)
Worker Node 2 ... N
Additional worker nodes with similar configuration.
Multiple executors processing different data partitions.
Additional Executors
Where the data lives permanently. Executors read from and write to this storage.
e.g., Amazon S3, Google Cloud Storage, HDFS, Hive Tables, databases.