1. What is Apache Spark and how does it work internally?
Apache Spark is an open-source, distributed computing engine designed for large-scale data processing. It executes computations in-memory for faster performance compared to Hadoop MapReduce.
How it works internally:
Spark applications are divided into jobs, which are split into stages, and each stage contains multiple tasks.
Spark constructs a Directed Acyclic Graph (DAG) of all transformations.
The driver program coordinates the workflow.
The cluster manager allocates resources.
Executors (JVMs on worker nodes) run tasks and store intermediate data.
2. What are jobs, stages, and tasks in Spark?
Concept
Description
Job
Triggered by an action (e.g., collect, count). Represents a full computation.
Stage
Subdivision of a job. Created at shuffle boundaries.
Task
The smallest unit of work. One task = one partition.
Lazy evaluation means Spark does not execute transformations immediately. Instead, it builds a logical DAG of transformations.
Execution happens only when an action (e.g., count(), collect(), saveAsTextFile()) is called.
Benefits:
Enables query optimization (Catalyst).
Avoids unnecessary computation.
Allows fault tolerance using lineage.
4. What is data lineage in Spark?
Data lineage is the record of transformations applied to a dataset (RDD or DataFrame). It forms a DAG that tracks dependencies between datasets.
Uses:
Provides fault tolerance (lost partitions can be recomputed).
Enables lazy evaluation and stage planning.
Helps Spark rebuild results without full replication.
Example:
rdd1 → map → filter → reduce forms a lineage chain Spark can replay if data is lost.
5. What is the difference between narrow and wide transformations?
Transformation Type
Description
Example
Narrow
Each partition of the parent RDD is used by one child partition. No shuffle.
map, filter, flatMap
Wide
Data from multiple parent partitions are needed to form a child partition. Causes shuffle.
groupByKey, reduceByKey, join
Wide transformations create stage boundaries since they require data movement (shuffle).
6. What are stage boundaries and how are they formed?
Stage boundaries occur wherever Spark needs to perform a shuffle operation (wide transformation).
Stages are divided based on narrow vs. wide dependencies.
Example:
map + filter = one stage (narrow)
reduceByKey triggers a shuffle = new stage boundary
7. What is a Spark job lifecycle?
Job submission: User triggers an action.
DAG creation: Logical plan built from transformations.
Stage division: Based on shuffle boundaries.
Task scheduling: Tasks assigned to executors.
Execution: Tasks run, shuffle if needed.
Result collection: Results sent back or saved to storage.
Completion: Executors released, job marked as finished.
8. What is the relationship between cluster, driver, and executors?
Component
Role
Cluster
Group of worker nodes managed by a cluster manager (YARN, Kubernetes).
Driver
Main program that builds the DAG, coordinates jobs, and schedules tasks.
Executor
JVM process running on worker nodes. Executes tasks, stores data, and reports status to the driver.