Apache Spark

Apache Spark Interview Questions and Answers


1. What is Apache Spark and how does it work internally?

Apache Spark is an open-source, distributed computing engine designed for large-scale data processing. It executes computations in-memory for faster performance compared to Hadoop MapReduce.

How it works internally:

  • Spark applications are divided into jobs, which are split into stages, and each stage contains multiple tasks.

  • Spark constructs a Directed Acyclic Graph (DAG) of all transformations.

  • The driver program coordinates the workflow.

  • The cluster manager allocates resources.

  • Executors (JVMs on worker nodes) run tasks and store intermediate data.


2. What are jobs, stages, and tasks in Spark?

Concept
Description

Job

Triggered by an action (e.g., collect, count). Represents a full computation.

Stage

Subdivision of a job. Created at shuffle boundaries.

Task

The smallest unit of work. One task = one partition.

Execution flow: Action → Job → Stages → Tasks → Executors.


3. What is lazy evaluation?

Lazy evaluation means Spark does not execute transformations immediately. Instead, it builds a logical DAG of transformations. Execution happens only when an action (e.g., count(), collect(), saveAsTextFile()) is called.

Benefits:

  • Enables query optimization (Catalyst).

  • Avoids unnecessary computation.

  • Allows fault tolerance using lineage.


4. What is data lineage in Spark?

Data lineage is the record of transformations applied to a dataset (RDD or DataFrame). It forms a DAG that tracks dependencies between datasets.

Uses:

  • Provides fault tolerance (lost partitions can be recomputed).

  • Enables lazy evaluation and stage planning.

  • Helps Spark rebuild results without full replication.

Example: rdd1 → map → filter → reduce forms a lineage chain Spark can replay if data is lost.


5. What is the difference between narrow and wide transformations?

Transformation Type
Description
Example

Narrow

Each partition of the parent RDD is used by one child partition. No shuffle.

map, filter, flatMap

Wide

Data from multiple parent partitions are needed to form a child partition. Causes shuffle.

groupByKey, reduceByKey, join

Wide transformations create stage boundaries since they require data movement (shuffle).


6. What are stage boundaries and how are they formed?

Stage boundaries occur wherever Spark needs to perform a shuffle operation (wide transformation). Stages are divided based on narrow vs. wide dependencies.

Example:

  • map + filter = one stage (narrow)

  • reduceByKey triggers a shuffle = new stage boundary


7. What is a Spark job lifecycle?

  1. Job submission: User triggers an action.

  2. DAG creation: Logical plan built from transformations.

  3. Stage division: Based on shuffle boundaries.

  4. Task scheduling: Tasks assigned to executors.

  5. Execution: Tasks run, shuffle if needed.

  6. Result collection: Results sent back or saved to storage.

  7. Completion: Executors released, job marked as finished.


8. What is the relationship between cluster, driver, and executors?

Component
Role

Cluster

Group of worker nodes managed by a cluster manager (YARN, Kubernetes).

Driver

Main program that builds the DAG, coordinates jobs, and schedules tasks.

Executor

JVM process running on worker nodes. Executes tasks, stores data, and reports status to the driver.

Flow: Driver → Cluster Manager → Executors → Tasks.


9. What is lazy execution flow (DAG, stages, tasks)?

  1. Transformations build a logical DAG of dependencies.

  2. When an action is called, Spark:

    • Converts the DAG into stages (based on shuffle).

    • Splits stages into tasks (per partition).

    • Schedules tasks to executors.

  3. Executors execute tasks and report back to the driver.


10. What is shuffling in Spark and why do we need it?

Shuffling is the process of redistributing data across partitions or nodes, usually after wide transformations.

Why needed:

To group or join data with the same key (e.g., in reduceByKey, join, groupBy).

Why expensive:

  • Network I/O (data moved across executors)

  • Disk I/O (shuffle writes/reads)

  • Serialization overhead


11. How do you handle data skew in Spark?

Data skew occurs when some partitions contain much more data than others. It causes stragglers and uneven executor workloads.

Fixes:

  • Broadcast joins: avoid shuffling large tables.

  • Salting keys: add random prefixes to heavy keys.

  • Map-side combine: reduce data before shuffle.

  • AQE (Adaptive Query Execution): auto-split skewed partitions.

  • Repartition by key: redistribute data more evenly.


12. What happens internally during a broadcast join?

  • The small dataset is broadcasted to all executors.

  • Each executor loads it into memory (hash map).

  • The large dataset is processed locally — each executor performs the join without shuffling data.

Benefits:

  • No shuffle of large dataset.

  • One stage instead of multiple.

  • Ideal when one table < spark.sql.autoBroadcastJoinThreshold (default 10 MB).


13. What is the difference between coalesce() and repartition()?

Function
Shuffle
Usage
Performance

coalesce(n)

❌ No shuffle

Reduce number of partitions

Fast, may be unbalanced

repartition(n)

✅ Full shuffle

Increase/decrease partitions evenly

Slower, balanced partitions

Use coalesce() for writing fewer output files, and repartition() before large joins or aggregations.


14. What is the Catalyst Optimizer in Spark?

Catalyst Optimizer is Spark SQL’s query optimization framework. It converts a user query into an efficient execution plan.

Phases:

  1. Analysis – resolve columns, tables, and types.

  2. Logical Optimization – apply algebraic rules (pushdown, pruning).

  3. Physical Planning – choose best join/sort strategy.

  4. Code Generation (WholeStageCodegen) – generate optimized JVM bytecode.

Goal: Minimize shuffle, scan, and computation cost.


15. What is the difference between RDD and DataFrame?

Aspect

RDD

DataFrame

Abstraction Level

Low-level (distributed collection of objects)

High-level (table with schema)

Schema

None

Yes (StructType)

Optimization

Manual

Automatic (Catalyst)

Performance

Slower

Faster (Tungsten, codegen)

Ease of Use

Complex

Simple, SQL-like

Type Safety

Compile-time (Scala/Java)

Runtime only

Best For

Unstructured, custom logic

Structured, analytical workloads


16. What is Tungsten in Spark?

Tungsten is Spark’s execution engine optimization layer that improves memory and CPU efficiency.

Key features:

  • Binary memory format (UnsafeRow): stores rows compactly, reduces GC overhead.

  • Off-heap memory management: avoids JVM GC pauses.

  • WholeStageCodegen: fuses multiple operators into a single compiled function.

  • Vectorized processing: executes batches of rows to leverage CPU cache and SIMD.

Goal: Run Spark computations as close to hardware speed as possible.


Last updated