Apache Hudi

🧩 Apache Hudi — Interview Questions & Answers


🔹 1️⃣ What is Apache Hudi and why is it used?

Answer: Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a data lake framework that brings ACID transactions, record-level upserts/deletes, and incremental ingestion to data lakes on object stores like S3, GCS, or HDFS.

It bridges the gap between:

  • Data warehouses (ACID, updates, fast queries)

  • Data lakes (scalable, cheap storage, flexible schema)

Use cases:

  • Change Data Capture (CDC) ingestion

  • Incremental ETL pipelines

  • GDPR deletions (right-to-be-forgotten)

  • Near real-time data serving

✅ Hudi makes your data lake behave like a database.


🔹 2️⃣ What are Hudi’s main components?

Component
Role

Write Client

Handles insert, upsert, delete, and compaction operations.

Storage Layer

Organizes files and metadata on disk (S3/HDFS).

Timeline Service

Tracks all commits and actions on the dataset (provides ACID).

Query Engine Layer

Allows reads via Spark, Presto, Hive, Trino, etc.

Metadata Table

Stores file listings and stats to speed up queries and listing operations.


🔹 3️⃣ What problems does Hudi solve?

Problem
Hudi Solution

Immutable data lakes → can’t update/delete records

Provides record-level upserts and deletes

Expensive full reloads

Supports incremental queries

Schema changes

Built-in schema evolution

Eventual consistency on S3

ACID transactions via commit timeline

Long ingestion delays

Streaming ingestion with Spark/Flink

Slow S3 list operations

Metadata Table optimization


🔹 4️⃣ What are Hudi’s table types?

Table Type
Description
Use Case

Copy-On-Write (COW)

Data is rewritten to new Parquet files on update.

Read-heavy batch workloads.

Merge-On-Read (MOR)

Writes new changes to Avro log files, later compacted into Parquet.

Write-heavy streaming workloads.

Example:

  • COW: nightly ETL batches → dashboards.

  • MOR: CDC stream → near-real-time analytics.


🔹 5️⃣ What are the query types in Hudi?

Query Type
Description
Use Case

Snapshot Query

Reads the latest committed view (base + logs).

Real-time analytics.

Read Optimized Query

Reads only base Parquet files (no merge).

Fast batch queries.

Incremental Query

Reads only new data since a specific commit.

Incremental ETL pipelines.


🔹 6️⃣ How does Hudi ensure ACID transactions on S3?

Hudi maintains a timeline (in .hoodie/) where every operation (commit, compaction, rollback) is an instant with a timestamp.

  • Writes are first done in staging.

  • On successful commit → the timeline is atomically updated.

  • Readers only see completed instants → ensures snapshot isolation.

  • If failure occurs → the instant is rolled back.

✅ This gives atomic, consistent, isolated, durable (ACID) semantics even on eventually consistent stores like S3.


🔹 7️⃣ What are “instants” in Hudi?

Instants represent all actions in the dataset’s timeline:

Type
Example
Purpose

commit

20250101123000.commit

Insert/upsert done.

deltacommit

20250101124000.deltacommit

MOR delta log write.

compaction

20250101130000.compaction

Merge logs → Parquet.

clean

Delete obsolete files.

rollback

Undo failed commit.

savepoint

Mark dataset for rollback safety.

Each instant moves from requestedinflightcompleted.


🔹 8️⃣ What write operations does Hudi support?

Operation
Description

insert

Add new records.

upsert

Insert or update existing records (via index).

bulk_insert

High-speed initial load (no index lookup).

insert_overwrite

Replace an entire dataset or partition.

delete

Remove records by key.


🔹 9️⃣ What is the difference between COW and MOR?

Aspect
COW
MOR

Storage

Only Parquet

Parquet + Avro logs

Write performance

Slower (rewrites entire file)

Faster (append logs)

Read performance

Faster (no merge)

Slightly slower (merge needed)

Use case

Batch, read-heavy

Streaming, write-heavy


🔹 🔟 How does Hudi handle schema evolution?

  • Uses Avro schema stored in the .hoodie metadata.

  • New columns → allowed (added as nullable).

  • Dropping columns → not recommended (need rewrite).

  • Backward and forward compatibility handled automatically.

✅ Safe schema evolution = crucial for evolving pipelines.


🔹 11️⃣ What indexing mechanisms does Hudi use?

Index Type
Description

Bloom Index (default)

Stores record key hashes in Parquet footers.

Global Bloom Index

Ensures uniqueness across all partitions.

Simple Index

Uses metadata table instead of file scans.

HBase Index

External index for ultra-large scale.

Metadata Table Index

Modern, fast index layer for file lookups.


🔹 12️⃣ What is the Metadata Table and why is it important?

The Metadata Table (stored under .hoodie/metadata/) caches:

  • File listings

  • Bloom filters

  • Column stats

Benefits:

  • Avoids expensive S3 LIST operations (reduces latency by 10–100×).

  • Speeds up query planning and compaction.


🔹 13️⃣ What is compaction in Hudi?

  • Applies only to MOR tables.

  • Merges small delta log files into larger Parquet base files.

  • Reduces read latency and small file count.

  • Can be inline (during write) or asynchronous (background job).


🔹 14️⃣ What are cleaning and archival in Hudi?

Process
Purpose

Cleaning

Deletes obsolete or unreferenced file versions.

Archival

Moves older commits from .hoodie/timeline to .hoodie/archive to reduce metadata overhead.


🔹 15️⃣ How does Hudi support incremental data ingestion?

Hudi tracks commits via the timeline. You can read only the new data since a given instant:

✅ This is the foundation of incremental ETL pipelines.


Engine
Integration
Description

Spark

DataSource API (format("hudi"))

Supports batch and streaming writes/reads.

Flink

Native connector

Real-time ingestion with exactly-once semantics.

Presto / Trino

Read connector

Fast analytical queries.

Hive

Sync via Hive Sync Tool

Enables Hive table queries on Hudi data.


🔹 17️⃣ What is clustering in Hudi?

Clustering re-organizes existing data to optimize layout and file sizes for query performance.


🔹 18️⃣ How do you tune Hudi performance?

Tuning Area
Parameter / Strategy

Parallelism

hoodie.upsert.shuffle.parallelism, hoodie.insert.shuffle.parallelism

Compaction

Schedule asynchronously; tune parallelism

File sizing

hoodie.parquet.max.file.size, hoodie.small.file.limit

Metadata Table

Enable for large datasets

Bloom filter

Disable for purely append-only use cases

Bulk insert

Use for initial data load (no index lookup)


🔹 19️⃣ How is Hudi different from Delta Lake and Iceberg?

Feature
Hudi
Delta Lake
Iceberg

Design focus

Streaming + incremental ETL

DWH-like batch reliability

Large-scale metadata management

Upsert/Delete

✅ Native

✅ Native

✅ Native

Table Types

COW, MOR

Single table type

Single table type

Incremental Reads

✅ Yes

✅ Limited

✅ Yes

Metadata Management

Metadata Table

Delta Log

Manifest List

Integrations

Spark, Flink, Hive

Spark

Spark, Flink, Trino

Best For

Streaming CDC pipelines

Batch ETL, BI

Federated query lakes


🔹 20️⃣ Real-world Use Case Example

You’re processing CDC data from Kafka to S3. Each event is either an insert, update, or delete.

Solution:

  • Ingest CDC stream via Spark Structured Streaming → Hudi upsert.

  • Maintain latest state table in MOR.

  • Query real-time dashboard via Snapshot reads.

  • Downstream batch ETL reads incrementally since last checkpoint.


✅ Summary

Concept
Description

Timeline

ACID metadata for commits & actions

Instants

Individual commit events

Table Types

COW & MOR

Query Types

Snapshot, Read Optimized, Incremental

Compaction

Merge delta logs (MOR only)

Metadata Table

Index + file listing optimization

Incremental Query

Reads new data since last commit

Indexing

Bloom / Metadata / Global

Clustering

Layout optimization for performance


In summary:

Apache Hudi is a streaming data lake framework providing ACID, incremental processing, and upserts/deletes over object stores, designed for real-time data engineering pipelines. It integrates deeply with Spark and Flink, making it ideal for building incremental, low-latency, and fault-tolerant data lakes.

Last updated