Apache Parquet

📘 Apache Parquet: Deep-Dive Q&A

1. What is a row group in Parquet, and why does its size matter?

Answer: A row group is the fundamental horizontal partition of data in a Parquet file. Each row group contains a set of rows, and within it, data is stored column-by-column as column chunks.

Typical Size: 128MB–1GB (configurable).
Why It Matters:
- Larger row groups:
  - Fewer files to manage.
  - Better compression (more data similarity).
  - More efficient for large scans.
- Smaller row groups:
  - Faster point-lookups (less I/O to scan a single row).
  - But may lead to the small files problem (too many tiny files, higher overhead).

👉 The “sweet spot” is usually ~128–512MB per row group for balanced performance.

2. What are column chunks and pages in Parquet? How do they affect read performance?

Answer: Inside a row group:

Column Chunk:
- The storage of all values for a given column in that row group.
- Example: user.age column chunk stores all ages for rows in this row group.
Pages:
- Column chunks are further split into pages (default ~1MB each).
- Pages are the smallest read/write unit in Parquet.
- Types of pages: data pages, dictionary pages, index pages.

Effect on Performance:

Queries can skip pages using page-level statistics (min/max).
Smaller pages → more fine-grained skipping, but more metadata overhead.
Larger pages → fewer seeks, better sequential reads, but less selective skipping.

3. What metadata is stored in the Parquet footer, and how does it enable predicate pushdown?

Answer: Every Parquet file ends with a footer, which stores critical metadata:

Schema: column names, types, nesting info.
Row Group Information: number of rows, byte offsets, column chunk sizes.
Column Statistics: min, max, null count per column, per row group/page.
Encoding/Compression Info: which codec/encoding used.

How Predicate Pushdown Works:

Query engine (Spark/Presto) reads footer first.
Example query:
```
SELECT * FROM users WHERE age > 50;
```
Footer might show:
- RowGroup1: age min=10, max=40 → skip.
- RowGroup2: age min=41, max=90 → read.

👉 Predicate pushdown = only scan row groups/pages that could contain matches → massive I/O savings.

4. What happens if a Parquet file is missing its footer? Can you still read it?

Answer: No. Without the footer:

The schema is unknown.
Column chunk/page offsets are missing.
Statistics are gone, so predicate pushdown can’t work.

The file becomes essentially unreadable by Parquet readers. Some recovery tools can attempt to “guess” schema from raw bytes, but correctness is not guaranteed.

👉 In distributed systems (e.g., Spark job killed mid-write), a file without a footer is considered corrupted.

5. How does Parquet achieve columnar storage on disk?

Answer: Parquet uses a hierarchical storage layout:

Row Groups: Horizontal partitions of rows.
Column Chunks: Within each row group, data is stored column-by-column.
Pages: Each column chunk is split into smaller pages (~1MB).
Footer: Stores metadata (schema, stats, offsets).

Visualization:

Parquet File
 ├─ Row Group 1
 │   ├─ Column Chunk: id
 │   ├─ Column Chunk: name
 │   └─ Column Chunk: age
 ├─ Row Group 2
 │   ├─ Column Chunk: id
 │   ├─ Column Chunk: name
 │   └─ Column Chunk: age
 └─ Footer (metadata)

👉 This layout allows scanning only the needed columns (e.g., just age), and skipping irrelevant row groups/pages via metadata.

6. What is predicate pushdown?

Answer: Predicate pushdown = pushing filter conditions from the query engine down to the storage layer, so only relevant data is read.

Enabled by min/max statistics in Parquet metadata.
Example:
```
SELECT * FROM orders WHERE amount > 1000;
```
If RowGroup stats show amount max=500, that group is skipped entirely.

👉 Predicate pushdown = “filter early, scan less.” Huge performance win, especially on cloud object stores.

7. How does Parquet handle nested format?

Answer: Parquet uses Dremel encoding (from Google’s Dremel paper) to support nested data like arrays, maps, and structs.

Flattening: Each nested attribute becomes a column (user.id, user.name, user.emails).
Repetition Level (RL): Tracks if a value belongs to a new parent or repeats under the same parent.
Definition Level (DL): Tracks if a field is defined or null at a given nesting depth.

Example:

{ "id": 1, "emails": ["a@x.com", "b@y.com"] }
{ "id": 2, "emails": null }

Flattened with RL/DL:

emails

a@x.com

b@y.com

null

👉 This preserves hierarchy so readers can reconstruct the original nested structure.

8. Can you explain the compression mechanisms in Parquet?

Answer: Parquet compresses in two layers:

Encoding (logical compression)
- Dictionary encoding: replaces values with integer IDs (best for low-cardinality).
- Run-length encoding (RLE): compresses consecutive duplicates as (value, count).
- Delta encoding: stores differences between consecutive values (great for timestamps).
- Bit-packing: stores integers in minimal bits needed.
- Boolean encoding: stores booleans as bitsets (8 values in 1 byte).
Codec (physical compression)
- Snappy: fast, lower ratio.
- Gzip: slower, higher ratio.
- Zstd: modern, tunable balance.

👉 Encodings shrink redundancy before the heavy codec is applied, making Parquet very efficient.

9. Why does Snappy have a lower ratio?

Answer: Snappy is designed for speed over size.

Snappy: lightweight dictionary + LZ77-like matching. Minimal entropy coding.
Gzip/Zstd: deeper compression (Huffman/entropy coding, advanced match finding).

Tradeoff:

Snappy ratio: ~2–4x, but compress/decompress is 🚀 fast.
Gzip ratio: ~5–10x, but 🐢 slower.
Zstd: 3–8x, tunable speed/ratio.

👉 In big-data systems (Spark, Hive, Hudi), decompression speed matters more than squeezing out extra bytes → that’s why Snappy is the default.

✅ Summary

Parquet achieves its efficiency through:

Columnar layout (row groups → column chunks → pages)
Rich metadata in the footer (for predicate pushdown)
Nested data support (via repetition/definition levels)
Two-tier compression (encoding + codec)
Balanced tradeoffs in codecs (Snappy for speed, Gzip/Zstd for ratio)

PreviousInterview Prep NextApache Spark

Last updated 5 months ago

hashtag📘 Apache Parquet: Deep-Dive Q&A

hashtag1. What is a row group in Parquet, and why does its size matter?

hashtag2. What are column chunks and pages in Parquet? How do they affect read performance?

hashtag3. What metadata is stored in the Parquet footer, and how does it enable predicate pushdown?

hashtag4. What happens if a Parquet file is missing its footer? Can you still read it?

hashtag5. How does Parquet achieve columnar storage on disk?

hashtag6. What is predicate pushdown?

hashtag7. How does Parquet handle nested format?

hashtag8. Can you explain the compression mechanisms in Parquet?

hashtag9. Why does Snappy have a lower ratio?

hashtag✅ Summary