Apache Parquet

πŸ“˜ Apache Parquet: Deep-Dive Q&A


1. What is a row group in Parquet, and why does its size matter?

Answer: A row group is the fundamental horizontal partition of data in a Parquet file. Each row group contains a set of rows, and within it, data is stored column-by-column as column chunks.

  • Typical Size: 128MB–1GB (configurable).

  • Why It Matters:

    • Larger row groups:

      • Fewer files to manage.

      • Better compression (more data similarity).

      • More efficient for large scans.

    • Smaller row groups:

      • Faster point-lookups (less I/O to scan a single row).

      • But may lead to the small files problem (too many tiny files, higher overhead).

πŸ‘‰ The β€œsweet spot” is usually ~128–512MB per row group for balanced performance.


2. What are column chunks and pages in Parquet? How do they affect read performance?

Answer: Inside a row group:

  • Column Chunk:

    • The storage of all values for a given column in that row group.

    • Example: user.age column chunk stores all ages for rows in this row group.

  • Pages:

    • Column chunks are further split into pages (default ~1MB each).

    • Pages are the smallest read/write unit in Parquet.

    • Types of pages: data pages, dictionary pages, index pages.

Effect on Performance:

  • Queries can skip pages using page-level statistics (min/max).

  • Smaller pages β†’ more fine-grained skipping, but more metadata overhead.

  • Larger pages β†’ fewer seeks, better sequential reads, but less selective skipping.


Answer: Every Parquet file ends with a footer, which stores critical metadata:

  • Schema: column names, types, nesting info.

  • Row Group Information: number of rows, byte offsets, column chunk sizes.

  • Column Statistics: min, max, null count per column, per row group/page.

  • Encoding/Compression Info: which codec/encoding used.

How Predicate Pushdown Works:

  • Query engine (Spark/Presto) reads footer first.

  • Example query:

  • Footer might show:

    • RowGroup1: age min=10, max=40 β†’ skip.

    • RowGroup2: age min=41, max=90 β†’ read.

πŸ‘‰ Predicate pushdown = only scan row groups/pages that could contain matches β†’ massive I/O savings.


Answer: No. Without the footer:

  • The schema is unknown.

  • Column chunk/page offsets are missing.

  • Statistics are gone, so predicate pushdown can’t work.

The file becomes essentially unreadable by Parquet readers. Some recovery tools can attempt to β€œguess” schema from raw bytes, but correctness is not guaranteed.

πŸ‘‰ In distributed systems (e.g., Spark job killed mid-write), a file without a footer is considered corrupted.


5. How does Parquet achieve columnar storage on disk?

Answer: Parquet uses a hierarchical storage layout:

  1. Row Groups: Horizontal partitions of rows.

  2. Column Chunks: Within each row group, data is stored column-by-column.

  3. Pages: Each column chunk is split into smaller pages (~1MB).

  4. Footer: Stores metadata (schema, stats, offsets).

Visualization:

πŸ‘‰ This layout allows scanning only the needed columns (e.g., just age), and skipping irrelevant row groups/pages via metadata.


6. What is predicate pushdown?

Answer: Predicate pushdown = pushing filter conditions from the query engine down to the storage layer, so only relevant data is read.

  • Enabled by min/max statistics in Parquet metadata.

  • Example:

    If RowGroup stats show amount max=500, that group is skipped entirely.

πŸ‘‰ Predicate pushdown = β€œfilter early, scan less.” Huge performance win, especially on cloud object stores.


7. How does Parquet handle nested format?

Answer: Parquet uses Dremel encoding (from Google’s Dremel paper) to support nested data like arrays, maps, and structs.

  • Flattening: Each nested attribute becomes a column (user.id, user.name, user.emails).

  • Repetition Level (RL): Tracks if a value belongs to a new parent or repeats under the same parent.

  • Definition Level (DL): Tracks if a field is defined or null at a given nesting depth.

Example:

Flattened with RL/DL:

id
emails
DL
RL

1

a@x.com

2

0

b@y.com

2

1

2

null

1

0

πŸ‘‰ This preserves hierarchy so readers can reconstruct the original nested structure.


8. Can you explain the compression mechanisms in Parquet?

Answer: Parquet compresses in two layers:

  1. Encoding (logical compression)

    • Dictionary encoding: replaces values with integer IDs (best for low-cardinality).

    • Run-length encoding (RLE): compresses consecutive duplicates as (value, count).

    • Delta encoding: stores differences between consecutive values (great for timestamps).

    • Bit-packing: stores integers in minimal bits needed.

    • Boolean encoding: stores booleans as bitsets (8 values in 1 byte).

  2. Codec (physical compression)

    • Snappy: fast, lower ratio.

    • Gzip: slower, higher ratio.

    • Zstd: modern, tunable balance.

πŸ‘‰ Encodings shrink redundancy before the heavy codec is applied, making Parquet very efficient.


9. Why does Snappy have a lower ratio?

Answer: Snappy is designed for speed over size.

  • Snappy: lightweight dictionary + LZ77-like matching. Minimal entropy coding.

  • Gzip/Zstd: deeper compression (Huffman/entropy coding, advanced match finding).

Tradeoff:

  • Snappy ratio: ~2–4x, but compress/decompress is πŸš€ fast.

  • Gzip ratio: ~5–10x, but 🐒 slower.

  • Zstd: 3–8x, tunable speed/ratio.

πŸ‘‰ In big-data systems (Spark, Hive, Hudi), decompression speed matters more than squeezing out extra bytes β†’ that’s why Snappy is the default.


βœ… Summary

Parquet achieves its efficiency through:

  • Columnar layout (row groups β†’ column chunks β†’ pages)

  • Rich metadata in the footer (for predicate pushdown)

  • Nested data support (via repetition/definition levels)

  • Two-tier compression (encoding + codec)

  • Balanced tradeoffs in codecs (Snappy for speed, Gzip/Zstd for ratio)

Last updated