Apache Parquet
π Apache Parquet: Deep-Dive Q&A
1. What is a row group in Parquet, and why does its size matter?
Answer: A row group is the fundamental horizontal partition of data in a Parquet file. Each row group contains a set of rows, and within it, data is stored column-by-column as column chunks.
Typical Size: 128MBβ1GB (configurable).
Why It Matters:
Larger row groups:
Fewer files to manage.
Better compression (more data similarity).
More efficient for large scans.
Smaller row groups:
Faster point-lookups (less I/O to scan a single row).
But may lead to the small files problem (too many tiny files, higher overhead).
π The βsweet spotβ is usually ~128β512MB per row group for balanced performance.
2. What are column chunks and pages in Parquet? How do they affect read performance?
Answer: Inside a row group:
Column Chunk:
The storage of all values for a given column in that row group.
Example:
user.agecolumn chunk stores all ages for rows in this row group.
Pages:
Column chunks are further split into pages (default ~1MB each).
Pages are the smallest read/write unit in Parquet.
Types of pages: data pages, dictionary pages, index pages.
Effect on Performance:
Queries can skip pages using page-level statistics (min/max).
Smaller pages β more fine-grained skipping, but more metadata overhead.
Larger pages β fewer seeks, better sequential reads, but less selective skipping.
3. What metadata is stored in the Parquet footer, and how does it enable predicate pushdown?
Answer: Every Parquet file ends with a footer, which stores critical metadata:
Schema: column names, types, nesting info.
Row Group Information: number of rows, byte offsets, column chunk sizes.
Column Statistics: min, max, null count per column, per row group/page.
Encoding/Compression Info: which codec/encoding used.
How Predicate Pushdown Works:
Query engine (Spark/Presto) reads footer first.
Example query:
Footer might show:
RowGroup1:
age min=10, max=40β skip.RowGroup2:
age min=41, max=90β read.
π Predicate pushdown = only scan row groups/pages that could contain matches β massive I/O savings.
4. What happens if a Parquet file is missing its footer? Can you still read it?
Answer: No. Without the footer:
The schema is unknown.
Column chunk/page offsets are missing.
Statistics are gone, so predicate pushdown canβt work.
The file becomes essentially unreadable by Parquet readers. Some recovery tools can attempt to βguessβ schema from raw bytes, but correctness is not guaranteed.
π In distributed systems (e.g., Spark job killed mid-write), a file without a footer is considered corrupted.
5. How does Parquet achieve columnar storage on disk?
Answer: Parquet uses a hierarchical storage layout:
Row Groups: Horizontal partitions of rows.
Column Chunks: Within each row group, data is stored column-by-column.
Pages: Each column chunk is split into smaller pages (~1MB).
Footer: Stores metadata (schema, stats, offsets).
Visualization:
π This layout allows scanning only the needed columns (e.g., just age), and skipping irrelevant row groups/pages via metadata.
6. What is predicate pushdown?
Answer: Predicate pushdown = pushing filter conditions from the query engine down to the storage layer, so only relevant data is read.
Enabled by min/max statistics in Parquet metadata.
Example:
If RowGroup stats show
amount max=500, that group is skipped entirely.
π Predicate pushdown = βfilter early, scan less.β Huge performance win, especially on cloud object stores.
7. How does Parquet handle nested format?
Answer: Parquet uses Dremel encoding (from Googleβs Dremel paper) to support nested data like arrays, maps, and structs.
Flattening: Each nested attribute becomes a column (
user.id,user.name,user.emails).Repetition Level (RL): Tracks if a value belongs to a new parent or repeats under the same parent.
Definition Level (DL): Tracks if a field is defined or null at a given nesting depth.
Example:
Flattened with RL/DL:
1
a@x.com
2
0
b@y.com
2
1
2
null
1
0
π This preserves hierarchy so readers can reconstruct the original nested structure.
8. Can you explain the compression mechanisms in Parquet?
Answer: Parquet compresses in two layers:
Encoding (logical compression)
Dictionary encoding: replaces values with integer IDs (best for low-cardinality).
Run-length encoding (RLE): compresses consecutive duplicates as
(value, count).Delta encoding: stores differences between consecutive values (great for timestamps).
Bit-packing: stores integers in minimal bits needed.
Boolean encoding: stores booleans as bitsets (8 values in 1 byte).
Codec (physical compression)
Snappy: fast, lower ratio.
Gzip: slower, higher ratio.
Zstd: modern, tunable balance.
π Encodings shrink redundancy before the heavy codec is applied, making Parquet very efficient.
9. Why does Snappy have a lower ratio?
Answer: Snappy is designed for speed over size.
Snappy: lightweight dictionary + LZ77-like matching. Minimal entropy coding.
Gzip/Zstd: deeper compression (Huffman/entropy coding, advanced match finding).
Tradeoff:
Snappy ratio: ~2β4x, but compress/decompress is π fast.
Gzip ratio: ~5β10x, but π’ slower.
Zstd: 3β8x, tunable speed/ratio.
π In big-data systems (Spark, Hive, Hudi), decompression speed matters more than squeezing out extra bytes β thatβs why Snappy is the default.
β
Summary
Parquet achieves its efficiency through:
Columnar layout (row groups β column chunks β pages)
Rich metadata in the footer (for predicate pushdown)
Nested data support (via repetition/definition levels)
Two-tier compression (encoding + codec)
Balanced tradeoffs in codecs (Snappy for speed, Gzip/Zstd for ratio)
Last updated