Preliminary Notes on Open-Source Variant Performance

Recently, there’s been a lot of interest in migrating from JSON-as-string columns to the new native Parquet Variant type. The design rationale is compelling: a self-describing binary encoding that avoids redundant text parsing, enables typed column access, and supports shredding — extracting frequently-accessed fields into dedicated Parquet columns for faster projection and filtering.

However, when we put this to the test using our upcoming variant-conformance-benchmark harness, the early numbers told a much more nuanced story. As of March 2026, if you are expecting a massive, out-of-the-box performance leap by simply switching to Variant in open-source Spark 4, our preliminary results suggest the engine implementations and defaults might still need some time to mature.

Below are the numbers, a read on where things look rough today, and an invitation for more workloads so we can see how Variant behaves in the wild. This note is educational, not a benchmark claim: same queries, same data slice, so people can reason about storage choices and whether variant-conformance-benchmark might help on their own data later.

The Setup

We ran our benchmark on Spark 4 (local mode). While this benchmark was executed on a single machine with 3 timed runs, the timings are consistent and the performance patterns are reproducible, having been observed on another Intel + Linux machine as well.

Implementation status: Spark’s native Variant execution path and Iceberg’s integration for Variant tables are still under active development as of March 2026. The stack is not feature-complete: several planned performance-oriented paths and optimizations are missing or only partially landed. The numbers below reflect current open-source behavior, not an upper bound on what the design can deliver once implementations catch up.

Engine: Testing was performed using the latest Spark master (4.x) + Iceberg main as of March 2026.
Dataset: One calendar day of the GitHub Archive (GHA) dataset (2023-01-06), totaling ~4.4 million rows. Each row is an API event with a heavy, semi-structured payload column where the shape depends on the event type.
Workload: 14 analytics read-only queries (c-q01 to c-q14) covering group bys, aggregations, nested fields, arrays, and mixed types.
Configurations: We tested a matrix of data representations and storage formats: string vs. Variant, shredded vs. unshredded, and Native vs. Iceberg.
- Note on “Native”: In this benchmark, “Native” specifically means a Spark native, Hive-style Parquet table under the Spark warehouse.
- Note on vectorized reads: For Iceberg + Variant runs, vectorized reads are not supported (as of March 2026) and Iceberg falls back to its row-by-row reader. This is a known limitation and means the Iceberg Variant numbers carry an additional overhead not present in the Native runs.

Note on timing: The metric reported is query_median_s — Spark’s own query execution time extracted from its output, median of 3 timed runs, excluding JVM startup. Each query runs in its own spark-sql session (cold start), so this isolates query execution from session overhead.

The Results: Total Query Times

Here is the sum of the median query times (Σ query_median_s) across all 14 queries for each configuration. The percentage column shows how much slower each configuration is compared to the fastest row in this table (here, JSON as String + Native table):

Representation	Table Format	Total Query Time (Σ `query_median_s`)	vs. fastest
JSON as String	Native table	54.83 s	—
Variant (Unshredded)	Iceberg table	58.35 s	+6%
Variant (Unshredded)	Native table	61.12 s	+11%
JSON as String	Iceberg table	66.11 s	+21%
Variant (Shredded)	Native table	80.25 s	+46%

The Results: Per-Query Breakdown

Here is where the totals come from: per-query query_median_s. No single stack wins every row — String (Native), Unshredded Variant (Native), and Variant (Iceberg) each lead on different queries, while shredded Variant (Native) is often the slowest. c-q05 is the dramatic outlier (picked up again below).

How to read: Bold = fastest configuration in that row; values such as 5.954 (+32%) are 32% slower than that row’s minimum.

No semi-structured access (baseline)

c-q01 counts events by type and does not touch payload.

Query	String (Native)	Shredded Variant (Native)	Unshredded Variant (Native)	String (Iceberg)	Variant (Iceberg)
c-q01	2.114	2.301 (+9%)	2.112	2.424 (+15%)	2.434 (+15%)

Shallow / scalar field access

c-q02–c-q04, c-q05, and c-q07: top repos and actors, PR actions, aggregates on a single numeric path in payload, and repo languages with NULL handling.

Query	String (Native)	Shredded Variant (Native)	Unshredded Variant (Native)	String (Iceberg)	Variant (Iceberg)
c-q02	3.067	3.118 (+2%)	3.118 (+2%)	3.277 (+7%)	3.461 (+13%)
c-q03	2.634	2.921 (+11%)	2.816 (+7%)	2.751 (+4%)	3.135 (+19%)
c-q04	4.930 (+9%)	5.954 (+32%)	4.739 (+5%)	6.088 (+35%)	4.514
c-q05	4.163	11.771 (+183%)	5.664 (+36%)	5.147 (+24%)	4.478 (+8%)
c-q07	3.544	6.013 (+70%)	6.207 (+75%)	4.247 (+20%)	4.711 (+33%)

Array index and array-length access

c-q06, c-q08, c-q12–c-q14: indexed commits / labels paths and array-length predicates.

Query	String (Native)	Shredded Variant (Native)	Unshredded Variant (Native)	String (Iceberg)	Variant (Iceberg)
c-q06	4.804 (+17%)	6.239 (+52%)	4.092	5.776 (+41%)	4.929 (+20%)
c-q08	3.700 (+2%)	5.810 (+60%)	3.633	4.732 (+30%)	4.300 (+18%)
c-q12	3.518	5.926 (+68%)	5.664 (+61%)	4.123 (+17%)	4.371 (+24%)
c-q13	3.165	5.740 (+81%)	3.448 (+9%)	3.899 (+23%)	4.114 (+30%)
c-q14	4.235	6.210 (+47%)	5.633 (+33%)	5.112 (+21%)	4.538 (+7%)

Deep nested object path

c-q09–c-q11: draft flag, PR volume by day, and opener login under payload.pull_request.

Query	String (Native)	Shredded Variant (Native)	Unshredded Variant (Native)	String (Iceberg)	Variant (Iceberg)
c-q09	4.981 (+11%)	5.753 (+28%)	4.985 (+11%)	6.207 (+38%)	4.503
c-q10	4.946 (+26%)	6.587 (+68%)	3.918	6.164 (+57%)	4.385 (+12%)
c-q11	5.028 (+12%)	5.909 (+32%)	5.094 (+14%)	6.161 (+37%)	4.481

Full SQL, dataset preparation, and the benchmark runner will ship with the upcoming open-source variant-conformance-benchmark.

What We Observed from the Data

Totals favor String + Native. The summed median time was lowest for JSON-as-string on a native Hive-style table. That flips the usual Variant story (string_json pays parse cost every read; Variant should shrink that), so it is worth digging into rather than treating as noise.
Shredded Variant + Native was the slowest stack overall (+46% vs that baseline). All of these queries are read-only; shredding is mainly a write-time layout choice, and at read time you would expect typed side columns to help. Seeing the opposite suggests profiling work (for example how variant_get behaves on shredded vs unshredded Parquet).
c-q05 is the clearest outlier. Shredded Variant (Native) was 11.771s on this row; the other configs clustered about 4.1–5.6s. Same aggregation:
Variant (11.771s):

SELECT round(avg(variant_get(payload, '$.size', 'double')), 6) AS avg_commits,
       max(variant_get(payload, '$.size', 'long'))              AS max_commits
FROM vcb.gha_events_variant
WHERE type = 'PushEvent'
  AND variant_get(payload, '$.size', 'long') IS NOT NULL;

String + get_json_object (4.163s):

SELECT round(avg(CAST(get_json_object(payload, '$.size') AS DOUBLE)), 6) AS avg_commits,
       max(CAST(get_json_object(payload, '$.size') AS BIGINT))          AS max_commits
FROM vcb.gha_events_string_json
WHERE type = 'PushEvent'
  AND get_json_object(payload, '$.size') IS NOT NULL;

One numeric path under payload is where shredding should help most; trailing plain string JSON by roughly 3× is a strong signal.

Iceberg vs Native is confounded (format, reader path, and representation move together) — use those comparisons directionally, not as a clean Variant-only A/B.

Datasets, workloads, and the benchmark

On one calendar day of the GitHub Archive (GHA) dataset and these 14 queries, string on a native table came out ahead overall, and shredded Variant was surprisingly slow on reads. Together with the setup caveats above, we mostly use numbers like these for spotting gaps so Spark and Iceberg can keep tightening the Variant path.

We would love more real workloads: different JSON shapes, sizes, and query patterns. We are curious how Variant behaves there — wins, losses, and anything weird in between. If you have a dataset or job where Variant shines (or where it should but does not), open an issue on this site’s repo or reach me via github.com/qlong. Once variant-conformance-benchmark is public, issues on that project will be welcome too.

We also plan to open source variant-conformance-benchmark (queries, prep scripts, runner) so people can reproduce this setup and run their own. The hope is twofold: help engineers see where things stand today, and give teams a practical way to decide whether Variant is the right type for their data — without treating any single run as the last word.

The Setup#

The Results: Total Query Times#

The Results: Per-Query Breakdown#

No semi-structured access (baseline)#

Shallow / scalar field access#

Array index and array-length access#

Deep nested object path#

What We Observed from the Data#

Datasets, workloads, and the benchmark#

See also#