Setting the Table: Benchmarking Open Table Formats

Start of Main Content

At Brooklyn Data Company, we've watched the Modern Data Stack (MDS) grow and evolve by leaps and bounds every year. Not only are new players and categories entering the MDS, but existing players are improving their offerings all the time. One area that's starting to get more attention is open table storage formats, such as Apache Iceberg and Delta Lake.

The Delta Lake™ (governed by the Linux Foundation™) and Apache® Iceberg (governed by the Apache Software Foundation) storage formats bring data warehouse capabilities to the data lake, such as ACID transactions and data versioning. Cloud data warehouses and data lakehouses are adopting support for one or more of these formats, improving data interoperability and reducing the need to move data across many analytical systems. For example, Databricks fully supports Delta Lake (and can also use Iceberg with a few tweaks), and Snowflake currently has native support of Iceberg tables in private preview.

Why this benchmark?

To evaluate these table storage formats, we ran a set of comprehensive workloads against each of them to test the performance of inserts and deletes, and the effect of these updates on the performance of subsequent reads.

To this end, we took some of these common sales tables from the TPC-DS benchmark and ran the following:

Inserting new data: Newly arriving data is appended to an existing table (for example, batching in yesterday's data from an operational system). The data added is typically a small fraction of the overall data set.
Deleting existing data: Existing data is removed from the table for archival purposes or to fix data quality issues. Volumes deleted vary widely but typically cover a specific date range or source system.
GDPR Deletions: For any of our clients that operate in the EU, we have to be ready to accommodate any GDPR deletion requests — ideally without blowing up performance. Data is removed from the entire table for just one person regardless of date. This is a challenging deletion since tables are frequently partitioned by date. Data governance and protection are not limited to GDPR compliance; they are table stakes for all analytical systems.

Methodology

At a high level, these are the steps we followed to generate our benchmark:

Denormalize the standard TPC-DS dataset into one wide sales table in Parquet format, at 1TB scale.
- TPC-DS is a heavily normalized data model, and performing multiple joins isn't really going to test the performance of the underlying storage formats. So instead we denormalized the dataset, creating a wide sales table. By focusing on a single wide table, we test the storage layer more rigorously, asking it to add and remove larger volumes of data in a single step.
- Hold back a subset of the data (~1.5%) in a separate Parquet file, for merging back into the table as part of the benchmark exercises.
Create the main denormalized table in each storage format, based off of the Parquet table. Record the count of underlying files in the table.
- Run a set of 10 read queries altered from the TPC-DS queries 3 times each, and record the run times.
- We altered these queries to accommodate our new wide sales table. We chose them because they use only the store_sales table and dimension tables. They do not refer to any of the other _sales or _returns tables that we did not denormalize.
Steps. After each mutation step, we get the new file count for the table, as well as rerun the 10 queries (3x) to measure read performance.
- Upsert Medium Data: Using a MERGE statement, insert the subset of data that was initially held back in step 1B
- Delete Extra Small Data: Using a MERGE statement, delete a "extra small" amount (0.015%) of data
- Delete Small Data: Using a MERGE statement, delete a "small" amount (0.15%) of data
- Delete Medium Data: Using a MERGE statement, delete a "medium" amount (1.5%) of data
- GDPR Request: Using a DELETE statement, simulate a GDPR request by removing all rows with a matching customer ID.

We ran the entire process (steps 1-3) four times, and calculated the geomean runtime across all 4 runs (e.g., geomean of 4 measurement points for each write task, and 120 measurement points for each read test).

Infrastructure

To run the benchmark, we used the following stack (all in AWS us-east-2):

Amazon EMR v6.8, running Spark 3.3
- 16 workers, using i3.2xlarge VMs
- 1 driver using an i3.2xlarge VM
1 Amazon RDS MySQL instance using db.m5.large
- This database is used for the Hive metastore catalog
1TB datasets, stored in s3
- Parquet dataset used for initial ingestion
- Delta Lake dataset using Delta Lake 2.1.0 (available here)
- Iceberg dataset using Iceberg 0.14.1 ¹ (available here)

Findings

Overall, we saw Delta Lake outperform Iceberg after any kind of data mutation, especially on read workloads. Read workloads were consistently 7x - 8x faster on Delta Lake. Write workloads were typically 1.5x - 3x faster on Delta Lake. Let's dive into the specifics!

Performance on writes

Delta Lake and Iceberg had very similar performance when initially creating the denormalized dataset. However, write times on performing the mutations were faster on Delta Lake, with the "Delete Small Data" mutation seeing a difference of 17x.

Performance on reads

While Delta's read performance remained consistent after every mutation step, we saw Iceberg's read performance decrease drastically after the first mutation, and then increase again after the 5th mutation.

Why the Difference?

After seeing the differences above, we wanted to explore reasons why they're so stark. We did one more additional run, this time adding statements to collect the number of files added and removed after each step:

Apache Iceberg: select * from spark_catalog.etl_sf1000_iceberg.store_sales_denorm_iceberg.snapshots
- File counts were pulled from the summary field of the latest snapshot record.
Delta Lake: describe history store_sales_denorm_delta

The reason for the difference quickly became apparent — while both formats began with the same number of files, Iceberg added 92,000 new files in the Upsert Medium Data step, leading to drastically longer read and write times for every step thereafter. Our theory is that Delta's repartitionBeforeWrite feature is what's keeping its file counts so low.

We only used default "out-of-the-box" settings for both storage formats, but Iceberg provides a large number of configurations that might improve performance. We researched this extensively² and two options seemed most relevant for this testing:

First, Iceberg offers a table compaction feature implemented as a stored procedure. This feature rewrites the underlying data files in a more optimized form. We did not try it because we expect that it would require significant extra time and cloud spend to rewrite 92,000+ files on S3 between each step.
Second, Iceberg recently added support for a merge-on-read data modification strategy versus their default copy-on-write strategy. Copy-on-write means that any changed data files are fully rewritten by the transaction.
- Delta only supports copy-on-write currently, so to ensure a like-to-like comparison we used copy-on-write for this benchmark. However, we see that the Delta Lake community is working on Deletion Vectors for DML operations which seems similar to merge-on-read. We expect that merge-on-read / deletion vectors would be faster during data modification and slower during read queries. This would be a great extension of the benchmark once both features are generally available.

What Comes Next?

While these results are interesting, we view this as only the first iteration of this benchmarking exercise. We'd love to hear what others in the data community think of our approach, and invite you to create pull requests in our repo to improve the benchmarking process. As we stated above, there are at least two modifications to the Iceberg configurations that would be a great next step for testing. If you're an expert in either format and know of better ways for us to configure them, please submit a PR!

Another way we'd love to extend this benchmark is trying other query engines — it would be exciting to see these formats benchmarked using engines such as Trino in addition to Spark.

In addition - both of these formats are moving fast and evolving constantly. We're excited to see how they both improve, and how BDC and our clients can take advantage!

Links:

Note: This is an independent benchmark designed and carried out by Brooklyn Data Company. Databricks came to BDC with the idea for the benchmark.

¹Apache Iceberg version 1.0.0 was released while we were performing this exercise. We chose not to re-run the benchmarks since 1.0 is based on 0.14.1 and did not introduce any new or improved functionality, it only guarantees API stability moving forward. To ensure there were no differences between the two versions, we ran our 10 benchmark queries against the fresh dataset using version 1.0.0 and compared to our 0.14.1 numbers. The geometric means of the run times for each version were nearly equivalent, on both read and write workloads.

² GitHub issues give us confidence that our benchmark reflects real experience:

"Slow performance on TPC-DS tests"
"Spark on iceberg table is slower than spark on hive parquet table"
"Merge Into Performance to Iceberg Table"
"Spark3.2 and spark3.3 Dynamic partition pruning is not enabled" (during merge)
"Iceberg table maintenance/compaction within AWS" (compaction doesn't improve performance)

Written by: Brooklyn Data Co.

Published: Nov 28, 2022

Data and Analytics Engineering