
Snappy vs. Zstd
Introduction
Everyone has data. If it’s high volumes of it, it’s likely compressed. The race to find the best compression is on-going in the open-source community. Two popular ones are Snappy and Zstd. Below are my observations as a user of the two.
🔎 Major Differences
Less Data
Zstd compression produces less data than Snappy compression since it’s a newer technology.
Creating less data means that we can reduce the shuffle partitions slightly for a given workload, which allows us to process more efficiently.
Tuning the value for spark.shuffle.partitions
yields a slight performance boost.
However, storing data in Zstd requires more decompression time than Snappy.
In a few cases, we need to allocate larger executors to successfully decompress files.
This means we need to set higher values for spark.executor.memory
.
Incompatible Data
Zstd compression is a more recent technology – so, consumers using older versions of spark, hive, presto, etc. may have trouble reading the data. Snappy compression is more widely adopted and supported.