Benchmarking Modern Data Processing Engines
Historically, data processing often relied on Pandas for in-memory workloads and Apache Spark for distributed processing. This approach was highly effective for structured tabular data.
However, modern workloads increasingly involve multimodal data, such as images, audio, and video frames. In these scenarios, traditional challenges like the JVM's garbage collection overhead and Python's Global Interpreter Lock can become performance bottlenecks, particularly when integrating GPU inference. This shift has accelerated the adoption of engines built on Rust and Apache Arrow.
Having transitioned many of my own single-instance pipelines to Polars, I wanted to objectively evaluate how it compares to other modern engines. This article presents a comparison of several Rust-based compute frameworks using real-world datasets to provide a practical guide for choosing the right tool.
The entire benchmark suite and runnable code can be found in the engine-comparison-demo repository. I encourage you to clone it, run the benchmarks on your own hardware, and verify the results yourself.
TL;DR: Rust-native engines promise up to 94x speedups over Pandas on single nodes. You might see constant but smaller improvements in real workloads, with Polars still being the best overall. For multimodal AI, pipelined execution models (Ray Data, Daft) achieve up to 18x faster processing than Spark by keeping GPUs saturated. There is no single "best" engine β the right choice depends on your data modality, scale, and whether you need GPUs.
Data Processing Workload Types
Before selecting an engine, it is helpful to categorize workloads, as different tools are optimized for different requirements. The primary distinction is between structured and multimodal data.
Structured / Tabular Data. These CPU-bound workloadsβsuch as filtering, aggregating, and joining dataβoften fit in memory. Many of these workloads are run on distributed systems unnecessarily; roughly 90% of queries process less than 1 TB of data, which can typically be processed on a single machine.
Multimodal AI. These workloads are often GPU-bound, involving images, video, and audio. In traditional pipelines, preprocessing can become a bottleneck because deep learning models consume data faster than CPUs can decode it. On standard VM shapes like AWS g6.xlarge (4 CPUs per GPU), GPU utilization can drop below 20%. Maintaining adequate data throughput to the GPU is critical for overall pipeline efficiency.
The new engines address both worlds: Rust for raw speed, Apache Arrow for zero-copy data sharing, and streaming execution to handle datasets larger than RAM.
Part 1: Single-Node Processing
The Pandas bottleneck is not a bug β it is baked into the design. Pandas loads entire files into RAM at once (eager execution), creates intermediate copies for almost every operation, and is stuck on a single thread because of Python's GIL. In official PDS-H benchmarks at scale factor 10, Pandas took about 365 seconds to finish a standard query suite. Polars, using its streaming engine, finished the same suite in 3.89 seconds β promising performance up to 94x faster.
Polars: Tabular Data Processing
Polars has effectively replaced Pandas for high-performance local ETL. Written entirely in Rust, it uses a lazy execution model: instead of running every line immediately, Polars builds an optimized query plan β applying predicate pushdown, projection pruning, and column pruning β before executing it across all available CPU cores in streaming batches.
The performance advantage grows with dataset size. At scale factor 100 (~100 GB), Polars' streaming engine completed the benchmark in 23.94 seconds versus 152.27 seconds for its own in-memory engine β over 6x faster β proving that streaming execution handles larger-than-memory workloads gracefully.
From engine_comparison_examples.ipynb:
import polars as pl
# scan_parquet reads only the schema β no data loaded yet
q = (
pl.scan_parquet("yellow_tripdata_2024-01.parquet")
.filter(
(pl.col("trip_distance") > 5.0)
& (pl.col("total_amount") > 30.0)
)
.group_by("payment_type")
.agg(
pl.col("total_amount").mean().alias("avg_fare"),
pl.col("trip_distance").mean().alias("avg_distance"),
pl.len().alias("trip_count"),
)
)
# The entire plan is optimized and executed here, in parallel
result = q.collect()
The difference from Pandas is not syntax β it is architecture. Polars builds a plan of all operations and optimizes the entire pipeline before touching data. The filter predicate gets pushed down into the Parquet reader, skipping entire row groups. Only the columns used in the query are read from disk. Pandas cannot do this β every line runs immediately, creating expensive intermediate copies.
Apache DataFusion: Extensible Query Engine
Where Polars is a library for end-users, DataFusion is the "engine behind the engines." It powers InfluxDB 3.0, GreptimeDB, and Apple's Comet Spark accelerator. In November 2024, DataFusion became the fastest single-node engine for querying Parquet files in ClickBench, beating DuckDB, chDB, and ClickHouse on the same hardware. The Embucket team then showed DataFusion could run TPC-H at scale factor 1000 (~1 TB) on a single node β proving that most workloads do not need a distributed cluster.
From engine_comparison_examples.ipynb:
from datafusion import SessionContext
ctx = SessionContext()
ctx.register_parquet("taxi", "yellow_tripdata_2024-01.parquet")
# SQL executed directly against Parquet β no intermediate copies
df = ctx.sql("""
SELECT payment_type,
COUNT(*) AS trip_count,
AVG(trip_distance) AS avg_distance,
AVG(total_amount) AS avg_fare
FROM taxi
WHERE trip_distance > 5.0 AND total_amount > 30.0
GROUP BY payment_type
ORDER BY trip_count DESC
""")
result = df.to_pandas()
DataFusion's strength is its modularity. It provides extension APIs for custom catalogs, table providers, optimizer rules, and execution plans β making it the go-to choice when you need to build a custom data platform or embed a query engine into your own product.
Daft: Multimodal Data Processing
Daft differentiates itself by treating multimodal data β images, audio, video, embeddings β as first-class citizens in the DataFrame abstraction. Where Pandas treats images as opaque byte arrays and Spark serializes them through the JVM, Daft exposes native Rust-powered expressions for decoding images, downloading URLs, and generating tensors, bypassing the Python loop overhead entirely.
From engine_comparison_examples.ipynb:
import daft
df = daft.read_parquet("yellow_tripdata_2024-01.parquet")
result = (
df.where(
(daft.col("trip_distance") > 5.0)
& (daft.col("total_amount") > 30.0)
)
.groupby("payment_type")
.agg(
daft.col("total_amount").mean().alias("avg_fare"),
daft.col("trip_distance").mean().alias("avg_distance"),
daft.col("trip_distance").count().alias("trip_count"),
)
.collect()
)
Daft's API feels familiar to Pandas users, but under the hood it is backed by the same Rust + Arrow stack as Polars and DataFusion. Where Daft truly shines is when your "rows" contain images, PDFs, or tensors.
Benchmark: Single-Node Performance
I benchmarked all four engines (plus native Rust via Polars-rs) on ~41M NYC yellow taxi trips from 2024 Full Year. Full results from the demo repo:
At this dataset size (~660 MB Parquet), all engines are fast. Polars promises up to 94x faster performance than Pandas on specific massive analytical benchmarks (like PDS-H). In my own demo benchmark on ~41M rows, I've got constant improvement over Pandas, but not 94x. It's not the same massive multiplier, but still very good. Especially considering I've compared it with other engines and even clean pure Rust (via Polars-rs), and they perform about the sameβwith DataFusion showing the fastest total time, and Polars remaining an excellent overall choice across all tools tested.
Part 2: Handling Multimodal Data
Traditional ETL typically reduces data volume through filtering and aggregation. In contrast, multimodal AI pipelines often expand data; for example, a single document path might generate numerous text chunks and embeddings.
Spark treats complex types like images and audio as binary data. Processing them usually requires serializing the data from the JVM to Python (via Py4J), processing it with libraries like Pillow or OpenCV, and serializing it back. This serialization round-trip can introduce substantial overhead.
Pipelined Execution and GPU Utilization
Spark processes partitions in parallel across cores, but its stage-based (bulk synchronous) execution creates a different bottleneck. Within a single task, all operations β download, decode, classify β are fused and run sequentially on one core with no async I/O overlap. More importantly, all tasks in a stage must complete before the next stage begins: GPUs sit completely idle while CPUs preprocess data, and CPUs sit idle during GPU inference. Intermediate results must be fully materialized between stages, which can also cause memory pressure.
An alternative approach is pipelined execution: instead of running stages one after another, modern engines overlap I/O, CPU work, and GPU inference β keeping all hardware busy at the same time.
Benchmark: Image Processing
I benchmarked image processing on 500 real photographs from the Food-101 dataset. Results from the demo repo:
Polars and DataFusion are excluded from multimodal because they lack native image operations β image work would still go through sequential Python. The key insight: Daft's Rust-native image operations are 3.6x faster than Pandas+Pillow, and pure Rust is 4.2x faster than Pandas for the total pipeline.
Part 3: Distributed Processing
When data or compute needs exceed a single machine, you go distributed. This is where architectural differences matter most.
Apache Spark
Spark is still the go-to for petabyte-scale tabular ETL with complex shuffles. Its strengths: battle-tested fault tolerance via RDD lineage, a mature ecosystem (Databricks, AWS EMR), and proven reliability for multi-TB joins. But for AI workloads, Spark struggles with GPUs: its stage-based execution model assigns tasks to JVM executors, leaving GPUs idle while waiting for slow CPU preprocessing.
From distributed_spark.ipynb:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName("TaxiETL") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
orders = spark.read.parquet("s3a://lake/taxi/*.parquet")
zones = spark.read.csv("s3a://lake/taxi_zones.csv", header=True)
# Spark excels at this: joining massive tables with a shuffle
result = (
orders
.filter(F.col("trip_distance") > 5.0)
.join(zones, orders.PULocationID == zones.LocationID, "inner")
.groupBy("Borough")
.agg(
F.sum("total_amount").alias("total_revenue"),
F.avg("total_amount").alias("avg_fare"),
)
.orderBy(F.desc("total_revenue"))
)
Ray Data: GPU Utilization
Ray Data was built for the AI era. Unlike Spark's stage-based execution, Ray uses a streaming model designed to keep GPUs fed. Its killer feature is mixed resource scheduling: you can assign specific resources (e.g., "1 GPU, 4 CPUs") to different actors within the same pipeline.
Amazon reported saving over $120 million per year by migrating specific workloads from Spark to Ray. Their proof-of-concept showed 91% better cost efficiency and 13x more data per hour; in production the number settled at 82% better cost efficiency per GiB of S3 input.
From distributed_ray.ipynb:
import ray
ray.init()
ds = ray.data.read_images("s3://my-bucket/food101/")
class ImageClassifier:
def __init__(self):
import torch
from torchvision.models import resnet18, ResNet18_Weights
self.model = resnet18(weights=ResNet18_Weights.DEFAULT).cuda()
self.model.eval()
self.preprocess = ResNet18_Weights.DEFAULT.transforms()
def __call__(self, batch):
import torch
tensors = torch.stack([
self.preprocess(img) for img in batch["image"]
]).cuda()
with torch.no_grad():
preds = self.model(tensors)
return {
"prediction": preds.argmax(dim=1).cpu().numpy(),
"confidence": preds.max(dim=1).values.cpu().numpy(),
}
# ActorPoolStrategy creates persistent GPU workers
predictions = ds.map_batches(
ImageClassifier,
compute=ray.data.ActorPoolStrategy(size=4),
num_gpus=1,
batch_size=64,
)
predictions.write_parquet("s3://output/predictions/")
An important detail: as you add more CPUs per GPU, Ray Data keeps getting faster while other engines plateau. In Anyscale's benchmarks, going from a 4:1 to a 32:1 CPU-to-GPU ratio gave a 3x performance jump for image inference β the CPU "feeder" tasks could finally keep up with GPU demand.
Daft (Distributed)
Daft scales to distributed mode via its Flotilla engine, which puts one Swordfish worker per node. Swordfish handles local Rust execution while Flotilla manages cluster-wide scheduling. The key: it pipelines I/O with compute through small-batch streaming, keeping each node fully utilized.
In benchmarks across four multimodal workloads, Daft with Flotilla ran 2-18x faster than Spark. On the heaviest workload (video object detection with YOLO11n), Spark took over 3.5 hours while Daft finished in under 12 minutes β 18x faster, mostly because Spark wastes time on serialization overhead and stage barriers.
Worth noting: Anyscale (the company behind Ray) published their own benchmarks showing that with manual tuning, Ray Data narrows the gap and can beat Daft on high-CPU instances. This is an active race β both engines are improving fast.
From distributed_daft.ipynb:
import daft
from daft import col
# Daft's Flotilla engine distributes work across the cluster
df = daft.read_parquet("s3://data-lake/pdf_metadata/*.parquet")
# Download PDFs β Daft parallelizes downloads in Rust
df = df.with_column("pdf_bytes", col("pdf_url").url.download())
# Define a GPU UDF for embedding generation
@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()))
class TextEmbedder:
def __init__(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
def __call__(self, text_col):
texts = text_col.to_pylist()
embeddings = self.model.encode(texts, batch_size=32)
return [emb.tolist() for emb in embeddings]
# Daft schedules CPU downloads and GPU embeddings simultaneously
df = df.with_column("embedding", TextEmbedder(col("text")))
df.write_parquet("s3://output/embeddings/")
Distributed Head-to-Head
| Feature | Spark | Ray Data | Daft (Flotilla) |
|---|---|---|---|
| Execution Model | Task-per-core, partition-based | Streaming tasks and actors | Swordfish-per-node, streaming batches |
| Strengths | Massive SQL/ETL, fault tolerance | Heterogeneous compute, GPU saturation | Multimodal pipelining, bounded memory |
| GPU Utilization | Poor (stage-based, no CPU/GPU overlap) | Excellent (explicit resource assignment) | Excellent (async I/O-compute pipelining) |
| Tuning Burden | High (executor memory, cores, partitions) | Medium (batch sizes, object store) | Low (engine manages resources) |
| Best For | Petabyte tabular ETL | Training/inference with mixed CPU+GPU | Multimodal data loading for PyTorch |
Part 4: The Role of Rust and Arrow
Under all the competition, something interesting is happening: Polars, Daft, DataFusion, and even Ray's core all share the same foundation β Apache Arrow for the in-memory format and Rust for safe, high-performance concurrency.
This has real practical impact. The Arrow PyCapsule Interface (__arrow_c_stream__ protocol) lets you move data between engines without serialization: compute a result in DataFusion and hand it to Polars or PyArrow with zero overhead. Use Daft for multimodal preprocessing and pass Arrow batches straight into a PyTorch DataLoader. This significantly reduces serialization overhead across the pipeline.
from datafusion import SessionContext
import polars as pl
# Compute in DataFusion (Rust-native execution)
ctx = SessionContext()
ctx.register_parquet("events", "events.parquet")
df = ctx.sql("""
SELECT user_id, COUNT(*) as event_count
FROM events WHERE event_type = 'purchase'
GROUP BY user_id HAVING COUNT(*) > 5
""")
# Zero-copy conversion to Arrow, then to Polars
arrow_table = df.to_arrow_table()
df_polars = pl.from_arrow(arrow_table)
# Continue analysis in Polars
result = df_polars.with_columns(
pl.col("event_count").rank().alias("rank")
).sort("rank")
Why Rust over the JVM? Three reasons:
-
No garbage collection pauses. Rust uses ownership and borrowing instead of a garbage collector. No "stop-the-world" pauses, no unpredictable latency spikes. This eliminates the class of OOM errors that haunt JVM-based systems at scale.
-
No object tax. The JVM adds 12-16 bytes per object header. When processing billions of small objects, this overhead adds up fast. Rust has no such overhead.
-
Compile-time thread safety. Rust's type system catches data races at compile time, so you get safe multi-threaded execution without subtle concurrency bugs.
Combined with Arrow's columnar memory layout, the serialization bottleneck β which eats up to 30% of CPU cycles in legacy stacks β just disappears.
Part 5: Adopting Multiple Engines
The most practical takeaway: don't pick one engine β use each where it excels.
The pattern in production looks like this: raw data in a lake gets cleaned by Spark (or Dask for Python-native teams) into Parquet/Delta tables. Those tables then get fed into Ray Data or Daft for "last mile" work β GPU inference, embedding generation, multimodal transforms that Spark was never designed for. For ad-hoc analysis and local dev, Polars and DuckDB give you instant feedback without cluster overhead.
The glue that makes this work: open formats β Parquet, Delta, Iceberg, Arrow. Every engine reads and writes them natively, so data flows between stages without conversion or lock-in.
Summary of Engine Use Cases
| Task | Recommended Tool | Trade-off |
|---|---|---|
| Ad-hoc analysis on a laptop | Polars / DuckDB | Speed > Scale |
| Massive SQL ETL (petabyte) | Apache Spark | Reliability > Speed |
| Python-native scaling | Dask | Simplicity > Optimization |
| Training LLMs / computer vision | Ray Data | GPU Utilization > ETL Features |
| Multimodal data loading | Daft | Developer Experience > Ecosystem |
| Building custom query engines | DataFusion | Extensibility > Out-of-box Features |
Diagonal Scaling: The New Frontier
For years the choice was simple: Scale Up (bigger machine) or Scale Out (more machines). Modern engines offer a third option β Diagonal Scaling. Polars Cloud does exactly this: it scales horizontally to multiple nodes when reading from cloud storage to max out I/O, but once filters and aggregations shrink the data, it switches to a single big node for compute β skipping the overhead of a distributed shuffle.
The surprising finding: bigger, more expensive instances can be cheaper per job because they drive higher GPU utilization, cutting total execution time enough to lower the overall bill.
Try It Yourself
The companion demo repository includes everything you need to reproduce these benchmarks:
# Install dependencies
uv sync
# Run the tabular benchmark (~2.9M NYC taxi trips)
uv run python -m engine_comparison.benchmarks.tabular
# Run the multimodal benchmark (500 real food photos)
uv run python -m engine_comparison.benchmarks.multimodal
# Run native Rust benchmarks (Polars-rs + image crate)
cd rust_benchmark && cargo run --release && cd ..
The repo also includes interactive notebooks for side-by-side API comparisons, and Docker Compose configurations for local distributed experiments with Spark, Ray, and Daft.
Key Takeaways
-
Don't distribute if you don't have to. Polars and DataFusion handle up to ~1 TB on one machine with constant, significant speedups over Pandas (promising up to 94x on heavy analytical benchmarks). Only reach for a cluster when you've actually hit the limits of a single node.
-
Understand Pandas' constraints. Eager execution, single-threading, and intermediate copies are architectural decisions that become limitations for certain workloads. Lazy execution with predicate pushdown and column pruning provides significant performance advantages at scale.
-
Pipelined execution improves hardware utilization. Spark's stage-based execution model can leave GPUs idle during CPU preprocessing. Pipelined execution models (like Ray Data and Daft) overlap operations to keep resources better utilized.
-
Use each engine where it shines. Spark for petabyte ETL, Polars/DuckDB for ad-hoc analysis, Ray Data for GPU training pipelines, Daft for multimodal data loading, DataFusion for custom query engines.
-
Rust + Arrow is the new foundation. This is not a coincidence β it eliminates GC pauses, serialization overhead, and concurrency bugs at the infrastructure level.
The best engine is the one that keeps your hardware busy and your developers productive.
References
Benchmarks and Performance Data
- Polars PDS-H Benchmarks - Polars streaming vs. Pandas at SF-10 (94x) and SF-100 (6.4x over in-memory)
- DataFusion ClickBench Results (Nov 2024) - Fastest single-node Parquet query engine
- Embucket: TPC-H SF-1000 on DataFusion - Full TPC-H at ~1 TB on a single node
- Amazon Spark-to-Ray Migration - $120M/year savings, 82% production cost efficiency
- Daft Flotilla Benchmarks - 2β18x faster than Spark on multimodal workloads
- Anyscale Ray vs. Daft Benchmarks - Ray Data ~30% average speedup on multimodal
Engines and Frameworks
- Polars - Rust DataFrame library with lazy execution
- Apache DataFusion - Embeddable Rust query engine (GitHub)
- Daft - Multimodal-native distributed DataFrame
- Ray - Distributed compute framework for AI (Data Internals)
- Apache Spark - Distributed ETL and SQL engine
- DuckDB - In-process analytical SQL engine
- Dask - Python-native parallel computing
Architecture and Ecosystem
- Polars Cloud: Diagonal Scaling - Dynamic vertical/horizontal scaling
- Daft Flotilla Architecture - Swordfish + Flotilla distributed engine
- Apache DataFusion Comet - Spark accelerator originally developed at Apple
- Apache Arrow - Columnar in-memory format (Flight RPC, PyCapsule Interface)
- InfluxDB 3.0 + DataFusion - Time-series database built on DataFusion
- GreptimeDB - Observability database using DataFusion
Demo Repository
- engine-comparison-demo - Companion code for this article: benchmarks, notebooks, and Docker Compose for Spark/Ray/Daft
Datasets
- NYC Taxi Trip Records - NYC TLC yellow taxi data (Parquet, ~2.9M rows/month)
- Food-101 Dataset - 101K food images from ETH Zurich (Bossard et al., ECCV 2014)