Skip to content

Benchmarking Modern Data Processing Engines

Historically, data processing often relied on Pandas for in-memory workloads and Apache Spark for distributed processing. This approach was highly effective for structured tabular data.

However, modern workloads increasingly involve multimodal data, such as images, audio, and video frames. In these scenarios, traditional challenges like the JVM's garbage collection overhead and Python's Global Interpreter Lock can become performance bottlenecks, particularly when integrating GPU inference. This shift has accelerated the adoption of engines built on Rust and Apache Arrow.

Having transitioned many of my own single-instance pipelines to Polars, I wanted to objectively evaluate how it compares to other modern engines. This article presents a comparison of several Rust-based compute frameworks using real-world datasets to provide a practical guide for choosing the right tool.

The entire benchmark suite and runnable code can be found in the engine-comparison-demo repository. I encourage you to clone it, run the benchmarks on your own hardware, and verify the results yourself.

TL;DR: Rust-native engines promise up to 94x speedups over Pandas on single nodes. You might see constant but smaller improvements in real workloads, with Polars still being the best overall. For multimodal AI, pipelined execution models (Ray Data, Daft) achieve up to 18x faster processing than Spark by keeping GPUs saturated. There is no single "best" engine β€” the right choice depends on your data modality, scale, and whether you need GPUs.


Data Processing Workload Types

Before selecting an engine, it is helpful to categorize workloads, as different tools are optimized for different requirements. The primary distinction is between structured and multimodal data.

Two Worlds of Data: Structured vs. Multimodal

Structured / Tabular Data. These CPU-bound workloadsβ€”such as filtering, aggregating, and joining dataβ€”often fit in memory. Many of these workloads are run on distributed systems unnecessarily; roughly 90% of queries process less than 1 TB of data, which can typically be processed on a single machine.

Multimodal AI. These workloads are often GPU-bound, involving images, video, and audio. In traditional pipelines, preprocessing can become a bottleneck because deep learning models consume data faster than CPUs can decode it. On standard VM shapes like AWS g6.xlarge (4 CPUs per GPU), GPU utilization can drop below 20%. Maintaining adequate data throughput to the GPU is critical for overall pipeline efficiency.

The new engines address both worlds: Rust for raw speed, Apache Arrow for zero-copy data sharing, and streaming execution to handle datasets larger than RAM.


Part 1: Single-Node Processing

The Pandas bottleneck is not a bug β€” it is baked into the design. Pandas loads entire files into RAM at once (eager execution), creates intermediate copies for almost every operation, and is stuck on a single thread because of Python's GIL. In official PDS-H benchmarks at scale factor 10, Pandas took about 365 seconds to finish a standard query suite. Polars, using its streaming engine, finished the same suite in 3.89 seconds β€” promising performance up to 94x faster.

Eager vs. Lazy Execution

Polars: Tabular Data Processing

Polars has effectively replaced Pandas for high-performance local ETL. Written entirely in Rust, it uses a lazy execution model: instead of running every line immediately, Polars builds an optimized query plan β€” applying predicate pushdown, projection pruning, and column pruning β€” before executing it across all available CPU cores in streaming batches.

The performance advantage grows with dataset size. At scale factor 100 (~100 GB), Polars' streaming engine completed the benchmark in 23.94 seconds versus 152.27 seconds for its own in-memory engine β€” over 6x faster β€” proving that streaming execution handles larger-than-memory workloads gracefully.

From engine_comparison_examples.ipynb:

import polars as pl

# scan_parquet reads only the schema β€” no data loaded yet
q = (
    pl.scan_parquet("yellow_tripdata_2024-01.parquet")
    .filter(
        (pl.col("trip_distance") > 5.0)
        & (pl.col("total_amount") > 30.0)
    )
    .group_by("payment_type")
    .agg(
        pl.col("total_amount").mean().alias("avg_fare"),
        pl.col("trip_distance").mean().alias("avg_distance"),
        pl.len().alias("trip_count"),
    )
)

# The entire plan is optimized and executed here, in parallel
result = q.collect()

The difference from Pandas is not syntax β€” it is architecture. Polars builds a plan of all operations and optimizes the entire pipeline before touching data. The filter predicate gets pushed down into the Parquet reader, skipping entire row groups. Only the columns used in the query are read from disk. Pandas cannot do this β€” every line runs immediately, creating expensive intermediate copies.

Apache DataFusion: Extensible Query Engine

Where Polars is a library for end-users, DataFusion is the "engine behind the engines." It powers InfluxDB 3.0, GreptimeDB, and Apple's Comet Spark accelerator. In November 2024, DataFusion became the fastest single-node engine for querying Parquet files in ClickBench, beating DuckDB, chDB, and ClickHouse on the same hardware. The Embucket team then showed DataFusion could run TPC-H at scale factor 1000 (~1 TB) on a single node β€” proving that most workloads do not need a distributed cluster.

From engine_comparison_examples.ipynb:

from datafusion import SessionContext

ctx = SessionContext()
ctx.register_parquet("taxi", "yellow_tripdata_2024-01.parquet")

# SQL executed directly against Parquet β€” no intermediate copies
df = ctx.sql("""
    SELECT payment_type,
           COUNT(*)           AS trip_count,
           AVG(trip_distance) AS avg_distance,
           AVG(total_amount)  AS avg_fare
    FROM taxi
    WHERE trip_distance > 5.0 AND total_amount > 30.0
    GROUP BY payment_type
    ORDER BY trip_count DESC
""")
result = df.to_pandas()

DataFusion's strength is its modularity. It provides extension APIs for custom catalogs, table providers, optimizer rules, and execution plans β€” making it the go-to choice when you need to build a custom data platform or embed a query engine into your own product.

Daft: Multimodal Data Processing

Daft differentiates itself by treating multimodal data β€” images, audio, video, embeddings β€” as first-class citizens in the DataFrame abstraction. Where Pandas treats images as opaque byte arrays and Spark serializes them through the JVM, Daft exposes native Rust-powered expressions for decoding images, downloading URLs, and generating tensors, bypassing the Python loop overhead entirely.

From engine_comparison_examples.ipynb:

import daft

df = daft.read_parquet("yellow_tripdata_2024-01.parquet")

result = (
    df.where(
        (daft.col("trip_distance") > 5.0)
        & (daft.col("total_amount") > 30.0)
    )
    .groupby("payment_type")
    .agg(
        daft.col("total_amount").mean().alias("avg_fare"),
        daft.col("trip_distance").mean().alias("avg_distance"),
        daft.col("trip_distance").count().alias("trip_count"),
    )
    .collect()
)

Daft's API feels familiar to Pandas users, but under the hood it is backed by the same Rust + Arrow stack as Polars and DataFusion. Where Daft truly shines is when your "rows" contain images, PDFs, or tensors.

Benchmark: Single-Node Performance

I benchmarked all four engines (plus native Rust via Polars-rs) on ~41M NYC yellow taxi trips from 2024 Full Year. Full results from the demo repo:

Benchmark Results: Single-Node Performance

At this dataset size (~660 MB Parquet), all engines are fast. Polars promises up to 94x faster performance than Pandas on specific massive analytical benchmarks (like PDS-H). In my own demo benchmark on ~41M rows, I've got constant improvement over Pandas, but not 94x. It's not the same massive multiplier, but still very good. Especially considering I've compared it with other engines and even clean pure Rust (via Polars-rs), and they perform about the sameβ€”with DataFusion showing the fastest total time, and Polars remaining an excellent overall choice across all tools tested.


Part 2: Handling Multimodal Data

Traditional ETL typically reduces data volume through filtering and aggregation. In contrast, multimodal AI pipelines often expand data; for example, a single document path might generate numerous text chunks and embeddings.

Spark treats complex types like images and audio as binary data. Processing them usually requires serializing the data from the JVM to Python (via Py4J), processing it with libraries like Pillow or OpenCV, and serializing it back. This serialization round-trip can introduce substantial overhead.

Pipelined Execution and GPU Utilization

Spark processes partitions in parallel across cores, but its stage-based (bulk synchronous) execution creates a different bottleneck. Within a single task, all operations β€” download, decode, classify β€” are fused and run sequentially on one core with no async I/O overlap. More importantly, all tasks in a stage must complete before the next stage begins: GPUs sit completely idle while CPUs preprocess data, and CPUs sit idle during GPU inference. Intermediate results must be fully materialized between stages, which can also cause memory pressure.

Pipelined Execution Model

An alternative approach is pipelined execution: instead of running stages one after another, modern engines overlap I/O, CPU work, and GPU inference β€” keeping all hardware busy at the same time.

Benchmark: Image Processing

I benchmarked image processing on 500 real photographs from the Food-101 dataset. Results from the demo repo:

Benchmark Results: Multimodal Performance

Polars and DataFusion are excluded from multimodal because they lack native image operations β€” image work would still go through sequential Python. The key insight: Daft's Rust-native image operations are 3.6x faster than Pandas+Pillow, and pure Rust is 4.2x faster than Pandas for the total pipeline.


Part 3: Distributed Processing

When data or compute needs exceed a single machine, you go distributed. This is where architectural differences matter most.

Apache Spark

Spark is still the go-to for petabyte-scale tabular ETL with complex shuffles. Its strengths: battle-tested fault tolerance via RDD lineage, a mature ecosystem (Databricks, AWS EMR), and proven reliability for multi-TB joins. But for AI workloads, Spark struggles with GPUs: its stage-based execution model assigns tasks to JVM executors, leaving GPUs idle while waiting for slow CPU preprocessing.

From distributed_spark.ipynb:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("TaxiETL") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

orders = spark.read.parquet("s3a://lake/taxi/*.parquet")
zones = spark.read.csv("s3a://lake/taxi_zones.csv", header=True)

# Spark excels at this: joining massive tables with a shuffle
result = (
    orders
    .filter(F.col("trip_distance") > 5.0)
    .join(zones, orders.PULocationID == zones.LocationID, "inner")
    .groupBy("Borough")
    .agg(
        F.sum("total_amount").alias("total_revenue"),
        F.avg("total_amount").alias("avg_fare"),
    )
    .orderBy(F.desc("total_revenue"))
)

Ray Data: GPU Utilization

Ray Data was built for the AI era. Unlike Spark's stage-based execution, Ray uses a streaming model designed to keep GPUs fed. Its killer feature is mixed resource scheduling: you can assign specific resources (e.g., "1 GPU, 4 CPUs") to different actors within the same pipeline.

Amazon reported saving over $120 million per year by migrating specific workloads from Spark to Ray. Their proof-of-concept showed 91% better cost efficiency and 13x more data per hour; in production the number settled at 82% better cost efficiency per GiB of S3 input.

From distributed_ray.ipynb:

import ray

ray.init()

ds = ray.data.read_images("s3://my-bucket/food101/")

class ImageClassifier:
    def __init__(self):
        import torch
        from torchvision.models import resnet18, ResNet18_Weights
        self.model = resnet18(weights=ResNet18_Weights.DEFAULT).cuda()
        self.model.eval()
        self.preprocess = ResNet18_Weights.DEFAULT.transforms()

    def __call__(self, batch):
        import torch
        tensors = torch.stack([
            self.preprocess(img) for img in batch["image"]
        ]).cuda()
        with torch.no_grad():
            preds = self.model(tensors)
        return {
            "prediction": preds.argmax(dim=1).cpu().numpy(),
            "confidence": preds.max(dim=1).values.cpu().numpy(),
        }

# ActorPoolStrategy creates persistent GPU workers
predictions = ds.map_batches(
    ImageClassifier,
    compute=ray.data.ActorPoolStrategy(size=4),
    num_gpus=1,
    batch_size=64,
)
predictions.write_parquet("s3://output/predictions/")

An important detail: as you add more CPUs per GPU, Ray Data keeps getting faster while other engines plateau. In Anyscale's benchmarks, going from a 4:1 to a 32:1 CPU-to-GPU ratio gave a 3x performance jump for image inference β€” the CPU "feeder" tasks could finally keep up with GPU demand.

Daft (Distributed)

Daft scales to distributed mode via its Flotilla engine, which puts one Swordfish worker per node. Swordfish handles local Rust execution while Flotilla manages cluster-wide scheduling. The key: it pipelines I/O with compute through small-batch streaming, keeping each node fully utilized.

In benchmarks across four multimodal workloads, Daft with Flotilla ran 2-18x faster than Spark. On the heaviest workload (video object detection with YOLO11n), Spark took over 3.5 hours while Daft finished in under 12 minutes β€” 18x faster, mostly because Spark wastes time on serialization overhead and stage barriers.

Worth noting: Anyscale (the company behind Ray) published their own benchmarks showing that with manual tuning, Ray Data narrows the gap and can beat Daft on high-CPU instances. This is an active race β€” both engines are improving fast.

From distributed_daft.ipynb:

import daft
from daft import col

# Daft's Flotilla engine distributes work across the cluster
df = daft.read_parquet("s3://data-lake/pdf_metadata/*.parquet")

# Download PDFs β€” Daft parallelizes downloads in Rust
df = df.with_column("pdf_bytes", col("pdf_url").url.download())

# Define a GPU UDF for embedding generation
@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()))
class TextEmbedder:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")

    def __call__(self, text_col):
        texts = text_col.to_pylist()
        embeddings = self.model.encode(texts, batch_size=32)
        return [emb.tolist() for emb in embeddings]

# Daft schedules CPU downloads and GPU embeddings simultaneously
df = df.with_column("embedding", TextEmbedder(col("text")))
df.write_parquet("s3://output/embeddings/")

Distributed Head-to-Head

Feature Spark Ray Data Daft (Flotilla)
Execution Model Task-per-core, partition-based Streaming tasks and actors Swordfish-per-node, streaming batches
Strengths Massive SQL/ETL, fault tolerance Heterogeneous compute, GPU saturation Multimodal pipelining, bounded memory
GPU Utilization Poor (stage-based, no CPU/GPU overlap) Excellent (explicit resource assignment) Excellent (async I/O-compute pipelining)
Tuning Burden High (executor memory, cores, partitions) Medium (batch sizes, object store) Low (engine manages resources)
Best For Petabyte tabular ETL Training/inference with mixed CPU+GPU Multimodal data loading for PyTorch

Part 4: The Role of Rust and Arrow

Under all the competition, something interesting is happening: Polars, Daft, DataFusion, and even Ray's core all share the same foundation β€” Apache Arrow for the in-memory format and Rust for safe, high-performance concurrency.

Rust and Arrow Convergence

This has real practical impact. The Arrow PyCapsule Interface (__arrow_c_stream__ protocol) lets you move data between engines without serialization: compute a result in DataFusion and hand it to Polars or PyArrow with zero overhead. Use Daft for multimodal preprocessing and pass Arrow batches straight into a PyTorch DataLoader. This significantly reduces serialization overhead across the pipeline.

from datafusion import SessionContext
import polars as pl

# Compute in DataFusion (Rust-native execution)
ctx = SessionContext()
ctx.register_parquet("events", "events.parquet")
df = ctx.sql("""
    SELECT user_id, COUNT(*) as event_count
    FROM events WHERE event_type = 'purchase'
    GROUP BY user_id HAVING COUNT(*) > 5
""")

# Zero-copy conversion to Arrow, then to Polars
arrow_table = df.to_arrow_table()
df_polars = pl.from_arrow(arrow_table)

# Continue analysis in Polars
result = df_polars.with_columns(
    pl.col("event_count").rank().alias("rank")
).sort("rank")

Why Rust over the JVM? Three reasons:

  1. No garbage collection pauses. Rust uses ownership and borrowing instead of a garbage collector. No "stop-the-world" pauses, no unpredictable latency spikes. This eliminates the class of OOM errors that haunt JVM-based systems at scale.

  2. No object tax. The JVM adds 12-16 bytes per object header. When processing billions of small objects, this overhead adds up fast. Rust has no such overhead.

  3. Compile-time thread safety. Rust's type system catches data races at compile time, so you get safe multi-threaded execution without subtle concurrency bugs.

Combined with Arrow's columnar memory layout, the serialization bottleneck β€” which eats up to 30% of CPU cycles in legacy stacks β€” just disappears.


Part 5: Adopting Multiple Engines

The most practical takeaway: don't pick one engine β€” use each where it excels.

The Coexistence Model: Multi-Engine Pipeline

The pattern in production looks like this: raw data in a lake gets cleaned by Spark (or Dask for Python-native teams) into Parquet/Delta tables. Those tables then get fed into Ray Data or Daft for "last mile" work β€” GPU inference, embedding generation, multimodal transforms that Spark was never designed for. For ad-hoc analysis and local dev, Polars and DuckDB give you instant feedback without cluster overhead.

The glue that makes this work: open formats β€” Parquet, Delta, Iceberg, Arrow. Every engine reads and writes them natively, so data flows between stages without conversion or lock-in.

Summary of Engine Use Cases

Engine Selection Matrix

Task Recommended Tool Trade-off
Ad-hoc analysis on a laptop Polars / DuckDB Speed > Scale
Massive SQL ETL (petabyte) Apache Spark Reliability > Speed
Python-native scaling Dask Simplicity > Optimization
Training LLMs / computer vision Ray Data GPU Utilization > ETL Features
Multimodal data loading Daft Developer Experience > Ecosystem
Building custom query engines DataFusion Extensibility > Out-of-box Features

Diagonal Scaling: The New Frontier

For years the choice was simple: Scale Up (bigger machine) or Scale Out (more machines). Modern engines offer a third option β€” Diagonal Scaling. Polars Cloud does exactly this: it scales horizontally to multiple nodes when reading from cloud storage to max out I/O, but once filters and aggregations shrink the data, it switches to a single big node for compute β€” skipping the overhead of a distributed shuffle.

The surprising finding: bigger, more expensive instances can be cheaper per job because they drive higher GPU utilization, cutting total execution time enough to lower the overall bill.


Try It Yourself

The companion demo repository includes everything you need to reproduce these benchmarks:

# Install dependencies
uv sync

# Run the tabular benchmark (~2.9M NYC taxi trips)
uv run python -m engine_comparison.benchmarks.tabular

# Run the multimodal benchmark (500 real food photos)
uv run python -m engine_comparison.benchmarks.multimodal

# Run native Rust benchmarks (Polars-rs + image crate)
cd rust_benchmark && cargo run --release && cd ..

The repo also includes interactive notebooks for side-by-side API comparisons, and Docker Compose configurations for local distributed experiments with Spark, Ray, and Daft.


Key Takeaways

  1. Don't distribute if you don't have to. Polars and DataFusion handle up to ~1 TB on one machine with constant, significant speedups over Pandas (promising up to 94x on heavy analytical benchmarks). Only reach for a cluster when you've actually hit the limits of a single node.

  2. Understand Pandas' constraints. Eager execution, single-threading, and intermediate copies are architectural decisions that become limitations for certain workloads. Lazy execution with predicate pushdown and column pruning provides significant performance advantages at scale.

  3. Pipelined execution improves hardware utilization. Spark's stage-based execution model can leave GPUs idle during CPU preprocessing. Pipelined execution models (like Ray Data and Daft) overlap operations to keep resources better utilized.

  4. Use each engine where it shines. Spark for petabyte ETL, Polars/DuckDB for ad-hoc analysis, Ray Data for GPU training pipelines, Daft for multimodal data loading, DataFusion for custom query engines.

  5. Rust + Arrow is the new foundation. This is not a coincidence β€” it eliminates GC pauses, serialization overhead, and concurrency bugs at the infrastructure level.

The best engine is the one that keeps your hardware busy and your developers productive.


References

Benchmarks and Performance Data

Engines and Frameworks

Architecture and Ecosystem

Demo Repository

  • engine-comparison-demo - Companion code for this article: benchmarks, notebooks, and Docker Compose for Spark/Ray/Daft

Datasets