Modern Data Processing Engines Compared: Polars, DataFusion, Daft, Ray Data, Pandas, and Spark
For years, Pandas handled in-memory tabular work and Apache Spark handled the distributed kind. That split worked fine when the data was structured.
Modern workloads aren't always structured anymore. Image, audio, and video pipelines now sit alongside the tabular ones, and once GPU inference enters the picture, the JVM's garbage collection and Python's GIL stop being annoyances and start being the bottleneck. That's a big part of why a new generation of engines built on Rust and Apache Arrow has taken off.
I've moved most of my single-node pipelines to Polars over the last year, and I wanted to see how it actually stacks up against the rest of the new stack. So I benchmarked Polars, DataFusion, Daft, Ray Data, and Spark on real datasets — NYC taxi trips for the tabular side, Food-101 images for the multimodal side.
All of the code lives in the engine-comparison-demo repository. Clone it and run the benchmarks on your own hardware — your numbers will be different from mine, and that's the point.
TL;DR: On the marketing-friendly benchmarks, Rust-native engines hit 94x speedups over Pandas on a single node. On real workloads you'll see something more modest but consistent — Polars and DataFusion are both solid defaults. For multimodal pipelines, pipelined engines like Ray Data and Daft run up to 18x faster than Spark because they actually keep GPUs busy. There's no winner across the board. The right pick depends on your data, your scale, and whether GPUs are in the loop.
Data processing workload types
Engines specialize, so the first question to answer is what kind of workload you actually have. The biggest split is structured vs. multimodal.
Structured / Tabular Data. Filtering, aggregating, joining. CPU-bound, usually in memory. A lot of this work runs on distributed clusters that didn't need to be distributed in the first place — about 90% of queries process less than 1 TB of data, which fits comfortably on a single machine these days.
Multimodal AI. Images, audio, video. GPU-bound on the inference side, but the CPU preprocessing is what kills you. Deep learning models eat data faster than CPUs can decode it, so on a standard g6.xlarge (4 CPUs per GPU), GPU utilization drops below 20%. The whole pipeline ends up gated by how fast you can feed the GPU.
The new engines target both. Rust gives them speed, Arrow gives them zero-copy data sharing across processes, and streaming execution lets them handle datasets bigger than RAM.
Part 1: Single-Node Processing
The Pandas bottleneck isn't a bug. It's the design. Pandas loads files into RAM eagerly, copies intermediates on almost every operation, and stays on one thread because of the GIL. In the PDS-H benchmarks at scale factor 10, Pandas takes about 365 seconds to finish a standard query suite. Polars's streaming engine finishes the same work in 3.89 seconds — about 94x faster.
Polars: Tabular Data Processing
Polars has replaced Pandas as the default for serious local ETL in a lot of teams I've talked to. It's written in Rust and uses lazy execution: instead of running each line as it comes, it builds a query plan, optimizes it (predicate pushdown, projection pruning, column pruning), then executes across every CPU core in streaming batches.
The gap widens as the data grows. At scale factor 100 (~100 GB), Polars's streaming engine finished in 23.94 seconds against 152.27 seconds for its own in-memory engine — about 6x faster, on data that's larger than RAM.
From engine_comparison_examples.ipynb:
import polars as pl
# scan_parquet reads only the schema — no data loaded yet
q = (
pl.scan_parquet("yellow_tripdata_2024-01.parquet")
.filter(
(pl.col("trip_distance") > 5.0)
& (pl.col("total_amount") > 30.0)
)
.group_by("payment_type")
.agg(
pl.col("total_amount").mean().alias("avg_fare"),
pl.col("trip_distance").mean().alias("avg_distance"),
pl.len().alias("trip_count"),
)
)
# The entire plan is optimized and executed here, in parallel
result = q.collect()
The difference from Pandas isn't the API. It's the architecture. Polars builds the plan first and optimizes the whole pipeline before touching data. The filter gets pushed down into the Parquet reader, so entire row groups are skipped. Only the columns referenced in the query come off disk. Pandas can't do any of that — each line runs as soon as it's parsed and intermediate copies fall out everywhere.
Apache DataFusion: Extensible Query Engine
Where Polars is a library you use directly, DataFusion is the engine you build other engines on top of. It powers InfluxDB 3.0, GreptimeDB, and Apple's Comet Spark accelerator.
In November 2024, DataFusion became the fastest single-node engine for querying Parquet files in ClickBench, ahead of DuckDB, chDB, and ClickHouse on the same hardware. The Embucket team later ran TPC-H at scale factor 1000 (~1 TB) on a single node on top of it. That's a useful number to keep in mind whenever someone says they need a cluster.
From engine_comparison_examples.ipynb:
from datafusion import SessionContext
ctx = SessionContext()
ctx.register_parquet("taxi", "yellow_tripdata_2024-01.parquet")
# SQL executed directly against Parquet — no intermediate copies
df = ctx.sql("""
SELECT payment_type,
COUNT(*) AS trip_count,
AVG(trip_distance) AS avg_distance,
AVG(total_amount) AS avg_fare
FROM taxi
WHERE trip_distance > 5.0 AND total_amount > 30.0
GROUP BY payment_type
ORDER BY trip_count DESC
""")
result = df.to_pandas()
DataFusion's strength is its modularity. The extension APIs cover custom catalogs, table providers, optimizer rules, and execution plans, which is why it shows up whenever someone is building a custom data platform or embedding a query engine in their own product.
Daft: Multimodal Data Processing
Daft is the one that takes multimodal data seriously. Images, audio, video, embeddings — they're real types in the DataFrame, not opaque bytes. Pandas treats an image as a byte array, Spark serializes it through the JVM, but Daft has native Rust expressions for decoding images, fetching URLs, and producing tensors without the Python loop in the middle.
From engine_comparison_examples.ipynb:
import daft
df = daft.read_parquet("yellow_tripdata_2024-01.parquet")
result = (
df.where(
(daft.col("trip_distance") > 5.0)
& (daft.col("total_amount") > 30.0)
)
.groupby("payment_type")
.agg(
daft.col("total_amount").mean().alias("avg_fare"),
daft.col("trip_distance").mean().alias("avg_distance"),
daft.col("trip_distance").count().alias("trip_count"),
)
.collect()
)
The API will feel familiar if you know Pandas, but the engine is the same Rust + Arrow stack you find in Polars and DataFusion. Where Daft pays off is when the "rows" in your DataFrame are images, PDFs, or tensors.
Benchmark: Single-Node Performance
I ran all four engines, plus native Rust via Polars-rs, on ~41M NYC yellow taxi trips for full year 2024. Full results in the demo repo:
At this size (~660 MB Parquet), every modern engine is fast. The headline 94x Polars vs. Pandas number comes from PDS-H, a synthetic analytical benchmark — on my 41M-row workload I got a real, consistent improvement over Pandas, but nowhere near 94x. The more interesting result is that Polars, DataFusion, Daft, and pure Rust via Polars-rs all land in roughly the same ballpark on this workload. DataFusion edges out the rest on total time. Polars is the most pleasant to write and a solid all-rounder.
Part 2: Handling Multimodal Data
Tabular ETL usually shrinks data: filter, aggregate, write less than you read. Multimodal AI pipelines do the opposite. A single document path might fan out into dozens of text chunks and embedding vectors.
Spark treats images and audio as binary blobs. Touching them in Python means a JVM-to-Python hop through Py4J, processing with Pillow or OpenCV, then a serialization back. That round-trip dominates the wall-clock time on a lot of real workloads.
Pipelined Execution and GPU Utilization
Spark parallelizes partitions across cores, but its stage-based (bulk synchronous) execution is the catch. Inside a single task, every step (download, decode, classify) runs sequentially on one core with no async overlap. And every task in a stage has to finish before the next stage starts. GPUs sit idle while CPUs preprocess; CPUs sit idle while GPUs run inference. Each stage materializes its output before the next one begins, which puts memory pressure on top of everything.
Pipelined execution is the alternative. Instead of running stages back-to-back, the engine overlaps I/O, CPU work, and GPU inference, so the slow component is the only thing waiting at any given time.
Benchmark: Image Processing
I benchmarked image processing on 500 real photographs from the Food-101 dataset. Results from the demo repo:
Polars and DataFusion don't show up here because they don't have native image operations — image work would fall back to sequential Python and the comparison wouldn't be fair. The numbers that matter: Daft's Rust-native image ops are 3.6x faster than Pandas + Pillow, and pure Rust is 4.2x faster than Pandas across the whole pipeline.
Part 3: Distributed Processing
Once a single machine isn't enough, you have to pick how to spread the work. This is where the architectural differences between engines actually start to matter.
Apache Spark
Spark is still what I'd reach for on petabyte-scale tabular ETL with messy shuffles. Fault tolerance through RDD lineage works, the ecosystem is huge (Databricks, EMR), and multi-TB joins are well-trodden ground. AI workloads are where it falls apart. Tasks get scheduled onto JVM executors that don't know about GPUs, so GPUs sit idle waiting on CPU preprocessing.
From distributed_spark.ipynb:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName("TaxiETL") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
orders = spark.read.parquet("s3a://lake/taxi/*.parquet")
zones = spark.read.csv("s3a://lake/taxi_zones.csv", header=True)
# Spark excels at this: joining massive tables with a shuffle
result = (
orders
.filter(F.col("trip_distance") > 5.0)
.join(zones, orders.PULocationID == zones.LocationID, "inner")
.groupBy("Borough")
.agg(
F.sum("total_amount").alias("total_revenue"),
F.avg("total_amount").alias("avg_fare"),
)
.orderBy(F.desc("total_revenue"))
)
Ray Data: GPU Utilization
Ray Data was designed around AI workloads. Instead of Spark's stage barriers, it uses a streaming model that keeps GPUs fed. The feature that earns its existence is mixed resource scheduling: you can declare that one actor wants "1 GPU, 4 CPUs" and another wants only CPUs, and Ray figures out the rest.
Amazon reported saving over $120 million per year by moving specific workloads from Spark to Ray. Their PoC numbers were 91% better cost efficiency and 13x more data per hour; the production number settled at 82% better cost efficiency per GiB of S3 input.
From distributed_ray.ipynb:
import ray
ray.init()
ds = ray.data.read_images("s3://my-bucket/food101/")
class ImageClassifier:
def __init__(self):
import torch
from torchvision.models import resnet18, ResNet18_Weights
self.model = resnet18(weights=ResNet18_Weights.DEFAULT).cuda()
self.model.eval()
self.preprocess = ResNet18_Weights.DEFAULT.transforms()
def __call__(self, batch):
import torch
tensors = torch.stack([
self.preprocess(img) for img in batch["image"]
]).cuda()
with torch.no_grad():
preds = self.model(tensors)
return {
"prediction": preds.argmax(dim=1).cpu().numpy(),
"confidence": preds.max(dim=1).values.cpu().numpy(),
}
# ActorPoolStrategy creates persistent GPU workers
predictions = ds.map_batches(
ImageClassifier,
compute=ray.data.ActorPoolStrategy(size=4),
num_gpus=1,
batch_size=64,
)
predictions.write_parquet("s3://output/predictions/")
One detail worth knowing: as you add more CPUs per GPU, Ray Data scales while other engines plateau. In Anyscale's benchmarks, going from a 4:1 to a 32:1 CPU-to-GPU ratio gave a 3x speedup on image inference, because the CPU feeders finally kept up with the GPU.
Daft (Distributed)
Daft scales out through its Flotilla engine: one Swordfish worker per node, with Flotilla on top doing cluster-wide scheduling. Swordfish handles local Rust execution and pipelines I/O with compute through small-batch streaming, so each node stays busy without waiting on the next stage.
Across four multimodal workloads, Daft with Flotilla ran 2–18x faster than Spark. On the heaviest one (video object detection with YOLO11n), Spark took over 3.5 hours and Daft finished in under 12 minutes — 18x, mostly because Spark spends that time on serialization and stage barriers.
One caveat: Anyscale, the company behind Ray, published their own benchmarks showing Ray Data closing or reversing the gap on high-CPU instances with manual tuning. The two are leapfrogging each other on a quarterly basis at the moment.
From distributed_daft.ipynb:
import daft
from daft import col
# Daft's Flotilla engine distributes work across the cluster
df = daft.read_parquet("s3://data-lake/pdf_metadata/*.parquet")
# Download PDFs — Daft parallelizes downloads in Rust
df = df.with_column("pdf_bytes", col("pdf_url").url.download())
# Define a GPU UDF for embedding generation
@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()))
class TextEmbedder:
def __init__(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer("all-MiniLM-L6-v2", device="cuda")
def __call__(self, text_col):
texts = text_col.to_pylist()
embeddings = self.model.encode(texts, batch_size=32)
return [emb.tolist() for emb in embeddings]
# Daft schedules CPU downloads and GPU embeddings simultaneously
df = df.with_column("embedding", TextEmbedder(col("text")))
df.write_parquet("s3://output/embeddings/")
Distributed Head-to-Head
| Feature | Spark | Ray Data | Daft (Flotilla) |
|---|---|---|---|
| Execution Model | Task-per-core, partition-based | Streaming tasks and actors | Swordfish-per-node, streaming batches |
| Strengths | Massive SQL/ETL, fault tolerance | Heterogeneous compute, GPU saturation | Multimodal pipelining, bounded memory |
| GPU Utilization | Poor (stage-based, no CPU/GPU overlap) | Excellent (explicit resource assignment) | Excellent (async I/O-compute pipelining) |
| Tuning Burden | High (executor memory, cores, partitions) | Medium (batch sizes, object store) | Low (engine manages resources) |
| Best For | Petabyte tabular ETL | Training/inference with mixed CPU+GPU | Multimodal data loading for PyTorch |
Part 4: The Role of Rust and Arrow
Underneath the competition, the convergence is the more interesting story. Polars, Daft, DataFusion, and Ray's core all share the same foundation: Apache Arrow for the in-memory format and Rust for the concurrency.
That's not just architectural neatness. The Arrow PyCapsule Interface (__arrow_c_stream__ protocol) lets you hand data between engines with no serialization: compute in DataFusion, pass the result to Polars or PyArrow at zero cost; do multimodal preprocessing in Daft and feed Arrow batches straight into a PyTorch DataLoader.
from datafusion import SessionContext
import polars as pl
# Compute in DataFusion (Rust-native execution)
ctx = SessionContext()
ctx.register_parquet("events", "events.parquet")
df = ctx.sql("""
SELECT user_id, COUNT(*) as event_count
FROM events WHERE event_type = 'purchase'
GROUP BY user_id HAVING COUNT(*) > 5
""")
# Zero-copy conversion to Arrow, then to Polars
arrow_table = df.to_arrow_table()
df_polars = pl.from_arrow(arrow_table)
# Continue analysis in Polars
result = df_polars.with_columns(
pl.col("event_count").rank().alias("rank")
).sort("rank")
So why Rust instead of the JVM?
-
No garbage collection pauses. Ownership and borrowing replace GC, which means no stop-the-world pauses and no unpredictable latency spikes. The OOM errors that show up at scale on JVM systems mostly disappear.
-
No object header tax. The JVM adds 12–16 bytes of object header to every object. Across billions of small objects, that's a memory budget all by itself. Rust doesn't pay it.
-
Compile-time thread safety. Rust's type system rejects data races before the binary builds, so multi-threaded execution doesn't come with the usual class of subtle concurrency bugs.
Combined with Arrow's columnar memory layout, the serialization bottleneck that eats up to 30% of CPU cycles in legacy JVM stacks goes away.
Part 5: Adopting Multiple Engines
The honest answer is that you don't pick one engine. You compose them.
In production, this usually looks like a relay. Raw data in a lake gets cleaned and joined by Spark (or Dask if your team is Python-only) into Parquet or Delta tables. Those tables flow into Ray Data or Daft for the last-mile work — GPU inference, embedding generation, multimodal transforms — that Spark was never designed for. For ad-hoc analysis and local development, Polars and DuckDB give you instant feedback without spinning up a cluster.
What makes this composition possible is open formats: Parquet, Delta, Iceberg, Arrow. Every engine in the stack reads and writes them natively, so the handoff between stages is just a file path.
Summary of Engine Use Cases
| Task | Recommended Tool | Trade-off |
|---|---|---|
| Ad-hoc analysis on a laptop | Polars / DuckDB | Speed > Scale |
| Massive SQL ETL (petabyte) | Apache Spark | Reliability > Speed |
| Python-native scaling | Dask | Simplicity > Optimization |
| Training LLMs / computer vision | Ray Data | GPU Utilization > ETL Features |
| Multimodal data loading | Daft | Developer Experience > Ecosystem |
| Building custom query engines | DataFusion | Extensibility > Out-of-box Features |
Diagonal Scaling
For a long time the choice was scale up (a bigger machine) or scale out (more machines). Polars Cloud is doing something in between, which they call diagonal scaling: scale horizontally while reading from cloud storage to max out I/O, then collapse to a single big node once filters and aggregations have shrunk the data, skipping the distributed shuffle entirely.
The counterintuitive part: a bigger, more expensive instance can come out cheaper per job because it pushes GPU utilization up high enough that total execution time drops faster than the hourly rate goes up.
Try It Yourself
The companion demo repository has everything you need to reproduce these benchmarks:
# Install dependencies
uv sync
# Run the tabular benchmark (~2.9M NYC taxi trips)
uv run python -m engine_comparison.benchmarks.tabular
# Run the multimodal benchmark (500 real food photos)
uv run python -m engine_comparison.benchmarks.multimodal
# Run native Rust benchmarks (Polars-rs + image crate)
cd rust_benchmark && cargo run --release && cd ..
The repo also has notebooks for side-by-side API comparisons and Docker Compose setups for local distributed runs of Spark, Ray, and Daft.
Key Takeaways
-
Don't distribute if you don't have to. Polars and DataFusion handle up to ~1 TB on a single machine, with consistent speedups over Pandas (and 94x on heavy analytical benchmarks like PDS-H). Reach for a cluster when you've actually hit the limits of a single node, not before.
-
Pandas's limits are architectural. Eager execution, single-threading, intermediate copies — these are deliberate design choices that turn into bottlenecks once data gets large. Lazy execution with predicate pushdown and column pruning is what changes that.
-
Pipelined execution beats stage barriers for GPU work. Spark's stages leave GPUs idle while CPUs preprocess. Ray Data and Daft overlap I/O, CPU, and GPU work so that no resource is sitting around waiting for the next stage.
-
Use each engine for what it's good at. Spark for petabyte ETL, Polars and DuckDB for ad-hoc analysis, Ray Data for GPU training pipelines, Daft for multimodal data loading, DataFusion for custom query engines.
-
Rust + Arrow is the shared foundation. It's not a coincidence that Polars, DataFusion, Daft, and Ray all converge here — it gets rid of GC pauses, cross-process serialization overhead, and a whole class of concurrency bugs.
If you're not sure where to start: Polars on a single machine, then Daft or Ray Data when GPUs enter the picture. Spark only when you actually have a petabyte.
References
Benchmarks and Performance Data
- Polars PDS-H Benchmarks - Polars streaming vs. Pandas at SF-10 (94x) and SF-100 (6.4x over in-memory)
- DataFusion ClickBench Results (Nov 2024) - Fastest single-node Parquet query engine
- Embucket: TPC-H SF-1000 on DataFusion - Full TPC-H at ~1 TB on a single node
- Amazon Spark-to-Ray Migration - $120M/year savings, 82% production cost efficiency
- Daft Flotilla Benchmarks - 2–18x faster than Spark on multimodal workloads
- Anyscale Ray vs. Daft Benchmarks - Ray Data ~30% average speedup on multimodal
Engines and Frameworks
- Polars - Rust DataFrame library with lazy execution
- Apache DataFusion - Embeddable Rust query engine (GitHub)
- Daft - Multimodal-native distributed DataFrame
- Ray - Distributed compute framework for AI (Data Internals)
- Apache Spark - Distributed ETL and SQL engine
- DuckDB - In-process analytical SQL engine
- Dask - Python-native parallel computing
Architecture and Ecosystem
- Polars Cloud: Diagonal Scaling - Dynamic vertical/horizontal scaling
- Daft Flotilla Architecture - Swordfish + Flotilla distributed engine
- Apache DataFusion Comet - Spark accelerator originally developed at Apple
- Apache Arrow - Columnar in-memory format (Flight RPC, PyCapsule Interface)
- InfluxDB 3.0 + DataFusion - Time-series database built on DataFusion
- GreptimeDB - Observability database using DataFusion
Demo Repository
- engine-comparison-demo - Companion code for this article: benchmarks, notebooks, and Docker Compose for Spark/Ray/Daft
Datasets
- NYC Taxi Trip Records - NYC TLC yellow taxi data (Parquet, ~2.9M rows/month)
- Food-101 Dataset - 101K food images from ETH Zurich (Bossard et al., ECCV 2014)