Advantages of Rust for Data Engineering

Advantages of Rust for Data Engineering

After going through some of Rust's main features, it might make sense to have a closer look at how these features help our data engineering efforts. Some of the points have already been covered but it makes sense to list them again in a general overview.

Reliability and scalability

Rust is a relatively reliable programming language. Writing the code and compiling it successfully provides certain guarantees that help prevent many runtime errors: it's statically typed and ensures memory safety at compile time. Both these features make Rust very reliable in translating ideas correctly into machine instructions.

Rust's type system can also make the code more readable which helps increase reliability not only in function but also in representation: another developer can reason about the code base without too much overhead (given the initial developer didn't try to be "too smart" implementation complicated code).

Rust is also very performant, with performance comparable to that of C++ and often times better than interpreted languages like Python and Ruby. It's concurrency model and memory safety help make it a good choice for building scalable systems that can make efficient use of multicore CPUs.

Rust is a scalable and reliable programming tool to cover a wide variety of data engineering tasks.

Performance

I'd like to elaborate a little bit more on performance by starting with the fact that most programs can be made performant with enough time invested debugging, profiling and tweaking different parts. Usually this time invested in tuning the software is not reflected in online benchmarks, however it helps seeing performance in relation to other things and not as an absolute metric (unless you just want to calculate decimals of PI). In Rust's case, it is easy to write performant code without having to refer to arcane and obscure knowledge: use the basics and the software should be good enough. It's still possible to tune a lot more but it's more than enough for 80% of the cases.

At the end of the day, performance is a function of how fast a set of instructions lead to the result and how much effort is necessary to get there.

There is this thing called Rosetta code online, comparing performance of programming languages on a set of tasks, where the implementations are "tailored" at being the most efficient implementations. These don't reflect real life conditions.

Here's a better benchmark: given two developers with comparable skillsets, ask them to implement the same thing in under 30 minutes (for example: invert a matrix). One can use Rust and one can use Python. Do this experiment often enough and then see what the average performance of the different implementations are, clustered by programming language. I'm almost certain Rust wins by a landslide.

Of course all implementations can be tuned but it doesn't mean they will ever be looked at again if they work.

Popularity and community support

There is much to be said about Rust's steadily growing popularity and support within the community. Many years in a row now, Rust has been voted most loved programming language.

It's being used now accross multiple industries like web development (example), embedded programming, game development and just like this website will show: data engineering. It's been adopted by several large tech companies: Amazon, Microsoft, Mozilla, Discord and Google among others.

There is also a growing and welcoming community of developers who contribute to the language and tools built with it. Rust's community offers multiple avenues for learning and this guide hopes to be one of those. More and more events around Rust are being organized and more content is being generated daily to tech people about Rust.

New developments like WASM raise a lot of interest and curiosity as to how they tie with Rust's capabilities. The future is looking bright for Rust's adoption.

Libraries and ecosystem

Rust's community has developed a rich ecosystem of libraries and tools, many of which are open-source and available on platforms like crates.io, GitHub, and GitLab. There are a lot of libraries you can use for data engineering tasks, notably Pola.rs & Apache DataFusion and Apache Arrow which we will cover on this website. The list is constantly growing with new additions that leverage Rust's performance and interface seamlessly with the language's capabilities.

Some libraries might still be missing though and you won't find (yet) the same coverage of libraries that programming language ecosystems like Python provide.

All things considered, Rust is a great candidate for almost all of the tasks that require data transformation. Over the next chapters, we'll go through all the details with a lot more code.

Next, we'll see how Rust compares to Python in a bit more detail.

How does Rust compare to Python (and other programming languages)?
It’s important to know how Rust compares to Python if you consider switching some of your workloads to Rust. Using Rust instead of Python is a tough sell especially for things like data engineering. For any data problem that you can think of, there might surely be some Python implementation