Rust for data in 15 minutes

We've covered a lot of ground so far and seen many concepts.

It might be quite overwhelming to keep all of that in mind so here's the executive summary, Rust in 15 minutes.

This will be maintained and updated constantly.

Why Rust for data engineering?

Why even bother, Python does the job right? Well here are a few arguments:

  • Rust allows you to build code that is very fast
  • It's type system and data structures allow for extremely efficient and safe code.
  • The compiler and package manager Rust comes bundled with are almost everything you need to get started and get very far.
  • It will simplify much of the downstream operations (no more complicated pip/pipenv Docker gymnastics)
  • Most of the errors that will wake you up at night if you were to use Python, happen at compile time in Rust and are caught early. You catch the bugs, not your users.
  • Frankly, the Rust language is not that different from writing Python, if of course you follow some sane practices and not overdo things - this requires some experience.

Data & Functions

Working with data, at it's essence, involves fetching, processing and storing data in some form or the other. The Rust programming language stays very close to this spirit, most of the stuff is just data (algebraic types) and functions. No need to mess around with objects, garbage collection or heaven forbid: the GIL.

Furthermore, the data models and dependencies are defined and specified in code which eliminates much of the guessing, testing and validation work. If the code compiles, you know that the data will be handled correctly according to the instructions in the code.

Data modelling

You can use Rust's built in types to model data. Here's a simple example:

enum AnimalType {
	Dog,
	Cat,
	Bird,
}

struct Animal {
	animal_type: AnimalType,
	name: String,
	age: i32,
}

fn main() {
	let animal = Animal {
		animal_type: AnimalType::Dog,
		name: String::from("Buddy"),
		age: 3,
	};

	println!("{} is {} years old.", animal.name, animal.age);
}

The code above doesn't require any tests or further checking. We know anyone using our Animal struct will have guarantees like the name being a String, the age being a number and the animal type will be one of the possible types (Dog, Cat, Bird). We don't need to write tests for this.

Now imagine the same code but in Python and think of all the places it could go wrong. Here's the annotated code without dynamic types:

class AnimalType:
	DOG = 'dog'
	CAT = 'cat'
	BIRD = 'bird'

class Animal:
	def __init__(self, animal_type, name, age):
		self.animal_type = animal_type
		self.name = name
		self.age = age
		
animal = Animal(AnimalType.DOG, 'Buddy', 3)
animal_2 = Animal("Potato", None, "lol") 

print(f'{animal.name} is {animal.age} years old.')
print(f'{animal_2.name} is {animal_2.age} years old.')

# Buddy is 3 years old.
# None is lol years old.

Ok this isn't completely fair right, this is the version without types. How does the version with types look like?

from enum import Enum

class AnimalType(Enum):
	DOG = 'dog'
	CAT = 'cat'
	BIRD = 'bird'

class Animal:
	def __init__(self, animal_type: AnimalType, name: str, age: int) -> None:
		self.animal_type = animal_type
		self.name = name
		self.age = age

animal = Animal(AnimalType.DOG, 'Buddy', 3)
animal_2 = Animal("Potato", None, "lol") 

print(f'{animal.name} is {animal.age} years old.')
print(f'{animal_2.name} is {animal_2.age} years old.')

# Buddy is 3 years old.
# None is lol years old.

Wow that helped... 🥸

Now, it is safe to say that a local linter might have caught that for this simple example. But for a more complicated example, I would bet it wouldn't.

Here's another take:

Using Rust, you can enforce data rules (or data contracts as some call them) before leaving the IDE instead of waiting to catch them downstream, in production using some telemetry or log analysis.

I like to think of Rust vs Python like this:

  • Using Python you borrow some happiness from the future and spend it now, since it's easy to get something out of the door. You'll have to pay some toll for maintenance, tests and integration work down the line

  • With Rust, you can't borrow anything from the future (the borrow checker won't allow that), instead you sacrifice some time upfront writing code and making sure it's correct to invest in future happiness which comes in the form of code you won't have to maintain much and less surprises in production.

At the end of the day, you'll have to balance out your tolerance for risk, the speed at which you want to ship features and the compounding cost of system maintenance.

Operating on data

Now the cool thing using Rust is that as well as enforcing rules on the data itself, you can enforce rules on how operations on the data should work (or not).

Each function has information about the types it's operating on and fails at compile time if something is sketchy.

We've already covered this point in the previous chapters but just to drive it back home, here's another example.

enum AnimalType {
	Dog,
	Cat,
	Bird,
}

struct Animal {
	animal_type: AnimalType,
	name: String,
	age: i32,
}

fn print_animal_info(animal: Animal) {
	match animal.animal_type {
		AnimalType::Dog => println!("{} is a {}-year-old dog.", animal.name, animal.age),
		AnimalType::Cat => println!("{} is a {}-year-old cat.", animal.name, animal.age),
		AnimalType::Bird => println!("{} is a {}-year-old bird.", animal.name, animal.age),
	}
}

fn main() {
	let dog = Animal {
		animal_type: AnimalType::Dog,
		name: String::from("Buddy"),
		age: 3,
	};

	let cat = Animal {
		animal_type: AnimalType::Cat,
		name: String::from("Mittens"),
		age: 2,
	};

	print_animal_info(dog);
	print_animal_info(cat);
}

The code above is editable, try to pass anything other than an Animal to the print_animal_info and see what happens. Similarly, try to instantiate an animal with improper types.

All unsafe combinations will fail, at compile time. This keeps us safe when working and operating on complicated data.

Perhaps it is true that AGI might be written in Rust? ;)

The Zen of Rust

In one way or the other, I entertain myself by thinking that Rust is closer to the Zen of Python. To this end, I propose the Zen of Rust assisted by GPT-4:

The Zen of Rust:

Safety over speed, for caution paves the way.
Expressive code trumps cryptic haze.
Simplicity shines brighter than perplexity.
Perform well, but not at clarity's expense.
Be fearless with concurrency; Rust shall protect you.
Ergonomics and affordance, a developer's delight.

Compose, not inherit, to harness true might.
An ecosystem thrives when it shares its light.
Document your wisdom, enlighten the crowd.
Errors instruct, be clear and be loud.
Lifetime annotations, the keepers of age.
Cherish the past, but don't be its slave.

Rust is a journey, growth is the prize.
In each line of code, its Zen shall arise.

Rust for data engineering

Rust is a fantastic tool for data engineering. So far we've covered how to get started with Rust. This is however only just scratching the surface of what is possible.

I'm beyond excited to finish this chapter since the following ones will have a lot more to do with actually doing some productive data work with Rust and learning by doing.

Next we'll cover how to implement the following using Rust:

  • Parsing CSV files
  • Serialise and use Parquet/Avro files
  • Work with API's and JSON data
  • ... and more!

If you haven't subscribed yet, consider doing it. It helps me keep this guide updated and relevant.

Subscribtions temporarily closed

Stay tuned for a new 💅 & refreshed look. ✨

Want a sneak peek? 👀 Shoot me an email at → karim.jedda@gmail.com ←