Basics of representing data using Rust

One of the most important parts of data engineering is data modelling. This is usually a quite complicated and tricky exercise since it will have an impact on most of the data work that will be done afterwards.

It is tricky because data modelling is only possible when we have relatively solid assumptions about the data and how it's going to be accessed as well as by whom and how.

Data modelling is a complete topic in itself that could warrant it's own book but for now we'll keep it simple. Our goal is to use the Rust programming language to represent some sort of data.

It isn't quite like OOP

Rust uses structs instead of classes to define data structures and associated behaviours. While structs can have methods and implement traits (similar to interfaces), they lack some of the features associated with traditional classes, such as inheritance hierarchies and polymorphism. Rust's approach encourages composition over inheritance and favours traits for code reuse and polymorphism.

Object oriented languages like Java or Python (usual suspects for data work) force you to think in a specific way about your data, which starts very simple but gets very complicated and fragile the more the project grows. OOP languages are generally good to represent things like genetics but fall short for representing a lot of things in real life that are (ironically) subject to evolving.

Data modelling zoo

Let's take the following example:

Imagine you're building a simulation of a zoo, where various animals reside in different enclosures. Each animal has unique characteristics, such as species, age, and behaviour. The zookeepers need to perform specific actions based on the type of animal, like feeding, cleaning, or providing medical care.

In Java, you might create a base class called Animal and subclass it to represent different animal types, such as Lion, Elephant and Giraffe. Each subclass would have its own specific attributes and behaviours. Similarly, you might create a Zookeeper class responsible for managing and interacting with the animals.

As the simulation evolves, you realise that some animals require special care or have additional features that are not easily represented using inheritance alone. For example, some animals might require specific diets or have unique attributes like the ability to fly or swim. You might also want to simulate behaviours that are not directly tied to the animal's species, such as an animal performing tricks or participating in a race.

In Java (or Python), accommodating these new requirements becomes challenging within the existing class hierarchy. You may need to introduce complex conditional logic or modify the base classes and their subclasses as well as updating tests, potentially leading to a tightly coupled and hard-to-maintain codebase.

On the other hand, in a language like Rust, you can leverage its composition and trait system to handle these scenarios more flexibly. Rust's composition allows you to create reusable components that can be combined to represent various aspects of an animal, such as a Diet component, a FlightAbility component, or a Behaviour component. You can then mix and match these components based on the specific needs of each animal, without the constraints of a rigid class hierarchy.

Traits in Rust allow you to define common behaviours that an animal can implement, such as Feedable or PerformTrick, and then have individual animal instances implement those traits as needed. This approach provides a more modular and flexible way to represent the diverse characteristics and behaviours of animals in a zoo simulation.

Here's some minimal code to illustrate the point:

// Define traits for behaviors
trait Feedable {
	fn feed(&self);
}

trait PerformTrick {
	fn perform_trick(&self);
}

// Define reusable components
struct Diet {
	// Fields specific to the diet component
	// ...
}

struct FlightAbility {
	// Fields specific to the flight ability component
	// ...
}

struct Behavior {
	// Fields specific to the behavior component
	// ...
}

// Implement traits for specific animals
struct Lion {
	diet: Diet,
	behavior: Behavior,
}

impl Feedable for Lion {
	fn feed(&self) {
		// Implement lion-specific feeding logic
		println!("Om nom nom!");
	}
}

impl PerformTrick for Lion {
	fn perform_trick(&self) {
		// Implement lion-specific trick logic
		println!("*jumps through hoop*! Roar!");
	}
}

struct Eagle {
	diet: Diet,
	flight_ability: FlightAbility,
}

impl Feedable for Eagle {
	fn feed(&self) {
		// Implement eagle-specific feeding logic
		// Inspired from https://www.youtube.com/watch?v=IdFxnbZtu1I
		println!("kwit-kwit-kwit-kwit-kee-kee-kee-kee-ker");
	}
}

impl PerformTrick for Eagle {
	fn perform_trick(&self) {
		// Implement eagle-specific trick logic
		println!("*does a barell roll*");
	}
}

// Zookeeper interacts with animals based on traits
struct Zookeeper;

impl Zookeeper {
	fn feed_animal<T: Feedable>(&self, animal: &T) {
		animal.feed();
	}

	fn make_animal_perform_trick<T: PerformTrick>(&self, animal: &T) {
		animal.perform_trick();
	}
}

fn main() {
	let lion = Lion {
		diet: Diet {},
		behavior: Behavior {},
	};

	let eagle = Eagle {
		diet: Diet {},
		flight_ability: FlightAbility {},
	};

	let zookeeper = Zookeeper;

	zookeeper.feed_animal(&lion);
	zookeeper.feed_animal(&eagle);

	zookeeper.make_animal_perform_trick(&lion);
	zookeeper.make_animal_perform_trick(&eagle);
}

While it's still possible to achieve similar functionality in Java through interfaces and composition, the inheritance-heavy nature of the language can make it more cumbersome to handle evolving requirements and dynamic combinations of behaviours. Rust's design, with its emphasis on composition and traits, provides more flexibility and extensibility in scenarios where representing complex real-life entities is crucial.

Example (to the best of my knowledge):


// truncated previous code 

class Eagle implements Feedable, PerformTrick {
// class Eagle implements Feedable, PerformTrick, FlyHigh {
	private Diet diet;
	private FlightAbility flightAbility;

	// Constructor and other methods for Eagle

	public void feed() {
		// Implement eagle-specific feeding logic
	}

	public void performTrick() {
		// Implement eagle-specific trick logic
	}
	
	// public void flyHigh() {
			// Implement eagle-specific high-flying logic
	// }
}

If the Eagle were now to do something new, like FlyHigh for example, we'll have to modify the existing Eagle class hierarchy, potentially impacting other parts of the codebase that rely on those classes. In a large codebase, such modifications can be cumbersome, especially if many classes and methods need to be updated.

On the other hand, in Rust, the composition and trait system provides more flexibility. Adding a new behaviour like FlyHigh would involve creating a new component (FlyAbility) and implementing the FlyHigh trait for specific animal instances that require it. There would be no need to modify existing code or class hierarchies, making it more modular and less likely to introduce breaking changes in other parts of the codebase.

I hope this made it a little bit clear. This is a very simple example, imagine a 10 times more complex one and ask yourself which programming language would contribute to more cognitive load. In the end, it can also be a matter of preference which language to go with. I prefer Rust because there are no surprises.

Not just data modelling, state modelling too!

The bane of many data engineers and data analysts is usually contained in these two words: "data cleansing". It became a (sadly) accepted fact, that most of the data work has to be wasted in efforts like data cleaning to get any kind of useful insights out.

It shouldn't be this way, at least not in the ratio we're seeing it in the industry today. Statements like "90% of the data work is spent cleaning data" make me sad. Ideally we should strive for silent pagers, less on-call rotations and way less surprises.

Let's take the example of providing a library that implements some sort of LightBulb logic.

#![allow(unused)]
fn main() {
// Define the possible states of a light bulb
enum LightState {
	On,
	Off,
}

// Define the struct to represent the light bulb
struct LightBulb {
	state: LightState,
}

impl LightBulb {
	// Method to toggle the state of the light bulb
	fn toggle(&mut self) {
		match self.state {
			LightState::On => self.state = LightState::Off,
			LightState::Off => self.state = LightState::On,
		}
	}

	// Method to check the current state of the light bulb
	fn is_on(&self) -> bool {
		match self.state {
			LightState::On => true,
			LightState::Off => false,
		}
	}
}
}

Using Enums, we can model the different valid states our data (and program) can be in and make sure that anyone using the LightBulb won't end up with a NaN, null, None type state by mistake, since the rust compiler forces anyone using the LightBulb to cover every possible state when working with it (in this case: matching).

Reminder: this is possible without any additional testing or edge case testing.

fn main() {
	// Create a new light bulb instance
	let mut bulb = LightBulb {
		state: LightState::Off,
	};

	// Check the initial state of the light bulb
	println!("Is the light bulb on? {}", bulb.is_on());

	// Toggle the state of the light bulb
	bulb.toggle();

	// Check the updated state of the light bulb
	println!("Is the light bulb on? {}", bulb.is_on());
}

By leveraging Rust, we can now make sure that the data we are modelling is flexible as well as consistent throughout the lifetime of our program. This is phenomenal and much of the reliability will stem from this, since guarantees are embedded and enforced in code instead of specifications and docs.

Data Modelling with Rust

Let's go through a simple data modelling exercise for a blog, that can have Posts that are either still in draft or already live and can contain an optional Hero Image (main header image).

Modelling this in Rust gives us:

pub enum PostStatus { // possible post status
	Live,
	Draft,
}

pub struct Image {
	id: u32,
	url: String,
	alt_text: String,
}

pub struct Post {
	id: u32,
	title: String,
	content: String,
	pub_date: String, // for simplicity, using String instead of DateTime
	status: PostStatus,
	hero_image: Option<Image>, // optional Hero image
}

impl Image {
	pub fn new(id: u32, url: String, alt_text: String) -> Image {
		Image {
			id,
			url,
			alt_text,
		}
	}

	pub fn print_url(&self) {
		println!("Image URL: {}", self.url);
	}
}

impl Post {
	pub fn new(id: u32, title: String, content: String, pub_date: String, status: PostStatus, hero_image: Option<Image>) -> Post {
		Post {
			id,
			title,
			content,
			pub_date,
			status,
			hero_image,
		}
	}
	
	pub fn print_title(&self) {
		println!("Title: {}", self.title);
	}

	pub fn print_status(&self) {
		match self.status {
			PostStatus::Live => println!("The post is live!"),
			PostStatus::Draft => println!("The post is a draft."),
		}
	}
}

fn main(){
	let image = Image::new(1, String::from("https://grugbrain.dev/grug.png"), String::from("An example image"));
	let post = Post::new(
		1, 
		String::from("My first blog post"), 
		String::from("Hello, world!"), 
		String::from("2023-05-18"), 
		PostStatus::Draft, 
		Some(image)
	);
	post.print_title();
	post.print_status();
}

Now, go ahead and try to add a new PostStatus called "PendingReview", the code above is editable.

Solution
	pub enum PostStatus { // possible post status
		Live,
		Draft,
		PendingReview,
	}
	
	pub struct Image {
		id: u32,
		url: String,
		alt_text: String,
	}
	
	pub struct Post {
		id: u32,
		title: String,
		content: String,
		pub_date: String, // for simplicity, using String instead of DateTime
		status: PostStatus,
		hero_image: Option<Image>, // optional Hero image
	}
	
	impl Image {
		pub fn new(id: u32, url: String, alt_text: String) -> Image {
			Image {
				id,
				url,
				alt_text,
			}
		}
	
		pub fn print_url(&self) {
			println!("Image URL: {}", self.url);
		}
	}
	
	impl Post {
		pub fn new(id: u32, title: String, content: String, pub_date: String, status: PostStatus, hero_image: Option<Image>) -> Post {
			Post {
				id,
				title,
				content,
				pub_date,
				status,
				hero_image,
			}
		}
		
		pub fn print_title(&self) {
			println!("Title: {}", self.title);
		}
	
		pub fn print_status(&self) {
			match self.status {
				PostStatus::Live => println!("The post is live!"),
				PostStatus::Draft => println!("The post is a draft."),
				// we need to cover all the possible cases! no more NaN :D
				PostStatus::PendingReview => println!("The post is still pending a review."),
			}
		}
	}
	
	fn main(){
		let image = Image::new(1, String::from("https://grugbrain.dev/grug.png"), String::from("An example image"));
		let post = Post::new(
			1, 
			String::from("My first blog post"), 
			String::from("Hello, world!"), 
			String::from("2023-05-18"), 
			PostStatus::PendingReview, 
			Some(image)
		);
		post.print_title();
		post.print_status();
	}

For more exercises, check out the examples on this website and try to model them in Rust.