CSV

Reading CSV files

CSV stands for "Comma Separated Values". Let's first focus on reading CSV files. Here's an example:

Working with CSV data in memory

Name, Age
John, 25
Jane, 30
Bob, 22

The first line is called the header, it contains information about what is in each comma separated block. Here's an example:

extern crate csv;
use csv::{ReaderBuilder, StringRecord};
use std::io::BufReader;

fn main() {

    let csv_content = "Name,Age
John,25
Jane,30
Bob,22";

    // Read the CSV content using BufReader
    // A BufReader is like a buffer or temporary storage for reading data. 
    // Instead of reading data piece by piece directly from a source (like a file), BufReader reads a larger chunk at once and then lets you access parts of it. 
    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(csv_content.as_bytes()));

    for result in rdr.records() {
        // If everything went right, the CSV line will end up as StringRecord, which is an entry where every value is a String
        let record: StringRecord = result.expect("error reading CSV record");
        println!("{:?}", record);
    }
}

In this example, we used the csv crate, which is a Rust package that provides some functionality for dealing and working with CSV. You can read a bit more about it here.

While this is great, it's not extremely useful since we are outputting the fields as they come instead of parsing the CSV row by row, which would make more sense.

Wouldn't it be nice if we can access the fields like we would access the properties of any other "object"? We can use that by leveraging Rust structs:

extern crate csv;
use csv::{ReaderBuilder, StringRecord};
use std::io::BufReader;

// A Person struct, representing a row of the CSV file as we would expect it.
struct Person {
    name: String,
    age: u8,
}

fn main() {
    let csv_content = "Name,Age
John,25
Jane,30
Bob,22";

    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(csv_content.as_bytes()));

    for result in rdr.records() {
        let record: StringRecord = result.expect("error reading CSV record");

        // parse the row into a Person struct
        let person = Person { 
            name: record[0].to_string(),
            age: record[1].parse().expect("error parsing age") 
        };
        println!("Name: {}, Age: {}", person.name, person.age);

    }
}

This already looks a little bit better. We can access each row properly. There is an alternative to writing this:

extern crate csv;
use csv::{ReaderBuilder, StringRecord};
use std::io::BufReader;

struct Person {
    name: String,
    age: u8,
}

fn main() {
    let csv_content = "Name,Age
John,25
Jane,30
Bob,22";

    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(csv_content.as_bytes()));

    let persons: Vec<Person> = rdr.records()
        .map(|result| result.expect("error reading CSV record"))
        .map(|record| Person {
            name: record[0].to_string(),
            age: record[1].parse().expect("error parsing age")
        })
        .collect();

    for person in &persons {
        println!("Name: {}, Age: {}", person.name, person.age);
    }
}

The difference between the original and the alternative version, are essentially this:

Original

  • It reads the CSV content line by line.
  • For each line, it creates a Person and immediately prints out the name and age.
  • It does this one by one, for each line in the CSV.

Alternative

  • It reads the entire CSV content and directly converts all the lines into a list (or Vec) of Person structs.
  • After creating this list, it then goes through the list and prints out the names and ages of all the people.
  • So, instead of processing each line one by one, it first creates a full list of people and then prints them.

In essence:

  • The Original version is like reading a book and immediately telling someone what each page says as you read it. It's good for large datasets where memory usage is a concern, situations where immediate action or feedback is required for each record or when you want more granular error handling for each record.
  • The alternative version is like reading the entire book first, making a list of all the important points, and then telling someone all the points at once. Better for smaller datasets where memory usage isn't a primary concern or when you want to perform multiple transformations or operations on the entire dataset.

Depending on the use case, you can use one or the other.

Opening CSVs using file paths

Now obviously, you won't be pasting the contents of your CSVs into your code to be able to parse them, you'll generally refer to an extern file via it's path and provide it to the scripts. Let's start by getting a sample CSV file. I will get one from this website CSV Files, particularly the following filed hw_25000 containing height and weight data for 25000 individuals.

Create a new folder and run cargo init inside of it. Subsequently, run cargo add csv and copy the hw_25000.csv file in there. This is how the folder should look like at this point:

├── Cargo.lock
├── Cargo.toml
├── hw_25000.csv
├── src
│   └── main.rs

Let's inspect the file and see what's in it:

"Index", "Height(Inches)", "Weight(Pounds)"
1, 65.78331, 112.9925
2, 71.51521, 136.4873
3, 69.39874, 153.0269
4, 68.2166, 142.3354
5, 67.78781, 144.2971
6, 68.69784, 123.3024
7, 69.80204, 141.4947
8, 70.01472, 136.4623

So far so good! Let's acces it in our Rust code now like we did before, in our src/main.rs file, add the following code:

extern crate csv;
use csv::{ReaderBuilder, StringRecord};
use std::fs::File;
use std::io::BufReader;

fn main() {
    let file = File::open("hw_25000.csv").expect("Could not open the CSV file");

    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(file));

    for result in rdr.records() {
        let record: StringRecord = result.expect("error reading CSV record");
        println!("{:?}", record);
    }
}

Ultimately, run cargo run and you should see the output of the file written to the console.

Now, let's try to make it more interesting by parsing the fields as we assume they should arrive. In our case, the Index would be an integer, the height and weight would be floats, we're going to parse all of that into a neat struct as we did previously.

Give it a try based on the examples above and compare with my solution :)

Parsing a CSV Row into a struct

The updated code:

extern crate csv;
use csv::{ReaderBuilder, StringRecord};
use std::fs::File;
use std::io::BufReader;

fn main() {

    struct Entry {
        index: i32,
        height: f32,
        weight: f32,
    }

    let file = File::open("hw_25000.csv").expect("Could not open the CSV file");

    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(file));

    for result in rdr.records() {
        let record: StringRecord = result.expect("error reading CSV record");
        let entry = Entry { 
            index: record[0].trim().parse().expect("Failed to parse index"),
            height: record[1].trim().parse().expect("Failed to parse height"),
            weight: record[2].trim().parse().expect("Failed to parse weight")
        };
        println!("Index: {}, Height: {}, Weight: {}", entry.index, entry.height, entry.weight);
    }
}

This is starting to shape up much better! You can also see that I snook in the .trim() function. This is required here because the dataset has some leading/trailing whitespace that can make it difficult to parse. Even without my indication, the Rust compiler must have shown you this error:

   Compiling data-with-rust-code v0.1.0 (/home/spongebob/tests/data-with-rust-code)
    Finished dev [unoptimized + debuginfo] target(s) in 0.20s
     Running `target/debug/data-with-rust-code`
thread 'main' panicked at 'Failed to parse height: ParseFloatError { kind: Invalid }', src/main.rs:24:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

There is another way to solve this, that I'll show you below.

Now let's investigate what we have. We have a piece of code that reads a CSV file and tells us if there are any issues with the parsing. Pretty neat! Although this is good enough, let's not get ahead of ourselves as there are a few improvements we can make. Notice those record[1] and record[0]? Those could be very problematic if for one reason or the other, we change the order of the columns in our CSV file. This can happen at any time and when it will, we'll get an error that isn't representative of the real problem. Can you guess which?

Parsing by index & sanitizing

Let's add some order to things by accessing the columns using their names as defined in the header. First we need to inspect the header:

"Index", "Height(Inches)", "Weight(Pounds)"

Looks good to me, but what about Rust? How does Rust interpret this header?

Here's the sample code to inspect the header (we'll do debugging, much later):

extern crate csv;
use csv::{ReaderBuilder};
use std::fs::File;
use std::io::BufReader;

fn main() {

    let file = File::open("hw_25000.csv").expect("Could not open the CSV file");

    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(file));

    let header = rdr.headers().expect("error parsing header");
    println!("{:?}", header)

}

Running this will give us:

StringRecord(["Index", " \"Height(Inches)\"", " \"Weight(Pounds)\""])

Do you see the problem? If we want to try and access the second column, using "Height(Inches)" as index, we'll get an error because is not the same as " \"Height(Inches)\"", and since this is what we're given as a header by the code, we need to work with it. The lack of standardization of CSV files is a big problem but Rust does help here by forcing you to think about it and keeps you from assuming things.

As you see, Rust doesn't make any assumptions about your data, you're in full control. But with great control comes great responsibility so let's address this first.

extern crate csv;
use csv::{ReaderBuilder, StringRecord};
use std::fs::File;
use std::io::BufReader;

fn main() {

    let file = File::open("hw_25000.csv").expect("Could not open the CSV file");

    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(BufReader::new(file));

    let header = rdr.headers().expect("error parsing header");
    // We trim to remove the whitespace
    let trimmed_header: StringRecord = header.iter().map(|field| field.trim()).collect();

    // Similarly we remove the \" characters
    let extra_cleaned: StringRecord = trimmed_header.iter().map(|field| field.replace('\"', "")).collect();

    println!("{:?}", extra_cleaned)

    // this gives us a header of:
    // StringRecord(["Index", "Height(Inches)", "Weight(Pounds)"])

}

It's possible to also add .quoting(true).trim(csv::Trim::All) to the ReaderBuilder::new() but this will not fix our problem in this case. There won't be any magic happening under the compiler's watch. Now let's make it neat and tidy and extract it into a separate function, then access our columns using the names of the columns:

extern crate csv;
use csv::{ReaderBuilder, StringRecord};
use std::fs::File;
use std::io::BufReader;

fn main() {

    struct Entry {
        index: i32,
        height: f32,
        weight: f32,
    }

    let file = File::open("hw_25000.csv").expect("Could not open the CSV file");

    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        // by adding this thing here, we can skim the .trim() we were adding at the Entry parsing :)
        .trim(csv::Trim::All)
        .from_reader(BufReader::new(file));

    let header = rdr.headers().expect("error parsing header");

    let trimmed_header = clean_column(header);

    // Get the position 
    let index_column_index = trimmed_header.iter().position(|field| field == "Index").expect("Index column not found");
    let height_column_index = trimmed_header.iter().position(|field| field == "Height(Inches)").expect("Height(Inches) column not found");
    let weight_column_index = trimmed_header.iter().position(|field| field == "Weight(Pounds)").expect("Weight(Pounds) column not found");

    for result in rdr.records() {
        let record: StringRecord = result.expect("error reading CSV record");
        let entry = Entry { 
            index: record[index_column_index].parse().expect("Failed to parse index"),
            height: record[height_column_index].parse().expect("Failed to parse height"),
            weight: record[weight_column_index].parse().expect("Failed to parse weight")
        };
        println!("Index: {}, Height: {}, Weight: {}", entry.index, entry.height, entry.weight);
    }

}

fn clean_column(record: &StringRecord) -> StringRecord {
    record.iter().map(|field| field.trim().replace('\"', "")).collect()
}

Exercise: You can create a find_column_index function to make our code above a bit more neat.

Adding types for the units

So far the code is done for reading and parsing a CSV file. We can add one last thing though, it might be cosmetic, but it might just make a huge difference later on. You see how we're storing the height and weight as f32, we're essentially loosing that information down the line if the code gets more complex. Let's do something about it.

extern crate csv;
use csv::{ReaderBuilder, StringRecord};
use std::fs::File;
use std::io::BufReader;
use std::str::FromStr;

fn main() {

    #[derive(Debug)]
    struct Pounds(f32);

    #[derive(Debug)]
    struct Inches(f32);

    impl FromStr for Pounds {
        type Err = std::num::ParseFloatError;
    
        fn from_str(s: &str) -> Result<Self, Self::Err> {
            s.parse().map(Pounds)
        }
    }
    
    impl FromStr for Inches {
        type Err = std::num::ParseFloatError;
    
        fn from_str(s: &str) -> Result<Self, Self::Err> {
            s.parse().map(Inches)
        }
    }

    struct Entry {
        index: i32,
        height: Inches,
        weight: Pounds,
    }

    let file = File::open("hw_25000.csv").expect("Could not open the CSV file");

    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        // by adding this thing here, we can skim the .trim() we were adding at the Entry parsing :)
        .trim(csv::Trim::All)
        .from_reader(BufReader::new(file));

    let header = rdr.headers().expect("error parsing header");

    let trimmed_header = clean_column(header);

    // Get the position 
    let index_column_index = trimmed_header.iter().position(|field| field == "Index").expect("Index column not found");
    let height_column_index = trimmed_header.iter().position(|field| field == "Height(Inches)").expect("Height column not found");
    let weight_column_index = trimmed_header.iter().position(|field| field == "Weight(Pounds)").expect("Weight column not found");

    for result in rdr.records() {
        let record: StringRecord = result.expect("error reading CSV record");
        let entry = Entry { 
            index: record[index_column_index].parse().expect("Failed to parse index"),
            height: record[height_column_index].parse().expect("Failed to parse height"),
            weight: record[weight_column_index].parse().expect("Failed to parse weight")
        };
        println!("Index: {}, Height: {:?}, Weight: {:?}", entry.index, entry.height, entry.weight);
    }

}

fn clean_column(record: &StringRecord) -> StringRecord {
    record.iter().map(|field| field.trim().replace('\"', "")).collect()
}

Exercise: Try the same now with Enums (Kilograms, Centimeters, etc). A bonus, would be to add a constraint on the height, to check if the entered height is between certain values.

Words on CSV parsing

Just like with food in a city you've never been to, it's always wise to treat any file you're importing and ingesting as a liability and potential risk. Which is why much time invested at the beginning.

By the way, the code we just wrote, needs 0 tests.

Writing CSV files

Now of course, reading CSV files is only part of the story. Many times you'll have to store some of the data you are processing in Rust and write it to a CSV file for another system or data pipeline to work with.

Similarly to the previous chapter, make sure that you add (or already added) the csv crate to your project cargo add csv.

To write to a CSV file, this is a first way to do it:

extern crate csv;
use std::fs::File;

fn main() {

    let operation = write_csv();

    match operation {
        Ok(()) => println!("CSV written successfully."),
        Err(e) => println!("Error: {}", e),
    }
}

fn write_csv() -> Result<(), String> {
    // Create a new CSV writer.
    let file = File::create("output.csv").expect("Couldn't create output.csv");
    let mut writer = csv::Writer::from_writer(file);

    // Write some records.
    writer.write_record(&["Name", "Person Age", "Country"]).expect("Error writing header"); 
    writer.write_record(&["Alice", "30", "Canada"]).expect("Error writing record");
    writer.write_record(&["Bob", "35", "USA"]).expect("Error writing record");

    // Flush the writer to ensure everything gets written. In Python, you wouldn't use that if you use a "with open('..') as f:"
    writer.flush().expect("Error writing");

    Ok(())
}

Running the above code will generate a properly formatted CSV file. Notice how we had to use writer.flush() to push the changes to the file. This isn't necessary if we use the following syntax:

extern crate csv;
use std::fs::File;

fn main() {

    let operation = write_csv();

    match operation {
        Ok(()) => println!("CSV written successfully."),
        Err(e) => println!("Error: {}", e),
    }
}

fn write_csv() -> Result<(), String> {
    // Create a new CSV writer.
    let file = File::create("output.csv").expect("Couldn't create output.csv");

    {
        let mut writer = csv::Writer::from_writer(file);

        // Write some records.
        writer.write_record(&["Name", "Person Age", "Country"]).expect("Error writing header"); 
        writer.write_record(&["Alice", "30", "Canada"]).expect("Error writing record");
        writer.write_record(&["Bob", "35", "USA"]).expect("Error writing record");
    }
    // / Here, the csv::Writer goes out of scope and automatically flushes because it implements a trait called 'Drop'.

    Ok(())
}

In this case, .flush() is called automatically when the writer goes out of scope, ensuring any buffered data is written to the file. This kinda replicates the same functionality as Python's with.

Writing a Vec to CSV in Rust

The above situation is valid and good, but most of the time you'll find yourself wanting to write an existing data structure, say a list, as a row in the CSV file.

extern crate csv;
use std::fs::File;

fn main() {

    let operation = write_csv();

    match operation {
        Ok(()) => println!("CSV written successfully."),
        Err(e) => println!("Error: {}", e),
    }
}

fn write_csv() -> Result<(), String> {
    // Create a new CSV writer.
    let file = File::create("users.csv").expect("Couldn't create users.csv");

    let users: Vec<(&str, &str)> = vec![("Alice", "30"), ("Bob", "35")];

    {
        let mut writer = csv::Writer::from_writer(file);

        // Write the header.
        writer.write_record(&["Name", "Person Age"]).expect("Error writing header"); 

        // then write the records by looping over the vec
        for user in &users {
            let (name, age) = user;
            writer.write_record(&[name, age]).expect("Error writing record.");
        }
    }

    Ok(())
}

Now here's a fun thing: Try to add a tuple in the vec, that is of a different length than the other records, say ("Peter", "41", "43") and read the error.

Similarly, run this in Python and think about what happens:

import csv

with open('test.csv', 'w') as f:
    # create the csv writer
    writer = csv.writer(f)

    rows = [
        ('Name', 'Age'),
        ('Patrick', '32', '564'),
        ('Spongebob', '124', '23'),
    ]

    for row in rows:
        # write a row to the csv file
        writer.writerow(row)

While the Python code will gladly run without skipping a beat or throwing an error, the Rust code won't. This alone will save you from so much trouble, believe me.

We're not done yet, I think another important thing to know is how to get Hashmaps/dicts written out into CSV. Let's do that next.

Writing a dictionnary to CSV in Rust

The code here won't be much different, but the approach is a bit more common.

extern crate csv;
use std::{fs::File, collections::HashMap};

fn main() {

    let operation = write_csv();

    match operation {
        Ok(()) => println!("CSV written successfully."),
        Err(e) => println!("Error: {}", e),
    }
}

fn write_csv() -> Result<(), String> {

    let file = File::create("countries.csv").expect("Couldn't create countries.csv");

    let mut countries = HashMap::new();
    countries.insert("Canada", "Ottawa");
    countries.insert("USA", "Washington D.C.");

    {
        let mut writer = csv::Writer::from_writer(file);
        writer.write_record(&["Country", "Capital"]).expect("Error writing header"); 

        for (country, capital) in &countries {
            writer.write_record(&[country, capital]).expect("Error writing record.");
        }
    }

    Ok(())
}

You'd similarly write structs to CSV files, this is where Rust shines. In the example above, one needs to always keep in mind the order of columns to make sure everything is written to the right place. Not always easy. Let's see how structs can help us here.

Writing structs to CSV in Rust

Here's a simple example:

use csv::Writer;
use std::fs::File;

struct Country {
    name: String,
    capital: String,
}

impl Country {
    // Provide a method to convert the struct into a slice of strings that we'll hand over to the CSV 
    fn to_csv(&self) -> [&str; 2] {
        [&self.name, &self.capital]
    }
}

fn main() {
    let file = File::create("countries.csv").expect("Couldn't create countries.csv");
    

    let canada = Country {
        name: "Canada".to_string(),
        capital: "Ottawa".to_string(),
    };
    let usa = Country {
        name: "USA".to_string(),
        capital: "Washington D.C.".to_string(),
    };

    {
        let mut writer = Writer::from_writer(file);

        writer.write_record(&["Name", "Capital"]).expect("Error writing header"); 
        writer.write_record(&canada.to_csv()).expect("Error writing header"); 
        writer.write_record(&usa.to_csv()).expect("Error writing header"); 
    }
}

Notice how at the time we're writing, we don't need to bother about any ordering. Later on in the course, we'll see how we can make all of this a lot simpler and keep our data impeccable.

The fact that CSV doesn't really define a "standard" doesn't mean we cannot have great CSV files. It also doesn't mean we need to write a huge battery of tests. Rust is enough.

I hope this was helpful so far, with this you should be ready to fearlessly start to work with CSV data, one file at the time until you get more comfortable with Rust. Then, we can move on to libraries that will help a lot with all of the above. But now, let's move to JSON/JSONL, some extremely popular data format.