from dharpa.benchmarking.data import clear_system_cache, MemoryRecorder, get_example_file
from rich.jupyter import print
This document is a primer on data formats and structures, and how and why those affect our project. I have no idea about how much of this is common knowledge, and how much is news to the majority. I have noticed a few common misconceptions and assumptions about some of the topics in here, so I figured it makes sense to try to get everyone on the same page. I've tried to keep this simple and short, so there are some things in here that are over-simplified bordering on incorrect.
My educated guess is that in our project we will mostly be concerned about structured, tabular data, which is why I'll be focussing on that. I might add a companion document about 'binary-blob' data later on.
Data serialization and storage¶
- data lives in memory or on disk
- lots of 0's and 1's -- binary format
- only 'decoding' gets you a useful representation
- 'text' is just an often used encoding format
The natural habitat of (digital) data is computer memory or on disk. Data is always stored in binary form, and there is always some sort of decoding involved to make data usable in one way or another (with the exception of booleans maybe). Even when we talk about text files (seemingly implying that those are not binary since they are 'text'), we are dealing with binary data. It's just such a standard data encoding format that tools to decode that sort of data are available everywhere. Decoding text is by no means trivial, but luckily our tools have evolved so much by now -- and we have standards like utf-8 commonly available -- that we as users hardly ever come across decoding issues anymore. At least not to the extent we used to. It still happens, and I would imagine quite a bit more in the Digital Humanities than in your everyday 'business' use-case. So it helps to be aware of at least the basics involved in text encoding standards, and I would recommend anyone doing any sort of programming to read up on it.
Tabular data (serialization) formats¶
- serialization/deserialization
- binary or not, here I come:
- avro, protobuf, pickle
- csv, json, yaml, xml
- 'structured formats' with schema, or without:
- avro, protobuf, thrift, flatbuffers, xml
- csv, json, yaml, messagepack, pickle
- zero-copy, memory-mapping?
The (arguably) most important type of data we'll be dealing with is structured, tabular data. So it pays to think about how its digital form is represented in memory, and what issues are involved when trying to store, manipulate and transfer it.
In our context, tabular data is always a 2 dimensional matrix (rows, columns), where each column has the same number of items, and each column contains items of the same data type (or union of data types). Tabular data can have a column that can be used as index, but that is optional. Each table can be described with a schema, which is basically a list of column names, and the data types associated with each of those columns.
Serialization / deserialization¶
Every programming language represents its basic data types differently in memory. That's especially true for the arguably most common data type: the string. Also, that in-memory representation is (almost) always different to the format of the (functially) same data when exported into a file on disk.
This means that, if we want to export our data in our code, we need to do a step that is called 'serializing' (or 'marshalling'): we convert the data into a commonly accepted representation of a commonly accepted set of data-types. This serialization is usually expensive, computationally speaking. We want to avoid it, if at all possible, or at least always postpone it until the last possible moment, when we are sure the data won't change anymore, so we only have to do it once. The same goes for de-serializing data, just in the other direction: we only want to do it once, then keep it in memory in our native representation (if the size of the data allows it), so we don't have to read it again. Even if the content of a file is in the OS (page) cache (which would mean we don't actually have to read the file-content from disk) we'd still have to spend the cpu-cycles for de-serialization. So, big no-no, bad data-scientist!
Format types¶
For serialization, we have two basic options: text, and binary (let's just not nitpick and assume that this distinction makes sense).
Text-based formats¶
Serializing into a text-based format usually means taking all the elements our tabular data consists of, one by one, then serialize each element into its textual representation (like for example "hello world"
for a string, 5
for an integer, true
for a boolean in json), and then assembling one big 'meta'-string out of all those sub-elements. For csv, that might include a header-row, and adding delimiters like ',' in between the elements. For json it would be adding list ('[
', ']
') or dictionary ('{
', '}
') indicators, as well as deliminters and other elements as outlined in the JSON specification.
I haven't done any research on it, but I'd imagine csv would be one of the oldest widely-used data storage formats. Csv is a text based tabular format, and it allows you to specify an optional header to describe column names. It allows for different deliminters between row cells (whole rows are delimited by the end-of-line special character). Other commonly used text-based formats are json, yaml, toml, xml. Those are not strictly tabular data formats, they can also contain just scalars, dictionaries, or lists (tabular data is always a list of dictionaries of the same shape).
Binary formats¶
Serializing into a binary format is usually very specific to the format itself, so there are not really any common tools to read more than one of them (like there are for text-based formats, where you could use any text editor and at least display the content in a meaningful way), and different formats have different priorites (like small size of the resulting blob, quick random read/write access, suitability for streaming, etc). Binary formats often have compression built-in, whereas text formats never have (but can be usually compressed well enough by external tools due to certain characteristics of encoded strings). Also, they usually are a lot easier on the cpu for serialization/deserialization purposes, since it's easier to optimize them for that scenario.
Binary formats existed for a long time, but in recent years they are used more widely again. Decentralized software architecture (microservices) as well as 'big(-ish) data' played a huge part in that. Because, as it turns out that, while serializing a few items of data per seconds into json and back is not that much of a problem, doing the same thing for millions of large (or even small) chunks of data actually is. In some cases that serialization step can take more time than the actual computation that was done on the data. To counter that issue, people came up with formats like 'Avro', 'Thrift', 'ProtoBuf'. Pythons 'pickle' can also be considerd a binary serialization format.
Schema¶
Another useful way to separate data formats is to check whether they include a (native) schema that describes the value types of the data they hold, or not. If schema information is present, it can either be included in the resulting file, or be stored elsewhere.
Having schema information for a dataset is highly desirable, because it tells us exactly what type of data we are dealing with (is it a string, integer, float? what precision?). Whereas most text-based data formats don't include a schema definition format, there are sometimes external efforts to remedy that (JSON-schema, for example). None of the text-based formats I can think of at the top of my head include the schema in a resulting file. This is important, because the complexity of tools that handle data increases if they need to worry about secondary, 'side-car' files for incoming data.
Slight detour: csv¶
Csv is bit special in that it can contain a 'header' row in the first line, which can be used to determine column names. Since this row is optional, it is not always present which of course complicates the import process. Because csv files are so common in data science, most of the tools we use include some csv-import method that more or less smartly determines the (text-)encoding of the file, whether it has a header row, as well as the schema of the included data. This method usually serializes the data into the appropriate internal representations of the column types after it is reasonably sure the inferred schema is correct-ish. Without a definite, (externally provided) schema it is not possible to guess this reliably in every case, so a developer should always assert the correct types are present after such an import.
Streaming, zero-copy, memory-mapping¶
One thing that everyone working semi-seriously in data science and handling big-ish data should be aware of is that in most OS'es you can read (or 'load') data in more ways than one. The 'one' way is usually something like:
file = open('dataset.csv')
lines = file.read() # or file.readlines()
When using Pandas, it'll probably take the form of:
import pandas as pd
pd.read_csv('dataset.csv')
Both of those read the whole file into memory. Which will be fine if the dataset is small or there is a need for it to be in memory in full. Depending on the situation, it might be wasteful, though. For example when calculating the mean for a column of integers. In that case it's a better strategy to read one line of the file, process the column we are interested in, eject the line from memory, then read the next line, only keeping the current total and the number of items we processed so far. That way we'll never allocate more memory than what is needed for a single line. We can even process datasets that are larger than the available memory of or our workstation.
As a matter of fact, we could do even better if we would know the offset and length of the column we are interested in, in that case, we would only have to read the bytes that hold the integer value we need, and could ignore the other cells of a row completely. Again, this might or might be an issue depending on the size of the data in a row, but if we have a dataset with a lot of large columns, the I/O operations we would not have to do by only reading the exact data we need could improve the speed of processing considerably. Doing that doesn't really work for csv files, for example. Since there is no good way for us to know the exact offset of length of the column we are interested in. There are data formats that support that kind of operation though.
Along with those fairly simple strategies to deal with data efficiently, there are more advanced ones that also deal with data and how it is handled in a system memory as well as on disk. For those of you who are interested, I would recommend looking up the terms 'memory-mapping', and 'zero-copy'.
Some random benchmarks, to illustrate¶
clear_system_cache()
file_path = get_example_file()
def count_lines(path):
f = open(path)
counter = 0
length = 0 # ignores '\n' characters
lines = f.readlines()
for line in lines:
counter = counter + 1
length = length + len(line)
return {"no_lines": counter, "size": length}
profile_count_lines = MemoryRecorder.profile_func("Reading the whole file", "This iterates through all lines in memory, keeping all of them in memory at the same time.", False, count_lines, file_path)
print(profile_count_lines.report)
clear_system_cache()
file_obj = open(file_path, buffering=True)
def count_lines(f):
counter = 0
length = 0 # ignores '\n' characters
for line in f: # when using open like we do here, it returns an iterator not a materialized list
counter = counter + 1
length = length + len(line)
return {"no_lines": counter, "size": length}
profile_count_lines = MemoryRecorder.profile_func("Reading the file line by line", "This allocates only very little memory, since once a line is read and processed, it will be disregarded.", False, count_lines, file_obj)
file_obj.close()
print(profile_count_lines.report)
Structured (binary) data layout strategies¶
- Row-based:
- (most commonly used dbs): sqlite, Postgres, MySQL, ...
- Avro
- Column-based
- OLAP dbs: duckdb, ClickHouse, BigQuery, Snowflake ...
- pandas dataframe
- parquet, feather
In order to use the more advanced operations on data I described earlier, the data formats we use need to support them. None of the simple formats like csv, json, yaml do. There is one category of applications that had to deal with those things for decades though: databases. So I think it pays to look a bit at how they handle storing data, and what kind of trade-offs they are making. Basically, a database is a system that lets you persist (mostly) structured data on disk, and gives you an easy, memory- and processing-efficient way to query and retrieve it back. To do that, they have different ways to persist data, add indexes, cache 'hot' data, and so on. As it turns out, there are 2 main ways data can be stored on disk for efficient retrieval: row-based, and column-based (I'm ignoring document/'nosql' databases here, since -- for almost all practical use-cases -- they are inferior to relational ones).
Row-oriented databases¶
The most common database type is 'row-oriented'. This means that data is stored in a way so that each row represents a continuous block of disk (or memory). Data is quick and easy to read (if you are interested in a subset of rows) and it is very easy and fast to add new rows/records. This fits the most common requirements businesses have for a database, since new items are added constantly, which is why most databases we encounter in the wild are row-based. Examples for such databases are: Postgres, sqlite, MySQL.
Column-oriented databases¶
Column-oriented databases have existed for a long time, but they are not as prevalent as their row-based cousins, and often ignored by developers who haven't been exposed to them and their advantages. Instead of storing data row-by-row, they store data column-by-column. This means that column-cells are layed out next to each other on disk, and different columns occupy different regions of the storage (not necessarily close to each other at all). The querying logic is quite different for this type of database, the main advantage is that a certain type of analytical query is really fast (speedups of 10x or even 100x are quite possible), also it is very easy to request whole columns from such a database without it having to access any irrelevant parts of the data. Compressing data is also easier with column-oriented databases, so usually those occupy less disk space than their row-based peers. The disadvantage of those databases is that it's harder and slower to add new data, so they are more common for situations where one either has a fixed dataset, or updates are rare, and come in big(er) batches. Also, certain types of queries are less suited for that layout, which makes it always a good idea to think about what you need out of your data before deciding on a database/database type.
Row-based/column-based in data science¶
How is this relevant? Well, because in data science we are dealing mostly with fixed datasets, and the queries we do on them are mostly analytical in a way that fits column-oriented data layouts; although exceptions from that rule are not uncommon. So it makes sense to depart from the 'common wisdom' of using a row-based approach. In fact, Numpy arrays and Pandas dataframes (which depend on them) are kept in memory using the column-based approach. This is important to know, because it helps us using and querying data correctly in our code. For example, it's not a good idea and very slow to add 'rows' to a Pandas dataframe. Instead, we should initialize the Dataframe with the whole dataset once at the beginning, and then only add columns to it (which is very fast), but no new rows if at all possible.
One issue with Numpy/Pandas is that commonly a dataset is loaded into memory as a whole. There are ways around that (for example by processing a csv file in batches), but very often those are not used. In reality, it's probably not that big an issue in the DH field, since datasets seem to be smaller on average. But it is still a good idea to be as efficient as possible in this regard, esp. for our purpose, since we won't have any knowledge or guarantees in advance about the data we'll be handling (which might very well be bigger than the availble memory). Also, since we are building an interactive application, it makes a difference whether a query comes back within a second, or ten.
More random benchmarks, this time with Pandas¶
clear_system_cache()
import pandas as pd
def load_csv(path):
counter = 0
df = pd.read_csv(path)
for _ in df["year_month"]:
counter = counter + 1
return counter
profile_read_csv = MemoryRecorder.profile_func("Reading a csv file with Pandas as a whole.", "This is ok, as long as the dataset is not too big.", False, load_csv, file_path)
print(profile_read_csv.report)
clear_system_cache()
import pandas as pd
def load_csv(path):
counter = 0
df = pd.read_csv(path)
for index, row in df.iterrows():
counter = counter + 1
return counter
profile_read_csv = MemoryRecorder.profile_func("Reading a csv file with Pandas, iterating over rows", "As one can see, this is very very slow, and not a good idea at all to do in Pandas.", False, load_csv, file_path)
print(profile_read_csv.report)
def load_csv_in_chunks(path):
counter = 0
chunksize=1000
with pd.read_csv(path, chunksize=chunksize) as reader:
for chunk_df in reader:
for _ in chunk_df["year_month"]:
counter = counter + 1
return counter
profile_read_csv = MemoryRecorder.profile_func("Reading a csv file with Pandas, in chunks.", "This is a good approach when dealing with a dataset that is large, and we don't need it except for a single operation on a single column. We can optimize the execution-time/memory-usage by adjusting the 'chunksize' value.", False, load_csv_in_chunks, file_path)
print(profile_read_csv.report)