equivariant

Backblaze hard drive stats and time series databases

Backblaze publishes quarterly hard drive statistics. The data files are in CSV format and contain one row per day per drive. The model and capacity are given, as well as a large array of SMART attributes. The SMART data is relatively sparse — most drive models only report a few SMART attributes. A couple potential uses for this dataset:

My goal today, however, was to try out some time series databases. I began importing the data into InfluxDB using a simple script built on influxdb-python, but stopped after I realized it was going to take about 10 hours to run (5k records/sec ingestion rate). Running the import script without actually connecting to InfluxDB was projected at about 2 hours, so it's entirely plausible that the majority of the overhead was on the Python side of things. It would be interesting to rewrite the import script in a lower level language like Rust or Go and see how it compares. I decided not to play around with InfluxDB more today, because I already use it fairly often at work.

Next, I tried out TimescaleDB, a Postgres extension for time series data. I've had good experiences with Postgres in the past, so it seemed promising. To import the Backblaze CSV data, I used the timescaledb-parallel-copy program, which chunks the CSV data and runs several COPY commands in parallel. It took about 45 minutes to import the full dataset (70k records/sec ingestion rate).

TimescaleDB has a neat compression feature that takes advantage of the fact that time series tend to be fairly predictable. They claim to achieve 91-96% compression on real-world datasets. When I tried this out on the Backblaze dataset, the Postgres data directory ended up 53% larger. A VACUUM FULL did not resolve it. I'm likely doing something wrong. More experimentation needed!