Backblaze hard drive stats and time series databases
Backblaze publishes quarterly hard drive statistics. The data files are in CSV format and contain one row per day per drive. The model and capacity are given, as well as a large array of SMART attributes. The SMART data is relatively sparse — most drive models only report a few SMART attributes. A couple potential uses for this dataset:
- Calculate the reliability of different drives. The failure rate can then be taken into account when calculating the lifetime cost of a storage array.
- Use SMART data to predict future drive failures.
My goal today, however, was to try out some time series databases. I began importing the data into InfluxDB using a simple script built on influxdb-python, but stopped after I realized it was going to take about 10 hours to run (5k records/sec ingestion rate). Running the import script without actually connecting to InfluxDB was projected at about 2 hours, so it's entirely plausible that the majority of the overhead was on the Python side of things. It would be interesting to rewrite the import script in a lower level language like Rust or Go and see how it compares. I decided not to play around with InfluxDB more today, because I already use it fairly often at work.
Next, I tried out TimescaleDB, a Postgres extension for time series data. I've had good experiences with Postgres in the past, so it seemed promising. To import the Backblaze CSV data, I used the timescaledb-parallel-copy program, which chunks the CSV data and runs several COPY
commands in parallel. It took about 45 minutes to import the full dataset (70k records/sec ingestion rate).
TimescaleDB has a neat compression feature that takes advantage of the fact that time series tend to be fairly predictable. They claim to achieve 91-96% compression on real-world datasets. When I tried this out on the Backblaze dataset, the Postgres data directory ended up 53% larger. A VACUUM FULL
did not resolve it. I'm likely doing something wrong. More experimentation needed!