Storing Market Data Efficiently
Below is a comprehensive listing of tick data storage options.
The assumption is that the market data is level 2 tick data and/or very large in size (multi-GB per daily.).
If your data set is smaller, you should pick whichever format is easiest for you. That may even just be in-memory in whichever language you are using.
To quickly provide some overall guidance, we suggest:
- Flat Files - For a small number of simple cases using all data, e.g. backtesting.
- Standard Databases - If you need to query and slice the data and want to take advantage of standard SQL and tooling.
- Time-Series Databases - If you want to perform specialized queries and get great performance.
- Storing Market Data in Python - If you are working solely in Python.
For a smaller number of concurrent users, a shared file system can be used to scale a flat-file based solution.
Where you have many concurrent users a database solution is recommended
Flat Files
Flat files have the benefit of being extremely simple and can be processed in almost any language.
Beyond the simplest use-cases, we recommend parquet file storage, this stores the data column-oriented and allows easy compression, with APIs available for every language.
Technology | Description |
---|---|
Apache Arrow | Standardized column-oriented format optimized for memory access that is able to represent flat and hierarchical data for efficient analytics. |
Apache Avro | Serialization format for record data, offers excellent schema evolution. |
Apache Iceberg | Open table format for analytic datasets. (Hive/Spark) |
Apache Orc | high Performance columnar storage for Hadoop. |
Apache Parquet | Column-oriented data file format designed for efficient data storage and retrieval including data compression. |
HDF5 | High performance file format to manage, process, and store your heterogeneous data. HDF5 is built for fast I/O processing and storage. |
CSV | Simple but not efficient. Sometimes zipped for data compression. |
Language Serialization | For example: Pickling in python or java serialization. Not recommended unless you are very sure you will only ever have that specific access method. |
TeaFiles | TeaFiles provide fast read/write access to time series data from any software package on any platform. |
Folder / File Structure
Depending on your use-case you will want to be careful what folder structure you choose.
Will you be examining the data for less than 10 stocks at once? Within what time-frame?
Will you be wanting to look at data for 100s of stocks within a small time-frame?
Typically in the case you want to look at a small number, you may have the structure these in a folder structure symbol/YYYY-MM-DD.csv
e.g. daily split - GOOG/2023-11-28.csv
AAPL/2023-11-28.csv
e.g. hourly - GOOG/2023-11-28T08.csv
GOOG/2023-11-28T09.csv
This allows easy slicing of data by symbol and date/time.
Standard Databases
If you happen to have a lot of existing in-house expertise and pre-configured workflows you could consider using a standard database. PostgreSQL, MySQL, MS SQL will NOT scale to the largest tick data sets but perhaps you don't need every message stored. Storing aggregates or samples, while still being able to reuse all your existing tooling may be a worthwhile trade off.
One thing you may want to keep in mind is that for MySQL and PostgreSQL there are customized extensions available that you could use later should you need more performance. For example Citus Data is a column-oriented version of PostgreSQL more suitable for market data.
Time-Series Databases
These databases will have an increased cost, as they are non-standard or expensive commercial solutions. However with the cost comes benefits, they have been optimized for speed and performing the advanced queries you may need.
Top time series databases include our 3 recommendations shown below.
Product | Score | SQL | Time-Joins | Popularity | Description | License |
---|---|---|---|---|---|---|
Clickhouse (wp) | 8 | Some + Custom | asof | Popular | Very fast OLAP database with cloud version available. Started 10 years ago at Yandex to store the russian equivalent of google analytics. | Apache License 2.0 |
QuestDB | 7 | High + Extensions | asof+ | New | Fast database with strong focus on time-series. Very similar ideas to kdb+ but open source. | Apache License 2.0 |
kdb+ (wp) | 8 | Some + qSQL | Yes. AJ/WJ | Finance | Very fast column-oriented database with custom language q and custom time-series joins.
Steep learning curve and difficult to find experts. |
Commercial |
Storing Market Data in Python
We decided to add this category as a)Python is hugely popular b)this means custom solutions have recently been introduced that scale much better:
Technology | Description |
---|---|
ArcticDB | High performance, serverless DataFrame database built for the Python Data Science ecosystem. |
DuckDB | DuckDB is an in-process SQL OLAP database management system. |