Archive for January, 2025

The Big Data Events of 2024

pyarrow downloads

PyArrow dowloads

  1. Open source tools are now as performant as pre-existing commercial offerings for data analysis and in many ways offer more features.
    Proof: See the time-series benchmarks and note how many are open source: https://www.timestored.com/data/time-series-database-benchmarks
  2. Everyone has discovered that column-oriented storage and vector execution is the secret to fast analytics.
  3. Arrow format has won. It is now a cornerstone technology used in python, numpy, polars, duckdb, R.
    Pandas replaces numpy with arrow, DuckDB quacks arrow, QuestDB will support arrow, InfluxDB (2023), Polars is built upon Apache Arrow.
  4. Apache parquet has won as the lowest common denominator for basic data storage.
    QuestDB queries parquet, DuckDB supports parquet (2021), Clickhouse , GreptimeDB uses Arrow and Parquet.
  5. Iceberg vs Delta vs Hudi. Iceberg won. AWS announcement.

 

Trends of 2024

DuckDB is on course to become the defacto column oriented database that all others will be compared to.
Clickhouse conquered a number of enterprises but difficuly deploying and getting started now seem like key factors that held it back.

DuckDB Downloads

DuckDB Downloads

DuckDB Stars

DuckDB Stars

Underlying Factors

Why has SQL and python won? In many ways these are terrible languages (GIL , SET theory) but they won? I can’t say all the reasons but some things that I believe worked in favour:

  1. Open Source + Free = Hard to beat. We’ve seen open-source companies (license disputes mentioned below) take over every area. VCs and startups have realised making big money selling dev tools requires solving two problems: distribution + technology and the harder one is now distribution. The important thing is getting your product into the hands and heads of as many people as possible. Once there, you can withhold all useful enterprise features and charge for them, assuming AWS doesn’t try the same trick. I do wonder if this is causing the death of otherwise small viable software bsuinesses.
  2. Google = a second brain that worked on keyword search. Languages that had judicial overloading are harder to search than languages with many function names. Google makes it easier to find uniquely named functions that python has. Does anyone still read the manual? nevermind the 500+ page language bibles that were the only way to learn languages 20 years ago?
  3. AI – It hasn’t been a factor to date but AI is similar to the google benefit but even more. The more data and usage, the more chance AI can write your code, write your query etc. Will this reinforce the benefit that fully expanded syntax and popularity already provides? APL could be even more dead than it is already.