The Big Data Events of 2024

pyarrow downloads

PyArrow dowloads

  1. Open source tools are now as performant as pre-existing commercial offerings for data analysis and in many ways offer more features.
    Proof: See the time-series benchmarks and note how many are open source: https://www.timestored.com/data/time-series-database-benchmarks
  2. Everyone has discovered that column-oriented storage and vector execution is the secret to fast analytics.
  3. Arrow format has won. It is now a cornerstone technology used in python, numpy, polars, duckdb, R.
    Pandas replaces numpy with arrow, DuckDB quacks arrow, QuestDB will support arrow, InfluxDB (2023), Polars is built upon Apache Arrow.
  4. Apache parquet has won as the lowest common denominator for basic data storage.
    QuestDB queries parquet, DuckDB supports parquet (2021), Clickhouse , GreptimeDB uses Arrow and Parquet.
  5. Iceberg vs Delta vs Hudi. Iceberg won. AWS announcement.

 

Trends of 2024

DuckDB is on course to become the defacto column oriented database that all others will be compared to.
Clickhouse conquered a number of enterprises but difficuly deploying and getting started now seem like key factors that held it back.

DuckDB Downloads

DuckDB Downloads

DuckDB Stars

DuckDB Stars

Underlying Factors

Why has SQL and python won? In many ways these are terrible languages (GIL , SET theory) but they won? I can’t say all the reasons but some things that I believe worked in favour:

  1. Open Source + Free = Hard to beat. We’ve seen open-source companies (license disputes mentioned below) take over every area. VCs and startups have realised making big money selling dev tools requires solving two problems: distribution + technology and the harder one is now distribution. The important thing is getting your product into the hands and heads of as many people as possible. Once there, you can withhold all useful enterprise features and charge for them, assuming AWS doesn’t try the same trick. I do wonder if this is causing the death of otherwise small viable software bsuinesses.
  2. Google = a second brain that worked on keyword search. Languages that had judicial overloading are harder to search than languages with many function names. Google makes it easier to find uniquely named functions that python has. Does anyone still read the manual? nevermind the 500+ page language bibles that were the only way to learn languages 20 years ago?
  3. AI – It hasn’t been a factor to date but AI is similar to the google benefit but even more. The more data and usage, the more chance AI can write your code, write your query etc. Will this reinforce the benefit that fully expanded syntax and popularity already provides? APL could be even more dead than it is already.

 

2024 – The year in Numbers and Images

2024 has been a good year with new major versions of both QStudio and Pulse released. 1000s of new users using our tools and we continue to release regularly and keep improving. Thanks go to our users for raising issues, providing feedback and commercially backing us.

Github Stars Shooting Up

Admittedly we weren’t trying to get github attnetion for the first 10 years so it’s a low start but we’ve made good progress.

 

QStudio Star History

Version 3.0 with Notebooks Launched in October

It looks like a quick holiday from coding in August, then coming back with fresh ideas and heads-down in October to get Notebooks released.

Downloads Grown Steadily Every Month

In case you are wondering the 1 fortnight where data is off the chart was this hacker news post.

We used https://notes.cleve.ai/unwrapped to generate this calendar of our major linkedin posts:

Calend Events

QStudio is now Free!

QStudio is now 100% Free. No registration or license required.

Free QStudio

Why? Are you shutting down?

Quite the opposite, we believe free and open source is the future and that is where we are going.
If anything we want customers to take this as a massive thanks.
Thank you for being part of driving QStudio forward and sponsoring development and cheering us along all these years.

Thanks in particular to

Thanks in particular to the large finance firms that took a chance on us. Big firms can be bureaucratic with onboarding, purchasing policies, vendor lists, 30 page contracts so I want to thanks all those individuals that jumped those hurdles to get us onboarded and those that put it on the corporate credit card.  Below this post is an image containing what may or may not be some customers and other firms that have provided feedback, assistance and input over the years. Strictly speaking we are not allowed to confirm nor deny customers.

What I would say as an external party is that on average these places knew how to complete paperwork, get out of staffs way and enable them to get work done so they are probably better places to work on average.

 

Over the years, a few larger firms failed to onboard as those attempting it were ground down under the paperwork.
The good news for them is that QStudio is now free and the paperwork should be halved!

We look forward to improving QStudio together.

Being Free opens up more opportunities, please:

This is me cancelling all the individual users that paid annually for QStudio after 10+ years of building them up! Similarly all corporate contracts are also terminated.

Thanks

Note: For those who recently renewed we are offering a Free Pulse license for 10x the users you purchased for QStudio. Get in touch for a demo.

The Best SQL Notebooks

Want to create beautiful live updating SQL notebooks?
While being able to easily source control the code?
and take static snapshots to share with colleagues that don’t have database access?

Today we launched exactly what you need and it’s available in both:

  • QStudio Version 4 – Desktop SQL Client entirely based on editing markdown files locally.
  • Pulse Version 3 – As a shared team server, where users only need a web address to get started and share results.

 

SQL Notebook Examples

We have worked with leading members of the community to create a showcase of examples.
These are snapshotted versions with static data. The source markdown and most the data to recreate them are available on github.SQL Notebook Examples

Let us know what you think, please report any issues, feature suggestions or bugs on our github QStudio issue or Pulse issue tracker.

Thanks to everyone that made this possible. Particularly Brian Luft, Rich Brown, Javier Ramirez, Alexander Unterrainer, Mark Street, James Galligan, Sean Keevey, Kevin Smyth, KX, Nick Psaris and QuestDB.

Ryan will be at Duckcon #6 Amsterdam

Duckcon #6 – Amsterdam

DuckDB has skyrocketed in the last year and Amsterdam is it’s home. QStudio will be there in 2025.


31st January 2025 – 16:40 Stock data analysis with DuckDB
One year ago we decided to bundle DuckDB as we thought it was awesome. A free column oriented database that can open local databases and perform ASOF joins at speed! We knew QStudio users would love it. This year Ryan is excited to be speaking at Duckcon #6 in Amsterdam.

QStudio was at Bigdata LDN

Ryan attended Big Data LDN in September, the highlights were:

  • Meeting the QuestDB team in-person and seeing their talk live.
  • Listening this talk on modern data lake data formats.
  • Complaining to Jonny Press and Gary Davies about over 50% of the hall being AI dominated.

SQL+Markdown qStudio Experiment 2024

SQL+Markdown qStudio experiment 🚀 🚀 Quick report creation with nice git code commits.
If this is something that interests you, message me.
Particularly if you have tried other notebooks and hold strong opinions 😡 .

At TimeStored we are constantly running experiments with both Pulse and qStudio with small groups of users to see what new ideas may provide value. Most fail. They don’t always work out or they don’t gather enough interest to be viable but we think SQLMarkdown might be a winner. We are already finding it useful for our own workflows.

kdb 5.0 – The Roadmap Ahead

kdb 5.0In 1998 kdb+ was released and changed the finance database industry. We want to do it again.

Today we are releasing kdb+ 5.0 that Works Easily for Everyone, Everywhere, with Everything.

  • A Data Platform that Easily Works for
  • Everyone – Is the most user friendly q ever
  • Everywhere – Finance and beyond
  • Works with Everything
    • Works with every major database tool seamlessly.
    • Interoperates with R/Python and almost every major data tool using high speed standards

The Past – What we have done

Purpose of MS-DOS in Windows 95... - BetaArchive

15 years ago we had a product that was light years ahead of our competition. When you download q today it looks fundamentally similar to how it looked then. Users are presented with a bare q prompt and left to create a tickerplant, a framework and various parts themselves to get real work done.

The landscape has changed and we need to change faster with it. Today we address that. How?

 

Computer developer with glasses and colorful jumper sitting on a trading floor amongst finance people with grey business suits. Cartoon Lego style.

1. We are going to listen and embed ourselves with customers. Pierre and Oleg have been sitting and working with kdb teams at every major bank and hedge fund. They have seen the problems that are being solved, what amazing work those teams have done and where we can improve the core to help them.

2. We are working with the community. Data Intellect invented the marvellous Torq framework, Jo Shinonome has created Kola, Daniel Nugent wrote a wonderful testing framework and numerous others have written useful q modules. They’ve written some great useful components and provided us with lots of insight.

3. We are learning from the competition. Andrew and Ashok have gone round every database and technology similar to ours and examined their strengths and weaknesses. They coded on each and have found some amazing parts but going further they have looked at how those businesses operate and how they attract users.

 

The Past

2347: Dependency - explain xkcd

Previously. Someone downloaded kdb then needed to email us to use commercially and wait months for their company to negotiate a contract.

Previously. Someone starting with kdb has to recreate a lot of the framework work teams in banks have done and they have to discover and adapt the wonderful work the community has done. We want to unleash that creativity.

Previously. Someone trying to use kdb with tableau, pulse, java, c# has to learn our own driver and struggle to get it to communicate.

Previously. Someone trying to write queries has to write qSQL.

Today we are releasing an amazing version of kdb+ that Works Easily for Everyone, Everywhere with Everything.

Everyone = Modules

Q&A: NASA's New Spaceship - NASAToday: We are revealing a Module Framework built into kdb+.  This is going to make it easier for everyone to get started.
Bringing the current enterprise quality code to everyone AND enabling existing community contributions to be reused easily.

The great news is, we’ve worked with partners to already have production quality modules available from day zero:

  • Torq – from Data Intellect
  • qSpec – Testing framework from Daniel Nugent
  • QML – q math library – by Andrey Zholos
  • qTips – analytics library from Nick Psaris
  • S3 – querying from KX
import `qml
import `:https://github.com/nugend/qspec as qspec
import `torq/utils
import `log
q).qml.nicdf .25 .5 .975
-0.6744898 0 1.959964

The framework is documented and public, so you can even load modules from github or your own git URL. (This has required making namespaces stricter to prevent one module from being able to affect another. No more IPC vs local loading oddities). Kdb now ships with a packaging tool called qpm based on concepts similar to NPM.

This will allow both KX and the community to experiment in modules and if successful to integrate those libraries into core.
It will allow you to get up and running with kdb+ faster, at less cost and receive production quality maintenance and feature updates for larger parts of your stack.

Everyone = SQL = Becoming as SQL compatible as possible.

Big_Data

Before – piv:{…….}    ij  -100 sublist.

  1. Example: Select *
  2. Example: Select * from t inner join v LIMIT 50
  3. Example: Pivot using duckdb notation
  4. Example: sums, prods, finance functions.
  5. Query it as if standard postresql database – The old driver is loadable via module.
  6. Partitioned databases now all “date=…” to be placed anywhere in the query. If it’s not a nice clear error message is sent.

q)select * from partitionedtable where (price<10) AND (date=.z.d)
q)PIVOT Cities ON Year USING first Population as POP,Population as P
Country	Name	    | 2000_POP	2000_P	2010_POP	2010_P	2020_POP	2020_P
--------------------|----------------------------------------------------------------------
NL	Amsterdam     | 1005	[1005]	1065	[1065]	1158	[1158]
US	Seattle	      | 564	[564]	608	[608]	738	[738]
US	New York City | 8015	[8015]	8175	[8175]	8772	[8772]

 

With Everything = Postgres Wire Compatible

We’ve listened to user problems with ODBC, tableau, kx drivers over the years and we are now bundling pgwire compatibility within the default kdb engine.
Anything that bundles a postgres driver will now work with kdb+.

With Everything = PyArrow + Parquet

Select from and save to a wide range of open standards: parquet, arrow, delta lake, iceberg.

q)select * from file.parquet
q)select * from s3://blah.com/foo
q)select * from http://homer.internal/data.csv
q)`:asd.parquet 0: table
`:asd.parquet

 

Type Hints

func:{ [argA;argB] if[not -6h=type argA;'wrongType]  if[not -9h=type argB;'wrongTypeB]  }
/ now
func:{ [argA:int; argB:real] }

 

This will provide: runtime checking, optimization of code and we’ve worked with qStudio and vs code to automate checks in the UI.

Previously

You had to spend months getting kdb+, then setting it up and building a platform, integrating it with other systems, finding experts.

Today

Download, reuse the existing modules, it works with all existing tools, and the greater SQL and typing support allows more people to safely run queries.

Works Easily for Everyone, Everywhere with Everything.

  1. Everyone = friendlier SQL, type hints, more functions builtin including PIVOT.
  2. With Everything = S3 / Parquet / HTTP / Postresql wire compatible.

With modules to allowing greater community contribution and reuse.

One Last Thing: Everywhere = We are releasing the 32 bit version of q FREE for all usages including commercial.

Disclaimer: The above is entirely fictional based on some wishes of the author, no proprietary information is known nor being shared. If you like the ideas let KX know. If you dislike the ideas, let me know and post your thoughts for improvement.

DolphinDB & TimeStored: Partnering for Data Visualization in Quantitative Finance

DolphinDB and TimeStored working in partnership. Customization of qStudio can be found here.
Contents below.

 

DolphinDB, a leading provider of the real-time platform for analytics and stream processing, and TimeStored, a pioneering company in the field of data visualization and analytics, are proud to announce a partnership focused on advancing data visualization in quantitative finance. With an emphasis on integrating DolphinDB’s capabilities into TimeStored’s flagship products, qStudio and Pulse, this partnership aims to deliver innovative enhancements to complex analysis scenarios including quantitative trading, high-frequency backtesting, and risk management.

In the competitive field of quantitative trading, a high level of precision in data analysis is essential. Rivals are constantly striving to boost productivity and efficiency to obtain a competitive edge in the dynamic financial markets. To meet this challenge, DolphinDB is committed to providing cutting-edge real-time analytics tools to people worldwide. It offers a unified platform with over 1500 built-in features and a collection of stream computing engines for data warehouse, analytics, and streaming applications. Because of its exceptional efficiency in investment research, DolphinDB has emerged as a significant technology pillar in key areas including strategic research, risk control, and measurement platforms.

Data visualization is intrinsically intertwined with data analysis, serving as an indispensable partner in the exploration of complex datasets and the extraction of valuable insights. By deeply integrating DolphinDB’s efficient investment research capabilities with TimeStored’s advanced visualization technology, we have constructed a scenario which can intuitively display complex financial data. By transforming abstract financial data into intuitive charts and indicators, we significantly enhanced the readability of information and the efficiency of decision-making. It not only meets the current financial market’s demand for data transparency and immediacy but also provides a powerful analysis and decision-support platform for financial professionals. This empowers them to quickly seize opportunities and effectively manage risks in the volatile market.

The latest update to qStudio introduces powerful new features: DolphinDB syntax highlighting, code completion, and a server tree view. These enhancements significantly streamline developers’ workflow, offering intuitive coding and improved navigation. Moreover, the partnership has enabled the visualization of DolphinDB data within TimeStored’s Pulse product. It opens up new horizons for users interested in streaming data visualization, enabling a dynamic and interactive approach to analyzing real-time data.

This partnership leverages the technological strengths of both companies to revolutionize data management. DolphinDB and Timestored are committed to delivering the top-tier solutions for data analysis and quantitative investment research experience to global market participants.

About DolphinDB

Founded in 2016, DolphinDB is committed to providing users worldwide with cutting-edge real-time analytics platforms. Our flagship product, DolphinDB, offers a unified platform for data warehouse, analytics, and streaming workloads. At its core, it is a high-performance distributed time-series database. With a fully featured programming language, over 1500 built-in functions, and a suite of stream computing engines, DolphinDB enables rapid development of high-performance applications for mission-critical tasks in global financial institutions.

As an enterprise-focused real-time analytics provider, we take pride in enabling organizations to unlock the value of big data and make smarter decisions through real-time insights into their most demanding analytical workloads.

About TimeStored

TimeStored specializes in real-time interactive data tools, offering robust solutions since 2013. Their products, like Pulse and qStudio, support a wide array of databases and enhance data analysis capabilities. Pulse enables the creation of real-time interactive dashboards, facilitating collaborative data visualization. qStudio, a free SQL analysis tool, features an intelligent SQL editor with functionalities like syntax highlighting and code completion, aimed at improving the efficiency and effectiveness of data analysts.

 

The Future of kdb+?

It’s been 2 years since I worked full time in kdb+ but people seem to always want to talk to me about kdb+ and where I think it’s going, so to save rehashing the same debates I’m going to put it here and refer to it in future. Please leave a comment if you want and I will reply.

Let’s first look at the use cases for kdb+, consider the alternatives, then which I think will win for each use-case and why.

Use Cases

A. Historical market data storage and analysis. – e.g. MS Horizon, Citi CloudKDB, UBS Krypton (3 I worked on).
B. Local quant analysis – e.g. Liquidity analysis, PnL analysis, profitability per client.
C. Real-time Streaming Calcuation Engines – e.g. Streaming VWAP, Streaming TCA…
D. Distributed Computing – e.g. Margin calculations for stock portfolios or risk analysis. Spread data out, perform costly calcs, recombine.

Alternatives

Historical Market Data – kdb+ Alternatives

A large number of users want to query big data to get minute bars, perform asof joins or more advanced time-series analysis.

  • New Database Technologies – Clickhouse, QuestDB.
  • Cloud Vendors – Bigquery / redshift
  • Market Data as a Service

Let me tell you three secrets, 1. Most users don’t need the “speed” of kdb+. 2. Most internal bank platforms don’t fully unleash the speed of kdb+. 3. The competitors are now fast enough. I mean clickbench are totally transparent on benchmarking..

Likely Outcome: – Kdb+ can hold their existing clients but haven’t and won’t get the 2nd tier firms as they either want cloud native or something else. The previous major customers for this had to invest heavily to build their own platform. As far as I’m hearing the kdb cloud platform still needs work.

Local Quant Analysis – Alternatives

  • Python – with DuckDB
  • Python – with Polars
  • Python – with PyKX
  • Python – with dataframe/modin/….

Now I’m exaggerating slightly but the local quant analysis game is over and everyone has realised Python has won. The only question is who will provide the speedy add-on. In one corner we have widely popular free community tools that know how to generate interest at huge scale, are fast and well funded. In the other we have a niche company that never spread outside finance, wants to charge $300K to get started and has an exotic syntax.

Likely Outcome: DuckDB or Polars. Why? It’s free. People at Uni will start with it and not change. Any sensible quant currently in a firm will want to use a free tool so that they are guaranteed to be able to use similar analytics at their next firm. WIthout that ability they can only go places that have kdb+ else face losing a large percentage of their skillset.

Real-time Streaming / Distributed Computing

These were always the less popular cases for kdb+ and never the ones that “won” the contract. The ironic thing is, combining streaming with historical data in one model is kdbs largest strength. However the few times I’ve seen it done, it’s either taken someone very experienced and skillful or it has become a mess. These messes have been so bad it’s put other parts of the firm off adopting kdb+ for other use cases.

Likely Outcome: Unsure which will win but not kdb+. Kafka has won mindshare and is deployed at scale but flink/risingwave etc. are upcoming stars.

Summary

Kdb+ is an absolutely amazing technology but it’s about the same amazing today as it was 15 years ago when I started. In that time the world has moved on. The best open source companies have stolen the best kdb+ ideas:

  • Parquet/Iceberg is basically kdb+ on disk format for optimized column storage.
  • Apache Arrow – in-memory format is kdb+ in memory column format.
  • Even Kafka log/replay/ksql concept could be viewed as similar to a tplog viewed from a certain angle.
  • QuestDB / DuckDB / Clickhouse all have asof joins

Not only have the competitors learnt and taken the best parts of kdb+ but they have standardised on them. e.g. Snowflake, Dremio, Confluent, Databricks are all going to support Apache Iceberg/parquet. QuestDB / DuckDB / Python are all going to natively support parquet. This means in comparisons it’s no longer KX against one competitor, it’s KX against many competitors at once. If your data is parquet, you can run any of them against your data.

As many at KX would agree I’ve talked to them for years on issues around this and to be fair they have changed but they are not changing quick enough.
They need to do four things:

  1. Get a free version out there that can be used for many things and have an easy reasonable license for customers with less money to use.
  2. Focus on making the core product great. – For years we had Delta this and now it’s kdb.ai. In the meantime mongodb/influxdb won huge contracts with a good database alone.
  3. Reduce the steep learning curve. Make kdb+ easier to learn by even changing the language and technology if need be.
  4. You must become more popular else it’s a slow death

This is focussing on the core tech product.
Looking more widely at their financials and other huge costs/initiatives such as AI and massive marketing spending, wider changes at the firm should also be considered.

2024-08-03: This post got 10K+ views on the front page of Hacker News to see the followup discussion go here.

Author: Ryan Hamilton