Analyzing Stock Data with DuckDB: Loading and calculating VWAP & OHLC from NYSE TAQ

DuckDB vs kdb+

KX published an article on medium (18th December 2024) demonstrating stock analysis using kdb+. Can we load the same NYSE TAQ data and perform the same analysis using a free database? How will it perform?
Let's try it with DuckDB.

Loading the NYSE TAQ Trade Data

  1. Download the data at https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/,
    in particular the EQY_US_ALL_TRADE_20240702.gz and unzip the file.
  2. Download QStudio which bundles DuckDB and go to File menu -> New DuckDB Database.
  3. Run the below SQL code to load the data into a table and to add a ttime column of the correct type.

Loading NYSE TAQ data

It was very cool seeing DuckDB use all my CPUs by default to load the data.

Querying the Data

column_name column_type null key default extra
TimeVARCHARYES
ExchangeVARCHARYES
SymbolVARCHARYES
Sale ConditionVARCHARYES
Trade VolumeBIGINTYES
Trade PriceDOUBLEYES
Trade Stop Stock IndicatorVARCHARYES
Trade Correction IndicatorVARCHARYES
Sequence NumberBIGINTYES
Trade IdBIGINTYES
Source of TradeVARCHARYES
Trade Reporting FacilityVARCHARYES
Participant TimestampVARCHARYES
Trade Reporting Facility TRF TimestampVARCHARYES
Trade Through Exempt IndicatorBIGINTYES
ttimeTIMEYES

Example: Count the Number of Trades by Stock

This was a simple query that even I was able to write in 1 minute. The great news is that the result is the exact same as that produced by kdb+.

QStudio querying NYSE TAQ Data

The bad news is that it's significantly slower. 500ms is fast enough that it's not interrupting my workflow but it's not close to the 113 milliseconds reported for kdb+. I tried adjusting the CPUs used and running in-memory but performance stayed similar to initial settings or got worse.

DuckDB vs kdb+

Monitoring the Trade Volume for one stock

The ttime column that I created earlier is only Time, i.e. it is NOT a timestamp or a datetime. Time seems to have less functions available than timestamp in DuckDB so rounding it to the nearest time bucket is more tricky but this did give me a chance to create a cool xbarTime function. See our time-bucketing article for more examples.

Additional to the trades within each time bar, we want to add a cumulative sum, for this we use a window function:

earliest total_trade_volume sums
04:00:00.01 10805 10805
04:05:05.83 2257 13062
04:10:29.05 1078 14140
04:15:10.54 2216 16356
04:20:03.30 1365 17721
04:25:02.46 4717 22438
... .... .....
Charting Stock Trades SQL Time-series

The chart was plotted in QStudio a Free SQL client with charting, it looks similar to the chart from the original article.

Volume Weighted Average

The weighted average is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. In finance it is common to weight trades by the number of trades executed at a price point, to find a weighted average of all prices.

In SQL this is fairly simple and for fun I tried a second method including creating a wavg function. This gives the same result as kdb+ and takes 600ms.

Symbol wavg
68085147
A 126.02
AA 40.73
AAA 25.03
AAAU 23.01
AACG 0.81
AACI 11.37
.... .....
Weighted Average plotted vs Last

The weighted price and latest price move roughly in line similar to the original chart using kdb+ and KX developer.

Open-High-Low-Close

I've previously calculated candlestick charts in duckdb so adapting this query didn't take long:

Candlestick Chart in Pulse width=

Candlestick chart of query displayed in Pulse.

ttimelowopenclosehighvolumevwap
04:00:00.01215.89216.6216.11216.610805216.24
04:05:05.83216.1216.18216.27216.272257216.15
04:10:29.05216.2216.28216.35216.351078216.25
04:15:10.54216.28216.3216.39216.462216216.37
04:20:03.30216.24216.38216.26216.421365216.36
04:25:02.46216.21216.28216.4216.434717216.32
....

Conclusion

  • We've successfully recreated every example from the kdb+ article.
  • DuckDB was slower but all queries ran within a reasonable time that wouldn't disturb your workflow.
  • Loading the data was easier in DuckDB.
  • Time column xbar was more difficult in DuckDB and in general the SQL was more verbose.

The good news is that DuckDB is Free.

Everyone now has access to a powerful database for analysing stock data at speed. Technology that previously cost thousands of pounds annually. That's pretty cool.

QStudio is a free SQL client with DuckDB bundled and integrated, download it as the easiest way to try DuckDKB today.