PRQL Quick Start by Rich Brown

This tutorial demonstrates using QStudio a Free SQL Client to open an SQLite database and run PRQL queries against the tables. QStudio allows running PRQL against 30+ databases as it uses a translation layer to seamlessly convert queries. For more details see the QStudio PRQL support page.

The original source of this tutorial is Rich Brown via github and the source can be found here.

Background

Over the years, we have collected a ton of data - mostly from spreadsheets or PDF files, or retrieved from the Town, Vision or the Grafton County Register of Deeds. For example all the following files are in my Github repository.

All these files have a similar format: a bunch of rows (usually representing parcels) with interesting data in the columns.

For example, I scanned all the Old-to-New documents for 2021 to 2024; they're in the "OldToNewValues" spreadsheet. If you look at that spreadsheet, you'll see these columns:

PID
	Map/Block/Unit
	Location (street address)
	Owner
	Code
	OldValue
	NewValue
	Ratio
	Difference
	Year
	

Or look at the "Scraped Data" file. From time to time, I run a script that loads every page from Vision and plucks up the values for each property. It produces these columns:

PID
	Owner
	Street Address
	Map/Block/Unit
	Book/Page
	Assessment
	Appraisal
	Lot size,
	...
	

The Grafton County Register of Deeds file includes columns for:

ID
	Date&Time
	Date
	Time
	Type
	Book&Page
	Book
	Page
	Party1 (seller)
	Party2 (Buyer)
	...
	

The good side of spreadsheets

Spreadsheets are a very convenient way to view data. Everyone knows how to open and manipulate them. And lots of interesting data is reported in a spreadsheet.

You could consider putting each of the above into a separate tab of a spreadsheet. And if you got another source of data, you could add a new tab (with its columns and rows) to the spreadsheet.

The downsides of spreadsheets

But spreadsheets come with downsides, primarily because modifying a spreadsheet is (mostly) irreversible - it's very hard to get it back to its original state. For example:

  • To see the difference or ratio between old and new values, you could create a new column. But each time there's a new column, the "underlying data" gets changed.

  • Or sort by a column. But the act of sorting changes the underlying data of the spreadsheet. And putting the data back to its original state gets really hard.

  • Or, you don't particularly care about a particular column, say last year's assessed land value: that column just gets in the way. You could delete it - but maybe you'll want it later on.

What if there were a way that you can select only the columns you want, and create new columns with their own calculations, and sort every which way?

Using a database

That's where SQL databases come in. Each of those spreadsheets becomes a table in the database. The table's columns and rows exactly match the spreadsheet's columns and rows. When you use a database, you create queries that produce new results without changing the underlying data.

The fundamental rule of databases (their "super power") is that every operation starts with a table and creates a new (modified) table. Subsequent operations make further modifications to create yet another table. That new table doesn't get "added to the database". Instead, the database program just displays the final result, without changing the underlying tables.

And finally, a database query is simply a series of instructions that tell how to manipulate the tables. That query can be saved and re-executed at any time, or modified to answer different questions.

qStudio demonstration

qStudio is an application that lets you view and manipulate the tables of a SQL database. It's the first tool that I think is good enough to criticize - that is, all the prior ones were just too hard to use, with too many steps to get useful results. I like qStudio because it:

  • Runs on Windows, Mac, Linux.
  • Provides a good enough user interface. (Although qStudio's interface has some some sharp edges, they're easy to recognize and work around.)
  • Uses PRQL (pronounced "Prequel") to make it MUCH simpler to write queries (SQL is the pits, you won't need to know it)

Playing with qStudio

Lesson #2 tells how to install qStudio, download and open the _Property_in_Lyme_ SQLite database, and then start making your own queries.

  1. Install qStudio

    On a Mac:

    • Download the Mac version of qStudio from: https://randomneuronsfiring.com/wp-content/uploads/qStudio.zip
    • Unzip the result to get the qStudio.app application
    • Double-click the application. You will see a "unverified developer" (or similar) warning. Use System Preferences -> Privacy and Security to indicate that you trust the application. Click "Open anyway" if necessary. qStudio opens.

    On a Windows machine:

    • Download the Installer for Windows from the qStudio Downloads page: https://www.timestored.com/qstudio/download
    • Run the installer in the usual way
    • Using the command line, enter winget install prqlc to install the PRQL compiler.
    • Open qStudio
  2. Register for a free qStudio license When qStudio starts up, it requires you to register for a free account. Provide your email address and you will be taken to a page that displays a license code (a long string). Copy that string and paste it into the qStudio window.

  3. Download the "Property in Lyme" SQLite database The latest copy of the database is always available from: https://raw.githubusercontent.com/TaxFairness/TaxFairness/refs/heads/main/Property_In_Lyme.sqlite

  4. Open the "Property in Lyme" SQLite database in qStudio. To do this, drag the database's icon into the qStudio window (top left corner, but the bottom half of that pane).

  5. (You might also download the Chinook database that is frequently used for examples. You can also drag the Chinook database to qStudio - it will work properly when both are opened.)

Playing with qStudio

qStudio is now running. Here are some hints for trying its features.

View the tables

Click any of the names of tables at the top-left of the window. The columns and rows appear at the bottom of the window. Interesting tables are:

  • CleanScrapedData - a cleaned-up list of every parcel in Lyme, gleaned from information from the Vision database along with easement and frontage information.
  • GraftonCtyRoD - all records from Grafton County Register of Deeds back to April 2019.
  • LymeOldToNew - data from the (original) 2024 MS-1. I will update when we are permitted to view the info.

Create and edit queries

The basic process is:

  1. Create a new file in qStudio (File -> New File)
  2. Save as qStudioTest.prql (".prql" suffix is important) using File -> Save As... I recommend you save it to a new folder on the Desktop so it's easy to find later.
  3. Enter a query. Start simple with from LymeOldToNew - this statement just displays all the rows and columns from the LymeOldToNew table.
  4. Save the file with Cmd-S on Mac; Ctl-S on Windows
  5. Execute the query (Cmd-E / Ctl-E)
  6. You may be prompted with a request to install the "SQLite driver". Allow it to happen.
  7. The result from the query appears in the bottom "Result" pane. If the Result pane is not visible, choose Windows -> Result
  8. Edit the query by typing in the window. Save and Execute it as above.

Enhancing your query

Here are examples taken from the first lesson. You can experiment with qStudio by modifying the queries as shown below:

Note - these examples use the PRQL language (pronounced "Prequel"), not SQL. The PRQL Reference Book gives full details on creating queries.

Select two columns from the LymeOldToNew table, ignoring the rest

Do this by adding a select statement at the bottom. Note that the select statement requires a list enclosed within { ... } that has column names ("LO_Location" and "LO_Owner") separated by ",".

from LymeOldToNew
	select {
	   LO_Location,
	   LO_Owner,
	}
	

Here's the way to think about the query above. The statement from LymeOldToNew starts a "pipeline" with the contents of that table. It passes it to the next statement (select ...) that modifies the table, by retaining only the two named columns (with all the rows) that could be passed along to any subsequent statement.

Note that PRQL doesn't care about line breaks. A statement can be on one or many lines: the following is equivalent to the above: select { LO_Location, LO_Owner }

Select a couple more columns

Add more column names between the { ... }, separated by commas. Trailing commas are allowed, even if no other column name appears afterwards.

from LymeOldToNew
	select {
	   LO_Location,
	   LO_Owner,
	   LO_OldValue,
	   LO_NewValue,
	   LO_Year,
	}
	

The qStudio result now shows five columns, the first two plus the three new ones. The Result pane shows them in the order they are listed in the query.

Create a new column named "difference" of the new to old values

To do this, add a new line with the column name ("difference") followed by =, and then the calculation.

from LymeOldToNew
	select {
	   LO_Location,
	   LO_Owner,
	   LO_OldValue,
	   LO_NewValue,
	   LO_Year,
	   difference = LO_NewValue - LO_OldValue,
	}
	

_Note that the new "difference" column in the result is just as "real" as any of the original "LO_xxx" columns._

Note too, that the comma at the end of the line is optional, but recommended because you can comment out any line (see below) without hassle.

OK. That's all very interesting, but which property had the biggest change? Let's sort by the "difference" field. Add this line to the end, then re-execute the query:

sort {difference}
	

This sorts the rows by the value in the "difference" column. To reverse the sort, use sort { -difference }.

Oh, but that shows many different years. Let's just look at 2024. Add this line to the end:

filter (LO_Year == 2024)
	

_This includes any rows where the LO_Year is equal (==) to 2024. (The double-equal is required.) To exclude the year 2024, you can use "not-equal" !=. And the >, >=, <, and <= operators work as expected._

Commenting out lines

Sometimes you want to temporarily (or permanently) remove a line or group of lines from a query. PRQL uses # as the comment character. When it appears, the remainder of that line will be ignored. There is a keyboard shortcut - Cmd-/ (Mac) or Ctl-/ (Windows) - that comments out/uncomments the lines that are selected.

Export to Excel

One cool feature of qStudio is that you can export the results directly to Excel. The Result pane opens directly in a new workbook. To do this, click the small Excel icon in the top-right of the window. It's the second icon on the right side of the image below.

icons at top of result

Further observations

  1. PRQL statements "transform" the table they receive by modifing it in some way and passing it along to the next statement. This is the power feature of PRQL: the statements can be "stacked" to form a pipeline that continally modifies the data as it passes through to produce a final result.

  2. The filter statement could have been placed much earlier in the query. In fact, it could have immediately followed the initial from ... That would have removed all the other years prior to passing them to the first select - and in the end, the result would have been identical.

  3. Similarly, the sort ... statement could have been placed anywhere in this pipeline. From that point in the pipeline, all the rows would have remained in the same order.

  4. Finally, a bit of "PRQL lingo":

    • Since PRQL statements "transform" a table, they're also called transforms.
    • Tables are also called relations.
    • A tuple is the PRQL term for column names that are enclosed in { ... }. For example, it's proper to say the select transform requires a "tuple" (instead of requiring a "list").

Common Problems

You may encounter little problems with qStudio and PRQL. Here's how to work around them.

PRQL queries

PRQL is a terrific language for writing queries. When the query is correct, it works great.

But its diagnostic messages are often arcane. When things aren't correct, it displays the error in a "red bar", and displays details below. Here are some common error indications you'll encounter and what to do about them:

Missing main pipeline

The screen shot below shows [E00001] Error: Missing main pipeline in the bottom half of the window.

This is most likely caused by trying to execute an empty query or a commented-out line. Click somewhere in the query window (to un-select) and re-execute.

red-bar error

Unknown name

PRQL is case-sensitive and cares about spelling. The error message (in the image below) Unknown name 'EF_MuniEasemen' shows that the column name wasn't recognized because it was misspelled. Similarly, if it had been typed in lower-case as ef_munieasement it also would have given the "Unknown name" error. To correct the error, simply fix the typo and re-execute.

non-zero exit

Unknown name (reason #2)

Correcting line 7 above (adding back the "t") but changing the query in line 5 (within the select) to EF_MuniEasemen also gives an "Unknown name" error. (Try it.)

But qStudio still reports it on line 7. The reason is a bit subtle: the select transform passes along the named columns, including one named EF_MuniEasemen (without the "t"). Yet the filter transform is looking for a column name that has a "t", and so it gives the "Unknown name" error. To correct the error, simply fix the typo and re-execute.

Column heading in "..."

Note: neither qStudio nor PRQL have any idea of the actual column names in the database. You must type them correctly in the query. If a column name is consistently misspelled (for example, if both lines have EF_MuniEasemen), qStudio will dutifully create and execute the query.

When the response returns, the database simply substitutes the misspelling for the column name and the contents of the column. In the screen shot below, notice:

  • the column heading has its name in quotes ("EF_MuniEasemen"), an indication that there's no such column name in the table
  • the column contains the string EF_MuniEasemen for all its rows.

To correct the error, simply fix the typo(s) and re-execute.

column holds results

Result window workarounds

The Result window displays table contents and the results of a query. Sometimes qStudio doesn't act quite as expected.

Some columns are too wide in Result window

Drag the separator between columns to adjust their width.

Result window doesn't always show a table

If you click a table name at the top left, the Result pane (bottom of the window) shows the first 1,000 rows of that table. When you execute a query, the Result pane shows the results (as it should).

But if you then click a table name again, the Result pane does not change. To work around it, click a different table name, then the one you want. The Result pane changes to the desired table.

Result window only shows top 1,000 rows of a table

When you click a table name, the Result window shows the first 1000 rows. There doesn't seem to be a way to see more rows, or the next 1000 rows. To work around this, execute the query from <table-name> then all rows appear.

Result window doesn't pop out

Sometimes, clicking the "pop-out" icon of the Result window (black rectangle with arrow - see the screen shots above) causes it to become a separate window. Sometimes it doesn't. I don't know why.

More Transforms

PRQL has more transforms (statements) that modify a table as it passes through its pipeline. Much of the discussion below comes from the PRQL Reference Tutorial

We have already introduced several PRQL transforms. Let's review what each does and how each changes the "shape" of the table:

  • from - begins a pipeline and passes the entire table to the next transform.
  • select - Changes the number of columns by retaining only those named within the tuple (the list of column names inside the { ... }) but never the number of rows.
  • filter - Changes the number of rows by excluding the ones that don't match the criteria. Never changes the number of columns.
  • derive - Changes the number of columns by adding a new column, calculated from other columns in the row. Never changes the number of rows.
  • sort - Changes the order of the rows, but leaves the number of rows and columns unchanged.

This lesson talks about these new transforms.

take

The take transform picks rows to pass through based on their position within the table. The number of columns is unchanged. The set of rows picked can be specified in two ways:

  • a plain number x, which will pick the first x rows, or
  • an inclusive range of rows start..end

For example:

from invoices
	take 4         # takes the first four rows

	# - or -

	from invoices
	take 4..7      # takes rows 4, 5, 6, and 7
	

aggregate

The aggregate transform takes a tuple (a list of column names) and “distills down” data from all the rows into a single row. This is frequently used for statistical analysis.

from invoices
	aggregate { sum_of_orders = sum total }
	

In the query above, the "invoices" table has a column named total (perhaps it's the total of a single order). It sums all the values in the total column to produce a single row.

The number of columns is equal to the number of items within the tuple. In the example above, the result would be one column. In the example below, the resulting table has two columns - sum_of_orders and avg_of_orders

from invoices
	aggregate {
		sum_of_orders = sum total,
		avg_of_orders = average total,
	}
	

group

The group transform performs a set of operations on "groups" of rows based on some characteristic. You might use group to analyze data by city or some other value.

The group transform retains all the columns: the number of rows is equal to the number of different combinations within the group characteristic (say, the number of different cities).

See the "grouping" section of the PRQL docs for more information.

join

The join transform combines the columns of two tables together "side-by-side" based on related columns of each table. It is often useful to "join" two separate tables that each have columns with interesting data using common parameters, and then use select to extract the interesting columns.

The join transform adds columns to the result (it contains all the columns from both tables). The number of rows varies based on the data within the tables and the operators used to join the tables.

See the "join" section of the PRQL docs for more informatioin.

Practical Example

Let's try a practical example of using PRQL with qStudio.

The LymeOldToNew table of the _Property_in_Lyme.sqlite_ database has all the parcels in Lyme, showing their old and new assessments for each of the years 2022, 2023, and 2024. (In fact, there were two sets of 2024 data, one from August, and another in October that corrected many values.)

Question How can we understand what properties changed value between the August and October 2024 readings?

The following steps show the iterative process for reviewing and massaging data sets to produce the desired information. As you work through the steps, you may find oddities in the underlying data that need to be tidied up. The beauty of the PRQL lanuage (as opposed to using SQL) is that you can focus on the broad picture and let PRQL take care of the details.

Examine the LymeOldToNew table

Let's first review the data that we have available. (Be sure to retrieve a fresh copy of the _Property_In_Lyme.sqlite_ database from the repository and add it to qStudio.)

  • Click the LymeOldToNew table in the upper-left corner of qStudio
  • Notice that it has columns for address, owner, old and new values, and a year and CollectedOn column. (There are other columns that can mostly be ignored.)
  • We can use those last two columns to make comparisons between data from different years by creating a table that contains only rows where the year is 2022, or where the CollectedOn column has a certain date.

Create a table for the August 2024 rows

PRQL has the ability to create a new table from an existing one. The let statement does this. It assigns a name to a set of transforms that result in a table.

To start, create a new PRQL query. Always begin like this:

  1. Create a new query file. Within qStudio, choose File -> New, save it as AugToOct.prql on the Desktop. (The .prql suffix is required so that qStudio treats the lines as a PRQL query.)
  2. Add the following lines
  3. Execute the query (with Cmd-E or Ctl-E)
let aug24 = (
	   from LymeOldToNew
	)
	from aug24
	

At this point, the aug24 table has all the rows of its parent table. Let's keep only the rows that were collected on August 29, 2024. To do this, add the new line below. The @2024-08-29 is special notation for representing a date.

let aug24= (
		from LymeOldToNew
		filter (LO_CollectedOn == @2024-08-29)
	)
	from aug24
	

Now the aug24 table contains only rows where the LO_CollectedOn column has that date.

We don't want most of the columns - let's select the ones we want:

let aug24= (
		from LymeOldToNew
		filter (LO_CollectedOn == @2024-08-29)
		select {
			PID = LO_PID,
			Location = LO_Location,
			AugOld = LO_OldValue,
			AugNew = LO_NewValue,
		}
	)
	from aug24
	

Now the aug24 table contains only four columns, and only the rows from August 29, 2024. Keeping the PID column is especially important - we'll see why in a moment.

You'll also see that the column names in the result no longer have the "LO_" prefix because we assigned new names to them: PID, Location, AugOld, and AugNew.

Create an October table

Using a similar process, we can create a table that holds the October 2024 rows. Here's the completed example.

let oct24 = (
		from LymeOldToNew
		filter (LO_CollectedOn == @2024-10-31)
		select {
			LO_PID,
			OctOld = LO_OldValue,
			OctNew = LO_NewValue,
		}
	)
	from oct24
	

The oct24 table has just three columns: its PID (we left its name as "LO_PID") as well as the old and new values. (You will probably have to comment-out (with Cmd-/ or Ctl-/) the earlier from aug24 line.)

Use join to combine those tables

The join transform combines two tables, side-by-side, using one or more criteria to "match" the rows.

In this case, we'll use the PID values - they're the Parcel ID in the property tax database that uniquely identifies each parcel. Here's what we'll add to the query:

from aug24
	join oct24 (PID == LO_PID)
	

The way to think about the above is, "Start with the aug24 table, and join ("combine") it with the rows of the oct24 table, using the values of each table's PID columns to determine which of the rows to connect."

The process the computer follows is tedious, but straightforward: Start with the first row of aug24 - find its PID (it happens to be 1). Find a corresponding row (with a LO_PID of 1) in oct24 and place both rows side-by-side in the result. Then examine each subsequent row of aug24 and combine with a corresponding row of oct24 where their PID values match. The resulting table has seven columns: the four from aug24 and three from oct24

Try this now: Type the two lines above into qStudio, below the two table definitions and execute the query. (Comment out any earlier from... lines.)

We're now starting to get useful data. Notice the following: the values in the PID columns match; the AugOld and OctOld values always match; and that sometimes the AugNew and OctNew values don't.

Clean up the data a little bit

As you scroll through the results, you'll notice a lot of lines that are blank, or have only OctOld and OctNew values. These result from somewhat poor data preparation. (Those lines happen to be the totals of the old and new columns that were mistakenly included in the raw data files.)

It might be possible to go back to the source data, clean it up, and update the database. But sometimes, that's not possible: the data comes from from a source that cannot be changed.

But there's another way: we can just adjust the query to leave out those lines. In this case, all the PID values are empty. Add this statement to the bottom of the query:

filter (aug24.PID != '')
	

This transform means, "filter (and pass through) all the PID values that are not blank (''). Notice that all those nearly-blank lines are gone.

Note: The aug24.PID notation is a way to specify the column named PID that came from the aug24 table. (Tables of a join may well have identical column names - using the notation tablename.columnname helps indicate which column to use from the desired table.)

Find "new" values that don't match

It's hard to scan a table of numbers, comparing values in different columns. The computer can do it for us. We can make a new column using the derive transform that represents the difference between two columns.

Add this line to the end of the query:

derive AugToOctDiff = OctNew - AugNew
	

This creates a new column AugToOctDiff that contains the difference between the two columns named there.

That does half the job: the results show lots of rows where the difference is zero - that is, where the August and October values are the same. Let's retain only the rows where the difference is non-zero.

filter (AugToOctDiff != 0)
	

This result retains only the rows where the difference is non-zero. But again, it's hard to make sense of the information, since the rows seem to be in a random order. Let's sort by the difference column:

sort { AugToOctDiff }
	

The result now shows the property value changes, sorted by the difference between August and October.

A little more "data cleaning"

The result also has a LO_PID column left over from the oct24 table. It was critical to performing the join operation, but is no longer needed. It's also confusing to others reading the results ("What's this LO_PID thing? Do I need to know about it?") Better to take it out.

There's a way to use the select transform to exclude the named columns instead of retaining them. Place a ! in front of the tuple/list to designate the columns to exclude.

select !{ LO_PID }
	

This removes the LO_PID column, while retaining all the other columns.

Summary

This exercise shows how qStudio and PRQL allow you to investigate a data set interactively. You can view the full data set, then filter the rows, select the columns, and join rows together to get a result.

The original source of this tutorial is Rich Brown via github and the source can be found here.