Histograms and recursion in SQL
A few weeks ago, while making a histogram in a SQL query, I discovered that some solutions out there do not include bins with 0 counts. This bugged me so I figured out how to do it. I did not know about recursion in SQL when I first started on this problem, but it was fun to learn so that’s a bit of a bonus here.
For my example, I’ll be using some baseball data, but this should work with whatever kind of data you have. The SQL queries are done within a Jupyter notebook.
I’m in a rush, can I just see the final solution?
import pandas as pd
import sqlalchemy
import sqlalchemy_utils
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
# Define a database name
# Set your postgres username
dbname = "baseball"
username = "lacar" # change this to your username
# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database=dbname, user=username)
# Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine("postgres://%s@localhost/%s" % (username, dbname))
print(engine.url)
postgres://lacar@localhost/baseball
Basic histogram
Let’s say the problem is: “Build a histogram for the number of at-bats for all major league baseball players in 2019. Place the counts into bins with width of 50-at-bats. Therefore, count the number of players who had between 0 and 49 at-bats, 50 and 99 at-bats, 100 and 149, etc.
Let’s get a preview of the table we’re working with, ordering by the number of at-bats in descending fashion.
# All MLB players
sql_query = """
SELECT "Name", "Team", "AB"
FROM batting_stats
WHERE "Season"=2019
ORDER BY "AB" DESC
LIMIT 5;
"""
pd.read_sql_query(sql_query,con)
Name | Team | AB | |
---|---|---|---|
0 | Whit Merrifield | Royals | 681.0 |
1 | Marcus Semien | Athletics | 657.0 |
2 | Rafael Devers | Red Sox | 647.0 |
3 | Jonathan Villar | Orioles | 642.0 |
4 | Ozzie Albies | Braves | 640.0 |
From this table, each player’s number of at-bats is on a separate line. This tutorial describes how to go about it. I won’t repeat their lesson in detail, but I encourage you to take a look at it if what I write below does not make sense. I essentially applied their example to my query. We can take each player’s number of at-bats, divide by 50, and FLOOR
the result. By then multiplying by 50, you can then get back the bin groups based on the original at-bats, and then use COUNT
to produce the number in that bin.
# All MLB players
sql_query = """
SELECT FLOOR("AB"/50.0)*50 AS ab_floor,
COUNT(*)
FROM batting_stats
WHERE "Season"=2019
GROUP BY ab_floor
ORDER BY ab_floor;
"""
pd.read_sql_query(sql_query,con)
ab_floor | count | |
---|---|---|
0 | 0.0 | 444 |
1 | 50.0 | 104 |
2 | 100.0 | 51 |
3 | 150.0 | 44 |
4 | 200.0 | 51 |
5 | 250.0 | 39 |
6 | 300.0 | 46 |
7 | 350.0 | 38 |
8 | 400.0 | 38 |
9 | 450.0 | 44 |
10 | 500.0 | 35 |
11 | 550.0 | 37 |
12 | 600.0 | 17 |
13 | 650.0 | 2 |
As a quick sanity check, we can look back up in the preview of our table and see that Whit Merrifield of the Royals and (681 at-bats) and Marcus Semien of the Athletics (657 at-bats) batted the most. This matches the count of “2” for the 650 at-bat bin. It makes sense that there are a lot of players in the 0-49 bin since pitchers don’t hit in American League ballparks and many players have brief stints in the major leagues. Aside from these observations, I did more sanity checks when I worked this out on my own, so I think we can feel pretty good about the query.
But let’s say we wanted to limit the histogram to my favorite team. We see something peculiar.
# Interim table - using Padres, doing a group by the bin
sql_query = """
SELECT FLOOR("AB"/50.0)*50 AS ab_floor,
COUNT(*)
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'
GROUP BY ab_floor
ORDER BY ab_floor;
"""
df_query = pd.read_sql_query(sql_query,con)
df_query
ab_floor | count | |
---|---|---|
0 | 0.0 | 20 |
1 | 50.0 | 1 |
2 | 150.0 | 1 |
3 | 200.0 | 2 |
4 | 250.0 | 2 |
5 | 300.0 | 3 |
6 | 350.0 | 1 |
7 | 400.0 | 2 |
8 | 550.0 | 1 |
9 | 600.0 | 1 |
If you look closely, there are missing bins! The missing bins are where no players have at-bats in that bin. For example, no Padres player had an at-bat that fell between 450 and 500 at-bats, so that bin doesn’t show up. But it would make a better histogram if all bins of 0 count are represented.
(We did not have to worry about this when we were looking at at-bats across 990 players from all 30 teams. With more players, “gaps” in the histogram are less likely. But since we limited this to the Padres only, this last query was with data from only 34 players.)
I searched around for tutorials and I couldn’t seem to find an answer of how to include a bin and display a count of 0. The same tutorial I mentioned above acknowledges the issue in the example they provided and had this quote:
It has one failing in that if we have no data in a bucket (e.g. no purchases of 55 to 60 dollars), then that row will not appear in the results. We can fix that with a more complex query, but let’s skip it for now.
I did not see a more complex query later and the missing bin observation bugged me. This felt like a challenge.
Recursion
I put on my clever hat and thought about what we can do. I thought about creating the bin intervals and then left joining the above query onto the manufactured bins. That seemed like a reasonable approach, but I hadn’t used SQL to synthesize values before. That’s where I came across recursion in SQL. Some of this syntax is specific to PostgreSQL so keep that in mind when using your SQL variant.
From the tutorial, it looks like we would make a recursive CTE. What is interesting is how an SQL recursion differs from a Python recursion, at least from what I have learned. From a Python recursion tutorial, one starts with your input data and recursion shrinks this until a base case is met where the problem can be solved trivially. However in SQL recursion, you start with an “anchor term” (which appears analagous to a base case) and uses UNION
or UNION ALL
to horizontally concatenate the “recursive term” onto the anchor term. Recursion stops adding new rows when a condition is met.
It took me a while to wrap my head around this, so I started slow. My original goal is to create bins like 0, 50, 150, etc. but not rely on the data to give me back those values. First I looked at the range of data and how many bins I need.
# Get number of bins
sql_query = """
SELECT MIN(FLOOR("AB"/50.0)*50) AS min_ab_bin,
MAX(FLOOR("AB"/50.0)*50) AS max_ab_bin,
(MAX(FLOOR("AB"/50.0)*50)-MIN(FLOOR("AB"/50.0)*50))/50 AS no_bins
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres';
"""
df_query = pd.read_sql_query(sql_query,con)
df_query
min_ab_bin | max_ab_bin | no_bins | |
---|---|---|---|
0 | 0.0 | 600.0 | 12.0 |
I created something recursively but just using a placeholder as a bin number. The SQL recursion structure relies on creating a CTE using WITH RECURSIVE
. My anchor term was simply 0. The recursive term was simply to add 1 and then it would stop when the bin_number was less than 13, since you can see from the above query that the max number of bins is 12.
# Try doing something recursively
sql_query = """
WITH RECURSIVE
no_bins_t AS
(-- anchor term
SELECT 0 AS bin_number
UNION ALL
--recursive term
SELECT bin_number + 1 AS bin_number
FROM no_bins_t
WHERE bin_number < 13)
SELECT bin_number
FROM no_bins_t;
"""
df_query = pd.read_sql_query(sql_query,con)
df_query
bin_number | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
6 | 6 |
7 | 7 |
8 | 8 |
9 | 9 |
10 | 10 |
11 | 11 |
12 | 12 |
13 | 13 |
Great. Now we can put in proper subqueries and variables instead of relying on hard coding numbers. The anchor term has the subquery that evaluates the minimum bin. The 13 in the recursive term can be substituted for with the subquery by identifying the max bin.
# Get range
sql_query = """
WITH RECURSIVE
no_bins_t AS
(-- anchor term
SELECT MIN(FLOOR("AB"/50.0)*50) AS bin_number
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'
UNION ALL
--recursive term
SELECT bin_number + 1 AS bin_number
FROM no_bins_t
WHERE bin_number < (SELECT MAX(FLOOR("AB"/50.0)) + 1
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'))
SELECT bin_number,
bin_number*50 AS at_bat_bin
FROM no_bins_t;
"""
df_query = pd.read_sql_query(sql_query,con)
df_query
bin_number | at_bat_bin | |
---|---|---|
0 | 0.0 | 0.0 |
1 | 1.0 | 50.0 |
2 | 2.0 | 100.0 |
3 | 3.0 | 150.0 |
4 | 4.0 | 200.0 |
5 | 5.0 | 250.0 |
6 | 6.0 | 300.0 |
7 | 7.0 | 350.0 |
8 | 8.0 | 400.0 |
9 | 9.0 | 450.0 |
10 | 10.0 | 500.0 |
11 | 11.0 | 550.0 |
12 | 12.0 | 600.0 |
13 | 13.0 | 650.0 |
Final query: fully represented histogram using recursion
# Get range
sql_query = """
WITH RECURSIVE
no_bins_t AS
(-- anchor term
SELECT MIN(FLOOR("AB"/50.0)*50) AS bin_number
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'
UNION ALL
--recursive term
SELECT bin_number + 1 AS bin_number
FROM no_bins_t
WHERE bin_number < (SELECT MAX(FLOOR("AB"/50.0)) + 1
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'))
SELECT bin_number,
bin_number*50 AS at_bat_bin,
n_players
FROM no_bins_t
LEFT JOIN
(SELECT FLOOR("AB"/50.0)*50 AS ab_floor,
COUNT(*) AS n_players
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'
GROUP BY ab_floor
ORDER BY ab_floor) AS padres_ab
ON no_bins_t.bin_number*50=padres_ab.ab_floor;
"""
df_query = pd.read_sql_query(sql_query,con)
df_query
bin_number | at_bat_bin | n_players | |
---|---|---|---|
0 | 0.0 | 0.0 | 20.0 |
1 | 1.0 | 50.0 | 1.0 |
2 | 2.0 | 100.0 | NaN |
3 | 3.0 | 150.0 | 1.0 |
4 | 4.0 | 200.0 | 2.0 |
5 | 5.0 | 250.0 | 2.0 |
6 | 6.0 | 300.0 | 3.0 |
7 | 7.0 | 350.0 | 1.0 |
8 | 8.0 | 400.0 | 2.0 |
9 | 9.0 | 450.0 | NaN |
10 | 10.0 | 500.0 | NaN |
11 | 11.0 | 550.0 | 1.0 |
12 | 12.0 | 600.0 | 1.0 |
13 | 13.0 | 650.0 | NaN |
We have NULL
values but we can easily clean this up using CASE.
# Get range
sql_query = """
WITH RECURSIVE
no_bins_t AS
(-- anchor term
SELECT MIN(FLOOR("AB"/50.0)*50) AS bin_number
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'
UNION ALL
--recursive term
SELECT bin_number + 1 AS bin_number
FROM no_bins_t
WHERE bin_number < (SELECT MAX(FLOOR("AB"/50.0)) + 1
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'))
SELECT bin_number,
bin_number*50 AS at_bat_bin,
CASE WHEN n_players IS NULL THEN 0
ELSE n_players END AS n_players
FROM no_bins_t
LEFT JOIN
(SELECT FLOOR("AB"/50.0)*50 AS ab_floor,
COUNT(*) AS n_players
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'
GROUP BY ab_floor
ORDER BY ab_floor) AS padres_ab
ON no_bins_t.bin_number*50=padres_ab.ab_floor;
"""
df_query = pd.read_sql_query(sql_query,con)
df_query
bin_number | at_bat_bin | n_players | |
---|---|---|---|
0 | 0.0 | 0.0 | 20 |
1 | 1.0 | 50.0 | 1 |
2 | 2.0 | 100.0 | 0 |
3 | 3.0 | 150.0 | 1 |
4 | 4.0 | 200.0 | 2 |
5 | 5.0 | 250.0 | 2 |
6 | 6.0 | 300.0 | 3 |
7 | 7.0 | 350.0 | 1 |
8 | 8.0 | 400.0 | 2 |
9 | 9.0 | 450.0 | 0 |
10 | 10.0 | 500.0 | 0 |
11 | 11.0 | 550.0 | 1 |
12 | 12.0 | 600.0 | 1 |
13 | 13.0 | 650.0 | 0 |
We can make things look nicer with CONCAT
, although we need to rely on ordering by bin_number since the concatenated at_bat_bin will look like a string and some will appear out of order.
# Get range
sql_query = """
WITH RECURSIVE
no_bins_t AS
(-- anchor term
SELECT MIN(FLOOR("AB"/50.0)*50) AS bin_number
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'
UNION ALL
--recursive term
SELECT bin_number + 1 AS bin_number
FROM no_bins_t
WHERE bin_number < (SELECT MAX(FLOOR("AB"/50.0)) + 1
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'))
SELECT bin_number,
CONCAT(bin_number*50, '-', bin_number*50 + 49) AS at_bat_bin_interval,
CASE WHEN n_players IS NULL THEN 0
ELSE n_players END AS n_players
FROM no_bins_t
LEFT JOIN
(SELECT FLOOR("AB"/50.0)*50 AS ab_floor,
COUNT(*) AS n_players
FROM batting_stats
WHERE "Season"=2019
AND "Team"='Padres'
GROUP BY ab_floor
ORDER BY ab_floor) AS padres_ab
ON no_bins_t.bin_number*50=padres_ab.ab_floor
ORDER BY no_bins_t.bin_number;
"""
df_query = pd.read_sql_query(sql_query,con)
df_query
bin_number | at_bat_bin_interval | n_players | |
---|---|---|---|
0 | 0.0 | 0-49 | 20 |
1 | 1.0 | 50-99 | 1 |
2 | 2.0 | 100-149 | 0 |
3 | 3.0 | 150-199 | 1 |
4 | 4.0 | 200-249 | 2 |
5 | 5.0 | 250-299 | 2 |
6 | 6.0 | 300-349 | 3 |
7 | 7.0 | 350-399 | 1 |
8 | 8.0 | 400-449 | 2 |
9 | 9.0 | 450-499 | 0 |
10 | 10.0 | 500-549 | 0 |
11 | 11.0 | 550-599 | 1 |
12 | 12.0 | 600-649 | 1 |
13 | 13.0 | 650-699 | 0 |
There you have it. One thing to keep in mind is that the recursive query uses a CTE and I haven’t yet figured out a way to use another CTE at the same time. If you figure this out, please let me know!