PostgreSQL and Jupyter notebooks
PostgreSQL is one of the most popular variants of SQL. It is common to use PostgreSQL with pgadmin but I am not a big fan of their UI. By contrast, interacting with PostgreSQL through a Jupyter notebook has several benefits. Most immediately apparent is that SQL queries can be brought directly into a Python environment for data analysis, visualization, and machine learning. But the process of learning SQL itself is supported by the notebook format. Jupyter make it easy to adopt literate programming practices since note-taking is easy. In addition, with different cells in the same notebook, one can build complex SQL queries by first writing, understanding, and recording simpler commands.
Here is an example of a complicated-looking query in a Jupyter notebook cell. This is based on data for my baseball project that I worked on as a fellow at Insight Data Science. (For the baseball aficionados, the question I asked was “What is each team’s hit average against the infield shift versus a standard alignment?” The output provides an approximation of team batting average.)
sql_query = """
SELECT
ob_events_table.team_name,
ob_events_table.if_fielding_alignment,
ob_events_table.n_ob,
total_ab_table.n_ob,
ob_events_table.n_ob::decimal/total_ab_table.n_ob AS obp
FROM
(SELECT
CASE WHEN inning_topbot='Top' THEN away_team
WHEN inning_topbot='Bot' THEN home_team
END AS team_name,
if_fielding_alignment,
COUNT(batter) AS n_ob
FROM statcast
WHERE events IS NOT NULL
AND events IN ('single', 'double', 'triple', 'home_run')
GROUP BY team_name, if_fielding_alignment
HAVING if_fielding_alignment IN ('Infield shift', 'Standard')) AS ob_events_table
JOIN
(SELECT
CASE WHEN inning_topbot='Top' THEN away_team
WHEN inning_topbot='Bot' THEN home_team
END AS team_name,
if_fielding_alignment,
COUNT(batter) AS n_ob
FROM statcast
WHERE events IS NOT NULL
GROUP BY team_name, if_fielding_alignment
HAVING if_fielding_alignment IN ('Infield shift', 'Standard')) AS total_ab_table
ON
ob_events_table.team_name=total_ab_table.team_name
AND
ob_events_table.if_fielding_alignment=total_ab_table.if_fielding_alignment;
"""
df_query = pd.read_sql_query(sql_query,con)
You can see the output of this query (shift against the Angels and Orioles!), see how I built it, and learn how to setup PostgreSQL with Jupyter yourself in this notebook located on my project repo.