Automating a Postgres Report Using Python
I monitor a lab during the evening, and we track student attendence in a PostgreSQL database on a local server. On Saturday mornings, I have to prepare a report that details student attendence for each instructor. Suffice to say that Saturday morning reports are the easiest to forget. In this write-up, we’re going to look at how I automated the report so that I can focus my morning on other things, like writing blog posts.
The typical workflow is the following:
- Connect to the database using pgAdminIII
- Run the report
- Copy and paste the data into Excel
- Pivot the data using Excel
- Copy and paste the data into Outlook and send.
This looks like a lot of workflows found in businesses everywhere.
The Current Setup
I have access to a Postgres database running on the network. Let’s start by connecting directly to the database and looking at the data model.
The details of this view aren’t really important. It rolls up all attendence into weekly blocks and lists each student-week as a record. The first few records of the query are from the most recent week, and most of my students come from one professor.
We need this data pivoted so that the weeks are columns and the students are arranged by instructor, course, and name.
First, let’s make a project directory.
In this project, we’re going to use pipenv instead of pip and virtualenv. Pipenv combines both into one, which makes package management a little bit easier.
To set up a new project, simply do
I’m using python 3.5, since that’s the default on Ubuntu 16.04. This project doesn’t depend on any specific python version, so use whatever version you want.
Next, we need to install a couple of dependencies.
Now, let’s start the script. In report.py:
A brief rundown of each package here:
- argparse - simplifies command line interface development
- MIMEText - makes writing email messages easy
- numpy - the mother of all numeric computation libraries in python
- smtplib - used to establish an SMTP session for sending emails from python
- psycopg2 - a PostgreSQL library
- pandas - datamanipulation
Setting Up the Command Line Interface
The first function we’re going to write configures the command line interface.
Here, we create an ArgumentParser that describes the script’s functionality. Then, we add required host and password parameters. We add options user and port flags, since it’s nice to have this power but I don’t need it everytime.
Connecting to Postgres
In this function, we define the connection parameters and create a connection object. We’ll use this to interact with the database. We have to wrap this call in a try-except block, since there’s always a chance that this connection could fail unexpectedly.
Let’s take it for a test drive.
And call it from the command line.
If all goes well, we won’t see any output. The most likely causes of failure here are bad credentials and networking issues. Make sure you can connect to the database using your normal methods.
Getting the Data
The next step is to get the data. We need to get a cursor from the connection, which we then use to execute a query. The cursor stores the result of the query, and we can query the cursor for the data using fetchall().
Again, we need to wrap this in a try-except since we could always lose the PG server at any moment.
Pivotting the Data
We need to transform this data to get it into the desired format. To do that, we’re going to use pandas, which gives us DataFrames, powerful data manipulation objects.
We start by loading the data into a DataFrame and labeling the columns. Then, we use DataFrame.pivot_table() to transform the data into the required format. Each record will be a unique combination of instructor, course, first, and last. The columns will represent weeks, and we use the NumPy function sum() to aggregate the durations into weekly totals per record.
The last step is to fill NaN cells with 0, so we use DataFrame.fillna() to do so.
Option 1: Exporting to CSV
If all we’re interested in is getting the data, this is a good place to finish off.
DataFrames make exporting to CSV as simple as possible. Let’s look at the finalized main section.
We call each method in sequence, using the output from the last method in the call to the next. This is a clean, easy to read decomposition of the problem.
Option 2: Emailing the Report
We can take this design further, actually emailing the report to the individual faculty members instead of manually submitting it.
Let’s replace the existing pivot_rows with this one, which returns a dictionary of faculty email addresses and their related data.
Then, add a method to build the email message itself.
Then, we’ll use smtplib to submit the report.
Here, we connect to the server using smtplib.SMTP and smtplib.connect. Then, we start encryption as required by the server’s settings. I use my credentials to login and then send the email to the designated recipient. If we were designing this system for a public environment, we would want to use separate settings.py script that contains the sensitive credentials. With that said, it’s a good time to announce:
Do not store credentials in version control!
Let’s look at the main section for this option.
Automating the Process
I want to fully automate this report. As long as my computer is on, it’s going to send at 3AM on Saturday morning so that the faculty members have it as soon as possible. To do this, I’m going to set up a simple cron job. If you’re not familiar with how to use cron, check out this link.
Note that we need to cd into the correct directory so that Pipenv can see the Pipfile.
That’s it for this writeup. Thanks for taking the time to read, and please leave a comment with your thoughts.