I monitor a lab during the evening, and we track student attendence in a PostgreSQL database on a local server. On Saturday mornings, I have to prepare a report that details student attendence for each instructor. Suffice to say that Saturday morning reports are the easiest to forget. In this write-up, we’re going to look at how I automated the report so that I can focus my morning on other things, like writing blog posts.

The typical workflow is the following:

  1. Connect to the database using pgAdminIII
  2. Run the report
  3. Copy and paste the data into Excel
  4. Pivot the data using Excel
  5. Copy and paste the data into Outlook and send.

This looks like a lot of workflows found in businesses everywhere.

The Current Setup

I have access to a Postgres database running on the network. Let’s start by connecting directly to the database and looking at the data model.

$ psql -h -p 5432 -U postgres -d timein
psql (9.5.11, server 9.4.13)

timein=# select * from weekly_time limit 3;

           email            | course |   monday   | first_name | last_name  | duration 
 profof.math@university.edu | 0999   | 2018-01-08 | jane       | doe        |       50
 profof.math@university.edu | 0999   | 2018-01-08 | john       | smith      |       41
 profof.math@university.edu | 0997   | 2018-01-08 | george     | washington |       50
(3 rows)

The details of this view aren’t really important. It rolls up all attendence into weekly blocks and lists each student-week as a record. The first few records of the query are from the most recent week, and most of my students come from one professor.

We need this data pivoted so that the weeks are columns and the students are arranged by instructor, course, and name.

Getting Started

First, let’s make a project directory.

$ mkdir reports
$ cd reports

In this project, we’re going to use pipenv instead of pip and virtualenv. Pipenv combines both into one, which makes package management a little bit easier.

~/reports $ sudo apt install pip3
~/reports $ pip3 install pipenv

To set up a new project, simply do

~/reports $ pipenv --python 3.5

I’m using python 3.5, since that’s the default on Ubuntu 16.04. This project doesn’t depend on any specific python version, so use whatever version you want.

Next, we need to install a couple of dependencies.

~/reports $ pipenv install psycopg2 pandas

Now, let’s start the script. In report.py:

import argparse
from email.mime.text import MIMEText
import smtplib

import numpy as np
import psycopg2 as pg
import pandas as pd

A brief rundown of each package here:

Setting Up the Command Line Interface

The first function we’re going to write configures the command line interface.

def parse_args():
    parser = argparse.ArgumentParser(description="Connect to timein database and send reports.")
    parser.add_argument("host", help="ip address or url of postgres instance")
    parser.add_argument("password", help="password of the postgres user")
    parser.add_argument("-u", "--user", default="postgres", type=str, help="username of user to connect with. Default to postgres")
    parser.add_argument("-p", "--port", default=5432, type=int, help="port to connect on. Default 5432")
    return parser.parse_args()

Here, we create an ArgumentParser that describes the script’s functionality. Then, we add required host and password parameters. We add options user and port flags, since it’s nice to have this power but I don’t need it everytime.

Connecting to Postgres

def get_connection(host, password, user="postgres", port=5432):
    """Get a connection to the pg instance"""
        params = {
                "database": "timein",
                "user": user,
                "password": password,
                "host": host,
                "port": port

        return pg.connect(**params)
    except Exception as ex:
        print("Connect could not be established: ", ex)

In this function, we define the connection parameters and create a connection object. We’ll use this to interact with the database. We have to wrap this call in a try-except block, since there’s always a chance that this connection could fail unexpectedly.

Let’s take it for a test drive.

if __name__ == "__main__":
    args = parse_args()
    connection = get_connection(args.host, args.password, user=args.user, port=args.port)

And call it from the command line.

~/reports $ pipenv run python report.py $PG_PASSWORD
~/reports $

If all goes well, we won’t see any output. The most likely causes of failure here are bad credentials and networking issues. Make sure you can connect to the database using your normal methods.

Getting the Data

The next step is to get the data. We need to get a cursor from the connection, which we then use to execute a query. The cursor stores the result of the query, and we can query the cursor for the data using fetchall().

def get_view(connection):
    """Get the timein view from the connected db"""
        cursor = connection.cursor()
        cursor.execute("SELECT * FROM weekly_time")
        rows = cursor.fetchall()
        return rows
        print("Could not get data from database.")

Again, we need to wrap this in a try-except since we could always lose the PG server at any moment.

Pivotting the Data

We need to transform this data to get it into the desired format. To do that, we’re going to use pandas, which gives us DataFrames, powerful data manipulation objects.

def pivot_rows(rows):
    df = pd.DataFrame(rows, columns=["instructor", "course", "week", "first", "last", "duration"])
    table = pd.pivot_table(df, values="duration", index=["instructor", "course", "last", "first"], columns=["week"], aggfunc=np.sum)
    table = table.fillna(0)
    return table

We start by loading the data into a DataFrame and labeling the columns. Then, we use DataFrame.pivot_table() to transform the data into the required format. Each record will be a unique combination of instructor, course, first, and last. The columns will represent weeks, and we use the NumPy function sum() to aggregate the durations into weekly totals per record.

The last step is to fill NaN cells with 0, so we use DataFrame.fillna() to do so.

Option 1: Exporting to CSV

If all we’re interested in is getting the data, this is a good place to finish off.

def export(table, path="out.csv"):
    except Exception as ex:
        print("Could not export to CSV: ", ex)

DataFrames make exporting to CSV as simple as possible. Let’s look at the finalized main section.

if __name__ == "__main__":
    args = parse_args()
    connection = get_connection(args.host, args.password, user=args.user, port=args.port)
    rows = get_view(connection)
    table = pivot_rows(rows)
    export(table, "out.csv")

We call each method in sequence, using the output from the last method in the call to the next. This is a clean, easy to read decomposition of the problem.

Option 2: Emailing the Report

We can take this design further, actually emailing the report to the individual faculty members instead of manually submitting it.

Let’s replace the existing pivot_rows with this one, which returns a dictionary of faculty email addresses and their related data.

def pivot_rows(rows):
    df = pd.DataFrame(rows, columns=["instructor", "course", "week", "first", "last", "duration"])
    dfs = {}
    for instructor in df["instructor"]:
        if instructor is not "":
            selected = df.query("instructor=='{0}'".format(instructor))
            table = pd.pivot_table(selected, 
                index=["instructor", "course", "last", "first"], 
            dfs[instructor] = table.fillna(0)
    return dfs

Then, add a method to build the email message itself.

def get_email(pivot, recipient):
    message = MIMEText(pivot.to_html(), "html")
    message['Subject'] = "Lab times"
    message['From'] = "awilson@university.edu"
    message['To'] = recipient
    return message

Then, we’ll use smtplib to submit the report.

def send_email(email, recipient):
	"""Send an email message to recipient"""
    FROM = "awilson@university.edu"
    PASSWORD = "mySuperSecretPassword"
    SMTP_SERVER = "smtp.example.com"
    SMPT_PORT = 587

        s = smtplib.SMTP(SMTP_SERVER, SMTP_PORT)
        s.connect(SMTP_SERVER, SMTP_PORT)
        s.login(FROM, PASSWORD)
        s.sendmail(FROM, recipient, email.as_string())
    except Exception as ex:
        print("Could not connect to SMTP server: ", ex)

Here, we connect to the server using smtplib.SMTP and smtplib.connect. Then, we start encryption as required by the server’s settings. I use my credentials to login and then send the email to the designated recipient. If we were designing this system for a public environment, we would want to use separate settings.py script that contains the sensitive credentials. With that said, it’s a good time to announce:

Do not store credentials in version control!

Let’s look at the main section for this option.

if __name__ == "__main__":
    args = parse_args()
    connection = get_connection(args.host, args.password, user=args.user, port=args.port)
    rows = get_view(connection)
    pivots = pivot_rows(rows)

    for instructor in pivots:
        print("Emailing: ", instructor)
        email = get_email(pivots[instructor], instructor)
        send_email(email, instructor)

Automating the Process

I want to fully automate this report. As long as my computer is on, it’s going to send at 3AM on Saturday morning so that the faculty members have it as soon as possible. To do this, I’m going to set up a simple cron job. If you’re not familiar with how to use cron, check out this link.

Note that we need to cd into the correct directory so that Pipenv can see the Pipfile.

0 3 * * 6 cd /home/aaron/reports && pipenv run python report.py $TIME_IN_PASSWORD

That’s it for this writeup. Thanks for taking the time to read, and please leave a comment with your thoughts.