August 15, 2024

Python and Standardization at OPMI

Analysts at OPMI often have to perform similar data cleaning and analytic steps across various data pipelines in order to prepare the data for different projects. Instead of manually repeating these processes for every data pipeline and project, we have standardized the process using Python.

A day in the life of an OPMI analyst typically involves answering interesting research questions using a variety of survey-based or ridership datasets. Usually, this will mean digging into data pipelines that take various inputs (such as our rider satisfaction survey), processing them into the desired format and then conducting different analyses. Often we work with the same data sources that are updated periodically (such as our ridership information), or, we might be faced with entirely new inputs that raise novel analysis questions. In these cases, we build custom pipelines from scratch.

For each of these different use cases we are often faced with relatively similar tasks. Even across projects as varied as our yearly system demographic survey to our bike and pedestrian data analysis, many of the coding steps are in fact very similar. For our survey data, for instance, a common task is cross-walking MBTA asset names across different input files. One file might have the Providence Commuter Rail line spelled 'Providence/Stoughton' and another 'Providence / Stoughton Line'. Instead of manually matching every combination of names to the official GTFS (General Transit Feed Specification) naming standard, we decided to automate some of this processing.

To account for some of these common questions and standardize how we perform analysis across our team, we decided to implement a more robust solution: Enter OpyMI.

Overview

OpyMI is our new Python library for shared analytics and engineering tasks across the OPMI team. It is intended to provide helpful functions for time-consuming but relatively manual tasks like name matching or quickly checking results across datasets. While we don’t have a metric of how long each team member was spending on these tasks, the hope for this new library is that we free up some of this manual processing time for more interesting or complicated analysis.

More importantly, its functionality is meant to provide modular building blocks across our codebase instead of having isolated scripts that don't communicate with other parts of the codebase (often ending in replicated and error-prone code). OpyMI allows analysts to quickly import and share useful functions across their analysis. If one analyst has already solved a problem (such as fuzzy name matching for typos), others can replicate it in their work too. This also ensures we follow the same methods for dealing with the same type of problem (consistency!).

Having shared modules also allows us to change things quickly by only altering the source code instead of updating different programs replicated across multiple scripts. For example, a single line of code thatrepresents a database connector can be re-used in multiple projects (and is now handled by the URL SQLalchemy connector within our server module).

Lastly this solution solves for easy testability as well, which, given the repeated nature of the code, has been a challenging and slow process in the past. Besides the unit tests that are required to build the package, which ensure the functionality is performing as intended, having code intended to serve as modular building blocks lets us investigate errors quicker as they come up in our analysis.

To summarize, we wanted a solution that:

Build process

Building OpyMI was a months-long project. Earlier this year, our analyst Dea solicited team feedback about what would be most useful to include in the first version of the library. From a few rounds of comments, the initial set of modules spanned various areas of our work:

Why Python?

We chose Python to build the library primarily because of its extensive support for such a wide array of analysis, engineering, and data science tasks. By using Python we were able to develop the library through testing each step of the build with pytest. We used unit-testing of the different functions to make sure they produced the expected behavior before we utilize them across our codebase.

To make the process of testing and re-building the library as a package as automated as possible, we also implemented Continuous Integration and Delivery (CI/CD) pipelines with Github Actions that were activated upon pushing to the repository. This is where all of the tests were run, and the package was re-built automatically.

The CI/CD Github Action file consisted of the following steps:

Figure 1 - the CI/CD Github Action (click to enlarge)

These CI/CD pipelines are also extremely useful in making sure that the codebase was developed little by little and any new changes are integrated quickly (hence the Continuous Integration and Development name). Each time we push a change, we can ensure that it gets quickly tested in the CI step; if the tests fail, then the package doesn’t get updated.

 

This makes it easier to make smaller changes in short periods and check they work with the rest of the functionality. It also saves us some time from running every step manually.  For the last part of the build pipeline, for actually building the set of files into a python package, we used PSR (python semantic release) along with a standard pyproject.toml file that outlined the external packages and build tools.

Some examples

Here's some of the functionality of this new library:

server:

The server module builds data pipelining functionality into the package and allows us to interact with SQL databases more effectively.  Under the hood, the module consists of a client class, where each object is instantiated with a user’s credentials and can be used for all basic database operations. We are using SQLalchemy because of its flexibility in working with SQL queries. In this first version of the module, the client class implements data loading, insertions, deletions and creations of new schemas and columns.

mbta:  

First off, the mbta module is designed to help us work with string matching, especially if we are dealing with a list of non-standard names that we would like to analyze. It uses fuzzy string matching to compare string similarity (helping catch the most common kind of typo, which is a capitalization or extra space could be easily accounted for). Secondly, the mbta module allows us to quickly access information about the system, such as loading transit routes and stops.

 

This information comes from our GTFS feed and we are using the partridge library to parse it into txt files for extracting the different bits of information. Previously, we would have to rely on geospatial software like ArcGIS forgetting basic information, such as a data frame of stops – routes – their geometries. This typically happened once per each new project and the custom file would be saved locally where we’d repeat the process if we wanted a slightly different format of the data:

Figure 2 - the process of the mbta module process (click to enlarge)

time_opymi:

A lot of our analytics tasks use some form of temporal information, typically with data points in hourly or daily intervals. Working with dates has similar challenges to station string matching, so we wanted some way to account for date formatting when processing batches of time-based data. This module uses dateutil as a helper formatting function, thus helping us match even cases like ‘1 janUary 2024’ to ‘01-01-2024’.

 

Some of the other functionality includes calculating date ranges, broken down into weekends or weekdays, as well as being able to exclude certain dates. One common such example is where we might want to look at all weekdays across a month but take out the days where there was a closure.

census:

Many of our common US Census tasks involve merging or filtering on certain geometries. For this, the census module provides functions for extracting the different census IDs from smaller divisions (e.g. block groups from block groups; counties from tracts). It builds upon the pygris package which allows us to load census information for different years and states, such as area or population.

 

Furthermore, it also includes functions for more custom geospatial tasks such as outputting the census geometry for a point (latitude, longitude) or checking that a geometry or a GEOID is valid. This is typically useful for any task that involves looking at travel-related research questions across different Census geographies.

panel:

The panel module contains functions that simplify processing the customer satisfaction survey from Qualtrics onto our research databases. In it, we’ve included some common Qualtrics-related cleaning functionality as well as functions for metric calculations (such as %satisfaction or mean weight times).

 

Conclusion

Overall, OpyMI is an exciting new library that lets us streamline common coding tasks. With it we can save time and share code across the team, while having a standard repository to refer to for our analysis work.