FOSS4G 2022 general tracks

seth hosteter

Seth Hostetter is the Director of Data Operations Systems and Analytics at the New York City Department of Transportation. He oversees the management, maintenance, analysis, application and presentation of the agency’s safety data. His team provides analytic support for project and program development, design, and prioritization. He has provided the analysis for numerous studies including Safer Cycling: Bicycle Ridership and Safety in New York City, Vision Zero Borough Pedestrian Safety Action Plans, Making Safer Streets, and Pedestrian Safety and Action Plan. Seth is a data analyst and transportation planner with over a decade of experience in New York City. He received a bachelor’s in Environmental Design from SUNY University at Buffalo and received a master’s degree in Urban Planning from Columbia University, concentrating in transportation.


Sessions

08-25
17:50
5min
Building a data analytics library in Python
seth hosteter

The Data Operations Systems and Analytics team at NYC DOT’s primary mission is to support the data analysis and data product needs relating to transportation safety for the Agency. The team’s work producing safety analysis for projects and programs typically involves merging data from a variety of sources with collision data, asset data, and/or program data. The bulk of the analysis is performed in PostgreSQL databases all with a geospatial component. The work necessitates ingesting input data from other databases, csv/excel files, and various geospatial data formats. It is critical that the analysis be documented and repeatable.

Moving data around, getting external data into the database, transforming it, geocoding it etc., previously occupied the bulk of the team’s time before, reducing capacity for the actual analysis. Additionally the volume of one-off and exploratory analyses resulted in a cluttered database environment with multiple versions of datasets with unclear lineage and state of completeness.
Modeled on the infrastructure as code idea, we began building a python library that would allow us to preserve the entire analysis workflow from data ingestion to analysis and to output generation in a single python file or Jupyter notebook. The library began as a way to reduce the friction and standardize the process of ingesting external data into the various database environments utilized. It has since grown into the primary method to facilitate reproducible data analysis processes that includes the data ingestion, transformation, analysis, and output generation.

The library includes basic database connections, and facilitates quick and easy import and export from flat files, geospatial data files, and other databases. It provides both inferred and defined schemas, to allow both quick exploration and more thoroughly defined data pipeline processes. The library includes standardization of column naming, comments, and permissions. There are built in database cleaning processes, geocoding processes, and we have started building simple geospatial data display functions for exploratory analysis. The code is heavily reliant on numpy, pandas, GDAL/ogr2ogr, pyodbc, psycopg2, shapely, and basic sql and python. The library is not an ORM, but occupies a similar role, but geared towards analytic workflows.

The talk will discuss how the library has evolved over time, the functionality and use cases in the team’s daily workflows as well as where we would like to extend the functionality and open it up for contributions. While the library is not currently open source, we are actively working on creating an open version and migrating to Python 3.x. This library has greatly improved the speed and simplicity of conducting exploratory analysis and enhanced the quality and completeness of the documentation of our more substantial data analytics and research.
The library should be of interest and utility for anyone working with data without the support of a dedicated data engineering team to facilitate the collection of multiple datasets from a variety of formats, as well as anyone looking to standardize their data analysis workflows from beginning to end.

Use cases & applications
Room 4