Scalable Architecture for Distributed Spatiotemporal Analytics
11-05, 10:30–11:00 (America/New_York), Lake Audubon

This proposal introduces a scalable architecture for processing and analyzing high-volume GPS, ADS-B, and AIS data streams to enable low-latency, large-scale spatiotemporal analytics, built on a hybrid stack comprising PostgreSQL with PostGIS, Spark, Cassandra, and Iceberg.


This proposal outlines a novel architecture designed for processing large-scale GPS, ADS-B, and AIS data. The platform supports analytical use cases involving millions of vehicle data points streaming in from diverse global sources. At its core, our ingestion pipeline transforms raw positional pings into logical chunks of sequential data referred to as "trips." A trip typically represents a complete real-world journey such as a vehicle commute from home to office or a commercial flight from NYC to SFO.
A common analytical workflow involves a user defining one or more polygons on a map along with time windows of interest. The objective is to identify trips that either start, end, or pass through these spatial regions during the specified intervals. These queries are submitted to the analytics backend, which returns the corresponding set of trips that satisfy the user's criteria.
Our initial implementation was built atop the PostgreSQL ecosystem, leveraging PostGIS and stored procedures to handle complex geospatial queries. This solution was effective driving initial customer engagement. It scaled to approximately 200TB across several PostgreSQL instances split with data divided by contractual clients.
However, with rapid customer onboarding and exponential data growth, we encountered scalability limitations. Chief among these were:
- Skewed load distribution, leading to persistent disk I/O bottlenecks on certain customer instances.
- Limited observability and debuggability of stored procedures, making it difficult to estimate execution time or troubleshoot performance regressions.
These constraints prompted a broader exploration of distributed data processing frameworks and storage engines.
Following months of prototyping and benchmarking, we transitioned to a stack composed entirely of open-source technologies, selected for their scalability, community support, and interoperability.
While PostgreSQL continues to serve essential functions such as maintaining metadata, user-submitted geometries, and analytics job results the heavy-lifting is now delegated to a distributed analytics backend comprising:
- Apache Spark: Serves as the compute layer for the analytics. Spark jobs are a direct replacement for the SQL functions, offering more flexibility, observability and parallelism.
- Apache Cassandra: Serves as the primary indexing layer for fast, pre-filtered trip lookup.
- Apache Iceberg: Used to store high-fidelity trip data on cost-effective object storage, indexed for efficient retrieval.
Cassandra hosts three core indexing tables aligned with common user query patterns:
- trip_start: Indexes trips by their starting geohash and timestamp
- trip_end: Indexes trips by their end geohash and timestamp
- trip_passthrough: Indexes trips by intermediate geohashes and timestamps
Each table is partitioned using a compound key consisting of a 6-character geohash and a 1-hour epoch bin, yielding partitions approximately 1x1 km² in area and 60 minutes in duration. This granularity ensures that partitions stay within Cassandra’s optimal size threshold (~100MB), enabling low-latency reads.
When users submit polygon-based queries, the application decomposes the polygons into a set of intersecting geohash cells. These are used to scan the corresponding partitions in Cassandra. A second, more precise geometric filtering stage follows, eliminating false positives inherent in geohash approximations. This two-step approach minimizes expensive spatial computations and improves overall throughput.
Once candidate trips are identified, their full-resolution ping data is retrieved from Iceberg. By deferring access to this large, immutable dataset until late in the query lifecycle, we drastically reduce the volume of data read, improving performance.
Spark then assembles the final result set in the desired format and persists it back into PostgreSQL for downstream consumption or display.
The platform currently ingests approximately 400GB of new data per day and manages an active historical archive exceeding 500TB. It is designed to operate at global scale, ingesting and serving spatiotemporal data across continents with efficient query latency and fault-tolerant architecture.
Our talk will focus on the lessons learned migrating from relational database to a distributed architecture and high level walk through of the tradeoffs involved when choosing a distributed technical stack at scale for geospatial analytics.