11-05, 10:30–11:00 (America/New_York), Lake Anne
A deep technical dive into building a global, spatially aware embeddings database with 100% open source components.
Earth Index aims to make our entire planet searchable, transforming billions of satellite image chips into high-dimensional embeddings to unlock insights into environmental change. Storing and efficiently querying this global dataset presents a significant scalability challenge – particularly with the budget of a non-profit. We’ll show how we’ve created a fast, open source solution for less than 1% of the cost of a proprietary solution.
This talk details our approach to geospatially sharding and partitioning the massive embedding dataset across a distributed Postgres cluster. We’ll start by looking at design challenges including efficient data distribution, schema design, gridding schemes, global identifiers and query optimization.
We’ll then cover the practical performance achieved – including query latency for nearest neighbor searches, data ingestion rates, and index build times at scale. Crucially, we will present a transparent analysis of the cost implications, comparing our self-managed, open-source Postgres architecture against the projected costs of specialized managed vector database services, demonstrating significant cost efficiencies.
This should be a fascinating talk for anyone interested in massive datasets at the intersection of geospatial, AI and open source.