DuckLake: A scalable data lakehouse for web mapping? FOSS4G Europe 2026

DuckLake: A scalable data lakehouse for web mapping?
.ical
2026-06-30 10:30–11:00, A11

PostgreSQL and PostGIS remain the default choice for geospatial data management. Their strict ACID compliance and extensive spatial capabilities cover most standard use cases. However, scaling a dynamic, multi-tenant WebGIS platform, where users continuously upload custom datasets, reveals the scalability boundaries of traditional relational architectures.

During the development of the WebGIS platform, GOAT, we encountered these constraints directly. Initially, all spatial data was stored in a monolithic PostgreSQL database. To prevent the system catalogs from overloading due to creating thousands of user-specific tables, we consolidated datasets into multi-tenant tables grouped by geometry type. This stabilized the table count, but as platform adoption increased, individual tables rapidly exceeded 50GB. The resulting volume of data and indexes degraded query performance and complicated maintenance.

Attempting to resolve this computationally, we implemented Citus to shard and distribute the data. While this approach provided vertical scaling, managing a distributed PostgreSQL cluster introduced significant operational complexity. Furthermore, the application logic required extensive refactoring to accurately route queries using distribution columns. When we evaluated horizontal scaling through read replicas, the infrastructure cost of duplicating a multi-terabyte database proved prohibitive for a small team. Even with a strong preference for the PostGIS ecosystem, we were maintaining a fragile system that was slow to back up and difficult to sustainably host.

A common structural response to this challenge is separating storage from compute via a lakehouse architecture. We evaluated managed solutions like BigQuery and Databricks, which offer expanding spatial support. Yet, for an open-source project, these platforms present distinct disadvantages: high costs, vendor lock-in, and an inability to be self-hosted, bypassing data sovereignty requirements for certain clients. Focusing on open-source frameworks, Apache Iceberg stood out as a mature, production-ready standard. However, we ultimately opted to test a novel framework still in its early stages: DuckLake.

DuckLake provides a strict separation of concerns. It manages metadata within a lightweight relational database while storing the actual data in highly compressed Parquet files. Built around DuckDB, it allows direct access to DuckDB's native spatial functions, including vector tile generation.

We integrated DuckLake despite its beta status, primarily to test DuckDB's minimal storage requirements, strong analytical capabilities, and open-source foundation. The infrastructural shift was measurable. Transitioning large-scale system layers—such as nationwide street networks—and user data from PostGIS to Parquet reduced our total storage footprint by at least 90%. This was achieved through a combination of Parquet's columnar compression and the elimination of traditional database indexes, which previously accounted for a massive portion of our storage.

Currently, our data is stored on scalable volumes and backed up to S3-compatible object storage. Analytical compute has been decoupled and is managed by Windmill, orchestrating background Python jobs to execute DuckDB queries on demand. While a minor subset of highly specific queries show slight performance regressions compared to PostGIS, the vast majority execute significantly faster. This shift is particularly noticeable during large-scale spatial analytics, as columnar storage outperforms row-based continuous reading when scanning and aggregating massive datasets.

Because Parquet files are highly compressed and immutable, they cannot be edited in place. To support data mutations, DuckLake writes edits to separate delta files, which are dynamically merged during query execution. Furthermore, DuckLake can save small edits directly to PostgreSQL using inlining. The underlying metadata layer, responsible for tracking these file shifts, is managed by PostgreSQL. Since PostgreSQL is optimized for rapid transactions, this hybrid approach allows us to retain the metadata management of a relational database while utilizing the storage efficiency of Parquet for massive spatial datasets. Additionally, DuckLake provides built-in support for versioning and time travel, maintaining a history of data mutations.

While analytical workloads benefited from this architecture, we encountered distinct challenges regarding the high-frequency queries required by web mapping. Although Parquet files can be optimized through sorting and partitioning, they are not naturally designed for the low-latency, point-lookup access patterns required by vector tile or feature services. Notably, standard Parquet implementations do not inherently support spatial indexes, meaning spatial queries often trigger full file scans rather than targeted reads.
Specifically, generating dynamic vector tiles requires evaluating and filtering entire Parquet files for every incoming user request, creating a severe performance bottleneck on larger datasets. Additionally, maintaining continuous access to the PostgreSQL metadata layer via DuckDB frequently exhausted database connection limits, which introduced noticeable latency during initial query execution.

To mitigate these issues, we introduced connection pooling using PgBouncer and caching strategies via Redis. However, these solutions only masked the underlying problem for large datasets. Ultimately, we adopted a hybrid vector tile architecture to bypass these limitations: we rely on static vector tiles generated by Tippecanoe for base layers, reserving DuckDB's dynamic, on-the-fly vector tile generation strictly for volatile, filtered, or actively edited datasets.

During the development of GOAT, testing DuckLake demonstrated both its current utility and its limitations. While the framework is still in its early stages and requires architectural workarounds—like our hybrid vector tile setup—to handle low-latency web mapping, it functions as a practical complement to PostgreSQL and PostGIS for heavy analytical workloads. The upcoming native support for GeoParquet is expected to address several of the current performance bottlenecks by introducing geometry as a native Parquet type, which should improve execution times for spatial bounding box and intersection queries. For analytics-heavy platforms managing multi-tenant data, the combination of DuckLake and DuckDB offers an alternative approach to scaling spatial infrastructure while maintaining manageable server costs.

Indicate what is (are) the open source project(s) essential in your talk:

Essential Open Source Project(s):
- GOAT (https://github.com/plan4better/goat): The primary focus of the talk, an open-source WebGIS platform for scalable spatial analytics.
- DuckDB: An in-process SQL OLAP database management system used for high-performance spatial analytics on Parquet files.
- DuckLake: A novel open-source framework connecting DuckDB to data lakehouse architectures.
- PostgreSQL / PostGIS: The established relational database.
- Windmill: The open-source version of Windmill used to orchestrate and execute background Python processing jobs.

Give indication of resources (video, web pages, papers, etc.) to read in advance, that will help get up to speed on advanced topics.:

DuckLake Website: https://ducklake.select/
DuckLake Video: https://www.youtube.com/watch?v=zeonmOO9jm4&t=1897s

Assign a number between 1 and 4 indicating the level of technical complexity of your contribution.: 3: you should know what you're doing to understand this talk Select at least one general theme that best defines your proposal: Software status, new project development Under which license do you make your contribution available? The conference contribution comprises the abstract, the text contribution for the conference proceedings, the presentation materials as well as the video recording and live transmission of the presentation: CC BY

Majk Shkurti

A seasoned WebGIS developer with extensive experience in building and managing geospatial applications. With a strong focus on usability and open-source technologies, there is a clear drive to make spatial data more accessible and practical.

As CTO at Plan4Better, the work centers on developing innovative solutions at the intersection of mobility, sustainability, and GIS. With a passion for turning complex data into useful tools, the focus lies in bridging the gap between advanced geospatial technology and real-world decision-making.

DuckLake: A scalable data lakehouse for web mapping? .ical 2026-06-30 10:30–11:00, A11

DuckLake: A scalable data lakehouse for web mapping?
.ical
2026-06-30 10:30–11:00, A11