BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//talks.osgeo.org//foss4g-europe-2026//speaker//TE8YTF
BEGIN:VTIMEZONE
TZID:EET
BEGIN:STANDARD
DTSTART:20001029T050000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:EET
TZOFFSETFROM:+0300
TZOFFSETTO:+0200
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:EEST
TZOFFSETFROM:+0200
TZOFFSETTO:+0300
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-foss4g-europe-2026-33YRC7@talks.osgeo.org
DTSTART;TZID=EET:20260630T103000
DTEND;TZID=EET:20260630T110000
DESCRIPTION:PostgreSQL and PostGIS remain the default choice for geospatial
  data management. Their strict ACID compliance and extensive spatial capab
 ilities cover most standard use cases. However\, scaling a dynamic\, multi
 -tenant WebGIS platform\, where users continuously upload custom datasets\
 , reveals the scalability boundaries of traditional relational architectur
 es.\n\nDuring the development of the WebGIS platform\, [GOAT](https://gith
 ub.com/plan4better/goat)\, we encountered these constraints directly. Init
 ially\, all spatial data was stored in a monolithic PostgreSQL database. T
 o prevent the system catalogs from overloading due to creating thousands o
 f user-specific tables\, we consolidated datasets into multi-tenant tables
  grouped by geometry type. This stabilized the table count\, but as platfo
 rm adoption increased\, individual tables rapidly exceeded 50GB. The resul
 ting volume of data and indexes degraded query performance and complicated
  maintenance.\n\nAttempting to resolve this computationally\, we implement
 ed Citus to shard and distribute the data. While this approach provided ve
 rtical scaling\, managing a distributed PostgreSQL cluster introduced sign
 ificant operational complexity. Furthermore\, the application logic requir
 ed extensive refactoring to accurately route queries using distribution co
 lumns. When we evaluated horizontal scaling through read replicas\, the in
 frastructure cost of duplicating a multi-terabyte database proved prohibit
 ive for a small team. Even with a strong preference for the PostGIS ecosys
 tem\, we were maintaining a fragile system that was slow to back up and di
 fficult to sustainably host.\n\nA common structural response to this chall
 enge is separating storage from compute via a lakehouse architecture. We e
 valuated managed solutions like BigQuery and Databricks\, which offer expa
 nding spatial support. Yet\, for an open-source project\, these platforms 
 present distinct disadvantages: high costs\, vendor lock-in\, and an inabi
 lity to be self-hosted\, bypassing data sovereignty requirements for certa
 in clients. Focusing on open-source frameworks\, Apache Iceberg stood out 
 as a mature\, production-ready standard. However\, we ultimately opted to 
 test a novel framework still in its early stages: DuckLake. \n\nDuckLake p
 rovides a strict separation of concerns. It manages metadata within a ligh
 tweight relational database while storing the actual data in highly compre
 ssed Parquet files. Built around DuckDB\, it allows direct access to DuckD
 B's native spatial functions\, including vector tile generation.\n\nWe int
 egrated DuckLake despite its beta status\, primarily to test DuckDB's mini
 mal storage requirements\, strong analytical capabilities\, and open-sourc
 e foundation. The infrastructural shift was measurable. Transitioning larg
 e-scale system layers—such as nationwide street networks—and user data
  from PostGIS to Parquet reduced our total storage footprint by at least 9
 0%. This was achieved through a combination of Parquet's columnar compress
 ion and the elimination of traditional database indexes\, which previously
  accounted for a massive portion of our storage.\n\nCurrently\, our data i
 s stored on scalable volumes and backed up to S3-compatible object storage
 . Analytical compute has been decoupled and is managed by Windmill\, orche
 strating background Python jobs to execute DuckDB queries on demand. While
  a minor subset of highly specific queries show slight performance regress
 ions compared to PostGIS\, the vast majority execute significantly faster.
  This shift is particularly noticeable during large-scale spatial analytic
 s\, as columnar storage outperforms row-based continuous reading when scan
 ning and aggregating massive datasets.  \n\nBecause Parquet files are high
 ly compressed and immutable\, they cannot be edited in place. To support d
 ata mutations\, DuckLake writes edits to separate delta files\, which are 
 dynamically merged during query execution. Furthermore\, DuckLake can save
  small edits directly to PostgreSQL using inlining. The underlying metadat
 a layer\, responsible for tracking these file shifts\, is managed by Postg
 reSQL. Since PostgreSQL is optimized for rapid transactions\, this hybrid 
 approach allows us to retain the metadata management of a relational datab
 ase while utilizing the storage efficiency of Parquet for massive spatial 
 datasets. Additionally\, DuckLake provides built-in support for versioning
  and time travel\, maintaining a history of data mutations.\n\nWhile analy
 tical workloads benefited from this architecture\, we encountered distinct
  challenges regarding the high-frequency queries required by web mapping. 
 Although Parquet files can be optimized through sorting and partitioning\,
  they are not naturally designed for the low-latency\, point-lookup access
  patterns required by vector tile or feature services. Notably\, standard 
 Parquet implementations do not inherently support spatial indexes\, meanin
 g spatial queries often trigger full file scans rather than targeted reads
 .\nSpecifically\, generating dynamic vector tiles requires evaluating and 
 filtering entire Parquet files for every incoming user request\, creating 
 a severe performance bottleneck on larger datasets. Additionally\, maintai
 ning continuous access to the PostgreSQL metadata layer via DuckDB frequen
 tly exhausted database connection limits\, which introduced noticeable lat
 ency during initial query execution.\n\nTo mitigate these issues\, we intr
 oduced connection pooling using PgBouncer and caching strategies via Redis
 . However\, these solutions only masked the underlying problem for large d
 atasets. Ultimately\, we adopted a hybrid vector tile architecture to bypa
 ss these limitations: we rely on static vector tiles generated by Tippecan
 oe for base layers\, reserving DuckDB's dynamic\, on-the-fly vector tile g
 eneration strictly for volatile\, filtered\, or actively edited datasets.\
 n\nDuring the development of GOAT\, testing DuckLake demonstrated both its
  current utility and its limitations. While the framework is still in its 
 early stages and requires architectural workarounds—like our hybrid vect
 or tile setup—to handle low-latency web mapping\, it functions as a prac
 tical complement to PostgreSQL and PostGIS for heavy analytical workloads.
  The upcoming native support for GeoParquet is expected to address several
  of the current performance bottlenecks by introducing geometry as a nativ
 e Parquet type\, which should improve execution times for spatial bounding
  box and intersection queries. For analytics-heavy platforms managing mult
 i-tenant data\, the combination of DuckLake and DuckDB offers an alternati
 ve approach to scaling spatial infrastructure while maintaining manageable
  server costs.
DTSTAMP:20260604T211853Z
LOCATION:A11
SUMMARY:DuckLake: A scalable data lakehouse for web mapping? - Majk Shkurti
URL:https://talks.osgeo.org/foss4g-europe-2026/talk/33YRC7/
END:VEVENT
END:VCALENDAR