Searching the Spatial Data Lake: Bringing GeoParquet to Apache Lucene FOSS4G NA 2024

Searching the Spatial Data Lake: Bringing GeoParquet to Apache Lucene
.ical

09-11, 10:30–11:00 (America/Chicago), Grand B

This talk will present a new approach for making GeoParquet queryable using an Apache Lucene codec and giving the ability to make that data searchable using Apache Lucene and Lucenia's spatial indexing capabilities.

As storage costs continue to rise, the Geospatial community is looking at new ways to integrate with data processing tools and make their data more queryable. This talk will present a new approach that makes GeoParquet queryable using an Apache Lucene codec and gives the ability to make that data searchable using Apache Lucene and Lucenia's spatial indexing capabilities. We will demonstrate how to create a GeoParquet file with spatial indexing enabled and query the file using Apache Lucene. We will also discuss the performance implications of using spatial indexing with GeoParquet and how it can improve the efficiency of querying geospatial data.

Many users have consolidated data storage into Parquet, Avro, and Iceberg formats; however, these formats are only sometimes easily queryable. Tools like DuckDB work for small-scale use cases, but a more robust solution is needed for larger datasets. For those already running a search cluster or search solution, extending Apache Lucene to support new formats on disk is a natural progression to interop your data with your existing search solution. Alternatively, users may only adopt new software if it supports their existing storage formats.

While gaining familiarity with Apache Lucene may be necessary, it is a core component in numerous search engines used daily. Apache Lucene is a robust and high-performance text search engine library written entirely in Java. Its versatility makes it suitable for various applications requiring comprehensive full-text search capabilities across different platforms. Lucene is extensively used across multiple domains, including search engines, recommendation systems, and data analytics platforms. Serving as the foundational indexing component in popular projects like Apache Solr, Elasticsearch, OpenSearch, and Lucenia, Lucene's integrating advanced features such as spatial indexing and geospatial query capabilities make it particularly well-suited for geospatial applications in the open-source community.

Apache Parquet is a columnar storage format widely used in the extensive data ecosystem. It is famous for storing data in data lakes and warehouses. Parquet is efficient for storing and querying large datasets because it stores data in a columnar format, which allows for efficient compression and encoding of data. Parquet is also famous for storing geospatial data because it supports complex data types like arrays and maps, commonly used in geospatial data.

GeoParquet is an extension of Parquet that adds support for geospatial data types like points, lines, and polygons. It is a popular choice for storing geospatial data because it allows for efficient storage and querying of geospatial data. One limitation of GeoParquet is that it does not support spatial indexing, making it difficult to query geospatial data efficiently.

In this talk, we will present a new approach for making GeoParquet queryable using an Apache Lucene codec, which gives the ability to make that data searchable using Apache Lucene and Lucenia's spatial indexing capabilities. We will demonstrate how to create a GeoParquet file with spatial indexing enabled and query the file using Apache Lucene. We will also discuss the performance implications of using spatial indexing with GeoParquet and how it can improve the efficiency of querying geospatial data.

Wes Richardet

Searching the Spatial Data Lake: Bringing GeoParquet to Apache Lucene .ical 09-11, 10:30–11:00 (America/Chicago), Grand B

Searching the Spatial Data Lake: Bringing GeoParquet to Apache Lucene
.ical

09-11, 10:30–11:00 (America/Chicago), Grand B