Carlos Eduardo Mota FOSS4G 2024

Carlos Eduardo Mota
.ical

Geoscience Researcher, Geologist PhD, Head od Data Science Group of the Geoscientific Infrastructuture Directorship of Geological Survey of Brazil. 15 years of experience in Python and several FOSS4G tools (MapServer, Geoserver, PostGIS, GeoNode)

Sessions

12-05

16:00

30min

The Digital Module of the IS_Agro Project: Using the medallion architecture as a basis for automating pipeline execution routines in Apache Airflow

Carlos Eduardo Mota

The IS_Agro project is an initiative focused on the critical evaluation and subsequent adaptation of methodologies designed in global forums, with a view to their application in the national context based on the development of new agro-socio-environmental metrics and indicators (IASs) that aim to provide a more accurate and authentic representation of the agricultural landscape in the national territory. IASs are measures used to monitor and evaluate agricultural performance related to social, economic and environmental aspects, thus having great importance in guiding more sustainable political strategies and agricultural practices, whether by the public or private entity, serving “to evaluate the performance of agriculture in terms of its environmental, social and economic performance, providing comparative data and information between federative entities or countries, among several other applications” (EMBRAPA SOLOS, 2023). In this project, IASs are developed by different teams specialized in the proposed themes, whose works are previously approved and published in the scientific arena. To automate data collection, allocation, calculations and constant updates of the IASs, there is a team called the Digital Module, which develops solutions for each indicator, transforming them into digital algorithms. Structured, semi-structured and unstructured registration data are collected and stored in a data lakehouse, requiring a great deal of organization within the repository so that the data is always available and easily accessible. It was decided to implement the medallion architecture (medal architecture), which consists of allocating data in three layers with different purposes, while an open source platform was used for pipeline management and automation.

The conception of this project as a digital platform linked to the Brazilian Agricultural Observatory aims to publish indicators and parameters derived from well-founded technical and scientific data, capable of evaluating the effective performance of the national agricultural sector at the municipal or state level, contributing to sectoral policies and planning and management processes aimed at building sustainable agriculture and the correct positioning of the country on the international scene. Thus, the general objective is to develop an intelligent environment that automates and manages the IAS pipelines in a data storage organization environment based on the medallion architecture to be the basis of the data panel for publishing the indicators.

A data pipeline is a succession of connected phases that enable the collection, storage, modification, analysis, and representation of data, with the purpose of acquiring meaningful insights and supporting informed choices (CALANCA, 2023). A data lakehouse, the destination of the project pipelines, is “like a modern data platform built from a combination of a data lake and a data warehouse” (ORACLE CLOUD INFRASTRUCTURE, 2023), using “the flexible storage of unstructured data from a data lake and the management capabilities and tools of data warehouses, and then strategically deploying them together as a larger system” (ORACLE CLOUD INFRASTRUCTURE, 2023). The medallion architecture is the sequential structuring of data storage that aims to logically organize the data in the lakehouse, aiming to incrementally and progressively improve the structure and quality of the data as it flows through the three layers of the architecture (ARQUITETURA medallion, 2024). The terms bronze (raw data from the source), silver (transformation and validation of the data), and gold (refined and enriched data for use in projects) describe the quality of the data during the process (SKAYA et al, 2024) . Pipeline management is performed by Apache Airflow (version 2.44), an open-source platform for developing, scheduling, and monitoring batch-oriented workflows based on the Python programming language, which allows you to create workflows connected to virtually any technology (WHAT is Airflow™?, 2023). The Airflow execution environment was structured in Docker, an open-source platform that allows you to create and manage containers as modular virtual machines that contain the essentials for their execution. The developed image is available on GitHub.
To be confirmed, the routines will be executed once a month. Raw data is collected by downloading and maintaining its original format, with a hash of each file being recorded to indicate that the data has been updated and download it again in the event of a change. This data is cleaned and processed as needed. At the end of the silver phase, a tabular structure will be created with geocode (integer, IBGE code of municipalities or states), date (timestamp, ISO 8601), source (text) and value (floating point, real number) and will be saved in the data lakehouse as .parquet, an open-source columnar storage format designed for highly compressed storage and efficient data retrieval, providing improved performance for handling complex mass data (OVERVIEW, 2022). The .parquet files saved in the data lake are available for use in the gold tier with one-to-many cardinality. In this last phase of the architecture, the necessary calculations are performed for each source of the indicators, with some sources that do not require calculations. The final phase is with the export of the gold data to tables in a project database in PostgreSQL, being ready for use by an API developed internally that allows the provision of data for the data panel to be developed (by another team) and published to society from the project website.

This model has been adjusted and corrected throughout the development of the project in the Digital Module. Flexible, it is now considered ready to receive any indicator developed by other teams, as well as the development of the data panel for publication for use by society.

Use cases & applications

Deploying GeoNode in Production: Lessons from Brazilian Government Agencies

Carlos Eduardo Mota, Daniel Araújo Miranda

This talk presents case studies of deploying GeoNode, an open-source geospatial content management system, in production environments within two Brazilian government agencies: the Geological Survey of Brazil (SGB) and the Brazilian Federal Police (PF). We'll explore how these agencies have successfully implemented and customized GeoNode to meet their specific needs, addressing common challenges in large-scale FOSS4G deployments.

Key points we'll cover:

SGB's approach:
- Developing a Helm chart for automated GeoNode 4 installation on Red Hat OpenShift
- Addressing security requirements like rootless execution and random UID support
- Implementing autoscaling for most components based on CPU and memory utilization
- Exploring cluster implementation of GeoServer for improved scalability
PF's customizations:
- Creating a dedicated "inteligeo-deploy" repository for enhanced deployment features
- Implementing centralized configuration and logging
- Improving security by separating credentials and using Podman instead of Docker
- Integrating with internal systems and scheduling data updates

We'll discuss the challenges faced, solutions implemented, and lessons learned from both approaches. These case studies demonstrate that FOSS4G solutions like GeoNode are ready for production use in government agencies, providing flexibility, scalability, and security.

By sharing our experiences, we aim to help other organizations successfully deploy GeoNode and other FOSS4G solutions in production environments. We welcome questions and discussions on best practices for large-scale FOSS4G implementations.

OpenGeoSGB: State of the art of Transition to an Open Source, semi-automated, FAIR-ready Geological Spatial Data Infrastructure of the Geological Survey of Brazil

Carlos Eduardo Mota

The Geological Survey of Brazil (SGB) is under a digital transformation process. One of the pillars of this process involves speed, scalability, security and availability of data produced. Furthermore, CPRM is creating a favorable environment for the adoption of cloud-based architecture.

This paper aims to present an overview of a new developed Geological Spatial Data Infrastructure for the SGB. The GeoSGB is the main source for internal mapping, infographic and dashboard applications. It will assimilate legacy services, that runs on isolated map servers, such as self hosted node of OneGeology OGC Services.

The solution adopted is GeoNode 4.2, which brings together map, data services, metadata catalog and spatial database. It is free and open source software with a very active community. In addition, GeoNode has a good content management system, a rich API, and it’s fully customizable. To meet data access demands, it was customized to run in Kubernetes-based environments and each mapped area produces its own geoservice, exportable to different formats, such as shapefile and geotiff.

However, it became necessary a complete separation between the production and publishing environments. SGB’s production pipeline is composed by internally developed data management software. Some of these systems are being modernized, with updates on business rules, frameworks and security. GIS work is carried out in ArcGIS Enterprise®, with some exceptions in QGIS and GeoServer. With this background, it should be considered as a hybrid GIS model.

About database structures, a process of harmonization was necessary, mainly those produced from proprietary GIS. For legacy reasons, the proprietary structures were maintained, as long as possible to export to OGC WKT or WKB. Exported geometries are analyzed for compliance with Simple Features Standard (OGC/ISO19125). The information eligible for publication were consolidated in database views and is literally replicated to GeoSGB, by script.

The metadata production for continuous databases is carried out semi-automatically – templated - in accordance with the mapping program. This is possible by integrating GeoNode's APIs with internal databases, delivering associated metadata and resources directly to the authors. The contact with (meta)data authors were managed by GeoNode.

The symbolization of thematic layers involved the development of interoperable libraries, based on SVG glyphs inserted in OpenType fonts (ISO/IEC 14496-22:2007), with near equal rendering among different multi-platform GIS software.

Data and metadata pipelines were implemented using Python scripts, with specific libraries associated with GeoNode APIs. Apache Airflow manages the entire process of extracting internal bases, quality tests, structure analysis and loading on the GeoSGB database server, including being responsible for notification activities.

So, GeoSGB now is a continuous development platform, with focus in increase quality in delivered data to customers.

The future perspectives involve the transformation itself into research line in geotechnologies and high-performance IT services. It shoud envolve plug-in development for data management, processing and visualization including use of artificial intelligence. In operational terms, adoption of OGC APIs, data internationalization and harmonization, associated with adoption of OGC specific standards, such as GeoSciML and WaterML contributes to become SGB a global supplier of geoscientific data.

Transition to FOSS4G

Room I

Carlos Eduardo Mota .ical

Sessions

Carlos Eduardo Mota
.ical