The Digital Module of the IS_Agro Project: Using the medallion architecture as a basis for automating pipeline execution routines in Apache Airflow FOSS4G 2024

The Digital Module of the IS_Agro Project: Using the medallion architecture as a basis for automating pipeline execution routines in Apache Airflow
.ical

12-05, 16:00–16:30 (America/Belem), Room I

The IS_Agro project is an initiative focused on the critical evaluation and subsequent adaptation of methodologies designed in global forums, with a view to their application in the national context based on the development of new agro-socio-environmental metrics and indicators (IASs) that aim to provide a more accurate and authentic representation of the agricultural landscape in the national territory. IASs are measures used to monitor and evaluate agricultural performance related to social, economic and environmental aspects, thus having great importance in guiding more sustainable political strategies and agricultural practices, whether by the public or private entity, serving “to evaluate the performance of agriculture in terms of its environmental, social and economic performance, providing comparative data and information between federative entities or countries, among several other applications” (EMBRAPA SOLOS, 2023). In this project, IASs are developed by different teams specialized in the proposed themes, whose works are previously approved and published in the scientific arena. To automate data collection, allocation, calculations and constant updates of the IASs, there is a team called the Digital Module, which develops solutions for each indicator, transforming them into digital algorithms. Structured, semi-structured and unstructured registration data are collected and stored in a data lakehouse, requiring a great deal of organization within the repository so that the data is always available and easily accessible. It was decided to implement the medallion architecture (medal architecture), which consists of allocating data in three layers with different purposes, while an open source platform was used for pipeline management and automation.

The conception of this project as a digital platform linked to the Brazilian Agricultural Observatory aims to publish indicators and parameters derived from well-founded technical and scientific data, capable of evaluating the effective performance of the national agricultural sector at the municipal or state level, contributing to sectoral policies and planning and management processes aimed at building sustainable agriculture and the correct positioning of the country on the international scene. Thus, the general objective is to develop an intelligent environment that automates and manages the IAS pipelines in a data storage organization environment based on the medallion architecture to be the basis of the data panel for publishing the indicators.

A data pipeline is a succession of connected phases that enable the collection, storage, modification, analysis, and representation of data, with the purpose of acquiring meaningful insights and supporting informed choices (CALANCA, 2023). A data lakehouse, the destination of the project pipelines, is “like a modern data platform built from a combination of a data lake and a data warehouse” (ORACLE CLOUD INFRASTRUCTURE, 2023), using “the flexible storage of unstructured data from a data lake and the management capabilities and tools of data warehouses, and then strategically deploying them together as a larger system” (ORACLE CLOUD INFRASTRUCTURE, 2023). The medallion architecture is the sequential structuring of data storage that aims to logically organize the data in the lakehouse, aiming to incrementally and progressively improve the structure and quality of the data as it flows through the three layers of the architecture (ARQUITETURA medallion, 2024). The terms bronze (raw data from the source), silver (transformation and validation of the data), and gold (refined and enriched data for use in projects) describe the quality of the data during the process (SKAYA et al, 2024) . Pipeline management is performed by Apache Airflow (version 2.44), an open-source platform for developing, scheduling, and monitoring batch-oriented workflows based on the Python programming language, which allows you to create workflows connected to virtually any technology (WHAT is Airflow™?, 2023). The Airflow execution environment was structured in Docker, an open-source platform that allows you to create and manage containers as modular virtual machines that contain the essentials for their execution. The developed image is available on GitHub.
To be confirmed, the routines will be executed once a month. Raw data is collected by downloading and maintaining its original format, with a hash of each file being recorded to indicate that the data has been updated and download it again in the event of a change. This data is cleaned and processed as needed. At the end of the silver phase, a tabular structure will be created with geocode (integer, IBGE code of municipalities or states), date (timestamp, ISO 8601), source (text) and value (floating point, real number) and will be saved in the data lakehouse as .parquet, an open-source columnar storage format designed for highly compressed storage and efficient data retrieval, providing improved performance for handling complex mass data (OVERVIEW, 2022). The .parquet files saved in the data lake are available for use in the gold tier with one-to-many cardinality. In this last phase of the architecture, the necessary calculations are performed for each source of the indicators, with some sources that do not require calculations. The final phase is with the export of the gold data to tables in a project database in PostgreSQL, being ready for use by an API developed internally that allows the provision of data for the data panel to be developed (by another team) and published to society from the project website.

This model has been adjusted and corrected throughout the development of the project in the Digital Module. Flexible, it is now considered ready to receive any indicator developed by other teams, as well as the development of the data panel for publication for use by society.

See also: Apresentação - Presentation (406.9 KB)

Carlos Eduardo Mota

Geoscience Researcher, Geologist PhD, Head od Data Science Group of the Geoscientific Infrastructuture Directorship of Geological Survey of Brazil. 15 years of experience in Python and several FOSS4G tools (MapServer, Geoserver, PostGIS, GeoNode)

This speaker also appears in:

The Digital Module of the IS_Agro Project: Using the medallion architecture as a basis for automating pipeline execution routines in Apache Airflow .ical 12-05, 16:00–16:30 (America/Belem), Room I

The Digital Module of the IS_Agro Project: Using the medallion architecture as a basis for automating pipeline execution routines in Apache Airflow
.ical

12-05, 16:00–16:30 (America/Belem), Room I