Marco Minghini
Dr. Marco Minghini obtained a BSc degree (2008), an MSc degree (2010) and a PhD degree (2014) in Environmental Engineering at Politecnico di Milano. From 2014 to 2018 he was a Postdoctoral Research Fellow at the GEOlab of Politecnico di Milano, Italy. Since 2018 he works as a Scientific Project Officer at the European Commission - Joint Research Centre (JRC) in Ispra, Italy, focusing on (geo)data interoperability, sharing and standardisation in support of European data spaces, and contributing to the operation and evolution of the INSPIRE infrastructure. He is an advocate of open source software and open data. OSGeo Charter Member since 2015, active OpenStreetMap contributor and Voting Member of the Humanitarian OpenStreetMap Team, Chair of ISPRS ICWG "Openness in Geospatial Science and Remote Sensing". He is a regular participant and presenter at global and local FOSS4G events, Secretary of FOSS4G Europe 2015, Co-chair of the Academic Track at FOSS4G 2022 and 2023 and FOSS4G Europe 2024.
Sessions
Building footprints (hereinafter buildings) represent key geospatial datasets for several applications, including city planning, demographic analyses, modelling energy production and consumption, disaster preparedness and response, and digital twins. Traditionally, buildings are produced by governmental organisations as part of their cartographic databases, with coverage ranging from local to national and licensing conditions being heterogeneous and not always open. This makes it challenging to derive open building datasets with a continental or global scale. Over the last decade, however, the unparalleled developments in the resolution of satellite imagery, artificial intelligence techniques and citizen engagement in geospatial data collection have enabled the birth of several building datasets available at least at a continental scale under open licenses.
In this work, we analyse four such open building datasets. The first is the building dataset extracted from the well-known OpenStreetMap (OSM, https://www.openstreetmap.org) crowdsourcing project, which creates and maintains a database of the whole world released under the Open Database License (ODbL). OSM buildings are typically derived from the digitalisation of high-resolution satellite imagery, and in some case from the import of other databases with ODbL-compatible licenses. The second dataset is EUBUCCO (https://eubucco.com), a pan-European building database produced by a research team at the Technical University Berlin by merging different input sources: governmental datasets when available and open, and OSM otherwise [1]. EUBUCCO is mostly licensed under the ODbL, with only exceptions for two regions in Italy and Czech Republic. The third dataset is Microsoft Open Building Footprints (MS, https://github.com/microsoft/GlobalMLBuildingFootprints), extracted through the application of machine learning technology from high-resolution Bing Maps satellite imagery between 2014 and 2023, available at the global scale and also licensed under the ODbL. The fourth dataset, called Digital Building Stock Model (DBSM), was produced by the Joint Research Centre (JRC) of the European Commission to support studies on energy-related purposes. It is an ODbL-licensed pan-European dataset produced from the hierarchical conflation of three input datasets: OSM, MS and the European Settlement Map [2].
The objective of this work is to compare the four datasets – which derive from different approaches following heterogeneous processing steps and governance rules – in terms of their geometry (i.e. attributes are out of scope) in order to draw conclusions on their similarity and differences. It is known from literature that building completeness in OSM (which plays a key role in three out of the four datasets – OSM itself, EUBUCCO and DBSM) varies with the degree of urbanisation [3] and that machine learning applied to satellite imagery (used in MS) may have different performance depending on the urban or rural context [4]. In light of this, we analyse the building datasets according to the degree of urbanisation of their location using the administrative boundaries provided by Eurostat, which classifies each European province as urban, semi-urban or rural (https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units/countries).
We chose five European Union (EU) countries for the analysis: Malta (MT), Greece (EL), Belgium (BE), Denmark (DK) and Sweden (SE). The choice was motivated by the needs to: i) select countries of different size and geographical location, which ensure that their national OSM communities are substantially different; ii) select countries having different portions of urban, semi-urban and rural areas; and iii) select two sets of countries for which the input source for EUBUCCO buildings was a governmental dataset (BE, DK) and OSM (MT, EL, SE) to detect possibly different behaviours.
From the methodological point of view, for each country and degree of urbanisation we first calculated and compared the total number and total area of buildings in all datasets and we examined their statistics through box plots. This was followed by the calculation, for each couple of datasets and degree of urbanisation, of the building area of intersection and its fraction of the total building area of each of the two datasets. Finally, we intersected all the four datasets and calculated the fraction of the area of each dataset that this intersection represents.
Results show that in urban areas, while the datasets are overall similar in terms of total area of buildings, the total number of buildings is typically higher in EUBUCCO for DK and BE, where the information comes from governmental datasets. This suggests that such datasets outperform OSM in modelling the footprints of individual buildings in the most urbanised areas. In contrast, in semi-urban and rural areas, where OSM traditionally lacks completeness, MS (and as a consequence DBSM, which is also based on MS) captures more buildings. This is especially evident in SE, where 94% of the country area is not urban. When calculating the intersection between building areas for each couple of datasets in all countries and urban areas, the area of OSM buildings scores the lowest percentages of intersection when compared to the building areas of the other datasets. The lowest such percentages, equal to 25%, are scored when compared to MS in non-urban areas. EUBUCCO represents an obvious exception for the countries (MT, EL and SE) where it uses OSM. Finally, the dataset for which the area of intersection between the buildings of all the four datasets represents the largest percentage of the area is OSM, with values even higher than 80% for urban areas. This proves that EUBUCCO and even more DBSM can be considered a sort of ‘OSM extension’ improving its completeness. Instead, the lowest values are scored by MS and result from its radically different generation process compared to the other datasets.
The whole procedure was written in Python using libraries such as Pandas, Dask-GeoPandas and Plotly. The code is available under the European Union Public License (EUPL) v1.2 at https://github.com/eurogeoss/building-datasets in the form of Jupyter Notebooks. Work is ongoing to extend the analysis to the whole EU in order to validate the results of this study and formulate recommendations at the continental level.
Tenders Electronic Daily (TED) is the platform where all public tenders published in European Union (EU) Member States and European institutions are accessible. With approximately 520,000 public procurement notices published per year that are worth more than €420 billion, TED is a cornerstone of EU public procurement. The TED database is available as open data, providing an extremely interesting source for in-depth analysis on public procurements in the EU.
We developed an application that – based on an extraction of the TED database for two years (2021 and 2022) – allows users to: i) automatically label TED documents using GPT; ii) visualise the labels generated by GPT for all documents and manually correct them; iii) use the corrected labels to train a Support Vector Machine (SVM) Machine Learning classifier; and iv) assess the classification accuracy. The application supports an iterative process of re-labelling (using GPT) and re-training the SVM classifier until the expected classification performance is reached and the classifier can be applied to the whole TED dataset. In addition to the progressive improvement of the Machine Learning classifier through the controlled cycle of iterations, the benefits of this approach include user involvement in the correction/enrichment of labels and flexibility in adapting to the specific needs of the datasets and domain – the latter meaning that applicability is not limited to the TED database. Inclusion of the TED database for 2023 is currently ongoing; similarly, a dedicated UI is currently under development to provide a user-friendly access to the application.
The use case investigates the degree to which EU public procurements are relevant to open source geospatial software, open geospatial standards and open geospatial data. To this purpose, for each of the three categories a specific set of keywords was initially listed; this was then complemented by a series of similar keywords retrieved through a semantic text analysis tool named SeTA (https://seta.jrc.ec.europa.eu, developed in-house at the JRC) and further validated by an expert. The final list of keywords represented the input to filter a list of documents from TED to be annotated in the first step using GPT. The presentation will show the classification results and shed some light on the relevance of open source geospatial software, open geospatial standards and open geospatial data in EU tenders.
GPT models used by the application are run in a platform created under a special contract signed by the European Commission with Microsoft Azure. The platform, named GPT@JRC, provides internal APIs that can be accessed upon obtaining an authorization token. Through Python, users can query the APIs using the OpenAI library that offers convenient access to the OpenAI REST API from any Python 3.7+ application. The GPT@JRC offers several Large Language Models including 'gpt-35-turbo-0613', 'gpt-35-turbo-16k', 'gpt-35-turbo-0301'.
More concretely, GPT models are used for annotating the TED database by asking whether a certain TED document, typically through its abstract, covers a specific topic. The expected response is a simple ‘yes’ or ‘no’. By interacting with the APIs, we retrieve the responses and append them as labels to our documents. This allows us to perform unsupervised document classification. Subsequently, we can verify whether the documents have been correctly classified on a sample basis. Following this manual validation phase, as mentioned before, we use the result as input to an SVM classifier (using the Python scikit-learn library) to determine if there is a general rule to distinguish the topic of any TED document from the text of its abstract.