Semantic annotation and classification of EU tendering data on open geospatial software, standards and data using Large Language Models FOSS4G Europe 2024

Semantic annotation and classification of EU tendering data on open geospatial software, standards and data using Large Language Models
.ical
2024-07-04 10:30–11:00, Destination Earth (Van46 ring)

Tenders Electronic Daily (TED) is the platform where all public tenders published in European Union (EU) Member States and European institutions are accessible. With approximately 520,000 public procurement notices published per year that are worth more than €420 billion, TED is a cornerstone of EU public procurement. The TED database is available as open data, providing an extremely interesting source for in-depth analysis on public procurements in the EU.
We developed an application that – based on an extraction of the TED database for two years (2021 and 2022) – allows users to: i) automatically label TED documents using GPT; ii) visualise the labels generated by GPT for all documents and manually correct them; iii) use the corrected labels to train a Support Vector Machine (SVM) Machine Learning classifier; and iv) assess the classification accuracy. The application supports an iterative process of re-labelling (using GPT) and re-training the SVM classifier until the expected classification performance is reached and the classifier can be applied to the whole TED dataset. In addition to the progressive improvement of the Machine Learning classifier through the controlled cycle of iterations, the benefits of this approach include user involvement in the correction/enrichment of labels and flexibility in adapting to the specific needs of the datasets and domain – the latter meaning that applicability is not limited to the TED database. Inclusion of the TED database for 2023 is currently ongoing; similarly, a dedicated UI is currently under development to provide a user-friendly access to the application.
The use case investigates the degree to which EU public procurements are relevant to open source geospatial software, open geospatial standards and open geospatial data. To this purpose, for each of the three categories a specific set of keywords was initially listed; this was then complemented by a series of similar keywords retrieved through a semantic text analysis tool named SeTA (https://seta.jrc.ec.europa.eu, developed in-house at the JRC) and further validated by an expert. The final list of keywords represented the input to filter a list of documents from TED to be annotated in the first step using GPT. The presentation will show the classification results and shed some light on the relevance of open source geospatial software, open geospatial standards and open geospatial data in EU tenders.
GPT models used by the application are run in a platform created under a special contract signed by the European Commission with Microsoft Azure. The platform, named GPT@JRC, provides internal APIs that can be accessed upon obtaining an authorization token. Through Python, users can query the APIs using the OpenAI library that offers convenient access to the OpenAI REST API from any Python 3.7+ application. The GPT@JRC offers several Large Language Models including 'gpt-35-turbo-0613', 'gpt-35-turbo-16k', 'gpt-35-turbo-0301'.
More concretely, GPT models are used for annotating the TED database by asking whether a certain TED document, typically through its abstract, covers a specific topic. The expected response is a simple ‘yes’ or ‘no’. By interacting with the APIs, we retrieve the responses and append them as labels to our documents. This allows us to perform unsupervised document classification. Subsequently, we can verify whether the documents have been correctly classified on a sample basis. Following this manual validation phase, as mentioned before, we use the result as input to an SVM classifier (using the Python scikit-learn library) to determine if there is a general rule to distinguish the topic of any TED document from the text of its abstract.

See also: Presentation pdf (6.2 MB)

Marco Minghini

Dr. Marco Minghini obtained a BSc degree (2008), an MSc degree (2010) and a PhD degree (2014) in Environmental Engineering at Politecnico di Milano. From 2014 to 2018 he was a Postdoctoral Research Fellow at the GEOlab of Politecnico di Milano, Italy. Since 2018 he works as a Scientific Project Officer at the European Commission - Joint Research Centre (JRC) in Ispra, Italy, focusing on (geo)data interoperability, sharing and standardisation in support of European data spaces, and contributing to the operation and evolution of the INSPIRE infrastructure. He is an advocate of open source software and open data. OSGeo Charter Member since 2015, active OpenStreetMap contributor and Voting Member of the Humanitarian OpenStreetMap Team, Chair of ISPRS ICWG "Openness in Geospatial Science and Remote Sensing". He is a regular participant and presenter at global and local FOSS4G events, Secretary of FOSS4G Europe 2015, Co-chair of the Academic Track at FOSS4G 2022 and 2023 and FOSS4G Europe 2024.

This speaker also appears in:

Pan-European open building footprints: analysis and comparison in selected countries

Semantic annotation and classification of EU tendering data on open geospatial software, standards and data using Large Language Models .ical 2024-07-04 10:30–11:00, Destination Earth (Van46 ring)

Semantic annotation and classification of EU tendering data on open geospatial software, standards and data using Large Language Models
.ical
2024-07-04 10:30–11:00, Destination Earth (Van46 ring)