BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//talks.osgeo.org//foss4g-europe-2026//talk//SAADFE
BEGIN:VTIMEZONE
TZID:EET
BEGIN:STANDARD
DTSTART:20001029T050000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:EET
TZOFFSETFROM:+0300
TZOFFSETTO:+0200
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:EEST
TZOFFSETFROM:+0200
TZOFFSETTO:+0300
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-foss4g-europe-2026-SAADFE@talks.osgeo.org
DTSTART;TZID=EET:20260629T150000
DTEND;TZID=EET:20260629T153000
DESCRIPTION:Metadata are a foundational component of any data infrastructur
 e\, as they enable dataset discovery\, evaluation\, and reuse. Metadata cr
 eation and maintenance\, however\, is typically a costly\, inconsistent\, 
 and largely manual process.  In the European context\, this bottleneck is 
 visible in major data sharing initiatives such as the INSPIRE Directive an
 d the Common European Data Spaces\, where high-quality\, interoperable met
 adata are a legal requirement as well as a precondition for data market pa
 rticipation.\nTwo main challenges affect the creation and maintenance of g
 eospatial metadata. The first is schema heterogeneity\, since existing sch
 emas (including ISO 19115\, DCAT\, GeoDCAT-AP\, Dublin Core\, INSPIRE prof
 iles) show partially overlapping semantics but incompatible serialisations
 . Migration between schemas is typically performed through hand-crafted tr
 ansformations encoding explicit field-to-field mappings: brittle\, schema-
 specific artefacts requiring specialist maintenance. The second challenge 
 is content inconsistency: even within a single schema\, records produced b
 y different operators (including within the same organisation) may exhibit
  inconsistent terminology\, structure\, and level of detail in free-text f
 ields – which undermines   catalogue-level discoverability.\nGenerative 
 Artificial Intelligence (GenAI) and Large Language Models (LLMs) in partic
 ular offer   promising capabilities to address these challenges. AI-assist
 ed metadata management (acquisition\, cleaning\, verification and maintena
 nce) has been already explored in literature [1]\, but to our knowledge no
 t in the case of geospatial datasets. Existing work includes for example A
 I support for digitizing libraries\, classifying research resources\, and 
 create website content. Whatever the purpose\, research has focused on how
  to instruct LLMs to produce structured outputs for high-quality metadata 
 [2] or to address the limitations of metadata schemas\, particularly free-
 text attributes whose definitions are often self-referential or provide in
 sufficient contextual meaning for GenAI to process [3].\nThis work extends
  such literature by proposing an open-source\, AI-powered geospatial metad
 ata editor. The core design principle is schema agnosticism: output schema
 s are defined as declarative configuration files\, and any new schema can 
 be registered without modifying the application code. Field extraction and
  generation are driven by attribute type classification\, independently of
  field names or schema-specific logic\, with GeoDCAT-AP 3.0.0 as the curre
 nt default.\nThe tool recognises three different types of metadata attribu
 tes and tackles them differently. Type A attributes\, such as title\, keyw
 ords\, and description\, are characterised by a free-text nature and as su
 ch\, they are suitable for automatic generation using LLMs and prompts wit
 h contextual information. Type B are technical attributes such as spatial 
 resolution\, bounding box\, file extension\, etc. that can be automaticall
 y extracted from the dataset by standard functions. Finally\, Type C are o
 rganisational attributes\, such as publisher\, contact point and licensing
 : these can be usually reused across several records within the same organ
 isation.\nInput handling is multimodal. For extraction of Type B attribute
 s\, the tool accepts native geospatial datasets processed via GDAL/OGR\, o
 r the URL to the OGC service serving the dataset. As a third option\, usef
 ul for large datasets\, the textual output of the gdalinfo or ogrinfo insp
 ection utilities can be provided directly. Type A attributes are generated
  using accompanying documentation on the relevant dataset\, such as techni
 cal reports and scientific publications\, which are indexed as corpus cont
 ent. The generation phase implements Retrieval-Augmented Generation (RAG):
  semantically oriented queries are issued against the vector corpus for ea
 ch free-text attribute\, results are ranked by relevance\, and a context p
 assage is assembled to inform generation. The LLM component is any OpenAI-
 compatible inference endpoint\, supporting both cloud-hosted and fully loc
 al deployments.\nThe tool supports three modes of operation through a unif
 ied processing pipeline\, with differences arising solely from the inputs 
 provided. In the creation mode\,  the user supplies a geospatial dataset a
 nd supporting documentation. Technical (Type B) attributes are populated d
 eterministically from the dataset\; publisher’s information (Type C attr
 ibutes) is parsed from the structured publisher document provided as input
 . Finally\, by querying the assembled corpus and including the Type B and 
 C metadata attributes\, the LLM generates free-text fields (Type A) such a
 s title\, description\, keywords\, and provenance. A shared\, versioned pr
 ompt template encodes conventions – expected content sections\, field or
 der\, and level of specificity – applied consistently across all generat
 ed metadata records. This template functions as a content harmonisation in
 strument: metadata produced by different operators for different datasets 
 converge on a common descriptive structure\, improving catalogue-level dis
 coverability without requiring ad-hoc normalisation. The use of a template
  and of different prompting techniques to reduce model hallucination and i
 mprove harmonisation across datasets has been investigated in an article c
 urrently under review [4].\nIn the schema migration mode\,  the user addit
 ionally supplies an existing metadata record in any supported structured f
 ormat. The legacy record is parsed into field candidates and indexed in th
 e retrieval corpus. During generation\, the LLM receives the legacy metada
 ta as contextual input\; the mapping from source schema to target schema f
 ields emerges from the model semantic knowledge of both standards rather t
 han from explicit transformation rules\, generalising to schema pairs for 
 which no hand-crafted converter exists.\nFinally\, in the enrichment mode\
 ,  the user supplies a partial record or publisher documentation declaring
  organisational fields such as publisher\, license\, and contact point\; t
 hese are incorporated at the extraction time\, while remaining free-text f
 ields are generated and harmonised through the prompt template.\nIn all th
 ree modes\, the LLM is invoked only during the generation phase and only u
 pon explicit user action\, preserving a human-in-the-loop validation step 
 before the export.\nSource code was submitted for intellectual property an
 d security screening for release under the open-source European Union Publ
 ic License (EUPL) according to the organisation policy. Schema definitions
  and prompt templates are expressed as human-readable declarative files\, 
 versioned and customisable without modifying application code. \nThe empir
 ical evaluation in [4] is currently bounded by the institutional homogenei
 ty of JRC-produced metadata\; future research will address this limitation
  by incorporating records from Member State authorities engaged in the INS
 PIRE GeoDCAT-AP Pilot [5]\, enabling a more representative assessment of t
 he tool's harmonisation capacity across heterogeneous producer communities
 .
DTSTAMP:20260605T020624Z
LOCATION:A01
SUMMARY:An Open-Source AI-Powered Geospatial Metadata Editor for Schema-Agn
 ostic Generation\, Migration\, and Content Harmonisation - Margherita Di L
 eo
URL:https://talks.osgeo.org/foss4g-europe-2026/talk/SAADFE/
END:VEVENT
END:VCALENDAR
