09-10, 14:00–14:30 (America/Chicago), Grand G
This research presents a lean AI/ML approach using open-source resources to optimize geospatial-focused AI training data. It employs iterative human-machine teaming and quantitative analysis to enhance data diversity and label quality, enabling efficient and effective GeoAI.
Successful AI and machine learning depend on high-quality training data, with data diversity and quality significantly impacting model performance. Traditional data labeling methods often lack rigor, leading to the selection of redundant or irrelevant data and resulting in poor model performance. Our research promotes a 'lean AI/ML' approach that leverages open-source tools and introduces an open methodology to quantitatively optimize data diversity and label quantity, thus improving ML model performance.
In geospatial applications and other traditional approaches, the amount of data needed for labeling is often subjectively determined, leading to unnecessary resource use. We aim to present ways for organizations to address this issue by using open data, open-source tools, and models to experiment with and apply AI effectively without massive upfront financial investments.
Our methodology addresses these challenges using an iterative human-machine approach in Computer Vision. We incorporate model-predicted labels and capture human adjustments for individual object classes to systematically optimize the diversity and quantity of labels needed for detection models. Unlike traditional pre-labeling approaches that prioritize efficiency over quality, our method involves training a pre-labeling model on the initial batch of human-created labels. This enables us to identify the required data diversity and quantity of labels needed to optimize a CV model for specific mission needs.
We begin by training several pre-labeling models using different common model architecture approaches based on initial human-generated labels. These models generate preliminary labels for the next set of data in a labeling campaign. Throughout the campaign, we continually retrain these pre-labeling models on each labeling batch. We compare the predictions from the pre-labeling models with human-labeled data in each batch, using any differences to adjust the labeling process for each object class.
This approach allows us to identify specific deficiencies in the model early in the labeling campaigns. The analysis helps us choose the next set of data for labeling in sequential batches to improve model performance. As the differences in human adjustments to predicted labels decrease over iterations, we see where additional labeled data will have diminishing returns on model performance. By using different model architectures and comparing their outputs, we understand what data benefits models the most.
The methodology relies on strategically involving human labelers to optimize the model. Initially, human labelers are shown pre-labeled predictions, allowing them to easily identify and correct significant model inaccuracies. As the iterative cycles progress and the difference between model predictions and human adjustments diminishes, a portion of scenes are given to human labelers without revealing predictions. This process helps identify potential model bias and highlights areas where humans may no longer be able to adjust accurately due to increasingly accurate pre-labeling models.
These scenes enable adjustments to the labeling workflows and require additional human review when significant anomalies between predicted and human-labeled data are identified in later stages. This approach is crucial in preventing incorrect human labels from steering data selection and quantity requirements in later stages.
Our method reduces the need for a large number of human labelers, instead relying on a smaller group of highly skilled subject matter experts to guide the optimization of pre-labeling models. They use an iterative approach with each batch of data in a labeling campaign. High-quality human labels are crucial for accurately comparing corrected and predicted objects in each scene. Involving experts rather than a large number of human labelers ensures precise labeling, leveraging their expertise to guide data selection and avoid unnecessary labeling.
An aspect of this method, which drives operational efficiencies in data labeling campaigns, depends on the quality of human-generated labels and their impact on guiding data selection. This requires using advanced data labeling tools that offer full transparency and dynamic adjustments of quality control and trust score mechanisms. These tools must allow for dynamic adjustments to involve additional human review based on ontology or data scene complexity to avoid misdirection in the data labeling campaign.
Our research introduces a systematic approach to quantitatively optimize data labeling campaigns. This method enhances model performance by reducing the number of labels and eliminating unnecessary labeling that does not significantly improve model performance. Organizations can conduct more efficient and targeted data labeling campaigns and experiment with GeoAI effectively. By prioritizing fewer but higher-quality labels and leveraging expert knowledge, this method ensures that the selected data for labeling campaigns is both necessary and adequate.
Attendees of this presentation will learn how to curate data using open data and open models as a foundation for discussing AI applications in open geospatial tools. This approach enables organizations to leverage the power of AI and machine learning without the need for substantial upfront investments, making it accessible to a wider range of users and applications in the geospatial field.