FOSS4G 2022 academic track

An approach for real-time validation of the location of biodiversity observations contributed in a citizen science project
2022-08-25, 14:15–14:45 (Europe/Rome), Room Hall 3A


Because of technological advancements, public participation in scientific projects, known as citizen science, has grown significantly in recent years (Schade and Tsinaraki 2016; Land-Zandstra et al. 2016). Contributors to citizen science projects are very diverse, coming from a variety of expertise, age groups, cultures, and so on, and thus the data contributed by them should be validated before being used in any scientific analysis. Experts typically validate data in citizen science, but this is a time-consuming process. One disadvantage of this is that volunteers will not receive feedback on their contributions and may become demotivated to continue contributing in the future. Therefore, a method for (semi)-automating validation of citizen science data is critical. One way that researchers are now focusing on is the use of machine learning (ML) algorithms to validate citizen science data.


We developed a citizen science project with the goal of collecting and automatically validating biodiversity observations while also providing participants with real-time feedback. We implemented the application with the Django framework and a PostgreSQL/PostGIS database for data preservation. In general, the focus of biodiversity citizen science applications is on automatically identifying or validating species images, with less emphasis on automatically validating the location of observations. Our application's focus, aside from image and date validation (Lotfian et al. July 15-20, 2019), is on automatically validating the location of biodiversity observations based on the environmental variables surrounding the observation point. In this project, we generated species distribution models using various machine learning algorithms (Random Forest, Balanced Random Forest, Deep Neural Network, and Naive Bayesian) and used the models to validate the location of a newly added observation. After comparing the performance of the various algorithms, we chose the one with the best performance to use in our real-time location validation application.

We developed an API that validates new observations using the trained models of the chosen algorithm. The Flask framework was used to create the API. The API uses the location and species name as parameters to predict the likelihood of observing a species (for the time being, a bird species) in a given neighborhood. Moreover, the model prediction, as well as information on species habitat characteristics are then communicated to participants in the form of real-time feedback. The API has three endpoints: a POST request that takes the species name and location of observation and returns the model prediction for the probability of observing the species in a 1km neighborhood around the location of observation; a GET request that takes the location of observations and returns the top five species likely to be observed in a 1km neighborhood around the location of observation; and a GET request that returns the species common names in English.

User experiment:

A user experiment was carried out to investigate the impact of automatic feedback on simplifying the validation task and improving data quality, as well as the impact of real-time feedback on sustaining participation. Furthermore, a questionnaire was distributed to volunteers, who were asked about their feedback on the application interface as well as the impact of real-time feedback on their motivation to continue contributing to the application.


The results were divided into two parts: first, the performance of the machine learning algorithms and their comparison, and second, the results of testing the application through the user experiment.

We used the AUC metric to compare the performance of the machine learning algorithms, and the results showed that while DNN had a higher median AUC (0.86) than the other three algorithms, DNN performance was very poor for some species (below 0.6). Balanced Random Forest (AUC median 0.82) performed relatively better for all species in comparison to the other three algorithms. Furthermore, for some species where the other three algorithms performed poorly (AUC less than 70%), Balanced-RF outperforms the others.

The user experiment results provided us with preliminary findings that support the combination of citizen science and machine learning. According to the findings of the user experiment, participants with a higher number of contributions found real-time feedback to be more useful in learning about biodiversity and stated that it increased their motivation to contribute to the project. Besides that, as a result of automatic data validation, only 10% of observations were flagged for expert verification, resulting in a faster validation process and improved data quality by combining human and machine power.

Why it should be considered:

Data validation and long-term participation have always been two of the most difficult challenges in citizen science and VGI (volunteer geographic information) projects. Various studies have been conducted on biodiversity data validation, focusing primarily on observation images with automatic species identification; however, not enough attention has been paid to observation location validation, particularly automatic location validation taking into account species habitat characteristics. Furthermore, to the best of our knowledge, the combination of machine learning and citizen science for sustaining participation by providing real-time user-centered and machine generated feedback to participants has received, till now, little attention and therefore our work is new, original and completely coherent with the vision of community citizen science, where scientists and citizen scientists are supposed to learn from each other.


Land-Zandstra, Anne M., Jeroen L. A. Devilee, Frans Snik, Franka Buurmeijer, and Jos M. van den Broek. 2016. “Citizen Science on a Smartphone: Participants’ Motivations and Learning.” Public Understanding of Science 25 (1): 45–60.

Lotfian, Maryam, Jens Ingensand, Olivier Ertz, Simon Oulevay, and Thibaud Chassin. July 15-20, 2019. “Auto-Filtering Validation in Citizen Science Biodiversity Monitoring: A Case Study.” In Proceedings of the 29th ICA Conference. Vol. 2.

Schade S, Tsinaraki C.; Survey report: data management in Citizen Science projects; EUR 27920 EN; Luxembourg (Luxembourg): Publications Office of the European Union; 2016; doi:10.2788/539115

Maryam Lotfian Obtained her BSc in Geomatics Engineering from Tehran University (Iran) in 2013. In 2016, she received her MSc in Environmental and Geomatics Engineering from Politecnico di Milano (Italy) with a thesis on urban climate classification using satellite imagery. In collaboration with the University of Applied Sciences and Arts Western Switzerland (HEIG-VD), she began her PhD at the Politecnico di Milano Department of Civil and Environmental Engineering in October 2017. Her research focuses on challenges in citizen science projects, including participants’ motivations and data quality. She works on developing machine learning algorithms for automated data validation in citizen scientist projects.