2022-08-25, 15:15–15:20 (Europe/Rome), Room Hall 3A
opic modelling is a branch of Natural Language Processing that deals with the discovery of conversation topics in a document corpus. In social media, it translates into aggregating posts into topics of conversation and observing how these topics evolve over time (hence the “dynamic” adjective [Murakami, 2021]). Conveying the results of topic modelling to an analyst is challenging since the topics often do not lend themselves naturally to meaningful labelling, where relationships between them can involve hundreds of dimensions. Furthermore, the popularity of topics is itself subject to change over time.
In this paper, we propose a spatialization technique based on open-source software that reduces the intrinsic complexity of dynamic topic modelling output to familiar topographic objects, namely: ridges, valleys, and peaks. This offers new possibilities for understanding complex relationships that change over time, that overcomes issues with traditional topic modelling visualisation approaches such as network graphs [Karpovich, 2017].
Spatialization [Fabrikant, 2017], a technique that uses spatial metaphors to aid cognitive tasks, has been a research field since the early ‘90s. It can be used to make sense of vast amounts of information by reducing them to a physical landscape. In this work, we consider spatialization of topics in a 3D space where the X-axis is the similarity of topics posted on the same day, the Y-axis is the similarity of topics across time and how their relationships evolve, and the Z-axis is a measure of the topic popularity. With this approach, a topic is therefore reduced to a single point in a 3D space, and the interpolated surface constructed out of these points becomes a landscape with peaks, ridges, and valleys. More precisely, the “valleys” represent less popular topics, while “peaks” are the more popular ones and flat surfaces indicate the average topics.
Our team is working on the Australian Data Observatory project, which has been collecting tweets and other social media posts (Instagram, Reddit, YouTube, Flickr, etc)) related to Australia for the last 12 months. Through the use of the new Twitter academic license, the project is harvesting 10s of millions of tweets per month. The social media posts are stored and analyzed daily using the deep learning BERTopic package. The BERTopic output is then stored and served through a ReST API, which is used by different clients (at present these are Jupyter notebooks and a web application). The intended audience of our platform is composed of the average topics domain researchers including social scientists, linguists, and data journalists. The goal is to support big data exploration at scale and overcome the smaller scale cottage industry of social media research that has hitherto been the norm in academia in Australia
Topic modelling is often presented using 2D visualizations, such as circles with size proportional to topic popularity and position related to the similarity between topics, The dynamic (temporal) aspect of topic evolution is typically shown with animations that show how topics morph into different ones and wax and wane in popularity or it is ignored completely and researchers just use static topic modelling visualisations. here is merit in trying a different approach for dynamic topic visualisation: namely, to map the social media landscape to the physical one, as this metaphor allows the simultaneous appreciation of time, topic similarity, and popularity while allowing -via zoom operations- the aggregation/disaggregation of topics into bigger/smaller cluster of posts. This 3D landscape naturally aids the end-user in understanding complex highly dimensional data at a scale and volume that would otherwise be impossible. The formation of islands, archipelagos, mountain ranges or valleys related to mainstream topics such as Covid, vaccination, lockdown, through to geopolitical events such as the invasion of Ukraine provides a finger on the pulse of what is being discussed at scale by the broader population across the social media landscape.
This approach is currently realised using a web application that enables the “topographic” exploration of the topic landscape with functions to improve the user experience in the areas of topic labelling and inter-topic distance.
There are a few criticalities in the proposed visualization:
distance between topics has to be drastically reduced in dimensionality from the ones provided by the Deep Learning model to just one (the X-axis);
the Y-axis (time) has to be put in relation to a completely different measure (distance between topics) to make it amenable to an interpolation;
topic popularity (the Z-axis) has a huge variability leading to irregular surfaces, hence the need for a non-linear scaling of the Z-axis;
communicating the meaning of each topic to the user is difficult, as the top terms of each topic may not be meaningful to a human, and make for a poor label.
The proposed processing and visualization is developed using only open-source tools and frameworks, leveraging the work of the open-source geospatial community.
All the software developed in the course of the Australian Data Observatory project is available under the Apache 2.0 license, and available through the University of Melbourne GitLab source code repository.
I have been working on data analytics, software development, and geospatial technologies since graduation in Statistics (1991); first as an ESRI employee, then (1997-2012) as a free-lance consultant, and currently as a member of the professional staff at the University of Melbourne.