Topic modelling: interpretability and applications

What are topic models and why is their interpretability important?

Topic modelling algorithms, such as Latent Dirichlet Allocation (LDA) which we used in the H2020-funded coordination and support action CAMERA, are a set of natural language processing (NLP) based models used to detect underlying topics in huge corpora of text. However, the interpretability of the topics inferred by LDA and similar algorithms is often limited. As a result, defining the uncovered topics and presenting them in comprehendible formats often requires a fair amount of (manual) labour.

What do we refer to when we refer to the term “topic” in topic models? In LDA, a topic is a multinomial distribution over the terms in the vocabulary of the corpus. Therefore, what LDA gives as the output is not a easily interpretable by humans. Due to that, the raw output of such models is difficult to use as a standalone product or even as output for other models.

How is topic modelling relevant for our CAMERA objectives?

In CAMERA, we used LDA to two extents: first, to filter out research projects irrelevant to the mobility research in Europe and, second, to uncover topics among the retained mobility research projects. In the first case, we started from a large database of research initiatives funded by FP7 and H2020 framework programmes and exploited topic modelling to remove any entries unrelated to the CAMERA analysis, effectively creating CAMERA’s dataset. In the second, we uncovered hidden topics for interpretation in order to develop a standalone product for further data mining processes and mobility reports. While that is the main goal of CAMERA as a coordination, the output of topic modelling will come in handy in other predictive models currently being trained.

However, as already mentioned, those model outputs are difficult to use as is. How can we then easily interpret them and convert them into something that can be useful and presentable in mobility reports for interested decision-makers?

Visualising topic models’ output

Due to the complexity of the output acquired from LDA, the interpretation of the uncovered topics would be extremely difficult if not for interactive visualisations.

For CAMERA, we turned to the visualisation package LDAvis, presented in detail here, which provides an interactive way of visualising the results of an LDA model. The visualisation provided by LDAvis, shown in Figure 1 below (static here though interactive in the visualisations given by LDAvis), is composed of two main parts: a global map of the topic model on the left and the horizontal bar charts with most relevant terms for each topic on the right.

While most topic visualisations focus on most probable terms specific to each topic, LDAvis is unique in offering its new metric of relevance. Relevance ranks the terms within a topic, taking into account not only a term’s probability within a specific topic but also its probability across the whole corpus. In fact, the authors showed that just looking at the probability of topic-specific terms when trying to interpret a topic is suboptimal. The metric of relevance combines the two aforementioned metrics via a unique parameter of λλ, which determines how much each of the two parts will contribute to the metric.

On figure 1, the selected topic is 4 (red circle on the left hand side), λλ is set to 0.6 and, on the right side, the 20 most relevant terms for the topic 4, under these fixed parameters, are shown. The red horizontal bars give us the the probabilities of the terms for a specific topic, while the ratio between the red and blue bars illustrates how specific a particular term is to a particular topic versus all other topics. Looking at the topic 4, while the term “safety” is the most relevant term for this topic, it is also more common than, for example, the terms “accident” and “automation”, which are relatively rare in the whole corpus, with most occurrences pertaining to this topic. This indicates that the topic 4 identifies research performed in the area of automatic intelligent vehicles.

Relying on this kind of analysis, we were able to interpret the mobility research topics uncovered in CAMERA much more easily.

Figure 1. Example of topic modelling with nine uncovered topics

What about cases where the LDA output is not the final goal?

More often than not, the LDA topic-term distribution is not used as a final product, which is more useful when fed to subsequent models in other applications. The following list represents just the tip of the iceberg of possible applications:

  • Text summarisation. We live in the era of unprecedented amount of data available online. Text summarisation can help us summarise key points needed for a particular application, business or research objective, and topic modelling can improve the performance of such models, yielding better summaries of said content.
  • Query expansion. As topic models uncover relationships among words through latent topics, they can exploit the relationships in order to expand shorter queries on the semantic level, which can significantly improve the performance of search engines.
  • Sentiment analysis. Sentiment analysis deals with the extraction of sentiments and opinions of various groups of people (customers, stakeholders, investors, etc.). A big challenge there is deriving a useful numerical variable from text, and topic models can help. As one example, enriching such models with topic-term distribution could help airlines better categorise passengers’ online reviews, improving the provided services in key points for maximum effect on revenues.
  • Recommender systems. Similar to sentiment analysis, topic models in recommender systems can better group various services or products in comparison to more traditional clustering algorithms, ultimately resulting in more appropriate matching of users and products.
  • Blockchain. Though cryptocurrencies are growing in popularity, they remain risky to use due to their volatile and unregulated market. Topic modelling can help assess large quantities of unstructured information available online from Bitcoin developers and investors, improving the automatic detection of fraudulent activities, risk levels, and even future events on the market.
  • Understanding scientific publications. As with data, the amount of knowledge generated today remains unmatched in both breadth and extent, growing at greater speeds than has been noted in the modern world. It is becoming increasingly difficult to promptly and efficiently find needed information and assess the reliability and value of that information for research or business needs. More so, there is a growing need to systematise all generated knowledge. Our goal in CAMERA falls in line with this objective — we want to systematise recent research performed in the area of mobility so that users can access information more quickly and efficiently. As aligned with the efforts from CAMERA, we believe valuable insights should be more readily available to decision-makers and researchers working in related areas of interest.
  • And many more usages …

Are you interested in finding out how we used topic modelling in CAMERA to systematise the knowledge on mobility research initiatives in Europe since 2007? Stay tuned for the upcoming publication of our Second Mobility Report.

About Author

Damir Valput

Damir is a Data Scientist who enjoys developing mathematical models, designing algorithms, wrestling with data, and working on his n-th cup of coffee. When not in front of a computer, you can find him in the cinema, playing board games, exploring a new corner of the world or wondering what to eat. Read more about Damir Valput

Related Posts