Probabilistic Topic Models with MALLET


Probabilistic Topic Models (TMs) are a suite of statistical algorithms that aim to discover the main themes, denoted as topics, that pervade a large and otherwise unstructured collection of natural language documents. PTMs are able to annotate and summarize this corpus with the thematic and semantic information provided by topics.
The most successful contribution in the field of PTMs is the Latent Dirichlet Allocation (LDA). LDA is based upon the idea that documents hide a mixture of multiple topics; each topic is defined as a probability distribution over the words of the vocabulary. Thus, the corpus of documents can be formalized by a generative model, namely a simple probabilistic procedure by which each document can be ideally generated.
In this lighting talk I will provide an introduction to the intuition behind the LDA model. Then, I will show how to use the MALLET API to analyze a large corpus of news articles through topic models. MALLET is a Java-based package for statistical natural language processing, document classification, topic modeling and other machine learning application to text. Furthermore, I will show how to integrate MALLET inside a Python environment to have access to all the built-in functionalities of a scripting language.