Jun Zhu
Assoc. Prof. @ THU
Adj. Faculty @ CMU
Home
Research
Publications
Teaching
Software
People
Professional
Bio
Contact
MedLDA: Max-Margin Supervised Topic Models

This is a C++ implementation of the max-margin supervised topic models, including the variational algorithms presented in [1][2] and two Monte Carlo methods presented in [3]. We also provided a C++ implementation of the supervised LDA regression model presented in [5].

Downloads

  • Download the readme.txt
  • Download the source code of the variational algorithms presented in [1][2]: 1) Windows: MedLDAc.zip; and 2) Linux: medlda.zip
  • Download the source code of the collapsed Gibbs sampling and the importance sampling algorithms presented in [3]: 1) collapsed Gibbs sampling: MedLDA_Gibbs.zip; and 2) Importance sampling: MedLDA_IS.zip
  • Download the MedLDA regression model at: MedLDA_Regression
  • Download the supervised LDA regression model at: sLDA_Regression

Update News

  • Two Monte Carlo methods are published on Nov 18, 2012, based on the NIPS paper [3].
  • The implementation of MedLDA regression model and the supervised LDA regression model were published on Dec 02, 2011.
  • Bug on data loading and "est" option fixed on Dec 22nd, 2010. Thanks to Simon Lacoste-Julien and Michalis Raptis for reporting.
  • The Linux version of MedLDA classification model (version 1.0) is released on Aug 7th, 2010.
  • The MedLDA classification model (version 1.0) is released on Jul 8th, 2010.

Short Description

Supervised topic models utilize document's side information for discovering predictive low dimensional representations of documents. Existing models apply the likelihood-based estimation. In this project, we present a general framework of max-margin supervised topic models for both continuous and categorical response variables. Our approach, the maximum entropy discrimination latent Dirichlet allocation (MedLDA), utilizes the max-margin principle to train supervised topic models and estimate predictive topic representations that are arguably more suitable for prediction tasks. The general principle of MedLDA can be applied to perform joint max-margin learning and maximum likelihood estimation for arbitrary topic models, directed or undirected, and supervised or unsupervised, when the supervised side information is available. We develop efficient variational methods for posterior inference and parameter estimation, and demonstrate qualitatively and quantitatively the advantages of MedLDA over likelihood-based topic models on movie review, hotel review and 20 Newsgroups data sets. The following figures show the 2d embedding of the discovered latent representations on the 20newsgroup dataset by the MedLDA and standard unsupervised LDA. The 2d embeddings are achieved with the t-SNE (t-Distributed Stochastic Neighbor Embedding) method.

MMH DWH
MedLDA LDA

The following figures show the average latent representations for documents in the categories of "comp.graphics" and "sci.electronics", respectively. We can see that the latent representations discovered by MedLDA is much sparser and discriminative than those discovered by unsupervised LDA. More examples are provided in the paper.

comp.graphics graphics
sci.electronics electronics

References

  1. Jun Zhu, Amr Ahmed, and Eric P. Xing. MedLDA: Maximum Margin Supervised Topic Models, Journal of Machine Learning Research, 13(Aug):2237--2278, 2012.

  2. Jun Zhu, Amr Ahmed, and Eric P. Xing. MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification, In ICML, Montreal, Canada, 2009.

  3. Qixia Jiang, Jun Zhu, Maosong Sun, and Eric P. Xing. Monte Carlo Methods for Maximum Margin Supervised Topic Models, In NIPS, Lake Tahoe, USA, 2012.

  4. Jun Zhu and Eric P. Xing. Conditional Topic Random Fields,In ICML, Haifa, Israel, 2010.

  5. David Blei and Jon D. McAuliffe. Supervised Topic Models,In NIPS, Vancouver, CA, 2007.

Last updated on Dec. 4th, 2012.