Talk:Probabilistic latent semantic analysis

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics
???	This article has not yet received a rating on Wikipedia's content assessment scale.
???	This article has not yet received a rating on the importance scale.

Corrected a few inconsistencies/confusions/inaccuracies:

afaik the acronym PLSA is more common that the lower-case variant pLSA -- need to be consistent anyway.
Fisher kernels allow PLSA to be used in a discriminative setting, not as a generative model.
Whoever wrote the part about "severe overfitting problems" should provide a reference for that.

I stumbled upon a paper stating these overfitting problems and added the reference. —Preceding unsigned comment added by Keretapi (talk • contribs) 14:37, 17 September 2007 (UTC)[reply]

In "Evolutions...", _discriminative_ was obviously wrong -- I think what was meant is _generative_ -- that's one way to present LDA.
Added a bullet on the extension to higher-order data

Sunny house 20:00, 22 August 2007 (UTC)[reply]

Excellent. Rama 08:38, 23 August 2007 (UTC)[reply]

Errr -- whoever added the graph: it's nice and everything but could you try to use the same notation as in the article? Sunny house (talk) 19:44, 11 March 2008 (UTC)[reply]

No, in every paper i have read, the latent variable is always denoted as 'z'. So the text should be changed instead.--137.250.39.133 (talk) 09:21, 18 April 2008 (UTC)[reply]

Dear 137.250.39.133: first the goal here is not necessarily to reproduce what you read in other papers, but to provide a self-contained explanation of PLSA. Whether the latent variable is denoted c or z is inconsequential as long as it is clear that it is a latent variable. However, the main issue with the graph is that it is confusing w.r.t. the document variable 'd', which is denoted by the theta in the graph. I doubt every paper you read uses this notation -- of the papers cited here, Hofmann, Vinokourov et al. and Gaussier et al. cerrtainly do not. Finally there is a captioning problem: the words are not the only observables, the document index is observed too (by definition). Sunny house (talk) 13:18, 5 July 2008 (UTC)[reply]

Actually Hofmann, in its original paper "Probabilistic Latent Semantic Analysis" uses "d" for the document variable, "z" for the topic and "w" for the observed word. However, this is by no means important, and several other works approach both the plate notation as the formulas using diverse letters for the variables. It is in fact more common to see "z" as the topic, but this should not be taken as a rule. For clarity, both the text and the image should have the same letters. Paulo Gaspar.