機率潛在語義分析

概率的潛在語義分析（PLSA），也稱為概率潛在語義索引（PLSI，尤其是在信息檢索領域），是用於分析雙模和共現數據的統計方法。實際上，人們可以根據對某些隱變量的親和性來推導出觀測變量的低維表示，就像PLSA是從潛在語義分析中演化而來。

與源於線性代數並縮小發生表（通常通過奇異值分解）的標準潛在語義分析所不同的是，概率潛在語義分析基於從潛類模型導出的混合分解。

模型

考慮到以單詞和文檔的共現 $(w,d)$ 形式進行的觀察，PLSA將每次共現的概率建模為條件獨立的多項分布的混合：

P(w,d)=\sum _{c}P(c)P(d|c)P(w|c)=P(d)\sum _{c}P(c|d)P(w|c)

其中'c'是單詞的主題。值得注意的是，模型的主題數量是一個超參數，必須提前設置而不是從數據中估計。第一個公式是對稱式，其中 $w$ 和 $d$ 都是以類似的方式從潛變量 $c$ 生成（基於條件概率 $P(d|c)$ 和 $P(w|c)$ ）；而第二個公式是不對稱的，對於每個文檔 $d$ 根據 $P(c|d)$ 有條件地從文檔中選擇潛在類 $c$ ，然後根據 $P(w|c)$ 從該類生成一個單詞。雖然在這個例子中我們使用單詞和文檔建模，但是任何離散變量的共現也可以用完全相同的方式建模。

因此，模型參數的數量等於 $cd+wc$ ，參數數量隨文檔數量呈線性增長。此外，儘管PLSA是基於文檔集的生成模型，但它並不是新文檔的生成模型。

模型的參數使用最大期望算法（EM算法）學習得到。

應用

PLSA可以通過Fisher核函數用於判別設置。^[1]

PLSA在信息檢索和過濾、自然語言處理、文本機器學習及其他相關領域都有應用。

根據報告，概率潛在語義分析中使用的方面模型存在嚴重的過擬合問題。^[2]

擴展

分層擴展：
- 不對稱：MASHA（Multinomial ASymmetric Hierarchical Analysis，多項式非對稱分層分析）^[3]
- 對稱：HPLSA（Hierarchical Probabilistic Latent Semantic Analysis，分層概率潛在語義分析）^[4]

生成模型：已經開發了以下模型來解決經常被批評的PLSA缺點——它不是新文檔的正確生成模型。
- 潛在狄利克雷分配（LDA）——在每個文檔-主題分布上添加狄利克雷先驗
高階數據：儘管在科學文獻中很少討論這一點，但PLSA可以自然地擴展到更高階數據（三種模式或更高階），它可以模擬三個或更多變量的共現。在上面的對稱公式中，這僅需要為這些附加變量添加條件概率分布就可以實現。這是非負張量因子分解的概率類比。

歷史

這是潛類模型的一個特例（參見其中的參考文獻），它與非負矩陣分解有關。^[5]^[6]當前的術語是由Thomas Hofmann在1999年創造的。^[7]

參見

向量空間模型

參考文獻

^ Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization （頁面存檔備份，存於網際網路檔案館）, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000
^ Blei, David M.; Andrew Y. Ng; Michael I. Jordan. Latent Dirichlet Allocation (PDF). Journal of Machine Learning Research. 2003, 3: 993–1022 [2019-01-17]. doi:10.1162/jmlr.2003.3.4-5.993. （原始內容存檔 (PDF)於2020-12-26）.
^ Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002
^ Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents （頁面存檔備份，存於網際網路檔案館）, in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002
^ Chris Ding, Tao Li, Wei Peng (2006). "Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006" （頁面存檔備份，存於網際網路檔案館）
^ Chris Ding, Tao Li, Wei Peng (2008). "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing" （頁面存檔備份，存於網際網路檔案館）
^ Thomas Hofmann, Probabilistic Latent Semantic Indexing （頁面存檔備份，存於網際網路檔案館）, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

外部連結

[1] Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization （頁面存檔備份，存於網際網路檔案館）, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000

[2] Blei, David M.; Andrew Y. Ng; Michael I. Jordan. Latent Dirichlet Allocation (PDF). Journal of Machine Learning Research. 2003, 3: 993–1022 [2019-01-17]. doi:10.1162/jmlr.2003.3.4-5.993. （原始內容存檔 (PDF)於2020-12-26）.

[3] Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002

[4] Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents （頁面存檔備份，存於網際網路檔案館）, in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002

[5] Chris Ding, Tao Li, Wei Peng (2006). "Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006" （頁面存檔備份，存於網際網路檔案館）

[6] Chris Ding, Tao Li, Wei Peng (2008). "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing" （頁面存檔備份，存於網際網路檔案館）

[7] Thomas Hofmann, Probabilistic Latent Semantic Indexing （頁面存檔備份，存於網際網路檔案館）, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

[1]

[2]

[3]

[4]

[5]

[6]

[7]