概率潜在语义分析

概率的潜在语义分析（PLSA），也称为概率潜在语义索引（PLSI，尤其是在信息检索领域），是用于分析双模和共现数据的统计方法。实际上，人们可以根据对某些隐变量的亲和性来推导出观测变量的低维表示，就像PLSA是从潜在语义分析中演化而来。

与源于线性代数并缩小发生表（通常通过奇异值分解）的标准潜在语义分析所不同的是，概率潜在语义分析基于从潜类模型导出的混合分解。

模型

考虑到以单词和文档的共现 $(w,d)$ 形式进行的观察，PLSA将每次共现的概率建模为条件独立的多项分布的混合：

P(w,d)=\sum _{c}P(c)P(d|c)P(w|c)=P(d)\sum _{c}P(c|d)P(w|c)

其中'c'是单词的主题。值得注意的是，模型的主题数量是一个超参数，必须提前设置而不是从数据中估计。第一个公式是对称式，其中 $w$ 和 $d$ 都是以类似的方式从潜变量 $c$ 生成（基于条件概率 $P(d|c)$ 和 $P(w|c)$ ）；而第二个公式是不对称的，对于每个文档 $d$ 根据 $P(c|d)$ 有条件地从文档中选择潜在类 $c$ ，然后根据 $P(w|c)$ 从该类生成一个单词。虽然在这个例子中我们使用单词和文档建模，但是任何离散变量的共现也可以用完全相同的方式建模。

因此，模型参数的数量等于 $cd+wc$ ，参数数量随文档数量呈线性增长。此外，尽管PLSA是基于文档集的生成模型，但它并不是新文档的生成模型。

模型的参数使用最大期望算法（EM算法）学习得到。

应用

PLSA可以通过Fisher核函数用于判别设置。^[1]

PLSA在信息检索和过滤、自然语言处理、文本机器学习及其他相关领域都有应用。

根据报告，概率潜在语义分析中使用的方面模型存在严重的过拟合问题。^[2]

扩展

分层扩展：
- 不对称：MASHA（Multinomial ASymmetric Hierarchical Analysis，多项式非对称分层分析）^[3]
- 对称：HPLSA（Hierarchical Probabilistic Latent Semantic Analysis，分层概率潜在语义分析）^[4]

生成模型：已经开发了以下模型来解决经常被批评的PLSA缺点——它不是新文档的正确生成模型。
- 潜在狄利克雷分配（LDA）——在每个文档-主题分布上添加狄利克雷先验
高阶数据：尽管在科学文献中很少讨论这一点，但PLSA可以自然地扩展到更高阶数据（三种模式或更高阶），它可以模拟三个或更多变量的共现。在上面的对称公式中，这仅需要为这些附加变量添加条件概率分布就可以实现。这是非负张量因子分解的概率类比。

历史

这是潜类模型的一个特例（参见其中的参考文献），它与非负矩阵分解有关。^[5]^[6]当前的术语是由Thomas Hofmann在1999年创造的。^[7]

参见

向量空间模型

参考文献

^ Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization （页面存档备份，存于互联网档案馆）, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000
^ Blei, David M.; Andrew Y. Ng; Michael I. Jordan. Latent Dirichlet Allocation (PDF). Journal of Machine Learning Research. 2003, 3: 993–1022 [2019-01-17]. doi:10.1162/jmlr.2003.3.4-5.993. （原始内容存档 (PDF)于2020-12-26）.
^ Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002
^ Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents （页面存档备份，存于互联网档案馆）, in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002
^ Chris Ding, Tao Li, Wei Peng (2006). "Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006" （页面存档备份，存于互联网档案馆）
^ Chris Ding, Tao Li, Wei Peng (2008). "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing" （页面存档备份，存于互联网档案馆）
^ Thomas Hofmann, Probabilistic Latent Semantic Indexing （页面存档备份，存于互联网档案馆）, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

外部链接

[1] Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization （页面存档备份，存于互联网档案馆）, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000

[2] Blei, David M.; Andrew Y. Ng; Michael I. Jordan. Latent Dirichlet Allocation (PDF). Journal of Machine Learning Research. 2003, 3: 993–1022 [2019-01-17]. doi:10.1162/jmlr.2003.3.4-5.993. （原始内容存档 (PDF)于2020-12-26）.

[3] Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002

[4] Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents （页面存档备份，存于互联网档案馆）, in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002

[5] Chris Ding, Tao Li, Wei Peng (2006). "Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006" （页面存档备份，存于互联网档案馆）

[6] Chris Ding, Tao Li, Wei Peng (2008). "On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing" （页面存档备份，存于互联网档案馆）

[7] Thomas Hofmann, Probabilistic Latent Semantic Indexing （页面存档备份，存于互联网档案馆）, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

[1]

[2]

[3]

[4]

[5]

[6]

[7]