Comment imprimer les modèles de sujets LDA de gensim? Python

Question

En utilisant gensim, j'ai pu extraire des sujets d'un ensemble de documents dans LSA, mais comment puis-je accéder aux sujets générés à partir des modèles LDA?

Lors de l'impression de lda.print_topics(10), le code a généré l'erreur suivante, car print_topics() retournait une NoneType:

Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable

Le code:

from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] # remove common words and tokenize stoplist = set('for a of the and to in'.split()) texts = [[Word for Word in document.lower().split() if Word not in stoplist] for document in documents] # remove words that appear only once all_tokens = sum(texts, []) tokens_once = set(Word for Word in set(all_tokens) if all_tokens.count(Word) == 1) texts = [[Word for Word in text if Word not in tokens_once] for text in texts] dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] # I can print out the topics for LSA lsi = models.LsiModel(corpus_tfidf, id2Word=dictionary, num_topics=2) corpus_lsi = lsi[corpus] for l,t in izip(corpus_lsi,corpus): print l,"#",t print for top in lsi.print_topics(2): print top # I can print out the documents and which is the most probable topics for each doc. lda = ldamodel.LdaModel(corpus, id2Word=dictionary, num_topics=50) corpus_lda = lda[corpus] for l,t in izip(corpus_lda,corpus): print l,"#",t print # But I am unable to print out the topics, how should i do it? for top in lda.print_topics(10): print top

alvas · Answer

Après quelques bêtises, il semble que print_topics(numoftopics) pour la ldamodel ait un bogue. Donc, ma solution consiste à utiliser print_topic(topicid):

>>> print lda.print_topics() None >>> for i in range(0, lda.num_topics-1): >>> print lda.print_topic(i) 0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system ...

user2597000 · Answer

Je pense que la syntaxe de show_topics a changé au fil du temps:

show_topics(num_topics=10, num_words=10, log=False, formatted=True)

Pour num_topics nombre de sujets, renvoie num_words mots les plus significatifs (10 mots par sujet, par défaut).

Les rubriques sont renvoyées sous forme de liste - une liste de chaînes si formaté est définie sur True ou une liste de 2-tuples (probabilité, Word) si False.

Si log est défini sur True, indiquez également ce résultat dans le journal.

Contrairement à LSA, il n'y a pas d'ordre naturel entre les sujets dans LDA. Le sous-ensemble num_topics <= self.num_topics renvoyé de toutes les rubriques est donc arbitraire et peut changer entre deux cycles de formation LDA.

zanbri · Answer

Utilisez-vous une journalisation? print_topics s'imprime dans le fichier journal comme indiqué dans docs .

Comme @ mac389 le dit, lda.show_topics() est le moyen d'aller imprimer à l'écran.

xu2mao · Answer

vous pouvez utiliser:

for i in lda_model.show_topics(): print i[0], i[1]

Shirish Kumar · Answer

Voici un exemple de code pour imprimer des sujets:

def ExtractTopics(filename, numTopics=5): # filename is a pickle file where I have lists of lists containing bag of words texts = pickle.load(open(filename, "rb")) # generate dictionary dict = corpora.Dictionary(texts) # remove words with low freq. 3 is an arbitrary number I have picked here low_occerance_ids = [tokenid for tokenid, docfreq in dict.dfs.iteritems() if docfreq == 3] dict.filter_tokens(low_occerance_ids) dict.compactify() corpus = [dict.doc2bow(t) for t in texts] # Generate LDA Model lda = models.ldamodel.LdaModel(corpus, num_topics=numTopics) i = 0 # We print the topics for topic in lda.show_topics(num_topics=numTopics, formatted=False, topn=20): i = i + 1 print "Topic #" + str(i) + ":", for p, id in topic: print dict[int(id)], print ""

Nde Samuel Mbah · Answer

Je pense qu'il est toujours plus utile de voir les sujets sous forme de liste de mots. L'extrait de code suivant aide à atteindre cet objectif. Je suppose que vous avez déjà un modèle LDA appelé lda_model.

for index, topic in lda_model.show_topics(formatted=False, num_words= 30): print('Topic: {} 
Words: {}'.format(idx, [w[0] for w in topic]))

Dans le code ci-dessus, j'ai décidé d'afficher les 30 premiers mots appartenant à chaque sujet. Pour simplifier, j'ai montré le premier sujet que j'ai.

Topic: 0 Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental'] Topic: 1 Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

Je n'aime pas vraiment l'aspect des sujets ci-dessus, je modifie donc habituellement mon code comme indiqué:

for idx, topic in lda_model.show_topics(formatted=False, num_words= 30): print('Topic: {} 
Words: {}'.format(idx, '|'.join([w[0] for w in topic])))

... et la sortie (les 2 premiers sujets montrés) ressemblera.

Topic: 0 Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental Topic: 1 Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head

Maneet · Answer

Récemment, nous avons rencontré un problème similaire lors de l'utilisation de Python 3 et de Gensim 2.3.0. print_topics() et show_topics() ne donnaient aucune erreur, mais n'imprimaient pas non plus. Il s'avère que show_topics() renvoie une liste. Alors on peut simplement faire:

topic_list = show_topics() print(topic_list)

Feng Mai · Answer

Vous pouvez également exporter les principaux mots de chaque sujet dans un fichier csv. topn contrôle le nombre de mots sous chaque sujet à exporter.

import pandas as pd top_words_per_topic = [] for t in range(lda_model.num_topics): top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 5)]) pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv")

Le fichier CSV a le format suivant

Topic Word P 0 w1 0.004437 0 w2 0.003553 0 w3 0.002953 0 w4 0.002866 0 w5 0.008813 1 w6 0.003393 1 w7 0.003289 1 w8 0.003197 ...

Shivom Sharma · Answer

****This code works fine but I want to know the topic name instead of Topic: 0 and Topic:1, How do i know which topic this Word comes in**?** for index, topic in lda_model.show_topics(formatted=False, num_words= 30): print('Topic: {} 
Words: {}'.format(idx, [w[0] for w in topic])) Topic: 0 Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental'] Topic: 1 Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']