Implémentation simple des similitudes N-Gram, tf-idf et Cosinus dans Python

Question

J'ai besoin de comparer des documents stockés dans une base de données et de trouver un score de similitude entre 0 et 1.

La méthode que je dois utiliser doit être très simple. Implémentation d'une version Vanilla de n-grammes (où il est possible de définir le nombre de grammes à utiliser), avec une implémentation simple de tf-idf et de la similitude cosinus.

Y a-t-il un programme qui peut le faire? Ou devrais-je commencer à écrire cela à partir de zéro?

roman · Accepted Answer

Découvrez le package NLTK: http://www.nltk.org il a tout ce dont vous avez besoin

Pour la similitude cosinus:

 def cosine_distance(u, v): """ Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. """ return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

Pour les ngrammes:

 def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): """ A utility that produces a sequence of ngrams from a sequence of items. For example: >>> ngrams([1,2,3,4,5], 3) [(1, 2, 3), (2, 3, 4), (3, 4, 5)] Use ingram for an iterator version of this function. Set pad_left or pad_right to true in order to get additional ngrams: >>> ngrams([1,2,3,4,5], 2, pad_right=True) [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)] @param sequence: the source data to be converted into ngrams @type sequence: C{sequence} or C{iterator} @param n: the degree of the ngrams @type n: C{int} @param pad_left: whether the ngrams should be left-padded @type pad_left: C{boolean} @param pad_right: whether the ngrams should be right-padded @type pad_right: C{boolean} @param pad_symbol: the symbol to use for padding (default is None) @type pad_symbol: C{any} @return: The ngrams @rtype: C{list} of C{Tuple}s """ if pad_left: sequence = chain((pad_symbol,) * (n-1), sequence) if pad_right: sequence = chain(sequence, (pad_symbol,) * (n-1)) sequence = list(sequence) count = max(0, len(sequence) - n + 1) return [Tuple(sequence[i:i+n]) for i in range(count)]

pour tf-idf, vous devrez d'abord calculer la distribution, j'utilise Lucene pour cela, mais vous pouvez très bien faire quelque chose de similaire avec NLTK, utilisez FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

si vous aimez le pylucène, cela vous dira comment calculer tf.idf

 # reader = lucene.IndexReader(FSDirectory.open(index_loc)) docs = reader.numDocs() for i in xrange(docs): tfv = reader.getTermFreqVector(i, fieldname) if tfv: rec = {} terms = tfv.getTerms() frequencies = tfv.getTermFrequencies() for (t,f,x) in Zip(terms,frequencies,xrange(maxtokensperdoc)): df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term tmap.setdefault(t, len(tmap)) rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF # and normalize the values using cosine normalization if cosine_normalization: denom = sum([x**2 for x in rec.values()])**0.5 for k,v in rec.items(): rec[k] = v / denom

Tarantula · Answer

Si vous êtes intéressé, j'ai fait des séries de tutoriels ( Partie I et Partie II ) parlant de tf-idf et utilisant le Scikits.learn (sklearn) Python.

Partie a une similitude cosinus.

alvas · Answer

Voici une réponse avec juste python + numpy, en bref:

cosinus:

def cosine_sim(u,v): return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))

Ngrams:

def ngrams(sentence, n): return Zip(*[sentence.split()[i:] for i in range(n)])

TF-IDF (c'est un peu bizarre mais ça marche):

def tfidf(corpus, vocab): """ INPUT: corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this'] OUTPUT: [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] """ def termfreq(matrix, doc, term): try: return matrix[doc][term] / float(sum(matrix[doc].values())) except ZeroDivisionError: return 0 def inversedocfreq(matrix, term): try: return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) except ZeroDivisionError: return 0 matrix = [{k:v for k,v in Zip(vocab, i[1])} for i in corpus] tfidf = defaultdict(dict) for doc,_ in enumerate(matrix): for term in matrix[doc]: tf = termfreq(matrix,doc,term) idf = inversedocfreq(matrix, term) tfidf[doc][term] = tf*idf return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]

Voici la longue réponse avec les tests:

import numpy as np from math import sqrt, log from itertools import chain, product from collections import defaultdict def cosine_sim(u,v): return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v))) def ngrams(sentence, n): return Zip(*[sentence.split()[i:] for i in range(n)]) def tfidf(corpus, vocab): """ INPUT: corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this'] OUTPUT: [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]] """ def termfreq(matrix, doc, term): try: return matrix[doc][term] / float(sum(matrix[doc].values())) except ZeroDivisionError: return 0 def inversedocfreq(matrix, term): try: return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0]) except ZeroDivisionError: return 0 matrix = [{k:v for k,v in Zip(vocab, i[1])} for i in corpus] tfidf = defaultdict(dict) for doc,_ in enumerate(matrix): for term in matrix[doc]: tf = termfreq(matrix,doc,term) idf = inversedocfreq(matrix, term) tfidf[doc][term] = tf*idf return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)] def corpus2vectors(corpus): def vectorize(sentence, vocab): return [sentence.split().count(i) for i in vocab] vectorized_corpus = [] vocab = sorted(set(chain(*[i.lower().split() for i in corpus]))) for i in corpus: vectorized_corpus.append((i, vectorize(i, vocab))) return vectorized_corpus, vocab def create_test_corpus(): sent1 = "this is a foo bar" sent2 = "foo bar bar black sheep" sent3 = "this is a sentence" all_sents = [sent1,sent2,sent3] corpus, vocab = corpus2vectors(all_sents) return corpus, vocab def test_cosine(): corpus, vocab = create_test_corpus() for sentx, senty in product(corpus, corpus): print sentx[0] print senty[0] print "cosine =", cosine_sim(sentx[1], senty[1]) print def test_ngrams(): corpus, vocab = create_test_corpus() for sentx in corpus: print sentx[0] print ngrams(sentx[0],2) print ngrams(sentx[0],3) print def test_tfidf(): corpus, vocab = create_test_corpus() print corpus print vocab print tfidf(corpus, vocab) print "Testing cosine..." test_cosine() print print "Testing ngrams..." test_ngrams() print print "Testing tfidf..." test_tfidf() print

[en dehors]:

Testing cosine... this is a foo bar this is a foo bar cosine = 1.0 this is a foo bar foo bar bar black sheep cosine = 0.507092552837 this is a foo bar this is a sentence cosine = 0.67082039325 foo bar bar black sheep this is a foo bar cosine = 0.507092552837 foo bar bar black sheep foo bar bar black sheep cosine = 1.0 foo bar bar black sheep this is a sentence cosine = 0.0 this is a sentence this is a foo bar cosine = 0.67082039325 this is a sentence foo bar bar black sheep cosine = 0.0 this is a sentence this is a sentence cosine = 1.0 Testing ngrams... this is a foo bar [('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar')] [('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar')] foo bar bar black sheep [('foo', 'bar'), ('bar', 'bar'), ('bar', 'black'), ('black', 'sheep')] [('foo', 'bar', 'bar'), ('bar', 'bar', 'black'), ('bar', 'black', 'sheep')] this is a sentence [('this', 'is'), ('is', 'a'), ('a', 'sentence')] [('this', 'is', 'a'), ('is', 'a', 'sentence')] Testing tfidf... [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])] ['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this'] [[0.30000000000000004, 0.30000000000000004, 0.0, 0.30000000000000004, 0.30000000000000004, 0.0, 0.0, 0.30000000000000004], [0.0, 0.6000000000000001, 0.6000000000000001, 0.30000000000000004, 0.0, 0.0, 0.6000000000000001, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

Fred Foo · Answer

Dans le cas où vous êtes toujours intéressé par ce problème, j'ai fait quelque chose de très similaire en utilisant Lucene Java et Jython. Voici quelques extraits de mon code.

Lucene prétraitement des documents et des requêtes à l'aide de soi-disant analyseurs. Celui-ci utilise le filtre n-gram intégré de Lucene:

class NGramAnalyzer(Analyzer): '''Analyzer that yields n-grams for minlength <= n <= maxlength''' def __init__(self, minlength, maxlength): self.minlength = minlength self.maxlength = maxlength def tokenStream(self, field, reader): lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader)) return NGramTokenFilter(lower, self.minlength, self.maxlength)

Pour transformer une liste de ngrams en Document:

doc = Document() doc.add(Field('n-grams', ' '.join(ngrams), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))

Pour stocker un document dans un index:

wr = IndexWriter(index_dir, NGramAnalyzer(), True, IndexWriter.MaxFieldLength.LIMITED) wr.addDocument(doc)

Construire des requêtes est un peu plus difficile car le QueryParser de Lucene attend un langage de requête avec des opérateurs spéciaux, des guillemets, etc., mais il peut être contourné (comme expliqué en partie ici ).

Penang · Answer

Pour notre cours de recherche d'informations, nous utilisons du code écrit par notre professeur en Java. Désolé, pas de port python. "Il est publié à des fins éducatives et de recherche uniquement sous la licence GNU General Public License."

Vous pouvez consulter la documentation http://userweb.cs.utexas.edu/~mooney/ir-course/doc/

Mais plus précisément, consultez: http://userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html

Vous pouvez le télécharger http://userweb.cs.utexas.edu/users/mooney/ir-course/