n-grammes en python, quatre, cinq, six grammes?

Question

Je cherche un moyen de scinder un texte en n-grammes . Normalement, je ferais quelque chose comme:

import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams

Je suis conscient que nltk ne propose que des bigrammes et des trigrammes, mais existe-t-il un moyen de scinder mon texte en quatre grammes, cinq grammes ou même cent grammes?

Merci!

alvas · Accepted Answer

Excellentes réponses basées sur le python natif données par d'autres utilisateurs. Mais voici l'approche nltk (au cas où, le PO serait pénalisé pour avoir réinventé ce qui existait déjà dans la bibliothèque nltk).

Il existe un module ngram que les gens utilisent rarement dans nltk. Ce n'est pas parce qu'il est difficile de lire les ngrams, mais la formation d'une base de modèle sur des ngrams où n> 3 entraînera une grande fragmentation des données.

from nltk import ngrams sentence = 'this is a foo bar sentences and i want to ngramize it' n = 6 sixgrams = ngrams(sentence.split(), n) for grams in sixgrams: print grams

inspectorG4dget · Answer

Je suis surpris que cela ne soit pas encore apparu:

In [34]: sentence = "I really like python, it's pretty awesome.".split() In [35]: N = 4 In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)] In [37]: for gram in grams: print gram ['I', 'really', 'like', 'python,'] ['really', 'like', 'python,', "it's"] ['like', 'python,', "it's", 'pretty'] ['python,', "it's", 'pretty', 'awesome.']

M.A.Hassan · Answer

voici un autre moyen simple pour faire n-grammes

>>> from nltk.util import ngrams >>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams" >>> tokenize = nltk.Word_tokenize(text) >>> tokenize ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams'] >>> bigrams = ngrams(tokenize,2) >>> bigrams [('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')] >>> trigrams=ngrams(tokenize,3) >>> trigrams [('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')] >>> fourgrams=ngrams(tokenize,4) >>> fourgrams [('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

Δημητρης Παππάς · Answer

Utiliser uniquement les outils nltk

from nltk.tokenize import Word_tokenize from nltk.util import ngrams def get_ngrams(text, n ): n_grams = ngrams(Word_tokenize(text), n) return [ ' '.join(grams) for grams in n_grams]

Exemple de sortie

get_ngrams('This is the simplest text i could think of', 3 ) ['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

Afin de conserver les ngrams au format tableau, supprimez simplement ' '.join

tzaman · Answer

Vous pouvez facilement créer votre propre fonction pour le faire en utilisant itertools:

from itertools import izip, islice, tee s = 'spam and eggs' N = 3 trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N)))) list(trigrams) # [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '), # ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'), # ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'), # ('e', 'g', 'g'), ('g', 'g', 's')]

sel · Answer

Pour four_grams c'est déjà dans NLTK , voici un morceau de code qui peut vous aider dans ceci:

 from nltk.collocations import * import nltk #You should tokenize your text text = "I do not like green eggs and ham, I do not like them Sam I am!" tokens = nltk.wordpunct_tokenize(text) fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens) for fourgram, freq in fourgrams.ngram_fd.items(): print fourgram, freq

J'espère que ça aide.

Serendipity · Answer

Une approche plus élégante pour construire des bigrams avec la fonction Zip() intégrée de python. Convertissez simplement la chaîne d'origine en une liste par split(), puis transmettez la liste une fois normalement et une fois décalée d'un élément.

string = "I really like python, it's pretty awesome." def find_bigrams(s): input_list = s.split(" ") return Zip(input_list, input_list[1:]) def find_ngrams(s, n): input_list = s.split(" ") return Zip(*[input_list[i:] for i in range(n)]) find_bigrams(string) [('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

Nik · Answer

Je n'ai jamais traité avec nltk, mais N-grammes dans le cadre d'un projet de petite classe. Si vous voulez trouver la fréquence de tous les N-grammes apparaissant dans la chaîne, voici une façon de le faire. D vous donnerait l'histogramme de vos mots-N.

D = dict() string = 'whatever string...' strparts = string.split() for i in range(len(strparts)-N): # N-grams try: D[Tuple(strparts[i:i+N])] += 1 except: D[Tuple(strparts[i:i+N])] = 1

bhatman · Answer

Les gens ont déjà assez bien répondu au scénario dans lequel vous avez besoin de bigrames ou de trigrammes, mais si vous avez besoin de everygram pour la phrase, vous pouvez utiliser nltk.util.everygrams

>>> from nltk.util import everygrams >>> message = "who let the dogs out" >>> msg_split = message.split() >>> list(everygrams(msg_split)) [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

Si vous avez une limite, comme dans le cas des trigrammes où la longueur maximale doit être égale à 3, vous pouvez utiliser max_len param pour le spécifier.

>>> list(everygrams(msg_split, max_len=2)) [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

Vous pouvez simplement modifier le paramètre max_len pour obtenir un gramme, à savoir quatre grammes, cinq grammes, six ou même cent grammes.

Les solutions mentionnées précédemment peuvent être modifiées pour mettre en œuvre la solution susmentionnée, mais cette solution est beaucoup plus simple que cela.

Pour en savoir plus, cliquez ici

Et quand vous avez juste besoin d'un gramme spécifique comme bigram ou trigram, etc., vous pouvez utiliser le nltk.util.ngrams comme mentionné dans la réponse de M.A.Hassan.

Yann Dubois · Answer

Si l'efficacité est un problème et que vous devez créer plusieurs n-grammes différents (jusqu'à cent, comme vous le dites), mais que vous souhaitez utiliser du python pur, je le ferais:

from itertools import chain def n_grams(seq, n=1): """Returns an itirator over the n-grams given a listTokens""" shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i) shiftedTokens = (shiftToken(i) for i in range(n)) tupleNGrams = Zip(*shiftedTokens) return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams) def range_ngrams(listTokens, ngramRange=(1,2)): """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens.""" return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

Utilisation:

>>> input_list = input_list = 'test the ngrams generator'.split() >>> list(range_ngrams(input_list, ngramRange=(1,3))) [('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~ Même vitesse que NLTK:

import nltk %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 nltk.ngrams(input_list,n=5) # 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 n_grams(input_list,n=5) # 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 nltk.ngrams(input_list,n=1) nltk.ngrams(input_list,n=2) nltk.ngrams(input_list,n=3) nltk.ngrams(input_list,n=4) nltk.ngrams(input_list,n=5) # 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit input_list = 'test the ngrams interator vs nltk '*10**6 range_ngrams(input_list, ngramRange=(1,6)) # 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Republier de ma réponse précédente .

Franck Dernoncourt · Answer

Vous pouvez utiliser sklearn.feature_extraction.text.CountVectorizer :

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size)) vect.fit(string) print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

les sorties:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

Vous pouvez définir sur ngram_size pour n’importe quel entier positif. C'est à dire. vous pouvez diviser un texte en quatre, cinq ou même cent grammes.

Daniel P&#233;rez Rada · Answer

Nltk est génial, mais représente parfois des frais généraux pour certains projets:

import re def tokenize(text, ngrams=1): text = re.sub(r'[\b\\"\'/\s+\,\.:\?;]', ' ', text) text = re.sub(r'\s+', ' ', text) tokens = text.split() return [Tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

Exemple d'utilisation:

>> text = "This is an example text" >> tokenize(text, 2) [('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')] >> tokenize(text, 3) [('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

Joe Zhow · Answer

Vous pouvez obtenir tous les 4-6grammes en utilisant le code sans autre paquet ci-dessous:

from itertools import chain def get_m_2_ngrams(input_list, min, max): for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]): yield ' '.join(s) def get_ngrams(input_list, n): return Zip(*[input_list[i:] for i in range(n)]) if __== '__main__': input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams'] for s in get_m_2_ngrams(input_list, 4, 6): print(s)

la sortie est en dessous:

I am aware that am aware that nltk aware that nltk only that nltk only offers nltk only offers bigrams only offers bigrams and offers bigrams and trigrams bigrams and trigrams , and trigrams , but trigrams , but is , but is there but is there a is there a way there a way to a way to split way to split my to split my text split my text in my text in four-grams text in four-grams , in four-grams , five-grams four-grams , five-grams or , five-grams or even five-grams or even hundred-grams I am aware that nltk am aware that nltk only aware that nltk only offers that nltk only offers bigrams nltk only offers bigrams and only offers bigrams and trigrams offers bigrams and trigrams , bigrams and trigrams , but and trigrams , but is trigrams , but is there , but is there a but is there a way is there a way to there a way to split a way to split my way to split my text to split my text in split my text in four-grams my text in four-grams , text in four-grams , five-grams in four-grams , five-grams or four-grams , five-grams or even , five-grams or even hundred-grams I am aware that nltk only am aware that nltk only offers aware that nltk only offers bigrams that nltk only offers bigrams and nltk only offers bigrams and trigrams only offers bigrams and trigrams , offers bigrams and trigrams , but bigrams and trigrams , but is and trigrams , but is there trigrams , but is there a , but is there a way but is there a way to is there a way to split there a way to split my a way to split my text way to split my text in to split my text in four-grams split my text in four-grams , my text in four-grams , five-grams text in four-grams , five-grams or in four-grams , five-grams or even four-grams , five-grams or even hundred-grams

vous pouvez trouver plus de détails à ce sujet blog