Python NLTK: Bigrams trigrams fourgrams

Question

J'ai cet exemple et je veux savoir comment obtenir ce résultat. J'ai du texte et je tokenize alors je collectionne le bigramme et le trigramme et le fourgramme

import nltk from nltk import Word_tokenize from nltk.util import ngrams text = "Hi How are you? i am fine and you" token=nltk.Word_tokenize(text) bigrams=ngrams(token,2)

bigrams: [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]

trigrams=ngrams(token,3)

trigrammes: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

bigram [(a,b) (b,c) (c,d)] trigram [(a,b,c) (b,c,d) (c,d,f)] i want the new trigram should be [(c,d,f)] which mean newtrigram = [('are', 'you', '?'),('?', 'i','am'),...etc

toute idée sera utile

prooffreader · Accepted Answer

Si vous appliquez une théorie des ensembles (si j'interprète correctement votre question), vous verrez que les trigrammes que vous souhaitez sont simplement des éléments [2: 5], [4: 7], [6: 8], etc. la liste token.

Vous pouvez les générer comme ceci:

>>> new_trigrams = [] >>> c = 2 >>> while c < len(token) - 2: ... new_trigrams.append((token[c], token[c+1], token[c+2])) ... c += 2 >>> print new_trigrams [('are', 'you', '?'), ('?', 'i', 'am'), ('am', 'fine', 'and')]

Lewistrick · Answer

Je le fais comme ça:

def words_to_ngrams(words, n, sep=" "): return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]

Ceci prend une entrée list de mots en entrée et retourne une liste de ngrams (pour n donné), séparés par sep (dans ce cas un espace).

alvas · Answer

Essayez everygrams:

from nltk import everygrams list(everygrams('hello', 1, 5))

[en dehors]:

[('h',), ('e',), ('l',), ('l',), ('o',), ('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o'), ('h', 'e', 'l'), ('e', 'l', 'l'), ('l', 'l', 'o'), ('h', 'e', 'l', 'l'), ('e', 'l', 'l', 'o'), ('h', 'e', 'l', 'l', 'o')]

Jetons de mots:

from nltk import everygrams list(everygrams('hello Word is a fun program'.split(), 1, 5))

[en dehors]:

[('hello',), ('Word',), ('is',), ('a',), ('fun',), ('program',), ('hello', 'Word'), ('Word', 'is'), ('is', 'a'), ('a', 'fun'), ('fun', 'program'), ('hello', 'Word', 'is'), ('Word', 'is', 'a'), ('is', 'a', 'fun'), ('a', 'fun', 'program'), ('hello', 'Word', 'is', 'a'), ('Word', 'is', 'a', 'fun'), ('is', 'a', 'fun', 'program'), ('hello', 'Word', 'is', 'a', 'fun'), ('Word', 'is', 'a', 'fun', 'program')]