web-dev-qa-db-fra.com

Avertissement utilisateur: vos mots_arrêt peuvent ne pas correspondre à votre prétraitement

Je suis ce tutoriel de clustering de documents. En entrée, je donne un fichier txt qui peut être téléchargé ici . Il s'agit d'un fichier combiné de 3 autres fichiers txt divisé avec une utilisation de\n. Après avoir créé une matrice tf-idf, j'ai reçu cet avertissement:

, UserWarning: vos mots_arrêt peuvent être incompatibles avec votre prétraitement. Tokenisation des mots vides générés des jetons ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam' , 'becaus', 'becom', 'befor', 'besid', 'cri', 'description', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', ' tout "," partout "," fifti "," forti "," henc "," hereaft "," herebi "," howev "," hundr "," inde "," mani "," meanwhil "," moreov " , 'nobodi', 'midi', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', ' sincer ',' sixti ',' someon ',' someh ',' someim ',' somewher ',' themselv ',' thenc ',' thereaft ',' therebi ',' forfor ',' togetherh ',' twelv ' , 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'whereea', 'whereeaft', 'wherebi', 'whereev', 'whi', 'yourselv'] pas dans stop_words. "stop_words". % trié (incohérent)) ".

Je suppose que cela a quelque chose à voir avec l'ordre de lemmatisation et la suppression des mots vides, mais comme c'est mon premier projet dans le traitement txt, je suis un peu perdu et je ne sais pas comment résoudre ce problème ...

import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from nltk.stem.Snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer


stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    # first tokenize by sentence, then by Word to ensure that punctuation is caught as it's own token
    tokens = [Word for sent in nltk.sent_tokenize(text) for Word in nltk.Word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by Word to ensure that punctuation is caught as it's own token
    tokens = [Word.lower() for sent in nltk.sent_tokenize(text) for Word in nltk.Word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens


totalvocab_stemmed = []
totalvocab_tokenized = []
with open('shortResultList.txt', encoding="utf8") as synopses:
    for i in synopses:
        allwords_stemmed = tokenize_and_stem(i)  # for each item in 'synopses', tokenize/stem
        totalvocab_stemmed.extend(allwords_stemmed)  # extend the 'totalvocab_stemmed' list
        allwords_tokenized = tokenize_only(i)
        totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print (vocab_frame.head())

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

with open('shortResultList.txt', encoding="utf8") as synopses:
    tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses

print(tfidf_matrix.shape)

J'ai rencontré ce problème à cause du langage PT-BR.

TL; DR: supprimez les accents de votre langue.

# Special thanks for the user Humberto Diogenes from Python List (answer from Aug 11, 2008)
# Link: http://python.6.x6.nabble.com/O-jeito-mais-rapido-de-remover-acentos-de-uma-string-td2041508.html

# I found the issue by chance (I swear, haha) but this guy gave the tip before me
# Link: https://github.com/scikit-learn/scikit-learn/issues/12897#issuecomment-518644215

import spacy
nlp = spacy.load('pt_core_news_sm')

# Define default stopwords list
stoplist = spacy.lang.pt.stop_words.STOP_WORDS

def replace_ptbr_char_by_Word(word):
  """ Will remove the encode token by token"""
    Word = str(Word)
    Word = normalize('NFKD', Word).encode('ASCII','ignore').decode('ASCII')
    return Word

def remove_pt_br_char_by_text(text):
  """ Will remove the encode using the entire text"""
    text = str(text)
    text = " ".join(replace_ptbr_char_by_Word(word) for Word in text.split() if Word not in stoplist)
    return text

df['text'] = df['text'].apply(remove_pt_br_char_by_text)

J'ai mis la solution et les références dans ce Gist.

1
Flavio