Trouver des phrases de 2 et 3 mots à l'aide du paquet R TM

Question

J'essaie de trouver un code qui fonctionne réellement à trouver les phrases de deux et trois mots les plus fréquemment utilisées dans le paquet de mines de texte R (peut-être qu'il y a un autre paquet que je ne sais pas). J'ai essayé d'utiliser le tokéniseur, mais sembler avoir eu de chance.

Si vous avez travaillé sur une situation similaire dans le passé, pourriez-vous poster un code testé et fonctionne réellement? Merci beaucoup!

Timothy P. Jurka · Answer

Vous pouvez transmettre une fonction de jobenization personnalisée à tm 's DocumentTermMatrix fonction, donc si vous avez le package tau installé, il est assez simple.

library(tm); library(tau); tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n))))) texts <- c("This is the first document.", "This is the second file.", "This is the third text.") corpus <- Corpus(VectorSource(texts)) matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))

Où n dans la fonction tokenize_ngrams Est le nombre de mots par phrase. Cette fonctionnalité est également implémentée dans l'emballage RTextTools, qui simplifie encore les choses.

library(RTextTools) texts <- c("This is the first document.", "This is the second file.", "This is the third text.") matrix <- create_matrix(texts,ngramLength=3)

Cela renvoie une classe de DocumentTermMatrix pour une utilisation avec le package tm.

Ben · Answer

Ceci est la partie 5 de la FAQ [~ # ~] [~ # ~ ~] du TM Paquet :=

5. Puis-je utiliser des bigrams au lieu de jetons simples dans une matrice de documents termographiques?

Oui. RWWEKA fournit un tokéniseur pour les N-grammes arbitraires pouvant être directement transmis au constructeur de matrice de document de document. Par exemple.:

 library("RWeka") library("tm") data("crude") BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) inspect(tdm[340:345,1:10])

Patrick Perry · Answer

La bibliothèque corpus a une fonction appelée term_stats Cela fait ce que vous voulez:

library(corpus) corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_ text_filter(corpus)$drop_punct <- TRUE # ignore punctuation term_stats(corpus, ngrams = 2:3) ## term count support ## 1 of the 336 1 ## 2 the scarecrow 208 1 ## 3 to the 185 1 ## 4 and the 166 1 ## 5 said the 152 1 ## 6 in the 147 1 ## 7 the lion 141 1 ## 8 the tin 123 1 ## 9 the tin woodman 114 1 ## 10 tin woodman 114 1 ## 11 i am 84 1 ## 12 it was 69 1 ## 13 in a 64 1 ## 14 the great 63 1 ## 15 the wicked 61 1 ## 16 wicked witch 60 1 ## 17 at the 59 1 ## 18 the little 59 1 ## 19 the wicked witch 58 1 ## 20 back to 57 1 ## ⋮ (52511 rows total)

Ici, count est le nombre d'apparences et support est le nombre de documents contenant le terme.

G&#233;raud · Answer

J'ajoute un problème similaire en utilisant tm et ngram paquets. Après le débogage mclapply, j'ai vu là où des problèmes sur les documents avec moins de 2 mots avec l'erreur suivante

 input 'x' has nwords=1 and n=2; must have nwords >= n

J'ai donc ajouté un filtre pour éliminer le document avec le numéro de faible nombre de mots:

 myCorpus.3 <- tm_filter(myCorpus.2, function (x) { length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1 })

Puis mes regards de fonction tokenize comme:

bigramTokenizer <- function(x) { x <- as.character(x) # Find words one.list <- c() tryCatch({ one.gram <- ngram::ngram(x, n = 1) one.list <- ngram::get.ngrams(one.gram) }, error = function(cond) { warning(cond) }) # Find 2-grams two.list <- c() tryCatch({ two.gram <- ngram::ngram(x, n = 2) two.list <- ngram::get.ngrams(two.gram) }, error = function(cond) { warning(cond) }) res <- unlist(c(one.list, two.list)) res[res != ''] }

Ensuite, vous pouvez tester la fonction avec:

dtmTest <- lapply(myCorpus.3, bigramTokenizer)

Et enfin:

dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))

Monika Singh · Answer

Essayez le forfait TidyText

library(dplyr) library(tidytext) library(janeaustenr) library(tidyr

)

Supposons que j'ai un commentaire de Dataframe contenant une colonne de commentaire et je souhaite trouver une occurrence de deux mots ensemble. Alors essaye

bigram_filtered <- CommentData %>% unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>% separate(bigram, c("Word1","Word2"), sep=" ") %>% filter(!Word1 %in% stop_words$Word, !Word2 %in% stop_words$Word) %>% count(Word1, Word2, sort=TRUE)

Le code ci-dessus crée des jetons, puis retirez les mots d'arrêt qui n'aident pas à l'analyse (par exemple, le, A, à etc.), alors vous comptez survenus de ces mots. Vous utiliserez ensuite une fonction Unite pour combiner des mots individuels et enregistrer leur occurrence.

bigrams_united <- bigram_filtered %>% unite(bigram, Word1, Word2, sep=" ") bigrams_united

Renato Lyke · Answer

Essayez ce code.

library(tm) library(SnowballC) library(class) library(wordcloud) keywords <- read.csv(file.choose(), header = TRUE, na.strings=c("NA","-","?")) keywords_doc <- Corpus(VectorSource(keywords$"use your column that you need")) keywords_doc <- tm_map(keywords_doc, removeNumbers) keywords_doc <- tm_map(keywords_doc, tolower) keywords_doc <- tm_map(keywords_doc, stripWhitespace) keywords_doc <- tm_map(keywords_doc, removePunctuation) keywords_doc <- tm_map(keywords_doc, PlainTextDocument) keywords_doc <- tm_map(keywords_doc, stemDocument)

Ceci est la section Bigrams ou Tri grammes que vous pourriez utiliser

BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) # creating of document matrix keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer)) # remove sparse terms keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.95) # Frequency of the words appearing keyword.freq <- rowSums(as.matrix(keywords_naremoval)) subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20) frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq) # Sorting of the words frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq) frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ] frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ] # Printing of the words wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))

J'espère que cela t'aides. Ceci est un code complet que vous pourriez utiliser.