Comment diviser un texte en phrases à l'aide de l'analyseur Stanford?

Question

Comment diviser un texte ou un paragraphe en phrases à l’aide de Stanford parser ?

Existe-t-il une méthode permettant d'extraire des phrases, telle que getSentencesFromString() telle qu'elle est fournie Ruby ?

Kenston Choi · Accepted Answer

Vous pouvez vérifier la classe DocumentPreprocessor. Vous trouverez ci-dessous un court extrait. Je pense qu'il y a peut-être d'autres façons de faire ce que vous voulez.

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence."; Reader reader = new StringReader(paragraph); DocumentPreprocessor dp = new DocumentPreprocessor(reader); List<String> sentenceList = new ArrayList<String>(); for (List<HasWord> sentence : dp) { // SentenceUtils not Sentence String sentenceString = SentenceUtils.listToString(sentence); sentenceList.add(sentenceString); } for (String sentence : sentenceList) { System.out.println(sentence); }

Kevin · Answer

Je sais qu'il existe déjà une réponse acceptée ... mais en général, il vous suffit de récupérer les phrasesAnnotations d'un document annoté.

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // read some text in the text variable String text = ... // Add your text here! // create an empty Annotation just with the given text Annotation document = new Annotation(text); // run all Annotators on this text pipeline.annotate(document); // these are all the sentences in this document // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types List<CoreMap> sentences = document.get(SentencesAnnotation.class); for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // this is the text of the token String Word = token.get(TextAnnotation.class); // this is the POS tag of the token String pos = token.get(PartOfSpeechAnnotation.class); // this is the NER label of the token String ne = token.get(NamedEntityTagAnnotation.class); } }

Source - http://nlp.stanford.edu/software/corenlp.shtml (à mi-chemin)

Et si vous ne recherchez que des phrases, vous pouvez supprimer les étapes ultérieures telles que "parse" et "dcoref" de l'initialisation du pipeline, ce qui vous fera économiser du temps de chargement et de traitement. Rock and roll . ~ K

dantiston · Answer

La réponse acceptée pose quelques problèmes. Tout d'abord, le tokenizer transforme certains caractères, tels que le caractère «en deux caractères` `. Deuxièmement, rejoindre le texte symbolisé avec les espaces ne renvoie pas le même résultat qu'auparavant. Par conséquent, l'exemple de texte de la réponse acceptée transforme le texte d'entrée de manière non triviale.

Cependant, la classe CoreLabel utilisée par le tokenizer conserve une trace des caractères source vers lesquels elle est mappée. Il est donc simple de reconstruire la chaîne appropriée, si vous avez l'original.

L’approche 1 ci-dessous montre l’approche des réponses acceptées, l’approche 2 montre mon approche, qui surmonte ces problèmes.

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence."; List<String> sentenceList; /* ** APPROACH 1 (BAD!) ** */ Reader reader = new StringReader(paragraph); DocumentPreprocessor dp = new DocumentPreprocessor(reader); sentenceList = new ArrayList<String>(); for (List<HasWord> sentence : dp) { sentenceList.add(Sentence.listToString(sentence)); } System.out.println(StringUtils.join(sentenceList, " _ ")); /* ** APPROACH 2 ** */ //// Tokenize List<CoreLabel> tokens = new ArrayList<CoreLabel>(); PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), ""); while (tokenizer.hasNext()) { tokens.add(tokenizer.next()); } //// Split sentences from tokens List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens); //// Join back together int end; int start = 0; sentenceList = new ArrayList<String>(); for (List<CoreLabel> sentence: sentences) { end = sentence.get(sentence.size()-1).endPosition(); sentenceList.add(paragraph.substring(start, end).trim()); start = end; } System.out.println(StringUtils.join(sentenceList, " _ "));

Cela génère:

My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence . My 1st sentence. _ “Does it work for questions?” _ My third sentence.

Yaniv.H · Answer

En utilisant le package .net C #: Cela divisera les phrases, corrigera les parenthèses et préservera les espaces et la ponctuation d'origine:

public class NlpDemo { public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true"); public void ParseFile(string fileName) { using (var stream = File.OpenRead(fileName)) { SplitSentences(stream); } } public void SplitSentences(Stream stream) { var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream))); preProcessor.setTokenizerFactory(TokenizerFactory); foreach (Java.util.List sentence in preProcessor) { ProcessSentence(sentence); } } // print the sentence with original spaces and punctuation. public void ProcessSentence(Java.util.List sentence) { System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence)); } }

Contribution: - Les caractères de cette phrase possèdent un certain charme qu'on retrouve souvent dans la ponctuation et la prose. Ceci est une deuxième phrase? Il est en effet.

Résultat: 3 phrases ('?' Est considéré comme un séparateur de fin de phrase)

Remarque: pour une phrase du type "Le cours de Mme Havisham était impeccable (à perte de vue!) Sous tous ses aspects." Le tokenizer verra correctement que la période à la fin de Mrs. n'est pas un EOS, mais il marquera de manière incorrecte le! entre parenthèses comme un EOS et divisé "dans tous les aspects". comme deuxième phrase.

cindyxiaoxiaoli · Answer

Avec l'API Simple fournie par Stanford CoreNLP version 3.6.0 ou 3.7.0.

Voici un exemple avec 3.6.0. Cela fonctionne exactement de la même manière avec 3.7.0.

Extrait de code Java

import Java.util.List; import edu.stanford.nlp.simple.Document; import edu.stanford.nlp.simple.Sentence; public class TestSplitSentences { public static void main(String[] args) { Document doc = new Document("The text paragraph. Another sentence. Yet another sentence."); List<Sentence> sentences = doc.sentences(); sentences.stream().forEach(System.out::println); } }

Rendements:

Le paragraphe de texte.

Une autre phrase.

Encore une phrase.

pom.xml

<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.Apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.Apache.org/POM/4.0.0 http://maven.Apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>stanfordcorenlp</groupId> <artifactId>stanfordcorenlp</artifactId> <version>1.0-SNAPSHOT</version> <properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> </properties> <dependencies> <!-- https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp --> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>3.6.0</version> </dependency> <!-- https://mvnrepository.com/artifact/com.google.protobuf/protobuf-Java --> <dependency> <groupId>com.google.protobuf</groupId> <artifactId>protobuf-Java</artifactId> <version>2.6.1</version> </dependency> </dependencies> </project>

Delirante · Answer

Vous pouvez très facilement utiliser Stanford tagger pour cela.

String text = new String("Your text...."); //Your own text. List<List<HasWord>> tokenizedSentences = MaxentTagger.tokenizeText(new StringReader(text)); for(List<CoreLabel> act : tokenizedSentences) //Travel trough sentences { System.out.println(edu.stanford.nlp.ling.Sentence.listToString(act)); //This is your sentence }

demongolem · Answer

Un autre élément, qui n’est abordé que dans quelques réponses à vote négatif, est de savoir comment définir les délimiteurs de phrase. La méthode la plus courante, la valeur par défaut, consiste à utiliser les signes de ponctuation courants indiquant la fin d'une phrase. Il existe d’autres formats de documents auxquels on pourrait faire face en s’appuyant sur des corpus rassemblés, l’un d’eux étant chaque ligne constituant sa propre phrase.

Pour définir vos délimiteurs pour DocumentPreprocessor comme dans les réponses acceptées, vous utiliseriez setSentenceDelimiter(String). Pour utiliser l'approche de pipeline suggérée comme dans la réponse de @Kevin, il conviendrait d'utiliser les propriétés ssplit. Par exemple, pour utiliser le schéma de fin de ligne proposé dans le paragraphe précédent, définissez la propriété ssplit.eolonly sur true

Rahul Shah · Answer

Ajouter un chemin pour le fichier d'entrée et de sortie dans le code ci-dessous: -

import Java.util.*; import edu.stanford.nlp.pipeline.*; import Java.io.BufferedReader; import Java.io.BufferedWriter; import Java.io.FileReader; import Java.io.FileWriter; import Java.io.IOException; import Java.io.PrintWriter; public class NLPExample { public static void main(String[] args) throws IOException { PrintWriter out; out = new PrintWriter("C:\Users\Acer\Downloads\stanford-corenlp-full- 2018-02-27\output.txt"); Properties props=new Properties(); props.setProperty("annotators","tokenize, ssplit, pos,lemma"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); Annotation annotation; String readString = null; PrintWriter pw = null; BufferedReader br = null; br = new BufferedReader (new FileReader("C:\Users\Acer\Downloads\stanford- corenlp-full-2018-02-27\input.txt" ) ) ; pw = new PrintWriter ( new BufferedWriter ( new FileWriter ( "C:\Users\Acer\Downloads\stanford-corenlp-full-2018-02- 27\output.txt",false ))) ; String x = null; while (( readString = br.readLine ()) != null) { pw.println ( readString ) ; String xx=readString;x=xx;//System.out.println("OKKKKK"); annotation = new Annotation(x); pipeline.annotate(annotation); //System.out.println("LamoohAKA"); pipeline.prettyPrint(annotation, out); } br.close ( ) ; pw.close ( ) ; System.out.println("Done..."); } }

bernie2436 · Answer

Vous pouvez utiliser le préprocesseur document . C'est vraiment facile. Il suffit de nourrir un nom de fichier.

 for (List<HasWord> sentence : new DocumentPreprocessor(pathto/filename.txt)) { //sentence is a list of words in a sentence }

demongolem · Answer

Une variante de la réponse @Kevin qui résoudra la question est la suivante:

for(CoreMap sentence: sentences) { String sentenceText = sentence.get(TextAnnotation.class) }

qui vous obtient les informations de la phrase sans se soucier des autres annotateurs.