Améliorer l'extraction des noms humains avec nltk

Question

J'essaie d'extraire des noms humains du texte.

Quelqu'un a-t-il une méthode à recommander?

Voici ce que j'ai essayé (le code est ci-dessous): J'utilise nltk pour rechercher tout ce qui est marqué en tant que personne, puis générer une liste de toutes les parties NNP de cette personne. Je saute des personnes lorsqu'il n'y a qu'un seul NNP, ce qui évite de prendre un nom de famille isolé.

J'obtiens des résultats décents, mais je me demandais s'il existe de meilleures façons de résoudre ce problème.

Code:

import nltk from nameparser.parser import HumanName def get_human_names(text): tokens = nltk.tokenize.Word_tokenize(text) pos = nltk.pos_tag(tokens) sentt = nltk.ne_chunk(pos, binary = False) person_list = [] person = [] name = "" for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'): for leaf in subtree.leaves(): person.append(leaf[0]) if len(person) > 1: #avoid grabbing lone surnames for part in person: name += part + ' ' if name[:-1] not in person_list: person_list.append(name[:-1]) name = '' person = [] return (person_list) text = """ Some economists have responded positively to Bitcoin, including Francois R. Velde, senior economist of the Federal Reserve in Chicago who described it as "an elegant solution to the problem of creating a digital currency." In November 2013 Richard Branson announced that Virgin Galactic would accept Bitcoin as payment, saying that he had invested in Bitcoin and found it "fascinating how a whole new global currency has been created", encouraging others to also invest in Bitcoin. Other economists commenting on Bitcoin have been critical. Economist Paul Krugman has suggested that the structure of the currency incentivizes hoarding and that its value derives from the expectation that others will accept it as payment. Economist Larry Summers has expressed a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market strategist for ConvergEx Group, has remarked on the effect of increasing use of Bitcoin and its restricted supply, noting, "When incremental adoption meets relatively fixed supply, it should be no surprise that prices go up. And that’s exactly what is happening to BTC prices." """ names = get_human_names(text) print "LAST, FIRST" for name in names: last_first = HumanName(name).last + ', ' + HumanName(name).first print last_first

Sortie:

LAST, FIRST Velde, Francois Branson, Richard Galactic, Virgin Krugman, Paul Summers, Larry Colas, Nick

À part Virgin Galactic, toutes les sorties sont valables. Bien sûr, savoir que Virgin Galactic n'est pas un nom humain dans le contexte de cet article est la partie difficile (peut-être impossible).

troyane · Accepted Answer

Doit accepter la suggestion selon laquelle "améliorer mon code" ne convient pas à ce site, mais je peux vous donner un moyen de essayer de creuser dans.

Jetez un oeil sur Stanford Named Entity Recognizer (NER) . Sa liaison a été incluse dans NLTK v 2.0, mais vous devez télécharger des fichiers de base. Voici script qui peut faire tout cela pour vous.

J'ai écrit ce script:

import nltk from nltk.tag.stanford import NERTagger st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar') text = """YOUR TEXT GOES HERE""" for sent in nltk.sent_tokenize(text): tokens = nltk.tokenize.Word_tokenize(sent) tags = st.tag(tokens) for tag in tags: if tag[1]=='PERSON': print tag

et obtenu pas si mauvais résultat:

('François', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

J'espère que c'est utile.

Curtis Mattoon · Answer

Pour ceux qui cherchent, je trouve cet article utile: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk >>> def extract_entities(text): ... for sent in nltk.sent_tokenize(text): ... for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.Word_tokenize(sent))): ... if hasattr(chunk, 'node'): ... print chunk.node, ' '.join(c[0] for c in chunk.leaves()) ...

Viktor Vojnovski · Answer

Vous pouvez essayer de résoudre les noms trouvés et vérifier si vous pouvez les trouver dans une base de données telle que freebase.com. Obtenez les données localement et interrogez-les (c'est dans RDF), ou utilisez l'API de Google: https://developers.google.com/freebase/v1/getting-started . La plupart des grandes entreprises, des lieux géographiques, etc. (qui seraient capturés par votre extrait de code) pourraient ensuite être supprimés sur la base des données de la base gratuite.

neel · Answer

Spacy peut être une bonne alternative pour récupérer des noms sous forme de texte.

https://spacy.io/usage/training#ner

Shivansh bhandari · Answer

En fait, je voulais extraire uniquement le nom de la personne, donc, pensé à vérifier tous les noms qui sortent comme une sortie contre wordnet (une grande base de données lexicale en anglais) . Plus d’informations sur Wordnet sont disponibles ici: http: //www.nltk.org/howto/wordnet.html

import nltk from nameparser.parser import HumanName from nltk.corpus import wordnet person_names=person_list person_list = [] def get_human_names(text): tokens = nltk.tokenize.Word_tokenize(text) pos = nltk.pos_tag(tokens) sentt = nltk.ne_chunk(pos, binary = False) person = [] name = "" for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'): for leaf in subtree.leaves(): person.append(leaf[0]) if len(person) > 1: #avoid grabbing lone surnames for part in person: name += part + ' ' if name[:-1] not in person_list: person_list.append(name[:-1]) name = '' person = [] # print (person_list) text = """ Some economists have responded positively to Bitcoin, including Francois R. Velde, senior economist of the Federal Reserve in Chicago who described it as "an elegant solution to the problem of creating a digital currency." In November 2013 Richard Branson announced that Virgin Galactic would accept Bitcoin as payment, saying that he had invested in Bitcoin and found it "fascinating how a whole new global currency has been created", encouraging others to also invest in Bitcoin. Other economists commenting on Bitcoin have been critical. Economist Paul Krugman has suggested that the structure of the currency incentivizes hoarding and that its value derives from the expectation that others will accept it as payment. Economist Larry Summers has expressed a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market strategist for ConvergEx Group, has remarked on the effect of increasing use of Bitcoin and its restricted supply, noting, "When incremental adoption meets relatively fixed supply, it should be no surprise that prices go up. And that’s exactly what is happening to BTC prices." """ names = get_human_names(text) for person in person_list: person_split = person.split(" ") for name in person_split: if wordnet.synsets(name): if(name in person): person_names.remove(person) break print(person_names)

SORTIE

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

En dehors de Larry Summers, tous les noms sont corrects, à cause du nom de famille "Summers".

Martin Thoma · Answer

La réponse de @trojane n'a pas vraiment fonctionné pour moi, mais a beaucoup aidé pour celui-ci.

Pré-requis

Créez un dossier stanford-ner et téléchargez-y les deux fichiers suivants:

english.all.3class.distsim.crf.ser.gz
stanford-ner.jar (Recherchez téléchargez et extrayez l'archive)

Scénario

#!/usr/bin/env python # -*- coding: utf-8 -*- import nltk from nltk.tag.stanford import StanfordNERTagger text = u""" Some economists have responded positively to Bitcoin, including Francois R. Velde, senior economist of the Federal Reserve in Chicago who described it as "an elegant solution to the problem of creating a digital currency." In November 2013 Richard Branson announced that Virgin Galactic would accept Bitcoin as payment, saying that he had invested in Bitcoin and found it "fascinating how a whole new global currency has been created", encouraging others to also invest in Bitcoin. Other economists commenting on Bitcoin have been critical. Economist Paul Krugman has suggested that the structure of the currency incentivizes hoarding and that its value derives from the expectation that others will accept it as payment. Economist Larry Summers has expressed a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market strategist for ConvergEx Group, has remarked on the effect of increasing use of Bitcoin and its restricted supply, noting, "When incremental adoption meets relatively fixed supply, it should be no surprise that prices go up. And that’s exactly what is happening to BTC prices. """ st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar') for sent in nltk.sent_tokenize(text): tokens = nltk.tokenize.Word_tokenize(sent) tags = st.tag(tokens) for tag in tags: if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]: print(tag)

Résultats

(u'Bitcoin', u'LOCATION') # wrong (u'Francois', u'PERSON') (u'R.', u'PERSON') (u'Velde', u'PERSON') (u'Federal', u'ORGANIZATION') (u'Reserve', u'ORGANIZATION') (u'Chicago', u'LOCATION') (u'Richard', u'PERSON') (u'Branson', u'PERSON') (u'Virgin', u'PERSON') # Wrong (u'Galactic', u'PERSON') # Wrong (u'Bitcoin', u'PERSON') # Wrong (u'Bitcoin', u'LOCATION') # Wrong (u'Bitcoin', u'LOCATION') # Wrong (u'Paul', u'PERSON') (u'Krugman', u'PERSON') (u'Larry', u'PERSON') (u'Summers', u'PERSON') (u'Bitcoin', u'PERSON') # Wrong (u'Nick', u'PERSON') (u'Colas', u'PERSON') (u'ConvergEx', u'ORGANIZATION') (u'Group', u'ORGANIZATION') (u'Bitcoin', u'LOCATION') # Wrong (u'BTC', u'ORGANIZATION') # Wrong

C.Rider · Answer

Cela a plutôt bien fonctionné pour moi. Je devais juste changer une ligne pour que cela fonctionne.

 for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):

doit être

 for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):

Il y avait des imperfections dans la sortie (par exemple, il identifiait le "blanchiment d'argent" comme une personne), mais avec mes données, une base de données de noms peut ne pas être fiable.