web-dev-qa-db-fra.com

Améliorer l'extraction des noms humains avec nltk

J'essaie d'extraire des noms humains du texte. 

Quelqu'un a-t-il une méthode à recommander?

Voici ce que j'ai essayé (le code est ci-dessous): J'utilise nltk pour rechercher tout ce qui est marqué en tant que personne, puis générer une liste de toutes les parties NNP de cette personne. Je saute des personnes lorsqu'il n'y a qu'un seul NNP, ce qui évite de prendre un nom de famille isolé.

J'obtiens des résultats décents, mais je me demandais s'il existe de meilleures façons de résoudre ce problème.

Code:

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.Word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

Sortie:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

À part Virgin Galactic, toutes les sorties sont valables. Bien sûr, savoir que Virgin Galactic n'est pas un nom humain dans le contexte de cet article est la partie difficile (peut-être impossible).

37
emh

Doit accepter la suggestion selon laquelle "améliorer mon code" ne convient pas à ce site, mais je peux vous donner un moyen de essayer de creuser dans.

Jetez un oeil sur Stanford Named Entity Recognizer (NER) . Sa liaison a été incluse dans NLTK v 2.0, mais vous devez télécharger des fichiers de base. Voici script qui peut faire tout cela pour vous.

J'ai écrit ce script:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.Word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

et obtenu pas si mauvais résultat:

('François', 'PERSON') ('R.', 'PERSON') ('Velde', 'PERSON') ('Richard', 'PERSON') ('Branson', 'PERSON') ('Virgin', 'PERSON') ('Galactic', 'PERSON') ('Bitcoin', 'PERSON') ('Bitcoin', 'PERSON') ('Paul', 'PERSON') ('Krugman', 'PERSON') ('Larry', 'PERSON') ('Summers', 'PERSON') ('Bitcoin', 'PERSON') ('Nick', 'PERSON') ('Colas', 'PERSON')

J'espère que c'est utile.

19
troyane

Pour ceux qui cherchent, je trouve cet article utile: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk
>>> def extract_entities(text):
...     for sent in nltk.sent_tokenize(text):
...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.Word_tokenize(sent))):
...             if hasattr(chunk, 'node'):
...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
...
6
Curtis Mattoon

Vous pouvez essayer de résoudre les noms trouvés et vérifier si vous pouvez les trouver dans une base de données telle que freebase.com. Obtenez les données localement et interrogez-les (c'est dans RDF), ou utilisez l'API de Google: https://developers.google.com/freebase/v1/getting-started . La plupart des grandes entreprises, des lieux géographiques, etc. (qui seraient capturés par votre extrait de code) pourraient ensuite être supprimés sur la base des données de la base gratuite. 

5
Viktor Vojnovski

Spacy peut être une bonne alternative pour récupérer des noms sous forme de texte.

https://spacy.io/usage/training#ner

4
neel

En fait, je voulais extraire uniquement le nom de la personne, donc, pensé à vérifier tous les noms qui sortent comme une sortie contre wordnet (une grande base de données lexicale en anglais) . Plus d’informations sur Wordnet sont disponibles ici: http: //www.nltk.org/howto/wordnet.html

import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet

person_names=person_list
person_list = []
def get_human_names(text):
    tokens = nltk.tokenize.Word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

text = """

Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)

SORTIE

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

En dehors de Larry Summers, tous les noms sont corrects, à cause du nom de famille "Summers". 

2
Shivansh bhandari

La réponse de @trojane n'a pas vraiment fonctionné pour moi, mais a beaucoup aidé pour celui-ci.

Pré-requis

Créez un dossier stanford-ner et téléchargez-y les deux fichiers suivants:

Scénario

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk
from nltk.tag.stanford import StanfordNERTagger

text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that’s exactly what is happening to BTC prices.
"""

st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
                       'stanford-ner/stanford-ner.jar')

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.Word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
            print(tag)

Résultats

(u'Bitcoin', u'LOCATION')       # wrong
(u'Francois', u'PERSON')
(u'R.', u'PERSON')
(u'Velde', u'PERSON')
(u'Federal', u'ORGANIZATION')
(u'Reserve', u'ORGANIZATION')
(u'Chicago', u'LOCATION')
(u'Richard', u'PERSON')
(u'Branson', u'PERSON')
(u'Virgin', u'PERSON')         # Wrong
(u'Galactic', u'PERSON')       # Wrong
(u'Bitcoin', u'PERSON')        # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Paul', u'PERSON')
(u'Krugman', u'PERSON')
(u'Larry', u'PERSON')
(u'Summers', u'PERSON')
(u'Bitcoin', u'PERSON')        # Wrong
(u'Nick', u'PERSON')
(u'Colas', u'PERSON')
(u'ConvergEx', u'ORGANIZATION')
(u'Group', u'ORGANIZATION')     
(u'Bitcoin', u'LOCATION')       # Wrong
(u'BTC', u'ORGANIZATION')       # Wrong
1
Martin Thoma

Cela a plutôt bien fonctionné pour moi. Je devais juste changer une ligne pour que cela fonctionne. 

    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):

doit être

    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):

Il y avait des imperfections dans la sortie (par exemple, il identifiait le "blanchiment d'argent" comme une personne), mais avec mes données, une base de données de noms peut ne pas être fiable. 

0
C.Rider