Comment trouver un tag avec un texte particulier avec Beautiful Soup?

Question

J'ai le code HTML suivant (les sauts de ligne marqués d'un ):

... <tr> <td class="pos">
 "Some text:"
 <br>
 <strong>some value</strong>
 </td> </tr> <tr> <td class="pos">
 "Fixed text:"
 <br>
 <strong>text I am looking for</strong>
 </td> </tr> <tr> <td class="pos">
 "Some other text:"
 <br>
 <strong>some other value</strong>
 </td> </tr> ...

Comment trouver le texte que je cherche ? Le code ci-dessous renvoie la première valeur trouvée, je dois donc filtrer par Texte fixe en quelque sorte.

result = soup.find('td', {'class' :'pos'}).find('strong').text

Upd . Si j'utilise le code suivant:

title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'}) self.response.out.write(str(title.string).decode('utf8'))

alors il retourne juste Texte fixe: .

user130076 · Accepted Answer

Vous pouvez transmettre une expression régulière au paramètre text de findAll, comme suit:

import BeautifulSoup import re columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})

Bruno Bronosky · Answer

Ce post m'a amené à ma réponse même si la réponse est manquante de ce post. Je sentais que je devais redonner.

La difficulté réside dans le comportement incohérent de BeautifulSoup.find lors de recherches avec et sans texte.

Remarque: Si vous avez BeautifulSoup, vous pouvez le tester localement via:

curl https://Gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

Code: https://Gist.github.com/4060082

# Taken from https://Gist.github.com/4060082 from BeautifulSoup import BeautifulSoup from urllib2 import urlopen from pprint import pprint import re soup = BeautifulSoup(urlopen('https://Gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read()) # I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear. pattern = re.compile('Fixed text') # Peter's suggestion here returns a list of what appear to be strings columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'}) # ...but it is actually a BeautifulSoup.NavigableString print type(columns[0]) #>> <class 'BeautifulSoup.NavigableString'> # you can reach the tag using one of the convenience attributes seen here pprint(columns[0].__dict__) #>> {'next': <br />, #>> 'nextSibling': <br />, #>> 'parent': <td class="pos">
 #>> "Fixed text:"
 #>> <br />
 #>> <strong>text I am looking for</strong>
 #>> </td>, #>> 'previous': <td class="pos">
 #>> "Fixed text:"
 #>> <br />
 #>> <strong>text I am looking for</strong>
 #>> </td>, #>> 'previousSibling': None} # I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names # So, if you want to find the 'text' in the 'strong' element... pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})]) #>> [u'text I am looking for'] # Here is what we have learned: print soup.find('strong') #>> <strong>some value</strong> print soup.find('strong', text='some value') #>> u'some value' print soup.find('strong', text='some value').parent #>> <strong>some value</strong> print soup.find('strong', text='some value') == soup.find('strong') #>> False print soup.find('strong', text='some value') == soup.find('strong').text #>> True print soup.find('strong', text='some value').parent == soup.find('strong') #>> True

Bien qu'il soit certainement trop tard pour aider le PO, j'espère qu'il saura y remédier, car cela résout tous les dilemmes entourant la recherche par texte.

Prasad Giri · Answer

à partir de bs4 import BeautifulSoup

de urllib.request import urlopen, Demande

depuis urllib.parse import urljoin, urlparse

rawLinks=soup.findAll('a',href=True) for link in rawLinks: innercontent=link.text if keyword.lower() in innercontent.lower(): print(link)