Extraire des données d'un tableau HTML

Question

Je cherche un moyen d'obtenir certaines informations à partir de HTML dans l'environnement Shell de Linux.

C'est un peu ce qui m'intéresse:

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table>

Et je veux stocker dans des variables Shell ou les faire écho dans des paires clé-valeur extraites du code HTML ci-dessus. Exemple :

Tests : 103 Failures : 24 Success Rate : 76.70 % and so on..

Ce que je peux faire pour le moment est de créer un programme Java qui utilisera un analyseur syntaxique sax ou HTML, tel que jsoup, pour extraire ces informations.

Mais utiliser Java ici semble être une surcharge avec l’inclusion du fichier jar exécutable dans le script "wrapper" que vous voulez exécuter.

Je suis sûr qu'il doit exister des langages "Shell" capables de faire la même chose, à savoir Perl, Python, Bash, etc.

Mon problème est que je n’ai aucune expérience en la matière. Quelqu'un peut-il m'aider à résoudre ce problème "assez facile"?

Mise à jour rapide:

J'ai oublié de mentionner que j'ai plus de tables et plus de lignes dans le document .html, désolé pour ça (tôt le matin).

Mise à jour n ° 2:

J'ai essayé d'installer Bsoup comme ceci car je n'ai pas d'accès root:

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz $ tar -zxvf beautifulsoup4-4.1.0.tar.gz $ cp -r beautifulsoup4-4.1.0/bs4 . $ vi htmlParse.py # (paste code from ) Tichodromas' answer, just in case this (http://Pastebin.com/4Je11Y9q) is what I pasted $ run file (python htmlParse.py)

Erreur:

$ python htmlParse.py Traceback (most recent call last): File "htmlParse.py", line 1, in ? from bs4 import BeautifulSoup File "/home/gdd/setup/py/bs4/__init__.py", line 29 from .builder import builder_registry ^ SyntaxError: invalid syntax

Mise à jour # 3:

Lancer la réponse de Tichodromas obtient cette erreur:

Traceback (most recent call last): File "test.py", line 27, in ? headings = [th.get_text() for th in table.find("tr").find_all("th")] TypeError: 'NoneType' object is not callable

des idées?

user647772 · Accepted Answer

Une solution Python utilisant BeautifulSoup4 (Edit: avec les sauts appropriés. Edit3: Utilisation de class="details" pour sélectionner la table):

from bs4 import BeautifulSoup html = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table>""" soup = BeautifulSoup(html) table = soup.find("table", attrs={"class":"details"}) # The first tr contains the field names. headings = [th.get_text() for th in table.find("tr").find_all("th")] datasets = [] for row in table.find_all("tr")[1:]: dataset = Zip(headings, (td.get_text() for td in row.find_all("td"))) datasets.append(dataset) print datasets

Le résultat ressemble à ceci:

[[(u'Tests', u'103'), (u'Failures', u'24'), (u'Success Rate', u'76.70%'), (u'Average Time', u'71 ms'), (u'Min Time', u'0 ms'), (u'Max Time', u'829 ms')]]

Edit2: Pour produire le résultat souhaité, utilisez quelque chose comme ceci:

for dataset in datasets: for field in dataset: print "{0:<16}: {1}".format(field[0], field[1])

Résultat:

Tests : 103 Failures : 24 Success Rate : 76.70% Average Time : 71 ms Min Time : 0 ms Max Time : 829 ms

Michel M&#252;ller · Answer

Voici la meilleure réponse, adaptée à la compatibilité Python3, et améliorée en supprimant les espaces dans les cellules:

from bs4 import BeautifulSoup html = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table>""" soup = BeautifulSoup(s, 'html.parser') table = soup.find("table") # The first tr contains the field names. headings = [th.get_text().strip() for th in table.find("tr").find_all("th")] print(headings) datasets = [] for row in table.find_all("tr")[1:]: dataset = dict(Zip(headings, (td.get_text() for td in row.find_all("td")))) datasets.append(dataset) print(datasets)

Stephane Rouberol · Answer

En supposant que votre code HTML soit stocké dans un fichier mycode.html, voici une méthode bash:

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')

note: la sortie n'est pas parfaitement alignée

paolov · Answer

Vous trouverez ci-dessous une solution basée sur Python regex que j'ai testée sur Python 2.7. Il ne repose pas sur le module xml - il fonctionnera donc dans le cas où xml n'est pas complètement formé.

import re # input args: html string # output: tables as a list, column max length def extract_html_tables(html): tables=[] maxlen=0 rex1=r'<table.*?/table>' rex2=r'<tr.*?/tr>' rex3=r'<(td|th).*?/(td|th)>' s = re.search(rex1,html,re.DOTALL) while s: t = s.group() # the table s2 = re.search(rex2,t,re.DOTALL) table = [] while s2: r = s2.group() # the row s3 = re.search(rex3,r,re.DOTALL) row=[] while s3: d = s3.group() # the cell #row.append(strip_tags(d).strip() ) row.append(d.strip() ) r = re.sub(rex3,'',r,1,re.DOTALL) s3 = re.search(rex3,r,re.DOTALL) table.append( row ) if maxlen<len(row): maxlen = len(row) t = re.sub(rex2,'',t,1,re.DOTALL) s2 = re.search(rex2,t,re.DOTALL) html = re.sub(rex1,'',html,1,re.DOTALL) tables.append(table) s = re.search(rex1,html,re.DOTALL) return tables, maxlen html = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table>""" print extract_html_tables(html)

cdtits · Answer

undef $/; $text = <DATA>; @tabs = $text =~ m!<table.*?>(.*?)</table>!gms; for (@tabs) { @th = m!<th>(.*?)</th>!gms; @td = m!<td>(.*?)</td>!gms; } for $i (0..$#th) { printf "%-16s	: %s
", $th[$i], $td[$i]; } __DATA__ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> </table>

sortie comme suit:

Tests : 103 Failures : 24 Success Rate : 76.70% Average Time : 71 ms Min Time : 0 ms Max Time : 829 ms

mzjn · Answer

Une solution Python qui utilise uniquement la bibliothèque standard (tire parti du fait que le code HTML est un fichier XML bien formé). Plusieurs lignes de données peuvent être traitées.

(Testé avec Python 2.6 et 2.7. La question a été mise à jour et indique que l'OP utilise Python 2.4. Cette réponse peut donc ne pas être très utile dans ce cas. ElementTree a été ajouté dans Python 2.5)

from xml.etree.ElementTree import fromstring HTML = """ <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> <tr valign="top"> <th>Tests</th> <th>Failures</th> <th>Success Rate</th> <th>Average Time</th> <th>Min Time</th> <th>Max Time</th> </tr> <tr valign="top" class="Failure"> <td>103</td> <td>24</td> <td>76.70%</td> <td>71 ms</td> <td>0 ms</td> <td>829 ms</td> </tr> <tr valign="top" class="whatever"> <td>A</td> <td>B</td> <td>C</td> <td>D</td> <td>E</td> <td>F</td> </tr> </table>""" tree = fromstring(HTML) rows = tree.findall("tr") headrow = rows[0] datarows = rows[1:] for num, h in enumerate(headrow): data = ", ".join([row[num].text for row in datarows]) print "{0:<16}: {1}".format(h.text, data)

Sortie:

Tests : 103, A Failures : 24, B Success Rate : 76.70%, C Average Time : 71 ms, D Min Time : 0 ms, E Max Time : 829 ms, F