Utilisation de PDFbox pour déterminer les coordonnées de mots dans un document

Question

J'utilise PDFbox pour extraire les coordonnées de mots/chaînes dans un document PDF et j'ai jusqu'à présent réussi à déterminer la position de caractères individuels. c'est le code jusqu'ici, de la doc PDFbox:

package printtextlocations; import Java.io.*; import org.Apache.pdfbox.exceptions.InvalidPasswordException; import org.Apache.pdfbox.pdmodel.PDDocument; import org.Apache.pdfbox.pdmodel.PDPage; import org.Apache.pdfbox.pdmodel.common.PDStream; import org.Apache.pdfbox.util.PDFTextStripper; import org.Apache.pdfbox.util.TextPosition; import Java.io.IOException; import Java.util.List; public class PrintTextLocations extends PDFTextStripper { public PrintTextLocations() throws IOException { super.setSortByPosition(true); } public static void main(String[] args) throws Exception { PDDocument document = null; try { File input = new File("C:\path\to\PDF.pdf"); document = PDDocument.load(input); if (document.isEncrypted()) { try { document.decrypt(""); } catch (InvalidPasswordException e) { System.err.println("Error: Document is encrypted with a password."); System.exit(1); } } PrintTextLocations printer = new PrintTextLocations(); List allPages = document.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { PDPage page = (PDPage) allPages.get(i); System.out.println("Processing page: " + i); PDStream contents = page.getContents(); if (contents != null) { printer.processStream(page, page.findResources(), page.getContents().getStream()); } } } finally { if (document != null) { document.close(); } } } /** * @param text The text to be processed */ @Override /* this is questionable, not sure if needed... */ protected void processTextPosition(TextPosition text) { System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getCharacter()); } }

Cela produit une série de lignes contenant la position de chaque caractère, espaces compris, qui se présente comme suit:

String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P

Où "P" est le personnage. Je n'ai pas réussi à trouver une fonction dans PDFbox pour rechercher des mots et je ne suis pas suffisamment familiarisé avec Java pour pouvoir concaténer avec précision ces caractères dans des mots pour effectuer une recherche, même si les espaces sont également inclus. Quelqu'un d'autre a-t-il été dans une situation similaire et si oui comment l'avez-vous abordée? Je n'ai vraiment besoin que de la coordonnée du premier caractère du mot pour simplifier les parties, mais je ne sais pas comment comparer une chaîne à ce type de sortie.

Nicolas W. · Answer

Il n’existe pas de fonction dans PDFBox permettant d’extraire des mots automatiquement. Je travaille actuellement sur l'extraction de données pour les rassembler en blocs et voici mon processus:

J'extrais tous les caractères du document (appelés glyphes) et les stocke dans une liste.
Je fais une analyse des coordonnées de chaque glyphe, en passant en boucle sur la liste. S'ils se chevauchent (si le haut du glyphe actuel est compris entre le haut et le bas du précédent/ou le bas du glyphe actuel est compris entre le haut et le bas du précédent), je l'ajoute à la même ligne.
A ce stade, j'ai extrait les différentes lignes du document (attention, si votre document est composé de plusieurs colonnes, l'expression "lignes" désigne tous les glyphes qui se chevauchent verticalement, c'est-à-dire le texte de toutes les colonnes qui ont la même verticale. coordonnées).
Ensuite, vous pouvez comparer la coordonnée gauche du glyphe actuel à la coordonnée droite du précédent pour déterminer s’ils appartiennent ou non au même mot (la classe PDFTextStripper fournit une méthode getSpacingTolerance () qui vous indique, en fonction des essais et des erreurs , la valeur d'un espace "normal" Si la différence entre les coordonnées droite et gauche est inférieure à cette valeur, les deux glyphes appartiennent au même mot.

J'ai appliqué cette méthode à mon travail et cela fonctionne bien.

Dainesch · Answer

Sur la base de l'idée originale, voici une version de la recherche de texte pour PDFBox 2. Le code lui-même est approximatif, mais simple. Cela devrait vous permettre de démarrer assez rapidement.

import Java.io.IOException; import Java.io.Writer; import Java.util.List; import Java.util.Set; import lu.abac.pdfclient.data.PDFTextLocation; import org.Apache.pdfbox.pdmodel.PDDocument; import org.Apache.pdfbox.text.PDFTextStripper; import org.Apache.pdfbox.text.TextPosition; public class PrintTextLocator extends PDFTextStripper { private final Set<PDFTextLocation> locations; public PrintTextLocator(PDDocument document, Set<PDFTextLocation> locations) throws IOException { super.setSortByPosition(true); this.document = document; this.locations = locations; this.output = new Writer() { @Override public void write(char[] cbuf, int off, int len) throws IOException { } @Override public void flush() throws IOException { } @Override public void close() throws IOException { } }; } public Set<PDFTextLocation> doSearch() throws IOException { processPages(document.getDocumentCatalog().getPages()); return locations; } @Override protected void writeString(String text, List<TextPosition> textPositions) throws IOException { super.writeString(text); String searchText = text.toLowerCase(); for (PDFTextLocation textLoc:locations) { int start = searchText.indexOf(textLoc.getText().toLowerCase()); if (start!=-1) { // found TextPosition pos = textPositions.get(start); textLoc.setFound(true); textLoc.setPage(getCurrentPageNo()); textLoc.setX(pos.getXDirAdj()); textLoc.setY(pos.getYDirAdj()); } } } }

Marouita · Answer

jetez un coup d'œil à cela, je pense que c'est ce dont vous avez besoin.

https://jackson-brain.com/using-pdfbox-to-locate-text-coordinates-within-a-pdf-in-Java/

Voici le code:

import Java.io.File; import Java.io.IOException; import Java.text.DecimalFormat; import Java.util.ArrayList; import Java.util.Arrays; import Java.util.List; import org.Apache.pdfbox.exceptions.InvalidPasswordException; import org.Apache.pdfbox.pdmodel.PDDocument; import org.Apache.pdfbox.pdmodel.PDPage; import org.Apache.pdfbox.pdmodel.common.PDStream; import org.Apache.pdfbox.util.PDFTextStripper; import org.Apache.pdfbox.util.TextPosition; public class PrintTextLocations extends PDFTextStripper { public static StringBuilder tWord = new StringBuilder(); public static String seek; public static String[] seekA; public static List wordList = new ArrayList(); public static boolean is1stChar = true; public static boolean lineMatch; public static int pageNo = 1; public static double lastYVal; public PrintTextLocations() throws IOException { super.setSortByPosition(true); } public static void main(String[] args) throws Exception { PDDocument document = null; seekA = args[1].split(","); seek = args[1]; try { File input = new File(args[0]); document = PDDocument.load(input); if (document.isEncrypted()) { try { document.decrypt(""); } catch (InvalidPasswordException e) { System.err.println("Error: Document is encrypted with a password."); System.exit(1); } } PrintTextLocations printer = new PrintTextLocations(); List allPages = document.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { PDPage page = (PDPage) allPages.get(i); PDStream contents = page.getContents(); if (contents != null) { printer.processStream(page, page.findResources(), page.getContents().getStream()); } pageNo += 1; } } finally { if (document != null) { System.out.println(wordList); document.close(); } } } @Override protected void processTextPosition(TextPosition text) { String tChar = text.getCharacter(); System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getCharacter()); String REGEX = "[,.(:;!?)/]"; char c = tChar.charAt(0); lineMatch = matchCharLine(text); if ((!tChar.matches(REGEX)) && (!Character.isWhitespace(c))) { if ((!is1stChar) && (lineMatch == true)) { appendChar(tChar); } else if (is1stChar == true) { setWordCoord(text, tChar); } } else { endWord(); } } protected void appendChar(String tChar) { tWord.append(tChar); is1stChar = false; } protected void setWordCoord(TextPosition text, String tChar) { tWord.append("(").append(pageNo).append(")[").append(roundVal(Float.valueOf(text.getXDirAdj()))).append(" : ").append(roundVal(Float.valueOf(text.getYDirAdj()))).append("] ").append(tChar); is1stChar = false; } protected void endWord() { String newWord = tWord.toString().replaceAll("[^\x00-\x7F]", ""); String sWord = newWord.substring(newWord.lastIndexOf(' ') + 1); if (!"".equals(sWord)) { if (Arrays.asList(seekA).contains(sWord)) { wordList.add(newWord); } else if ("SHOWMETHEMONEY".equals(seek)) { wordList.add(newWord); } } tWord.delete(0, tWord.length()); is1stChar = true; } protected boolean matchCharLine(TextPosition text) { Double yVal = roundVal(Float.valueOf(text.getYDirAdj())); if (yVal.doubleValue() == lastYVal) { return true; } lastYVal = yVal.doubleValue(); endWord(); return false; } protected Double roundVal(Float yVal) { DecimalFormat rounded = new DecimalFormat("0.0'0'"); Double yValDub = new Double(rounded.format(yVal)); return yValDub; } }

Les dépendances:

PDFBox, FontBox, Apache Common Logging Interface.

Vous pouvez l'exécuter en tapant sur la ligne de commande:

javac PrintTextLocations.Java Sudo Java PrintTextLocations file.pdf Word1,Word2,....

la sortie est similaire à:

[(1)[190.3 : 286.8] Word1, (1)[283.3 : 286.8] Word2, ...]