web-dev-qa-db-fra.com

Comment obtenir du texte brut d'un fichier pdf en utilisant java

J'ai quelques fichiers pdf, En utilisant pdfbox, je les ai convertis en texte et stockés dans des fichiers texte, Maintenant, des fichiers texte que je veux supprimer

  1. Des hyperliens
  2. Tous les caractères spéciaux
  3. Lignes vierges
  4. pieds de page des fichiers pdf
  5. “1)”, “2)”, “a)”, “balles”, etc.

Je veux obtenir un texte valide ligne par ligne comme ceci:

Nous proposons OntoGain, une méthode d’apprentissage des ontologies à partir de termes conceptuels multi-mots extraits du texte brut. OntoGain suit un processus d’apprentissage ontologique défini par des couches de traitement distinctes. En se basant sur l'extraction de termes simples, une hiérarchie de concepts est formée en regroupant les concepts extraits. Le terme dérivé taxonomie est ensuite enrichi de relations non taxonomiques. Plusieurs méthodes de pointe différentes ont été examinées pour la mise en œuvre de chaque couche. OntoGain est basé sur des concepts de termes multi-mots, car les termes multi-mots ou composés sont dotés d'une sémantique plus solide et plus distinctive que les termes simples à mots simples. Nous avons opté pour une méthode de classification hiérarchique et un algorithme d'analyse de concept formel (FCA) pour construire le terme taxonomie. De plus, un algorithme de règle d'association est appliqué pour révéler des relations non taxonomiques. Une méthode qui essaie de réaliser le niveau de généralisation le plus approprié entre les concepts d'une relation est également implémentée. Pour montrer la preuve de concept, un prototype de système est mis en œuvre. OntoGain permet la transformation de l'ontologie dérivée en OWL à l'aide de Jena Semantic Web Frame-work1. OntoGain est appliqué sur deux sources de données distinctes, un corpus médical et informatique, et ses résultats sont comparés aux résultats similaires obtenus par Text2Onto, une méthode d’apprentissage à la pointe de la technologie ontologique. L'analyse des résultats de 11.5 CCD1.1 indique qu'OntoGain fonctionne mieux que Text2Onto en termes d'extraction de précision de concepts plus corrects, tout en étant plus sélectifs d'extraits moins complexes mais de concepts plus raisonnables.

Comment puis-je atteindre cet objectif?

21
user2609542

En utilisant pdfbox nous pouvons y arriver

Exemple : 

public static void main(String args[]) {

    PDFParser parser = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    PDFTextStripper pdfStripper;

    String parsedText;
    String fileName = "E:\\Files\\Small Files\\PDF\\JDBC.pdf";
    File file = new File(fileName);
    try {
        parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText.replaceAll("[^A-Za-z0-9. ]+", ""));
    } catch (Exception e) {
        e.printStackTrace();
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e1.printStackTrace();
        }

    }
}
29
SANN3

Bonjour, nous pouvons extraire les fichiers pdf en utilisant Apache Tika 

L'exemple est:

import Java.io.IOException;
import Java.io.InputStream;
import Java.util.HashMap;
import Java.util.Map;
import org.Apache.http.HttpEntity;
import org.Apache.http.HttpResponse;
import org.Apache.http.client.methods.HttpGet;
import org.Apache.http.impl.client.DefaultHttpClient;
import org.Apache.tika.metadata.Metadata;
import org.Apache.tika.metadata.TikaCoreProperties;
import org.Apache.tika.parser.AutoDetectParser;
import org.Apache.tika.parser.ParseContext;
import org.Apache.tika.sax.BodyContentHandler;

public class WebPagePdfExtractor {

    public Map<String, Object> processRecord(String url) {
        DefaultHttpClient httpclient = new DefaultHttpClient();
        Map<String, Object> map = new HashMap<String, Object>();
        try {
            HttpGet httpGet = new HttpGet(url);
            HttpResponse response = httpclient.execute(httpGet);
            HttpEntity entity = response.getEntity();
            InputStream input = null;
            if (entity != null) {
                try {
                    input = entity.getContent();
                    BodyContentHandler handler = new BodyContentHandler();
                    Metadata metadata = new Metadata();
                    AutoDetectParser parser = new AutoDetectParser();
                    ParseContext parseContext = new ParseContext();
                    parser.parse(input, handler, metadata, parseContext);
                    map.put("text", handler.toString().replaceAll("\n|\r|\t", " "));
                    map.put("title", metadata.get(TikaCoreProperties.TITLE));
                    map.put("pageCount", metadata.get("xmpTPg:NPages"));
                    map.put("status_code", response.getStatusLine().getStatusCode() + "");
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    if (input != null) {
                        try {
                            input.close();
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    }
                }
            }
        } catch (Exception exception) {
            exception.printStackTrace();
        }
        return map;
    }

    public static void main(String arg[]) {
        WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
        Map<String, Object> extractedMap = webPagePdfExtractor.processRecord("http://math.about.com/library/q20.pdf");
        System.out.println(extractedMap.get("text"));
    }

}
16
Johnsa Philip

Vous pouvez utiliser iText pour faire de telles choses

//iText imports

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

par exemple:

try {     
    PdfReader reader = new PdfReader(INPUTFILE);
    int n = reader.getNumberOfPages(); 
    String str=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
    System.out.println(str);
    reader.close();
} catch (Exception e) {
    System.out.println(e);
}

un autre

try {

    PdfReader reader = new PdfReader("c:/temp/test.pdf");
    System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
    String page = PdfTextExtractor.getTextFromPage(reader, 2);
    System.out.println("Page Content:\n\n"+page+"\n\n");
    System.out.println("Is this document tampered: "+reader.isTampered());
    System.out.println("Is this document encrypted: "+reader.isEncrypted());
} catch (IOException e) {
    e.printStackTrace();
}

les exemples ci-dessus peuvent uniquement extraire le texte, mais vous devez en faire davantage pour supprimer les hyperliens, les puces, les en-têtes et les numéros.

11
Ulaga

Extraire tous les mots-clés du fichier PDF sur votre ordinateur local ou une chaîne encodée en Base64:

    import org.Apache.commons.codec.binary.Base64;
    import org.Apache.pdfbox.pdmodel.PDDocument;
    import org.Apache.pdfbox.text.PDFTextStripper;

    import Java.io.File;
    import Java.io.FileInputStream;
    import Java.io.FileNotFoundException;
    import Java.io.IOException;
    import Java.util.HashMap;
    import Java.util.Map;

    public class WebPagePdfExtractor {

        public static void main(String arg[]) {
            WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();

            System.out.println("From file:   " + webPagePdfExtractor.processRecord(createByteArray()).get("text"));

            System.out.println("From string: " + webPagePdfExtractor.processRecord(getArrayFromBase64EncodedString()).get("text"));
        }

        public Map<String, Object> processRecord(byte[] byteArray) {
            Map<String, Object> map = new HashMap<>();
            try {
                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setSortByPosition(false);
                stripper.setShouldSeparateByBeads(true);

                PDDocument document = PDDocument.load(byteArray);
                String text = stripper.getText(document);
                map.put("text", text.replaceAll("\n|\r|\t", " "));
            } catch (Exception exception) {
                exception.printStackTrace();
            }
            return map;
        }

        private static byte[] getArrayFromBase64EncodedString() {
            String encodedContent = "data:application/pdf;base64,JVBERi0xLjMKJcTl8uXrp/Og0MTGCjQgMCBvYmoKPDwgL0xlbmd0aCA1IDAgUiAvRmlsdGVyIC9GbGF0ZURlY29kZSA+PgpzdHJlYW0KeAGF0E0OgjAQBeA9p3hL3UCHlha2Gg9A0sS1AepPxIDl/rFFErVESDddvPlm8nqU6EFpzARjBCVkLHNkipBzPBsc8UCyt4TKgmCr/9HI+GDqg2x8Luzk8UtfYwX5DVWLnQaLmd+qHTsF3V5QEekWidZuDNpgc7L1FvqGg35fOzPlqslFYJrzZdnkq6YI77TXtrs3GBo7oKvNss9mfhT0IAV+e6CUL5pSTWb0t1tVBKbI5McsXxNmciYKZW5kc3RyZWFtCmVuZG9iago1IDAgb2JqCjE4NQplbmRvYmoKMiAwIG9iago8PCAvVHlwZSAvUGFnZSAvUGFyZW50IDMgMCBSIC9SZXNvdXJjZXMgNiAwIFIgL0NvbnRlbnRzIDQgMCBSIC9NZWRpYUJveCBbMCAwIDU5NSA4NDJdC" +
                    "j4+CmVuZG9iago2IDAgb2JqCjw8IC9Qcm9jU2V0IFsgL1BERiAvVGV4dCBdIC9Db2xvclNwYWNlIDw8IC9DczEgNyAwIFIgL0NzMiA4IDAgUiA+PiAvRm9udCA8PAovVFQxIDkgMCBSID4+ID4+CmVuZG9iagoxMCAwIG9iago8PCAvTGVuZ3RoIDExIDAgUiAvTiAxIC9BbHRlcm5hdGUgL0RldmljZUdyYXkgL0ZpbHRlciAvRmxhdGVEZWNvZGUgPj4Kc3RyZWFtCngBhVVdaBxVFD67c2cDEgcftA0ttIM/bQnpMolWE4u12026SRO362ZTmyrKdHY2O81kZpyZ3SahT6XgmxYE6augPsaCCLYqNi/2paXFkko1DwoRWowgKH1S8Dsz22R2QTLDnfnuueeee8537rmXqOtv3fPstEo054R+oZybPjl9Su26TWlSqJvw6Ebg5UqlCcaO65j8b38e3qUUS+7sZ1vtY1v25KoZGNC6huZWA2OOKKURZWqG54dEXZcgHzwbeoxvAz85WynngdeAldZcQHqqYDqmbxlqwdcX1JLv1i" +
                    "w76etW42xjy2fObrCv/OxG6w5mJ8fx74XPF0xnahJ4H/CSoY8w7gO+27ROFGOcTnvhkXKsn842ZqdyLfnJmn90qiW/UG+MMs4SpZcW65U3gJ8AXnVOF4+39Ndn3XG200Mk9RhB/hTws8Ba3RzjPKnAFd8tsz7Lw6o5PAL8MvAlKxyrAMO+9EPQnGQ5sKDFep79xFoie0Y/VgLeBnzItAu8FuyIiheW2OYg8LxjF3ktxC4um0EUL2IXP4X1ymisL6dDv8JznyaS99Sso2PA4EQerfujLIc/cujZ0d56EXjJb5Q59j3Aa7o/UgCGzcxjVX2YeX4BeIBOpHQyyaXT+Brk0L+INyCLmhHyyMdYDX2bCtBw0Hz0DGgVgHRaAColtEz0WCeeo1IVPZVmollBhNjK/ahvUH7Xp9SAtE7rkNaBXqNfIsk8/Upz6OchbWBspsNuHl44tAgP2BO2+aBl0xXbhSaeRzsoJsQrYlAMkSpeFYfFITEM6ZA4GM2JvU/6zn4+2LD0LtZN+r4MDkKsZ8MzB6xwNAE8+AfrzkaaCbYu7mjs87yP3j/vv2MZtz74s429APoxJ7/BogtrJiXmXj/3TU/CQ3VFfPXWne7r5+h4MktR3qqdWZLX5PvyCr735NWkDflneRXvvbZcPcoL/5O5zSFGO5LNQc48m1G0ccYbwCG4qUVz9rdZTLLptmK0YMlClJ2ruP/LCfPDPLexUnMu7vC8tz9jNs33ig+LdL5Pu6y" +
                    "ta59oP2p/aCvax0C/Sx9KX0rfSlekq9INUqVr0rL0nfS99Ln0NXpfQLosXenYSXHsG7sHfsZ71mjtMGaGsxQQ88LazApLH/F3BmOb+TOh1V4Dnbt/Yy3liLJTeUYZVnYrzykTSq9yQDmsbFcG0PqVUWUvRnZusGRjPc6AhX+SZ4umI67iPLFXdbDnw0sd76ZfXMPWhjXYST0Ontnapg6vEVe/FVVjvDtdnAY6TSFii84ich86nB8nqv7O2VyTODVSb+KUsMQu0S/GWjWYEwdQheNt9TjIVZoZyQxncqRmejNDmf7MMcZRrNH5ktmL0SF8RxLeM8sx/5s1xGcY7x3mqAlso4dbKzTncd8R5V1vwbdm6qE6oGkvqTlcr6Y65hjZPlW3bTUaClTfDEy/aVazxHc3zyP66/XoTk5tu2E0/GYso1TqJtF/t4+TNAplbmRzdHJlYW0KZW5kb2JqCjExIDAgb2JqCjExMTYKZW5kb2JqCjcgMCBvYmoKWyAvSUNDQmFzZWQgMTAgMCBSIF0KZW5kb2JqCjEyIDAgb2JqCjw8IC9MZW5ndGgg" +
                    "MTMgMCBSIC9OIDMgL0FsdGVybmF0ZSAvRGV2aWNlUkdCIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlID4+CnN0cmVhbQp4AYVVW4gbVRj+kznJCrvO09rVLaRDvXQpu0u2Fd2ltJpbk7RrGrLZ1RZBs5OTZMzsJM5M0gt9KoLii6u+SUG8vS0IgtJ6wdYH+1KpUFZ36yIoPrR4QSj0RbfxO5NkJllqm2XPfPP93/lv558ZooG1Qr2u+xWiJcM2c8mo8tzRY8rAOvnpIRqkURosqFY9ks3OEn5CK679v1s/kE8wVyfubO9Xb7kbLHJLJfLdB75WtNQl4BNEgbNq3bSJBobBTx+36wKLHIZNJAj8osDlNoaNhhfb+DVHk8/FoDkLLKuVQhF4BXh8sYcv9+B2DlDAT5Ib3NRURfQia9ZKms4dQ3u5h7lHeTe4pDdQs/PbgXXIqs4dxnUMtb9SLMQFngReUQuJOeBHgK81tYVMB9+u29Ec8GNE/p2N6nwEeDdwqmQenAeGH79ZaaS6+J1Tlfyz4LeB/8ZYzBzp7F1TrRh6STvB367wtOhviEhSN" +
                    "DudB4Yf6YBZywk9cpBKRR5PAI8Dv16tHRY5wKf0mdWcE7zIZ+1UJSbyFPzllwqHssCjwL9yPSn0iCX9W7eznRxYyNAzIi5isTi3nHrhh4XsSj4FHnGZbpv5zl62XNIOpjv6TypmSvBi77W67swocgv4zUZO1I5YgcmCmUgCw2cgy4150U+Bm7TgKxCnGi1iVcmgTVIoR0mK4lonE5YSaaSD4bByMBx3Xc2Es8+iKniNmo7Nwpp1lO2dXa1CZbAGXXe0KsVCH1EDnir0B9iK61OhGO4a4Mr/46edy42OnxobYWG2F//72Czbz6bZDCnsKfY0O8DiYGfYPtd3Fnu6FYl8biBK28/LiMgd3QJqv4gabSpg/QWKGlmuh76uLI82xjzLGfMFTb3yxt89vdKws+oqJvo6euRePQ/8FrgeWMW6HthwfSiBnwIb+FtHb7xaap6902VxUhpOtNan23oWXVUElerOziV0QUPNvKfmiV4fl05/+aAXbZWde/7q0KXTJWN51GNFF/irmVsZOjPuseEfw3+GV8PvhT8M/y69LX0qfSWdlz6XLpMiXZ" +
                    "AuSl9L30ofS1+4+rvNkHv2JDIXcyXyFtPVrbC315hYOSpvlx+W4/IO+VF51lUp8og8JafkXbBsd8/Nm2+lt3L05Siidftz51jiWdFcTzgD3/2YAM2L2DcD88hYo+PwaaLfYt4MOglt75PXqYiF2BRLb5nuaTHzXd/BRDAejJAS3B2cCU4FDwncfZaDu2CbwZrozQ3z4Sr6KuU2PyG+JxSr1U+aWrliK3vC4SeVCD59XEkb6uS4UtB1xTFZisktbjZ5cZLEd1PsI7qZc76Hvm1XPM5+hmj/X3j3fe9xxxpEKxbRyOMeN4Z35QPvEp17Qm2YzbY/8vm+I7JKe/c4976hKN5fP7daN/EeG3iLaPPNVuuf91utzQ/gf4Pogv4foJ98VQplbmRzdHJlYW0KZW5kb2JqCjEzIDAgb2JqCjEwNzkKZW5kb2JqCjggMCBvYmoKWyAvSUNDQmFzZWQgMTIgMCBSIF0KZW5kb2JqCjMgMCBvYmoKPDwgL1R5cGUgL1BhZ2VzIC9NZWRpYUJveCBbMCAwIDU5NSA4NDJdIC9Db3VudCAxIC9LaWR" +
                    "zIFsgMiAwIFIgXSA+PgplbmRvYmoKMTQgMCBvYmoKPDwgL1R5cGUgL0NhdGFsb2cgL1BhZ2VzIDMgMCBSID4+CmVuZG9iago5IDAgb2JqCjw8IC9UeXBlIC9Gb250IC9TdWJ0eXBlIC9UcnVlVHlwZSAvQmFzZUZvbnQgL0NOVFpYVStNZW5" +
                    "sby1SZWd1bGFyIC9Gb250RGVzY3JpcHRvcgoxNSAwIFIgL0VuY29kaW5nIC9NYWNSb21hbkVuY29kaW5nIC9GaXJzdENoYXIgMzIgL0xhc3RDaGFyIDExNiAvV2lkdGhzIFsgNjAyCjAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgNjAyIDYwMiA2MDIgNjAyIDYwMiA2MDIgMCAwIDAgMCAwIDAgMCAwIDAKMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgMCAwIDAgNjAyIDAgMAo2MDIgNjAyIDYwMiA2MDIgNjAyIDYwMiAwIDAgNjAyIDYwMiAwIDAgNjAyIDAgMCA2MDIgNjAyIF0gPj4KZW5kb2JqCjE1IDAgb2JqCjw8IC9UeXBlIC9Gb250RGVzY3JpcHRvciAvRm9udE5hbWUgL0NOVFpYVStNZW5sby1SZWd1bGFyIC9GbGFncyAzMyAvRm9udEJCb3gKWy01NTggLTM3NSA3MTggMTA0MV0g" +
                    "L0l0YWxpY0FuZ2xlIDAgL0FzY2VudCA5MjggL0Rlc2NlbnQgLTIzNiAvQ2FwSGVpZ2h0IDcyOQovU3RlbVYgOTkgL1hIZWlnaHQgNTQ3…/ZfICj5JcLdi/ATmQZKogDPg0lIDBunI0ZGOB1OB/Lpyce1TbJqCpBThycVs3GyQPZSLKexbMGyFss8LF4sNb2lElu5HPlJ2439G1jKsbRh6cTyPNpx8I6AFxa8P+xD2E4e/G+5PqJ/8aDzERFvGBJR/WLkfwcM3kRCiZpokDMdxhn5MeD9Rn5MSm0mYUpLSF98J5HXaQgtpJvoDWGesEe4C4NgK3woWsQ88RgzszXsMM4WyALeIC5gO5B/FYk/pNxVCJGoZT8NYc8LIknrONeVQYznus51pYeZHCaXw+RYIJLAEogJfMEbVPrvv31S6icvTMlp1EQhO41cOuXb0EEkSYkmGaMXSzuIfhCKAA4Y/YScTs9ASizblWVyWB1UT4fwNfSp9+mgwLFd4oI3D++9++kuheYWpOnEeBhLJrv7kVg" +
                    "Xk1hkVDRExLgkieUZTTt1jZYGkTTiXU8tULUtIsEIfeKMgY5AV3u7yZyTQdK6Mm923fwgHe1GZWTfmCJy5CYi05PgwqWzB5HBw2n2wL7OBEmVPZxmZYpWi6TSU7pM2BNY1kojs0sLN1bPOLZ4/nuzL1CNp/S+zt27dx+lA4Y/2/hA1fq8kR9kZF57u6R96YgvZRmsvfe5OBj5TSKjkN+wRqu6LrRF1yjF19lbYhudDVKTdVe/8DAClihbX6MNEuItofH9kF9k+FwXMofC7rqCDHcZu293G7tz0qmNWi2iM6FvYrYN2RuEvCbT7GDnZ0xDyMZt/Otb8z+aP+/dOS17927ZurVu24ZVnrYFz7w95jxlayE+8b3Nf/m6b5/j2QMb1v26qeXZ8iWVSUkH7fYLb1bKBw7aA57LYgVGYAGtLM8dT3WgIwC6PAIaVSOjsCaUatXEFiJKBm0fvTEQODe0K9Mki/mK3DPnBOUsHkchH/ckhFIHZJmyrE6T0+TIFi7xfvRjx/X33jves5rFBb6Gk4GsHXwbLX1Hlp0XZZeKa8eRYe4EURUX3agy" +
                    "1RnXWxp1QiNZo2tS7baBjUTYqDqBGONtspI7UEwosSsoMUVevAM5CEO9mmRVEquF/ExwsrxOCTd7OpKnpXxFjfzzO8uPTnz44Oydb7bunLy1kHXu5huMBt59vYvfsNtPZmb4tjfvdblQGjXI23jUayTpg9w5VfFRjer4RqP6DRGPs/ViY3iDscmVYCN9dQkqKZaGxbuMga6uwBXZeYLq/MKI6jShPq0DqDNBUBg0Wy2C0y6YjMSRGU4TJKslPKhYuJS7fkL7u+m7F33yzc3PeOBb6qSWsZv4Zys3bVq5as0atu+gK5Ff4ldLH+d3vvuW36bL6Ab6LF0X37Pw4I4dB//4+z0+RZ8y306xEuNGP7LI3V+tItF2baRBRfZHqurNjjr7O3H1fdrMTZE6GilG6dWSNt8uStbh/Y03u9AkM1G3snI7rtwMyCKWd2DKMeegZ6W749Lj0+3pjvSEZtJMm4VmdbNme3hzRHNkc1RztH4m7rJ3Q4OzB5uc2XpE9M0eOOh+mi1LoNfdwlGfQtuwV159duGWPfTAgfv/VP3GBz98d4eu2jirfca81uK6o8P62oWsJxaXLT57sN/4npUtpY/8eXvr4bhVzwwa6E9MnDIlc2PQditxr2bMIowYLdLd0YxYouv1lvqQJn0bfREiRCIJo0xmzeg43Ju8NTk2KIaDe0ynpqxeHlEd5ixUx790gazCdL9/QFPpiWvX3y/byg1ramr" +
                    "q6mpq1sAZYeQ/utYVTaP3Uys10cHTuOaj8xfPdV44L/uSzE8Jyt6K/GQFIyKGKSUIChgEY09jScPcUUJf02C4DCNRShuL32qS0Zoo1WGVZIMYbEXZ2QmaSVamWRUUnlgS+DzknT3F7eWPHpnBf+Dnqf3GR3d82g1ran4XItRPl744dl/OW8nJNIeGUS118782Lt3lWyT72RH08USUUxgZiFIyUm3IfonWkxf10mG1EKYioUzSGTQW47mhHYGhHZmKAVzJDKD60b9lA0aJxFHZyWSndqBKs8TEM3Mn0JV8hZ930uRdf5IsTZPnz/UG0uCMd6JfTq9lefDRornXFke7E6O0tpjEUDDXhYWH1tvC6w2AlmgzHEk63D8xikjaUZLZ7BiNhtjRqy10846gERo7u9EC0RKRGyV0Bx0nDL3pxzg5TJANrleZEdlZMH31ytXrvWtWrPZ3Xx3fUjSneeTmNSlbyjuuX+9Y2JDmF3JOffzxqVOfnuefBXggNmb/gJTtvpCqWQ/TIVSFZ+mQh6ZvkPcRlF+MIr8Ud2SoHvC3OKne1KZ9UU0FiYzVhUqaQgvaGIoMTWwoxnKM67KFOU1BZrGTpfh/uBhz4LEnVtb5/RmvL3ljl7C/Z6ywv3H9W2/0rJYsPTtK5l6W5daN+qqWDHh+6oicYqKpyD9kyocpRTtSXcR8Em1J7muwlW1LexrtSs5JZLuScwa5pXgGyx/hdZxIeA" +
                    "JTPMvxBMSaYoSmQ+nHtDywiJbzyzTe7xcfDmR5vTBcyPsKv7yBPEyXopGDxCAHOoUoppRILPQiribnP/IqOjlLka3XEo5eIbu8bCTCqRmej6+99ib/lF6im3/13ItnD8OtF4LyxN+YxQq0iwTyqjsx0mwIFVUkLkZSWbX1dmiLORxlVBGTIWSC" +
                    "NNE0wTAxNnJCdIHTeHOcTzt1nM80dUbxARJ9r/0+T2BoAA0leOYPHXrlpnIwoYmg8NPdo9LFdJYupavSQ9JD09Xpmtzw3IjcyNyo3OjcmNzY3LhcWzVUi9WsWqpWVYdUh1arqzXecG+EN9Ib5Y32xnhjvXFem5POpPLJEh5Ff6LMf2vVqgwKOx" +
                    "IeHbu64vXswkn3v54zdkzOzp2Oubnjy6B7dMEZfqlnubDymyWVX/SsEFbeWCy3YknJ0NxCWddt/CFxKspCjmFZ7tgfY1ibvokegcNxGL9GKZGsUI5imYqJoV/8GMZcsm0pXJhNRtkbfuofdPmBA3IYu/rV+/Oa6I3VNatqa1fVrF7Xc1xSe4um" +
                    "8Xf5df53fnwavfXR+Qud5y5iFJPtvRPjmIQ8JZJlbrdOK+g1EfG2kFBBpY6wxdvy4myRao0tXrSSOtouWuqs7ZH1JrHe1WZqSopTa+JjVOSBGEk/RiVZEgqSgu58RXZfObDIh6OR3+o23uo2RyjHipKn6ZU8Tak9CcETUz5L4pVkSPq3k2cPTBMGYPo2CFUCJx9oLqqqfPitsWvXdX1YtP+x+YemPrvqVkjBy789//70FjFn34ABk4vGjXXqo7dVtbQ6nW3Z2XM91RmCPn7jilf+4FD2incAMYS9hLExwx2pZyEG2E9M9HDIfnWIJhTzYclo1v88MnbdHNohp0Cyg8sx8Wdmb8Lr7nY+a9ayU5dP7ZZDI3uJH/b2NP9qzsaWE0KJlw5HnSvPvefwlwRZ2r988CqMnmzBUyScRGD+EUXyySgymowhY8k4Mp48gLl+EXmITFM+pPjvhSANSb5Ej5w4dXrxg8kTyhYtrEidUjZ/2cLZTxLyT2S78dEKZW5kc3RyZWFtCmVuZG9iagoxNyAwIG9iago0ODAxCmVuZG9iagoxOCAwIG9iagooKQplbmRvYmoKMTkgMCBvYmoKKE1hYyBPUyBYIDEwLjEyLjYgUXVhcnR6IFBERkNvbnRleHQpCmVuZG9iagoyMCAwIG9iagooKQplbmRvYmoKMjEgMCBvYmoKKCkKZW5kb2JqCjIyIDAgb2JqCihUZXh0TWF0ZSkKZW5kb2JqCjIzIDAgb2JqCihEOjIwMTcxMjEyMTMwMzQ4WjAwJzAwJykKZW5kb2JqCjI0IDAgb2JqCigpCmVuZG9iagoyNSAwIG9iagpbICgpIF0KZW5kb2JqCjEgMCBvYmoKPDwgL1RpdGxlIDE4IDAgUiAvQXV0aG9yIDIwIDAgUiAvU3ViamVjdCAyMSAwIFIgL1Byb2R1Y2VyIDE5IDAgUiAvQ3JlYXRvcgoyMiAwIFIgL0NyZWF0aW9uRGF0ZSAyMyAwIFIgL01vZERhdGUgMjMgMCBSIC9LZXl3b3JkcyAyNCAwIFIgL0FBUEw6S2V5d29yZHMKMjUgMCBSID4+CmVuZG9iagp4cmVmCjAgMjYKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDA4OTI5IDAwMDAwIG4gCjAwMDAwMDAzMDAgMDAwMDAgbiAKMDAwMDAwMzAyOCAwMDAwMCBuIAowMDAwMDAwMDIyIDAwMDAwIG4gCjAwMDAwMDAyODEgMDAwMDAgbiAKMDAwMDAwMDQwNCAwMDAwMCBuIAowMDAwMDAxNzUzIDAwMDAwIG4gCjAwMDAwMDI5OTIgMDAwMDAgbiAKMDAwMDAwMzE2MSAwMDAwMCBuIAowMDAwMDAwNTEyIDAwMDAwIG4gCjAwMDAwMDE3MzIgMDAwMDAgbiAKMDAwMDAwMTc4OSAwMDAwMCBuIAowMDAwMDAyOTcxIDAwMDAwIG4gCjAwMDAwMDMxMTEgMDAwMDAgbiAKMDAwMDAwMzU0NCAwMDAwMCB" +
                    "uIAowMDAwMDAzNzk2IDAwMDAwIG4gCjAwMDAwMDg2ODcgMDAwMDAgbiAKMDAwMDAwODcwOCAwMDAwMCBuIAowMDAwMDA4NzI3IDAwMDAwIG4gCjAwMDAwMDg3ODAgMDAwMDAgbiAKMDAwMDAwODc5OSAwMDAwMCBuIAowMDAwMDA4ODE4IDAwMDAwIG4gCjAwMDAwMDg4NDUgMDAwMDAgbiAKMDAwMDAwODg4NyAwMDAwMCBuIAowMDAwMDA4OTA2IDAwMDAwIG4gCnRyYWlsZXIKPDwgL1NpemUgMjYgL1Jvb3QgMTQgMCBSIC9JbmZvIDEgMCBSIC9JRCBbIDxkYjc4M2NhNDM2Mzg4YzI5ZDc5MDQ2NzY3NjUxNjE3OT4KPGRiNzgzY2E0MzYzODhjMjlkNzkwNDY3Njc2NTE2MTc5PiBdID4+CnN0YXJ0eHJlZgo5MTA0CiUlRU9GCg==";
            String content = encodedContent.substring("data:application/pdf;base64," .length());
            return Base64.decodeBase64(content);
        }

        public static byte[] createByteArray() {
            String pathToBinaryData = "/bla-bla/src/main/resources/small.pdf";

            File file = new File(pathToBinaryData);
            if (!file.exists()) {
                System.out.println(" could not be found in folder " + pathToBinaryData);
                return null;
            }

            FileInputStream fin = null;
            try {
                fin = new FileInputStream(file);
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }

            byte fileContent[] = new byte[(int) file.length()];

            try {
                fin.read(fileContent);
            } catch (IOException e) {
                e.printStackTrace();
            }

            return fileContent;
        }
    }
0
Dmitri Algazin