Obtenez des lignes et des paragraphes, pas des symboles à partir de l'OCR de l'API Google Vision sur PDF

Question

J'essaie d'utiliser la détection de texte des documents PDF/TIFF désormais prise en charge à partir de l'API Google Cloud Vision. En utilisant leur exemple de code, je peux soumettre un PDF et recevoir en retour un objet JSON avec le texte extrait. Mon problème est que le fichier JSON qui est enregistré dans GCS ne contient que des zones de délimitation et du texte pour "symboles", c'est-à-dire chaque caractère dans chaque mot. Cela rend l'objet JSON assez lourd et très difficile à utiliser. J'aimerais pouvoir obtenir le texte et les cadres de délimitation pour "LIGNES", "PARAGRAPHES" et "BLOCS" , mais je n'arrive pas à trouver un moyen de le faire via la méthode AsyncAnnotateFileRequest().

L'exemple de code est le suivant:

def async_detect_document(gcs_source_uri, gcs_destination_uri): """OCR with PDF/TIFF as source files on GCS""" # Supported mime_types are: 'application/pdf' and 'image/tiff' mime_type = 'application/pdf' # How many pages should be grouped into each json output file. batch_size = 2 client = vision.ImageAnnotatorClient() feature = vision.types.Feature( type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION) gcs_source = vision.types.GcsSource(uri=gcs_source_uri) input_config = vision.types.InputConfig( gcs_source=gcs_source, mime_type=mime_type) gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri) output_config = vision.types.OutputConfig( gcs_destination=gcs_destination, batch_size=batch_size) async_request = vision.types.AsyncAnnotateFileRequest( features=[feature], input_config=input_config, output_config=output_config) operation = client.async_batch_annotate_files( requests=[async_request]) print('Waiting for the operation to finish.') operation.result(timeout=180) # Once the request has completed and the output has been # written to GCS, we can list all the output files. storage_client = storage.Client() match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri) bucket_name = match.group(1) prefix = match.group(2) bucket = storage_client.get_bucket(bucket_name=bucket_name) # List objects with the given prefix. blob_list = list(bucket.list_blobs(prefix=prefix)) print('Output files:') for blob in blob_list: print(blob.name) # Process the first output file from GCS. # Since we specified batch_size=2, the first response contains # the first two pages of the input file. output = blob_list[0] json_string = output.download_as_string() response = json_format.Parse( json_string, vision.types.AnnotateFileResponse()) # The actual response for the first page of the input file. first_page_response = response.responses[0] annotation = first_page_response.full_text_annotation # Here we print the full text from the first page. # The response contains more information: # annotation/pages/blocks/paragraphs/words/symbols # including confidence scores and bounding boxes print(u'Full text:
{}'.format( annotation.text))

Dustin Ingram · Accepted Answer

Malheureusement, lorsque vous utilisez le DOCUMENT_TEXT_DETECTION type, vous ne pouvez obtenir que le texte intégral par page ou les symboles individuels. Il n'est pas trop difficile de rassembler les paragraphes et les lignes des symboles, quelque chose comme ça devrait fonctionner (dans le prolongement de votre exemple):

breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType paragraphs = [] lines = [] for page in annotation.pages: for block in page.blocks: for paragraph in block.paragraphs: para = "" line = "" for Word in paragraph.words: for symbol in Word.symbols: line += symbol.text if symbol.property.detected_break.type == breaks.SPACE: line += ' ' if symbol.property.detected_break.type == breaks.EOL_SURE_SPACE: line += ' ' lines.append(line) para += line line = '' if symbol.property.detected_break.type == breaks.LINE_BREAK: lines.append(line) para += line line = '' paragraphs.append(para) print(paragraphs) print(lines)