files (shared.Files) | files (File, Blob, shared.Files) | The file to process. |
coordinates (bool) | coordinates (boolean) | If true, return bounding box coordinates for each element extracted via OCR. Default: false |
encoding (str) | encoding (string) | The encoding method used to decode the text input. Default: utf-8 |
extract_image_block_types (List[str]) | extractImageBlockTypes (string[]) | The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields |
gz_uncompressed_content_type (str) | gzUncompressedContentType (string) | If file is gzipped, use this content type after unzipping. Example: application/pdf |
hi_res_model_name (str) | hiResModelName (string) | The name of the inference model used when strategy is hi_res. Example: chipper |
include_page_breaks (bool) | includePageBreaks (boolean) | If True, the output will include page breaks if the filetype that supports it. Default: false |
languages (List[str]) | languages (string[]) | The languages present in the document, for use in partitioning and/or OCR. See the Tesseract documentation for a full list of languages. |
output_format (str) | outputFormat (string) | The format of the response. Supported formats are application/json and text/csv. Default: application/json. |
pdf_infer_table_structure (bool) | pdfInferTableStructure (boolean) | Deprecated! If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, ‘text_as_html’, where the value (string) is a just a transformation of the data into an HTML table. |
skip_infer_table_types (List[str]) | skipInferTableTypes (string[]) | The document types that you want to skip table extraction with. Default: [‘pdf’, ‘jpg’, ‘png’, ‘heic’] |
split_pdf_page (bool) | splitPdfPage (boolean) | Should the pdf file be split at client. Ignored on backend. |
strategy (str) | strategy (string) | The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: auto |
unique_element_ids (bool) | uniqueElementIds (boolean) | When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False |
xml_keep_tags (bool) | xmlKeepTags (boolean) | If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents. |
chunking_strategy (str) | chunkingStrategy (string) | Use one of the supported strategies to chunk the returned elements after partitioning. When chunking_strategy is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: "basic", "by_title" |