Parameters

The only required parameter is files - the file you wish to process.
Python & direct callJavaScriptDescription
files (shared.Files)files (File, Blob, shared.Files)The file to process.
coordinates (bool)coordinates (boolean)If true, return bounding box coordinates for each element extracted via OCR. Default: false
encoding (str)encoding (string)The encoding method used to decode the text input. Default: utf-8
extract_image_block_types (List[str])extractImageBlockTypes (string[])The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields
gz_uncompressed_content_type (str)gzUncompressedContentType (string)If file is gzipped, use this content type after unzipping. Example: application/pdf
hi_res_model_name (str)hiResModelName (string)The name of the inference model used when strategy is hi_res. Example: chipper
include_page_breaks (bool)includePageBreaks (boolean)If True, the output will include page breaks if the filetype that supports it. Default: false
languages (List[str])languages (string[])The languages present in the document, for use in partitioning and/or OCR. See the Tesseract documentation for a full list of languages.
output_format (str)outputFormat (string)The format of the response. Supported formats are application/json and text/csv. Default: application/json.
pdf_infer_table_structure (bool)pdfInferTableStructure (boolean)Deprecated! If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, ‘text_as_html’, where the value (string) is a just a transformation of the data into an HTML table.
skip_infer_table_types (List[str])skipInferTableTypes (string[])The document types that you want to skip table extraction with. Default: [‘pdf’, ‘jpg’, ‘png’, ‘heic’]
split_pdf_page (bool)splitPdfPage (boolean)Should the pdf file be split at client. Ignored on backend.
strategy (str)strategy (string)The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: auto
unique_element_ids (bool)uniqueElementIds (boolean)When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False
xml_keep_tags (bool)xmlKeepTags (boolean)If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents.
chunking_strategy (str)chunkingStrategy (string)Use one of the supported strategies to chunk the returned elements after partitioning. When chunking_strategy is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: "basic", "by_title"
The following parameters only apply when a chunking_strategy is specified. Otherwise, they are ignored.
Python & direct callJavaScriptDescription
combine_under_n_chars (int)combineUnderNChars (number)Applies only when chunking strategy is set to "by_title". Use this parameter to combines small chunks until the combined chunk reaches a length of n chars. This can mitigate the appearance of small chunks created by short paragraphs, not intended as section headings, being identified as Title elements in certain documents. Default: the same value as max_characters
include_orig_elements (bool)includeOrigElements (boolean)When True (the default), the elements used to form a chunk appear in .metadata.orig_elements for that chunk.
max_characters (int)maxCharacters (number)Cut off new sections after reaching a length of n chars (hard max). Default: 500
multipage_sections (bool)multipageSections (boolean)Applies only when chunking_strategy is set to by_title. Determines if a chunk can include elements from more than one page. Default: true
new_after_n_chars (int)newAfterNChars (number)Applies only when chunking_strategy is specified. Cut off new sections after reaching a length of n chars (soft max). Default: 1500
overlap (int)overlap (number)A prefix of this many trailing characters from the prior text-split chunk is applied to second and later chunks formed from oversized elements by text-splitting. Default: None
overlap_all (bool)overlapAll (boolean)When True, overlap is also applied to ‘normal’ chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. Default: None
Need help getting started? Check out the Examples page for some inspiration.