The Unstructured API provides parameters to customize the processing of documents. Below are the details for these parameters.
files
- the file you wish to process.
Python & direct call | JavaScript | Description |
---|---|---|
files (shared.Files) | files (File, Blob, shared.Files) | The file to process. |
coordinates (bool) | coordinates (boolean) | If true, return bounding box coordinates for each element extracted via OCR. Default: false |
encoding (str) | encoding (string) | The encoding method used to decode the text input. Default: utf-8 |
extract_image_block_types (List[str]) | extractImageBlockTypes (string[]) | The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields |
gz_uncompressed_content_type (str) | gzUncompressedContentType (string) | If file is gzipped, use this content type after unzipping. Example: application/pdf |
hi_res_model_name (str) | hiResModelName (string) | The name of the inference model used when strategy is hi_res . Example: chipper |
include_page_breaks (bool) | includePageBreaks (boolean) | If True, the output will include page breaks if the filetype that supports it. Default: false |
languages (List[str]) | languages (string[]) | The languages present in the document, for use in partitioning and/or OCR. See the Tesseract documentation for a full list of languages. |
output_format (str) | outputFormat (string) | The format of the response. Supported formats are application/json and text/csv . Default: application/json . |
pdf_infer_table_structure (bool) | pdfInferTableStructure (boolean) | Deprecated! If True and strategy=hi_res, any Table Elements extracted from a PDF will include an additional metadata field, ‘text_as_html’, where the value (string) is a just a transformation of the data into an HTML table. |
skip_infer_table_types (List[str]) | skipInferTableTypes (string[]) | The document types that you want to skip table extraction with. Default: [‘pdf’, ‘jpg’, ‘png’, ‘heic’] |
split_pdf_page (bool) | splitPdfPage (boolean) | Should the pdf file be split at client. Ignored on backend. |
strategy (str) | strategy (string) | The strategy to use for partitioning PDF/image. Options are fast , hi_res , auto . Default: auto |
unique_element_ids (bool) | uniqueElementIds (boolean) | When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False |
xml_keep_tags (bool) | xmlKeepTags (boolean) | If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents. |
chunking_strategy (str) | chunkingStrategy (string) | Use one of the supported strategies to chunk the returned elements after partitioning. When chunking_strategy is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: "basic" , "by_title" |
chunking_strategy
is specified. Otherwise, they are ignored.
Python & direct call | JavaScript | Description |
---|---|---|
combine_under_n_chars (int) | combineUnderNChars (number) | Applies only when chunking strategy is set to "by_title" . Use this parameter to combines small chunks until the combined chunk reaches a length of n chars. This can mitigate the appearance of small chunks created by short paragraphs, not intended as section headings, being identified as Title elements in certain documents. Default: the same value as max_characters |
include_orig_elements (bool) | includeOrigElements (boolean) | When True (the default), the elements used to form a chunk appear in .metadata.orig_elements for that chunk. |
max_characters (int) | maxCharacters (number) | Cut off new sections after reaching a length of n chars (hard max). Default: 500 |
multipage_sections (bool) | multipageSections (boolean) | Applies only when chunking_strategy is set to by_title . Determines if a chunk can include elements from more than one page. Default: true |
new_after_n_chars (int) | newAfterNChars (number) | Applies only when chunking_strategy is specified. Cut off new sections after reaching a length of n chars (soft max). Default: 1500 |
overlap (int) | overlap (number) | A prefix of this many trailing characters from the prior text-split chunk is applied to second and later chunks formed from oversized elements by text-splitting. Default: None |
overlap_all (bool) | overlapAll (boolean) | When True, overlap is also applied to ‘normal’ chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. Default: None |