A standard partition configuration is a collection of parameters designed to oversee document partitioning, whether executed through API integration or by the unstructured library on a local system. These parameters serve a dual role, encompassing those passed to the partition method for the initial segmentation of documents and those responsible for coordinating data after processing, including the dynamic metadata associated with each element.
pdf_infer_table_structure: Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.
skip_infer_table_types: List of document types that you want to skip table extraction with.
strategy (default auto): The strategy to use for partitioning PDF/image. Uses a layout detection model if set to ‘hi_res’, otherwise partition simply extracts the text from the document and processes it.
ocr_languages: The languages present in the document, for use in partitioning and/or OCR. For partitioning image or pdf documents with Tesseract, you’ll first need to install the appropriate Tesseract language pack if running via local unstructured library. For other partitions, language is detected using naive Bayesian filter via langdetect. Multiple languages indicates text could be in either language.
encoding: The encoding method used to decode the text input. If None, utf-8 will be used.