auto
(default strategy): The “auto” strategy will choose the partitioning strategy based on document characteristics and the function kwargs.
fast
: The “fast” strategy will leverage traditional NLP extraction techniques to pull all text elements quickly. The “fast” strategy is not good for image-based file types.
hi_res
: The “hi_res” strategy will identify the document’s layout using detectron2. The advantage of “hi_res” is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements.
ocr_only
: Leverage Optical Character Recognition to extract text from the image-based files.
Elements to Exclude
: Select the element types you want to exclude from document processing. This option is useful if you want to include or exclude elements, such as Table or Image elements.
Include Page Breaks
: If checked, the output will include page breaks if the file type supports it. For more information about page break, check out the documentation here.
Infer Table Structure
: Check if you want to extract tables from PDFs or images.
Keep XML Tags
: If checked, the output will retain the XML tags. This only applies to partition_xml. For more information about XML tags, check out the documentation here.
Reprocess all documents
: The workflow will process the previously processed documents if checked.
Chunk by Title
: When a “Title” element appears, it marks the start of a new section. The system will then finish the current chunk and begin a new one, even if the current chunk has space to include the “Title” element. For more information about chunk by title, please refer to the documentation here.
Basic
: This strategy combines the sequential elements to optimize the size of each chunk while adhering to the predefined “max_characters” (hard maximum) and “new_after_n_chars” (soft maximum) settings. For more information about basic chunking, please refer to the documentation here.
OpenAI
: enter the API Key and select the model name from the dropdown menu. For more information about OpenAI embedding, please refer to the documentation here.
Bedrock
: enter the AWS Access Key, AWS Secret Key, and AWS Region to connect to AWS Bedrock embedding models. For more information about AWS Bedrock embedding, please refer to the documentation here.
Actions
dropdown menu next to the respective workflow name: