The Ingest Library is a powerful tool designed to coordinate the process of pulling data from data providers, partitioning the content, and pushing that new content to a desired location. This technical documentation will provide an in-depth understanding of the Ingest Library, including its features, architecture, installation, configuration, usage, API reference, troubleshooting, examples, and more.
Source ConnectorsConnect to your favorite data storage platforms for an effortless batch processing of your files.Destination ConnectorsConnect to your favorite data storage platforms to write you ingest results to.Ingest ConfigurationEach configuration used when generating an ingest process.
The Ingest Library follows a modular architecture comprising the following components:
Source Connectors: These components are responsible for fetching data from external sources, which can include databases, web services, file systems, or data streams.
Partitioning Engine: This component optimally partitions the incoming data into dedicated Elements for processing and distribution.
Reformatters: Optional steps supported to manipulate the partitioned content output, such as chunking and adding embeddings.
Destination Connectors: These components send the partitioned data to the desired destination, which could be a database, data warehouse, cloud storage, or any other user-defined target.
The library’s modular architecture provides flexibility and extensibility, allowing users to integrate custom components and adapt the library to their specific needs.
To install the Ingest Library, follow these steps:
Run pip install unstructured to install the latest version of the unstructured library which include the ingest code and the cli.
For specific connectors, run pip install unstructured[CONNECTOR_DEPS] where CONNECTOR_DEPS references the extra dependency label for a particular connector. For example, pip install unstructured[s3] will install the dependencies to interact with the s3 connectors. If these aren’t installed before hand, a convenient error message will be printed for you when you run the ingest CLI for the first time, prompting you with the correct pip command to run.
Once installed, you can run unstructured-ingest --help to get all the available commands.
The Ingest Library requires configuration to define data sources, ingestion processes, and destination targets. For the CLI, configuration is done through the various cli parameters supported. When the library is run in python, those parameters that are exposed in the CLI map to python config classes, which are described in more detail in the configs section.