ForkCohereCoherepublished Apr 28, 2024seen 6d

cohere-ai/unstructured

forked from Unstructured-IO/unstructured

Open original ↗

Captured source

source ↗
published Apr 28, 2024seen 6dcaptured 15hhttp 200method plain

cohere-ai/unstructured

Description: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Language: HTML

License: Apache-2.0

Stars: 2

Forks: 3

Open issues: 11

Created: 2024-04-28T15:42:49Z

Pushed: 2024-09-03T23:21:33Z

Default branch: main

Fork: yes

Parent repository: Unstructured-IO/unstructured

Archived: no

README:

Open-Source Pre-Processing Tools for Unstructured Data

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

API Announcement!

We are thrilled to announce our newly launched Unstructured API, providing the Unstructured capabilities from unstructured as an API. Check out the `unstructured-api` GitHub repository to start making API calls. You’ll also find instructions about how to host your own API version.

While access to the hosted Unstructured API will remain free, API Keys are required to make requests. To prevent disruption, get yours here and start using it today! Check out the `unstructured-api` README to start making API calls.

:rocket: Beta Feature: Chipper Model

We are releasing the beta version of our Chipper model to deliver superior performance when processing high-resolution, complex documents. To start using the Chipper model in your API request, you can utilize the hi_res_model_name=chipper parameter. Please refer to the documentation here.

As the Chipper model is in beta version, we welcome feedback and suggestions. For those interested in testing the Chipper model, we encourage you to connect with us on Slack community.

:eight_pointed_black_star: Quick Start

There are several ways to use the unstructured library:

1. Install from PyPI 2. Install for local development

  • For installation with conda on Windows system, please refer to the documentation

Run the library in a container

The following instructions are intended to help you get up and running using Docker to interact with unstructured. See here if you don't already have docker installed on your machine.

NOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware. docker pull should download the corresponding image for your architecture, but you can specify with --platform (e.g. --platform linux/amd64) if needed.

We build Docker images for all pushes to main. We tag each image with the corresponding short commit hash (e.g. fbc7a69) and the application version (e.g. 0.5.5-dev1). We also tag the most recent image with latest. To leverage this, docker pull from our image repository.

docker pull downloads.unstructured.io/unstructured-io/unstructured:latest

Once pulled, you can create a container from this image and shell to it.

# create the container
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest

# this will drop you into a bash shell where the Docker image is running
docker exec -it unstructured bash

You can also build your own Docker image.

If you only plan on parsing one type of data you can speed up building the image by commenting out some of the packages/requirements necessary for other data types. See Dockerfile to know which lines are necessary for your use case.

make docker-build

# this will drop you into a bash shell where the Docker image is running
make docker-start-bash

Once in the running container, you can try things directly in Python interpreter's interactive mode.

# this will drop you into a python console so you can run the below partition functions
python3

>>> from unstructured.partition.pdf import partition_pdf
>>> elements = partition_pdf(filename="example-docs/layout-parser-paper-fast.pdf")

>>> from unstructured.partition.text import partition_text
>>> elements = partition_text(filename="example-docs/fake-text.txt")

Installing the library

Use the following instructions to get up and running with unstructured and test your installation.

  • Install the Python SDK to support all document types with pip install "unstructured[all-docs]"
  • For plain text files, HTML, XML, JSON and Emails that do not require any extra dependencies, you can run pip install unstructured
  • To process other doc types, you can install the extras required for those documents, such as pip install "unstructured[docx,pptx]"
  • Install the following system dependencies if they are not already available on your system.

Depending on what document types you're parsing, you may not need all of these.

  • libmagic-dev (filetype detection)
  • poppler-utils (images and PDFs)
  • tesseract-ocr (images and PDFs, install tesseract-lang for additional language support)
  • libreoffice (MS Office docs)
  • pandoc (EPUBs, RTFs and Open Office docs). Please note that to handle RTF files, you need version 2.14.2 or newer. Running either make install-pandoc or ./scripts/install-pandoc.sh will install the correct version for you.
  • For suggestions on how to install on the Windows and to learn about dependencies for other features, see the

installation documentation…

Excerpt shown — open the source for the full document.