GOT-OCR Node Documentation

Overview

The got_ocr node is a component of the ComfyUI LLM Party, designed to perform Optical Character Recognition (OCR) on images. Leveraging advanced transformer models, this node can extract text from images or format the extracted content, making it an ideal tool for integrating OCR functionalities into Language Model workflows.

Functionality

The primary function of the got_ocr node is to process images to extract textual data using specified OCR models. It can handle various OCR operations, including multi-crop processing and OCR rendering, and supports different devices and data types for optimized performance.

Inputs

The got_ocr node requires the following inputs:

Required Inputs

Model Name or Path (model_name_or_path): A string specifying the name or path to the OCR model to be used.
Device (device): A selection between auto, cuda, cpu, or mps, which determines the hardware device used for processing.
OCR Type (ocr_type): A selection between ocr and format, which defines the type of OCR operation to be performed.
Image (image): The image input that needs to be processed for OCR. It should be provided in a compatible format.
Enable (is_enable): A boolean (true/false) option to enable or disable the node’s function.

Optional Inputs

OCR Box (ocr_box): A string to specify the bounding box for OCR if needed.
OCR Color (ocr_color): A string to specify color settings for the OCR operation.
Multi Crop (multi_crop): A boolean (true/false) option that activates multi-crop processing if enabled.
Render (render): A boolean (true/false) option that enables rendering of the OCR results into an HTML file.
Output Directory Path (out_dir_path): A string specifying the path where output files will be stored. Defaults to a pre-configured output directory.
Data Type (dtype): A selection from float32, float16, bfloat16, int8, or int4, indicating the data type to be used during processing.

Outputs

The got_ocr node generates the following output:

Text (text): A string that contains the text extracted from the provided image.

Usage in ComfyUI Workflows

The got_ocr node can be incorporated into ComfyUI workflows wherever OCR functionality is required. It can be used, for example, to preprocess images in LLM workflows where textual content extraction from images is necessary. The node's ability to handle various input types and configurations makes it versatile for different scenarios, such as:

Extracting text from scanned documents or receipts.
Processing image content for data collection and analysis.
Integrating into complex LLM workflows that need multimodal data handling.

Special Features and Considerations

Device Flexibility: The node can automatically choose the most optimal device, or allow users to specify the device manually, making it adaptable to different hardware capabilities.
Data Type Configurations: Supports various data types for processing, which can be adjusted to balance computational efficiency and precision.
Rendering Capabilities: Offers the option to render OCR results into a navigable HTML file for ease of access and review.
Customizable Outputs: Users can define the output directory, ensuring that file management aligns with workflow requirements.

Considerations

Ensure that the model name or path provided is accessible and compatible with the node's functions.
If using multi-crop or render functionalities, verify that the input image is suitable for these operations to avoid performance issues.
For optimal performance, match the data type (dtype) and device configuration with the available hardware.

With these functionalities and considerations in mind, the got_ocr node can be a powerful addition to workflows requiring advanced image processing capabilities in the ComfyUI environment.

comfyui_LLM_party

Available Nodes

got_ocr