ComfyUI-Florence2 Custom Nodes Repository

Introduction

The ComfyUI-Florence2 repository integrates the advanced capabilities of the Florence-2 vision foundation model into ComfyUI. Florence-2 uses a prompt-based approach to handle diverse vision and vision-language tasks. These tasks include captioning, object detection, and segmentation. By leveraging the extensive FLD-5B dataset with over 5.4 billion annotations across 126 million images, Florence-2 excels in multi-task learning. Its sequence-to-sequence architecture enables high performance in both zero-shot and fine-tuned settings, making it a competitive choice in the realm of vision foundation models.

Purpose

This repository aims to embed Florence-2's capabilities into ComfyUI, allowing users to utilize its advanced functionalities within their workflows. A significant enhancement in this fork is the inclusion of Document Visual Question Answering (DocVQA), enabling users to interact with and extract information from document images.

Installation

To install the ComfyUI-Florence2 custom nodes, follow these steps:

Clone the repository into the ComfyUI/custom_nodes folder:

git clone https://github.com/kijai/ComfyUI-Florence2 ComfyUI/custom_nodes/ComfyUI-Florence2

Install the dependencies listed in requirements.txt. Ensure transformers version 4.38.0 or higher is used:
```
pip install -r requirements.txt
```
For users of the portable version, run the following in the ComfyUI_windows_portable directory:
```
python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-Florence2\requirements.txt
```

Node Descriptions

This repository provides several nodes, each designed to integrate Florence-2 functionalities into ComfyUI:

DownloadAndLoadFlorence2Model: Facilitates the automatic download and loading of Florence-2 models into ComfyUI/models/LLM.
DownloadAndLoadFlorence2Lora: Similar functionality for Lora models.
Florence2ModelLoader: Handles the loading of Florence2 models.
Florence2Run: Executes tasks using the loaded Florence2 models.

Special Features

Document Visual Question Answering (DocVQA)

The standout feature in this repository is its support for Document Visual Question Answering (DocVQA) using the Florence-2 model. This feature allows users to ask questions about document images, providing answers based on both visual and textual data. This is particularly advantageous for interpreting information from scanned documents, forms, receipts, and other text-heavy visuals.

Supported Models

The repository supports a range of Florence-2 models, which can be automatically downloaded. Some of these models include:

Official:
Tested Finetunes:

Usage in ComfyUI Workflows

ComfyUI-Florence2 can significantly enhance your workflows by:

Automating Model Download: Simplifies the process of obtaining and integrating Florence-2 models.
Executing Vision Tasks: Provides powerful nodes to perform complex vision-related tasks seamlessly.
Document Interaction: With DocVQA, users can easily query and extract specific information from document images, offering a robust tool for handling text-heavy visual data.

By integrating these nodes, users can harness the cutting-edge capabilities of Florence-2 to build more intelligent and responsive applications within ComfyUI.

ComfyUI-Florence2

Available Nodes

Documentation