Florence-2 Operator

Overview

The Florence-2 LOP provides an interface to Microsoft’s powerful Florence-2 vision foundation model. It enables a wide range of vision tasks, including detailed image captioning, object detection (region proposal), OCR (Optical Character Recognition), referring expression segmentation, and more. This operator requires the separate SideCar server to be running, as the actual model loading and inference computation happen within the SideCar process, utilizing its resources (potentially including a dedicated GPU).

Florence-2 UI

Requirements

SideCar Server: The SideCar server application must be running. See the SideCar Guide for setup instructions.
SideCar Dependencies: The Python environment used by the SideCar server needs the following packages installed:
- torch>=2.1.1 (CUDA version recommended)
- transformers
- timm
- einops
ChatTD Operator: Required for asynchronous communication with the SideCar server and logging. Ensure the ChatTD Operator parameter on the ‘About’ page points to your configured ChatTD instance.

Input/Output

Inputs

Input TOP (in_top): Connect the image (TOP) you want to process here.

Outputs

Output Text DAT (output_dat): Contains the primary text result from the selected Florence-2 task (e.g., the generated caption, the OCR text).
Conversation DAT (conversation_dat): Stores the latest interaction, typically including the input prompt (if used) and the assistant’s (Florence-2’s) response.
History DAT (history_dat): Appends a log entry for each successful processing task, storing the role, message, model used, and timestamp.

Parameters

Page: Florence2

Load Model (Load) op('florence').par.Load Pulse

Default:: None

Process Image (Process) op('florence').par.Process Pulse

Default:: None

Reset (Reset) op('florence').par.Reset Pulse

Default:: None

Active (Active) op('florence').par.Active Toggle

Default:: Off

Status (Status) op('florence').par.Status String

Default:: None

Florence Model (Florencemodel) op('florence').par.Florencemodel Menu

Default:: microsoft/Florence-2-base
Options:: microsoft/Florence-2-base, microsoft/Florence-2-base-ft, microsoft/Florence-2-large, microsoft/Florence-2-large-ft, HuggingFaceM4/Florence-2-DocVQA, thwri/CogFlorence-2.1-Large, thwri/CogFlorence-2.2-Large, gokaygokay/Florence-2-SD3-Captioner, gokaygokay/Florence-2-Flux-Large, MiaoshouAI/Florence-2-base-PromptGen-v1.5, MiaoshouAI/Florence-2-large-PromptGen-v1.5, MiaoshouAI/Florence-2-base-PromptGen-v2.0, MiaoshouAI/Florence-2-large-PromptGen-v2.0

Input Prompt (Prompt) op('florence').par.Prompt String

Default:: None

Max Tokens (Maxtokens) op('florence').par.Maxtokens Int

Default:: 512
Range:: 1 to N/A
Slider Range:: 64 to 4096

Num Beams (Numbeams) op('florence').par.Numbeams Int

Default:: 3
Range:: 1 to N/A
Slider Range:: 1 to 10

Do Sample (Dosample) op('florence').par.Dosample Toggle

Default:: On

Random Seed (Seed) op('florence').par.Seed Int

Default:: 42
Range:: -1 to N/A
Slider Range:: 0 to 10000

Fill Region Masks (Fillmask) op('florence').par.Fillmask Toggle

Default:: On

Mask Selection (Maskselect) op('florence').par.Maskselect Str

Default:: None

Page: About

Bypass (Bypass) op('florence').par.Bypass Toggle

Default:: Off

Show Built-in Parameters (Showbuiltin) op('florence').par.Showbuiltin Toggle

Default:: Off

Version (Version) op('florence').par.Version String

Default:: 1.0.0

Last Updated (Lastupdated) op('florence').par.Lastupdated String

Default:: 2024-11-09

Creator (Creator) op('florence').par.Creator String

Default:: dotsimulate

Website (Website) op('florence').par.Website String

Default:: https://dotsimulate.com

ChatTD Operator (Chattd) op('florence').par.Chattd OP

Default:: /dot_lops/ChatTD

Usage Examples

Image Captioning

1. Ensure the SideCar server is running.
2. Connect an image TOP to the Florence-2 input.
3. Select a desired model (e.g., 'microsoft/Florence-2-large') from the `Florence Model` menu.
4. Pulse the `Load Model` parameter and wait for the status to indicate readiness (may take time on first load).
5. Set the `Task` parameter to 'more_detailed_caption'.
6. Pulse the `Process Image` parameter.
7. Monitor the `Status` parameter. The generated caption will appear in the `output_dat` DAT.

Optical Character Recognition (OCR)

1. Ensure SideCar is running and the desired model is loaded (pulse `Load Model`).
2. Connect an image TOP containing text to the input.
3. Set the `Task` parameter to 'ocr'.
4. Pulse `Process Image`.
5. The extracted text will appear in the `output_dat` DAT.

Object Detection (Region Proposal)

1. Ensure SideCar is running and the model is loaded.
2. Connect an image TOP.
3. Set the `Task` parameter to 'region_proposal'.
4. Pulse `Process Image`.
5. The results (bounding boxes and labels) will appear in the `output_dat` DAT (often as structured text or JSON). Visualizations may appear in the node viewer depending on internal settings.

Technical Notes

SideCar Dependency: This operator is critically dependent on the SideCar server. All model loading and inference occur in the SideCar process.
Resource Intensive: Florence-2 models, especially the larger variants, require significant computational resources, primarily GPU VRAM. Ensure the machine running SideCar meets the requirements for the selected model.
Asynchronous Operation: Communication with the SideCar server (loading models, processing images) is handled asynchronously via ChatTD’s TDAsyncIO, preventing TouchDesigner from freezing.
Task-Specific Prompts: Some tasks like docvqa or referring_expression_segmentation require an appropriate Input Prompt to function correctly.
Precision & Attention: Precision and Attention Mechanism parameters affect performance and resource usage on the SideCar server. fp16/bf16 and flash_attention_2 (if installed and supported) can offer significant speedups.

SideCar: The backend service required for this operator to function.
ChatTD: Provides core services like asynchronous task execution and logging.
OCR Operator: Another operator focused specifically on OCR, potentially using different backends (like EasyOCR or PaddleOCR via SideCar).