Skip to content

Florence-2 Operator

The Florence-2 LOP provides an interface to Microsoft’s powerful Florence-2 vision foundation model. It enables a wide range of vision tasks, including detailed image captioning, object detection (region proposal), OCR (Optical Character Recognition), referring expression segmentation, and more. This operator requires the separate SideCar server to be running, as the actual model loading and inference computation happen within the SideCar process, utilizing its resources (potentially including a dedicated GPU).

Florence-2 UI

  • SideCar Server: The SideCar server application must be running. See the SideCar Guide for setup instructions.
  • SideCar Dependencies: The Python environment used by the SideCar server needs the following packages installed:
    • torch>=2.1.1 (CUDA version recommended)
    • transformers
    • timm
    • einops
  • ChatTD Operator: Required for asynchronous communication with the SideCar server and logging. Ensure the ChatTD Operator parameter on the ‘About’ page points to your configured ChatTD instance.
  • Input TOP (in_top): Connect the image (TOP) you want to process here.
  • Output Text DAT (output_dat): Contains the primary text result from the selected Florence-2 task (e.g., the generated caption, the OCR text).
  • Conversation DAT (conversation_dat): Stores the latest interaction, typically including the input prompt (if used) and the assistant’s (Florence-2’s) response.
  • History DAT (history_dat): Appends a log entry for each successful processing task, storing the role, message, model used, and timestamp.
Load Model (Load) op('florence').par.Load Pulse
Default:
None
Process Image (Process) op('florence').par.Process Pulse
Default:
None
Reset (Reset) op('florence').par.Reset Pulse
Default:
None
Active (Active) op('florence').par.Active Toggle
Default:
Off
Status (Status) op('florence').par.Status String
Default:
None
Florence Model (Florencemodel) op('florence').par.Florencemodel Menu
Default:
microsoft/Florence-2-base
Options:
microsoft/Florence-2-base, microsoft/Florence-2-base-ft, microsoft/Florence-2-large, microsoft/Florence-2-large-ft, HuggingFaceM4/Florence-2-DocVQA, thwri/CogFlorence-2.1-Large, thwri/CogFlorence-2.2-Large, gokaygokay/Florence-2-SD3-Captioner, gokaygokay/Florence-2-Flux-Large, MiaoshouAI/Florence-2-base-PromptGen-v1.5, MiaoshouAI/Florence-2-large-PromptGen-v1.5, MiaoshouAI/Florence-2-base-PromptGen-v2.0, MiaoshouAI/Florence-2-large-PromptGen-v2.0
Precision (Precision) op('florence').par.Precision Menu
Default:
fp16
Options:
fp16, bf16, fp32
Attention Mechanism (Attention) op('florence').par.Attention Menu
Default:
sdpa
Options:
sdpa, flash_attention_2, eager
Task (Task) op('florence').par.Task Menu
Default:
more_detailed_caption
Options:
caption, region_caption, dense_region_caption, region_proposal, detailed_caption, more_detailed_caption, caption_to_phrase_grounding, referring_expression_segmentation, ocr, ocr_with_region, docvqa, prompt_gen_tags, prompt_gen_mixed_caption, prompt_gen_analyze
Input Prompt (Prompt) op('florence').par.Prompt String
Default:
None
Max Tokens (Maxtokens) op('florence').par.Maxtokens Int
Default:
512
Range:
1 to N/A
Slider Range:
64 to 4096
Num Beams (Numbeams) op('florence').par.Numbeams Int
Default:
3
Range:
1 to N/A
Slider Range:
1 to 10
Do Sample (Dosample) op('florence').par.Dosample Toggle
Default:
On
Random Seed (Seed) op('florence').par.Seed Int
Default:
42
Range:
-1 to N/A
Slider Range:
0 to 10000
Fill Region Masks (Fillmask) op('florence').par.Fillmask Toggle
Default:
On
Mask Selection (Maskselect) op('florence').par.Maskselect Str
Default:
None
Bypass (Bypass) op('florence').par.Bypass Toggle
Default:
Off
Show Built-in Parameters (Showbuiltin) op('florence').par.Showbuiltin Toggle
Default:
Off
Version (Version) op('florence').par.Version String
Default:
1.0.0
Last Updated (Lastupdated) op('florence').par.Lastupdated String
Default:
2024-11-09
Creator (Creator) op('florence').par.Creator String
Default:
dotsimulate
Website (Website) op('florence').par.Website String
Default:
https://dotsimulate.com
ChatTD Operator (Chattd) op('florence').par.Chattd OP
Default:
/dot_lops/ChatTD
1. Ensure the SideCar server is running.
2. Connect an image TOP to the Florence-2 input.
3. Select a desired model (e.g., 'microsoft/Florence-2-large') from the `Florence Model` menu.
4. Pulse the `Load Model` parameter and wait for the status to indicate readiness (may take time on first load).
5. Set the `Task` parameter to 'more_detailed_caption'.
6. Pulse the `Process Image` parameter.
7. Monitor the `Status` parameter. The generated caption will appear in the `output_dat` DAT.
1. Ensure SideCar is running and the desired model is loaded (pulse `Load Model`).
2. Connect an image TOP containing text to the input.
3. Set the `Task` parameter to 'ocr'.
4. Pulse `Process Image`.
5. The extracted text will appear in the `output_dat` DAT.
1. Ensure SideCar is running and the model is loaded.
2. Connect an image TOP.
3. Set the `Task` parameter to 'region_proposal'.
4. Pulse `Process Image`.
5. The results (bounding boxes and labels) will appear in the `output_dat` DAT (often as structured text or JSON). Visualizations may appear in the node viewer depending on internal settings.
  • SideCar Dependency: This operator is critically dependent on the SideCar server. All model loading and inference occur in the SideCar process.
  • Resource Intensive: Florence-2 models, especially the larger variants, require significant computational resources, primarily GPU VRAM. Ensure the machine running SideCar meets the requirements for the selected model.
  • Asynchronous Operation: Communication with the SideCar server (loading models, processing images) is handled asynchronously via ChatTD’s TDAsyncIO, preventing TouchDesigner from freezing.
  • Task-Specific Prompts: Some tasks like docvqa or referring_expression_segmentation require an appropriate Input Prompt to function correctly.
  • Precision & Attention: Precision and Attention Mechanism parameters affect performance and resource usage on the SideCar server. fp16/bf16 and flash_attention_2 (if installed and supported) can offer significant speedups.
  • SideCar: The backend service required for this operator to function.
  • ChatTD: Provides core services like asynchronous task execution and logging.
  • OCR Operator: Another operator focused specifically on OCR, potentially using different backends (like EasyOCR or PaddleOCR via SideCar).