Source Crawl4ai Operator

Overview

The Source Crawl4ai LOP utilizes the crawl4ai Python library to fetch content from web pages, sitemaps, or lists of URLs. It uses headless browsers (via Playwright) to render pages, extracts the main content, converts it to Markdown, and structures the output into a DAT table compatible with the Rag Index operator. It supports various crawling modes, URL filtering, and resource management features like concurrency limits and adaptive memory usage control.

Source Crawl4ai UI

Requirements

Python Packages:
- crawl4ai: The core crawling library.
- playwright: Required by crawl4ai for browser automation.
- requests (implicitly needed for sitemap fetching).
- These can be installed via the ChatTD operator’s Python manager by first installing crawl4ai.
Playwright Browsers: After installing the Python packages, the necessary browser binaries must be downloaded using the Install/Update Playwright Browsers parameter on this operator.
ChatTD Operator: Required for dependency management (package installation) and asynchronous task execution. Ensure the ChatTD Operator parameter on the ‘About’ page points to your configured ChatTD instance.

Input/Output

Inputs

None

Outputs

Output Table (DAT): The primary output, containing the crawled content. Columns match the requirements for the Rag Index operator:
- doc_id: Unique ID for the crawled page/chunk.
- filename: Source URL of the crawled page.
- content: Crawled content formatted as Markdown.
- metadata: JSON string containing source URL, timestamp, content length, etc.
- source_path: Source URL (duplicate of filename).
- timestamp: Unix timestamp of when the content was processed.
Internal DATs: (Accessible via operator viewer) index_table (summary view) and content_table (detailed view of selected doc).
Status/Log: Information is logged via the linked Logger component within ChatTD. Key status info is also reflected in the Status, Progress, and URLs Processed parameters.

Parameters

Page: Crawl Config

Target URL / Sitemap / .txt (Url) op('source_crawl4ai').par.Url Str

Default:: None

Max Depth (Recursive) (Maxdepth) op('source_crawl4ai').par.Maxdepth Int

Default:: 2
Range:: 1 to N/A
Slider Range:: 1 to 10

Include URL Patterns (Includepatterns) op('source_crawl4ai').par.Includepatterns Str

Default:: None

Exclude URL Patterns (Excludepatterns) op('source_crawl4ai').par.Excludepatterns Str

Default:: None

Start Crawl (Startcrawl) op('source_crawl4ai').par.Startcrawl Pulse

Default:: None

Stop Crawl (Stopcrawl) op('source_crawl4ai').par.Stopcrawl Pulse

Default:: None

Current Status (Status) op('source_crawl4ai').par.Status Str

Default:: /DOC_GENERATOR/NETWORK_VIEWER/source_crawl4ai initialized

Progress (Progress) op('source_crawl4ai').par.Progress Float

Default:: 0

URLs Processed (Urlsprocessed) op('source_crawl4ai').par.Urlsprocessed Int

Default:: 0

Max Concurrent Sessions (Maxconcurrent) op('source_crawl4ai').par.Maxconcurrent Int

Default:: 5
Range:: 1 to N/A
Slider Range:: 1 to 20

Memory Threshold (%) (Memorythreshold) op('source_crawl4ai').par.Memorythreshold Float

Default:: 70
Range:: 30 to 95
Slider Range:: 50 to 90

Install/Update Playwright Browsers (Installplaywright) op('source_crawl4ai').par.Installplaywright Pulse

Default:: None

Clear Table Data (Clearoutput) op('source_crawl4ai').par.Clearoutput Pulse

Default:: None

Caution: Exposing the viewer of large index tables will be heavy Header

Select Doc (Selectdoc) op('source_crawl4ai').par.Selectdoc Int

Default:: 1
Range:: 1 to N/A
Slider Range:: 1 to N/A

Display File (Displayfile) op('source_crawl4ai').par.Displayfile Str

Default:: None

Page: About

Bypass (Bypass) op('source_crawl4ai').par.Bypass Toggle

Default:: 0
Options:: off, on

Show Built-in Parameters (Showbuiltin) op('source_crawl4ai').par.Showbuiltin Toggle

Default:: 0
Options:: off, on

Version (Version) op('source_crawl4ai').par.Version Str

Default:: 1.0.0

Last Updated (Lastupdated) op('source_crawl4ai').par.Lastupdated Str

Default:: 2025-04-30

Creator (Creator) op('source_crawl4ai').par.Creator Str

Default:: dotsimulate

Website (Website) op('source_crawl4ai').par.Website Str

Default:: https://dotsimulate.com

ChatTD Operator (Chattd) op('source_crawl4ai').par.Chattd OP

Default:: /dot_lops/ChatTD

Clear Log (Clearlog) op('source_crawl4ai').par.Clearlog Pulse

Default:: None

Convert To Text (Converttotext) op('source_crawl4ai').par.Converttotext Toggle

Default:: 0
Options:: off, on

Usage Examples

Crawling a Single Page

1. Set 'Target URL / Sitemap / .txt' to the full URL (e.g., https://docs.derivative.ca/Introduction_to_Python).
2. Set 'Crawl Mode' to 'Single Page'.
3. Pulse 'Start Crawl'.
4. Monitor 'Status' and view results in the Output Table DAT.

Crawling from a Sitemap

1. Set 'Target URL / Sitemap / .txt' to the EXACT URL of the sitemap (e.g., https://example.com/sitemap.xml).
2. Set 'Crawl Mode' to 'Sitemap Batch'.
3. (Optional) Set 'Include/Exclude URL Patterns' to filter URLs from the sitemap.
4. Adjust 'Max Concurrent Sessions' based on your system.
5. Pulse 'Start Crawl'.
6. Monitor 'Status' and 'Progress'.

Recursive Crawl of a Small Site Section

1. Set 'Target URL / Sitemap / .txt' to the starting page (e.g., https://yoursite.com/documentation/).
2. Set 'Crawl Mode' to 'Crawl Site Links'.
3. Set 'Max Depth' (e.g., 3). Be cautious with high values on large sites.
4. (Optional) Set 'Exclude URL Patterns' to avoid specific sections (e.g., */blog* */forum*).
5. Adjust 'Max Concurrent Sessions'.
6. Pulse 'Start Crawl'.

Initial Setup (Installation)

1. Ensure the 'ChatTD Operator' parameter points to your ChatTD instance.
2. Use ChatTD's Python Manager to install the 'crawl4ai' package.
3. Return to this operator. Pulse the 'Install/Update Playwright Browsers' parameter.
4. Monitor the Textport for download progress. Installation is complete when the logs indicate success.

Technical Notes

Dependencies: Requires crawl4ai and playwright Python packages, installable via ChatTD. Crucially, Playwright also needs browser binaries downloaded via the Install/Update Playwright Browsers parameter pulse.
Resource Usage: Crawling, especially in batch modes (Sitemap, Recursive, Text File), uses headless browsers and can consume significant CPU, RAM, and network bandwidth.
Concurrency: Adjust Max Concurrent Sessions carefully. Too high can destabilize TouchDesigner or your system.
Memory Management: The Memory Threshold (%) helps prevent crashes on large crawls by pausing new sessions when system RAM usage is high.
Filtering: Use Include URL Patterns and Exclude URL Patterns effectively to limit the scope of crawls and avoid unwanted pages or file types. Wildcards (*, ?) are supported.
Output Format: Content is output as Markdown in the content column of the output DAT, ready for ingestion by the Rag Index operator.
Stopping: Pulsing Stop Crawl attempts a graceful shutdown, but currently active browser tasks might take time to fully terminate.

Rag Index: Ingests the output of this operator to create a searchable index.
ChatTD: Provides core services like dependency management and asynchronous task execution required by this operator.
Source Webscraper: An alternative web scraping operator using a different backend (aiohttp, trafilatura). Might be lighter weight for simpler scraping tasks not requiring full browser rendering.