Skip to content

Source Crawl4ai Operator

The Source Crawl4ai LOP utilizes the crawl4ai Python library to fetch content from web pages, sitemaps, or lists of URLs. It uses headless browsers (via Playwright) to render pages, extracts the main content, converts it to Markdown, and structures the output into a DAT table compatible with the Rag Index operator. It supports various crawling modes, URL filtering, and resource management features like concurrency limits and adaptive memory usage control.

Source Crawl4ai UI

  • Python Packages:
    • crawl4ai: The core crawling library.
    • playwright: Required by crawl4ai for browser automation.
    • requests (implicitly needed for sitemap fetching).
    • These can be installed via the ChatTD operator’s Python manager by first installing crawl4ai.
  • Playwright Browsers: After installing the Python packages, the necessary browser binaries must be downloaded using the Install/Update Playwright Browsers parameter on this operator.
  • ChatTD Operator: Required for dependency management (package installation) and asynchronous task execution. Ensure the ChatTD Operator parameter on the ‘About’ page points to your configured ChatTD instance.

None

  • Output Table (DAT): The primary output, containing the crawled content. Columns match the requirements for the Rag Index operator:
    • doc_id: Unique ID for the crawled page/chunk.
    • filename: Source URL of the crawled page.
    • content: Crawled content formatted as Markdown.
    • metadata: JSON string containing source URL, timestamp, content length, etc.
    • source_path: Source URL (duplicate of filename).
    • timestamp: Unix timestamp of when the content was processed.
  • Internal DATs: (Accessible via operator viewer) index_table (summary view) and content_table (detailed view of selected doc).
  • Status/Log: Information is logged via the linked Logger component within ChatTD. Key status info is also reflected in the Status, Progress, and URLs Processed parameters.
Target URL / Sitemap / .txt (Url) op('source_crawl4ai').par.Url Str
Default:
None
Crawl Mode (Crawlmode) op('source_crawl4ai').par.Crawlmode Menu
Default:
recursive
Options:
recursive, single, sitemap, txtfile
Max Depth (Recursive) (Maxdepth) op('source_crawl4ai').par.Maxdepth Int
Default:
2
Range:
1 to N/A
Slider Range:
1 to 10
Include URL Patterns (Includepatterns) op('source_crawl4ai').par.Includepatterns Str
Default:
None
Exclude URL Patterns (Excludepatterns) op('source_crawl4ai').par.Excludepatterns Str
Default:
None
Start Crawl (Startcrawl) op('source_crawl4ai').par.Startcrawl Pulse
Default:
None
Stop Crawl (Stopcrawl) op('source_crawl4ai').par.Stopcrawl Pulse
Default:
None
Current Status (Status) op('source_crawl4ai').par.Status Str
Default:
/DOC_GENERATOR/NETWORK_VIEWER/source_crawl4ai initialized
Progress (Progress) op('source_crawl4ai').par.Progress Float
Default:
0
URLs Processed (Urlsprocessed) op('source_crawl4ai').par.Urlsprocessed Int
Default:
0
Max Concurrent Sessions (Maxconcurrent) op('source_crawl4ai').par.Maxconcurrent Int
Default:
5
Range:
1 to N/A
Slider Range:
1 to 20
Memory Threshold (%) (Memorythreshold) op('source_crawl4ai').par.Memorythreshold Float
Default:
70
Range:
30 to 95
Slider Range:
50 to 90
Install/Update Playwright Browsers (Installplaywright) op('source_crawl4ai').par.Installplaywright Pulse
Default:
None
Clear Table Data (Clearoutput) op('source_crawl4ai').par.Clearoutput Pulse
Default:
None
Caution: Exposing the viewer of large index tables will be heavy Header
Display (Display) op('source_crawl4ai').par.Display Menu
Default:
content
Options:
index, content
Select Doc (Selectdoc) op('source_crawl4ai').par.Selectdoc Int
Default:
1
Range:
1 to N/A
Slider Range:
1 to N/A
Display File (Displayfile) op('source_crawl4ai').par.Displayfile Str
Default:
None
Bypass (Bypass) op('source_crawl4ai').par.Bypass Toggle
Default:
0
Options:
off, on
Show Built-in Parameters (Showbuiltin) op('source_crawl4ai').par.Showbuiltin Toggle
Default:
0
Options:
off, on
Version (Version) op('source_crawl4ai').par.Version Str
Default:
1.0.0
Last Updated (Lastupdated) op('source_crawl4ai').par.Lastupdated Str
Default:
2025-04-30
Creator (Creator) op('source_crawl4ai').par.Creator Str
Default:
dotsimulate
Website (Website) op('source_crawl4ai').par.Website Str
Default:
https://dotsimulate.com
ChatTD Operator (Chattd) op('source_crawl4ai').par.Chattd OP
Default:
/dot_lops/ChatTD
Clear Log (Clearlog) op('source_crawl4ai').par.Clearlog Pulse
Default:
None
Convert To Text (Converttotext) op('source_crawl4ai').par.Converttotext Toggle
Default:
0
Options:
off, on
1. Set 'Target URL / Sitemap / .txt' to the full URL (e.g., https://docs.derivative.ca/Introduction_to_Python).
2. Set 'Crawl Mode' to 'Single Page'.
3. Pulse 'Start Crawl'.
4. Monitor 'Status' and view results in the Output Table DAT.
1. Set 'Target URL / Sitemap / .txt' to the EXACT URL of the sitemap (e.g., https://example.com/sitemap.xml).
2. Set 'Crawl Mode' to 'Sitemap Batch'.
3. (Optional) Set 'Include/Exclude URL Patterns' to filter URLs from the sitemap.
4. Adjust 'Max Concurrent Sessions' based on your system.
5. Pulse 'Start Crawl'.
6. Monitor 'Status' and 'Progress'.
1. Set 'Target URL / Sitemap / .txt' to the starting page (e.g., https://yoursite.com/documentation/).
2. Set 'Crawl Mode' to 'Crawl Site Links'.
3. Set 'Max Depth' (e.g., 3). Be cautious with high values on large sites.
4. (Optional) Set 'Exclude URL Patterns' to avoid specific sections (e.g., */blog* */forum*).
5. Adjust 'Max Concurrent Sessions'.
6. Pulse 'Start Crawl'.
1. Ensure the 'ChatTD Operator' parameter points to your ChatTD instance.
2. Use ChatTD's Python Manager to install the 'crawl4ai' package.
3. Return to this operator. Pulse the 'Install/Update Playwright Browsers' parameter.
4. Monitor the Textport for download progress. Installation is complete when the logs indicate success.
  • Dependencies: Requires crawl4ai and playwright Python packages, installable via ChatTD. Crucially, Playwright also needs browser binaries downloaded via the Install/Update Playwright Browsers parameter pulse.
  • Resource Usage: Crawling, especially in batch modes (Sitemap, Recursive, Text File), uses headless browsers and can consume significant CPU, RAM, and network bandwidth.
  • Concurrency: Adjust Max Concurrent Sessions carefully. Too high can destabilize TouchDesigner or your system.
  • Memory Management: The Memory Threshold (%) helps prevent crashes on large crawls by pausing new sessions when system RAM usage is high.
  • Filtering: Use Include URL Patterns and Exclude URL Patterns effectively to limit the scope of crawls and avoid unwanted pages or file types. Wildcards (*, ?) are supported.
  • Output Format: Content is output as Markdown in the content column of the output DAT, ready for ingestion by the Rag Index operator.
  • Stopping: Pulsing Stop Crawl attempts a graceful shutdown, but currently active browser tasks might take time to fully terminate.
  • Rag Index: Ingests the output of this operator to create a searchable index.
  • ChatTD: Provides core services like dependency management and asynchronous task execution required by this operator.
  • Source Webscraper: An alternative web scraping operator using a different backend (aiohttp, trafilatura). Might be lighter weight for simpler scraping tasks not requiring full browser rendering.