Skip to content

Source Crawl4ai Operator

  • Table input mode support
  • Better deduplication algorithms
  • Enhanced multi-agent source gathering
  • Improved multiple source handling simultaneously

The Source Crawl4ai LOP utilizes the crawl4ai Python library to fetch content from web pages, sitemaps, or lists of URLs. It uses headless browsers (via Playwright) to render pages, extracts the main content, converts it to Markdown, and structures the output into a DAT table compatible with the Rag Index operator. It supports various crawling modes, URL filtering, and resource management features like concurrency limits and adaptive memory usage control.

Source Crawl4ai UI

  • Python Packages:
    • crawl4ai: The core crawling library.
    • playwright: Required by crawl4ai for browser automation.
    • requests (implicitly needed for sitemap fetching).
    • These can be installed via the ChatTD operator’s Python manager by first installing crawl4ai.
  • Playwright Browsers: After installing the Python packages, the necessary browser binaries must be downloaded using the Install/Update Playwright Browsers parameter on this operator.
  • ChatTD Operator: Required for dependency management (package installation) and asynchronous task execution. Ensure the ChatTD Operator parameter on the ‘About’ page points to your configured ChatTD instance.

None

  • Output Table (DAT): The primary output, containing the crawled content. Columns match the requirements for the Rag Index operator:
    • doc_id: Unique ID for the crawled page/chunk.
    • filename: Source URL of the crawled page.
    • content: Crawled content formatted as Markdown.
    • metadata: JSON string containing source URL, timestamp, content length, etc.
    • source_path: Source URL (duplicate of filename).
    • timestamp: Unix timestamp of when the content was processed.
  • Internal DATs: (Accessible via operator viewer) index_table (summary view) and content_table (detailed view of selected doc).
  • Status/Log: Information is logged via the linked Logger component within ChatTD. Key status info is also reflected in the Status, Progress, and URLs Processed parameters.
Crawl Source (Crawlsource) op('source_crawl4ai').par.Crawlsource Menu
Default:
url
Target URL / Sitemap / .txt (Url) op('source_crawl4ai').par.Url Str
Default:
None
URL Table (Urltable) op('source_crawl4ai').par.Urltable OP
Default:
None
Include URL Patterns (Includepatterns) op('source_crawl4ai').par.Includepatterns Str
Default:
None
Exclude URL Patterns (Excludepatterns) op('source_crawl4ai').par.Excludepatterns Str
Default:
None
Current Status (Status) op('source_crawl4ai').par.Status Str
Default:
None
Progress (Progress) op('source_crawl4ai').par.Progress Float
Default:
None
URLs Processed (Urlsprocessed) op('source_crawl4ai').par.Urlsprocessed Int
Default:
None
Caution: Exposing the viewer of large index tables will be heavy Header
Start Crawl (Startcrawl) op('source_crawl4ai').par.Startcrawl Pulse
Default:
None
Stop Crawl (Stopcrawl) op('source_crawl4ai').par.Stopcrawl Pulse
Default:
None
Clear Table on Crawl (Clearontable) op('source_crawl4ai').par.Clearontable Toggle
Default:
None
Avoid Repeats (Usehistory) op('source_crawl4ai').par.Usehistory Toggle
Default:
None
Clear Scrape History (Clearhistory) op('source_crawl4ai').par.Clearhistory Pulse
Default:
None
Crawl Mode (Crawlmode) op('source_crawl4ai').par.Crawlmode Menu
Default:
recursive
Max Depth (Recursive) (Maxdepth) op('source_crawl4ai').par.Maxdepth Int
Default:
2
Range:
1 to 10
Max Concurrent Sessions (Maxconcurrent) op('source_crawl4ai').par.Maxconcurrent Int
Default:
5
Range:
1 to 20
Memory Threshold (%) (Memorythreshold) op('source_crawl4ai').par.Memorythreshold Float
Default:
70.0
Range:
30 to 95
Install/Update Playwright Browsers (Installplaywright) op('source_crawl4ai').par.Installplaywright Pulse
Default:
None
Clear Table Data (Clearoutput) op('source_crawl4ai').par.Clearoutput Pulse
Default:
None
Display (Display) op('source_crawl4ai').par.Display Menu
Default:
index
Display File (Displayfile) op('source_crawl4ai').par.Displayfile Str
Default:
None
Select Doc (Selectdoc) op('source_crawl4ai').par.Selectdoc Int
Default:
1
Range:
1 to 1
Agent Return Content (Agentreturncontent) op('source_crawl4ai').par.Agentreturncontent Menu
Default:
none
Agent Execution Mode (Agentexecutionmode) op('source_crawl4ai').par.Agentexecutionmode Menu
Default:
wait
Agent Calls Add to Table (Agenttotable) op('source_crawl4ai').par.Agenttotable Toggle
Default:
None
Bypass (Bypass) op('source_crawl4ai').par.Bypass Toggle
Default:
None
Show Built-in Parameters (Showbuiltin) op('source_crawl4ai').par.Showbuiltin Toggle
Default:
None
Version (Version) op('source_crawl4ai').par.Version Str
Default:
None
Last Updated (Lastupdated) op('source_crawl4ai').par.Lastupdated Str
Default:
None
Creator (Creator) op('source_crawl4ai').par.Creator Str
Default:
None
Website (Website) op('source_crawl4ai').par.Website Str
Default:
None
ChatTD Operator (Chattd) op('source_crawl4ai').par.Chattd OP
Default:
None
Clear Log (Clearlog) op('source_crawl4ai').par.Clearlog Pulse
Default:
None
Convert To Text (Converttotext) op('source_crawl4ai').par.Converttotext Toggle
Default:
None
🔧 GetTool Enabled 2 tools

This operator exposes 2 tools that allow Agent and Gemini Live LOPs to crawl web pages and websites to extract content, supporting both single page crawling and full website recursive crawling for AI-driven content gathering.

The Source Crawl4ai LOP can be used as a tool by Agent LOPs, allowing an AI to autonomously crawl web pages and websites to gather information.

When connected to an Agent, this operator provides the following functions:

  • crawl_single_page(url): Fetches and returns the text content of a single, specific web page. This is best used when the agent needs the contents of one exact URL.
  • crawl_full_website_recursively(url, max_depth=2): Crawls an entire website by following internal links, starting from a given URL. It processes up to 20 pages to gather comprehensive information. This is ideal when an agent needs to understand the content of a whole website, not just a single page.
  1. Connect to Agent: Add the Source Crawl4ai LOP to the Tool sequence parameter on an Agent LOP.
  2. Agent Prompts: When the Agent receives a prompt that requires web content, it can choose to call one of the crawl tools.
  3. Execution: The Source Crawl4ai LOP executes the crawl asynchronously and returns the extracted Markdown content to the Agent.
  4. Response: The Agent then uses this content to formulate its response.
"Please summarize the main points from the article at https://example.com/news/latest-ai-breakthroughs and also give me an overview of the company's products from their website."

In this scenario, the Agent could:

  1. Call crawl_single_page with the URL https://example.com/news/latest-ai-breakthroughs.
  2. Call crawl_full_website_recursively with the URL https://example.com/products.
  3. Use the content from both tool calls to generate a comprehensive summary and overview.
  1. Set ‘Target URL / Sitemap / .txt’ to the full URL (e.g., https://docs.derivative.ca/Introduction_to_Python).
  2. Set ‘Crawl Mode’ to ‘Single Page’.
  3. Pulse ‘Start Crawl’.
  4. Monitor ‘Status’ and view results in the Output Table DAT.
  1. Set ‘Target URL / Sitemap / .txt’ to the EXACT URL of the sitemap (e.g., https://example.com/sitemap.xml).
  2. Set ‘Crawl Mode’ to ‘Sitemap Batch’.
  3. (Optional) Set ‘Include/Exclude URL Patterns’ to filter URLs from the sitemap.
  4. Adjust ‘Max Concurrent Sessions’ based on your system.
  5. Pulse ‘Start Crawl’.
  6. Monitor ‘Status’ and ‘Progress’.
  1. Set ‘Target URL / Sitemap / .txt’ to the starting page (e.g., https://yoursite.com/documentation/).
  2. Set ‘Crawl Mode’ to ‘Crawl Site Links’.
  3. Set ‘Max Depth’ (e.g., 3). Be cautious with high values on large sites.
  4. (Optional) Set ‘Exclude URL Patterns’ to avoid specific sections (e.g., /blog /forum).
  5. Adjust ‘Max Concurrent Sessions’.
  6. Pulse ‘Start Crawl’.
  1. Ensure the ‘ChatTD Operator’ parameter points to your ChatTD instance.
  2. Use ChatTD’s Python Manager to install the ‘crawl4ai’ package.
  3. Return to this operator. Pulse the ‘Install/Update Playwright Browsers’ parameter.
  4. Monitor the Textport for download progress. Installation is complete when the logs indicate success.
  • Dependencies: Requires crawl4ai and playwright Python packages, installable via ChatTD. Crucially, Playwright also needs browser binaries downloaded via the Install/Update Playwright Browsers parameter pulse.
  • Resource Usage: Crawling, especially in batch modes (Sitemap, Recursive, Text File), uses headless browsers and can consume significant CPU, RAM, and network bandwidth.
  • Concurrency: Adjust Max Concurrent Sessions carefully. Too high can destabilize TouchDesigner or your system.
  • Memory Management: The Memory Threshold (%) helps prevent crashes on large crawls by pausing new sessions when system RAM usage is high.
  • Filtering: Use Include URL Patterns and Exclude URL Patterns effectively to limit the scope of crawls and avoid unwanted pages or file types. Wildcards (*, ?) are supported.
  • Output Format: Content is output as Markdown in the content column of the output DAT, ready for ingestion by the Rag Index operator.
  • Stopping: Pulsing Stop Crawl attempts a graceful shutdown, but currently active browser tasks might take time to fully terminate.
  • Rag Index: Ingests the output of this operator to create a searchable index.
  • ChatTD: Provides core services like dependency management and asynchronous task execution required by this operator.
  • Source Webscraper: An alternative web scraping operator using a different backend (aiohttp, trafilatura). Might be lighter weight for simpler scraping tasks not requiring full browser rendering.