Source Crawl4ai Operator
Source Crawl4AI v1.3.0 [ September 2, 2025 ]
- Table input mode support
- Better deduplication algorithms
- Enhanced multi-agent source gathering
- Improved multiple source handling simultaneously
The Source Crawl4ai LOP utilizes the crawl4ai
Python library to fetch content from web pages, sitemaps, or lists of URLs. It uses headless browsers (via Playwright) to render pages, extracts the main content, converts it to Markdown, and structures the output into a DAT table compatible with the Rag Index operator. It supports various crawling modes, URL filtering, and resource management features like concurrency limits and adaptive memory usage control.
Requirements
Section titled “Requirements”- Python Packages:
crawl4ai
: The core crawling library.playwright
: Required bycrawl4ai
for browser automation.requests
(implicitly needed for sitemap fetching).- These can be installed via the ChatTD operator’s Python manager by first installing
crawl4ai
.
- Playwright Browsers: After installing the Python packages, the necessary browser binaries must be downloaded using the
Install/Update Playwright Browsers
parameter on this operator. - ChatTD Operator: Required for dependency management (package installation) and asynchronous task execution. Ensure the
ChatTD Operator
parameter on the ‘About’ page points to your configured ChatTD instance.
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”None
Outputs
Section titled “Outputs”- Output Table (DAT): The primary output, containing the crawled content. Columns match the requirements for the Rag Index operator:
doc_id
: Unique ID for the crawled page/chunk.filename
: Source URL of the crawled page.content
: Crawled content formatted as Markdown.metadata
: JSON string containing source URL, timestamp, content length, etc.source_path
: Source URL (duplicate offilename
).timestamp
: Unix timestamp of when the content was processed.
- Internal DATs: (Accessible via operator viewer)
index_table
(summary view) andcontent_table
(detailed view of selected doc). - Status/Log: Information is logged via the linked
Logger
component within ChatTD. Key status info is also reflected in theStatus
,Progress
, andURLs Processed
parameters.
Parameters
Section titled “Parameters”Page: Crawl Config
Section titled “Page: Crawl Config”op('source_crawl4ai').par.Url
Str - Default:
None
op('source_crawl4ai').par.Urltable
OP - Default:
None
op('source_crawl4ai').par.Includepatterns
Str - Default:
None
op('source_crawl4ai').par.Excludepatterns
Str - Default:
None
op('source_crawl4ai').par.Status
Str - Default:
None
op('source_crawl4ai').par.Progress
Float - Default:
None
op('source_crawl4ai').par.Urlsprocessed
Int - Default:
None
op('source_crawl4ai').par.Startcrawl
Pulse - Default:
None
op('source_crawl4ai').par.Stopcrawl
Pulse - Default:
None
op('source_crawl4ai').par.Clearontable
Toggle - Default:
None
op('source_crawl4ai').par.Usehistory
Toggle - Default:
None
op('source_crawl4ai').par.Clearhistory
Pulse - Default:
None
op('source_crawl4ai').par.Maxdepth
Int - Default:
2
- Range:
- 1 to 10
op('source_crawl4ai').par.Maxconcurrent
Int - Default:
5
- Range:
- 1 to 20
op('source_crawl4ai').par.Memorythreshold
Float - Default:
70.0
- Range:
- 30 to 95
op('source_crawl4ai').par.Installplaywright
Pulse - Default:
None
op('source_crawl4ai').par.Clearoutput
Pulse - Default:
None
op('source_crawl4ai').par.Displayfile
Str - Default:
None
op('source_crawl4ai').par.Selectdoc
Int - Default:
1
- Range:
- 1 to 1
Page: Agents
Section titled “Page: Agents”op('source_crawl4ai').par.Agenttotable
Toggle - Default:
None
Page: About
Section titled “Page: About”op('source_crawl4ai').par.Bypass
Toggle - Default:
None
op('source_crawl4ai').par.Showbuiltin
Toggle - Default:
None
op('source_crawl4ai').par.Version
Str - Default:
None
op('source_crawl4ai').par.Lastupdated
Str - Default:
None
op('source_crawl4ai').par.Creator
Str - Default:
None
op('source_crawl4ai').par.Website
Str - Default:
None
op('source_crawl4ai').par.Chattd
OP - Default:
None
op('source_crawl4ai').par.Clearlog
Pulse - Default:
None
op('source_crawl4ai').par.Converttotext
Toggle - Default:
None
Agent Tool Integration
Section titled “Agent Tool Integration”This operator exposes 2 tools that allow Agent and Gemini Live LOPs to crawl web pages and websites to extract content, supporting both single page crawling and full website recursive crawling for AI-driven content gathering.
Use the Tool Debugger operator to inspect exact tool definitions, schemas, and parameters.
The Source Crawl4ai LOP can be used as a tool by Agent LOPs, allowing an AI to autonomously crawl web pages and websites to gather information.
Available Tools
Section titled “Available Tools”When connected to an Agent, this operator provides the following functions:
crawl_single_page(url)
: Fetches and returns the text content of a single, specific web page. This is best used when the agent needs the contents of one exact URL.crawl_full_website_recursively(url, max_depth=2)
: Crawls an entire website by following internal links, starting from a given URL. It processes up to 20 pages to gather comprehensive information. This is ideal when an agent needs to understand the content of a whole website, not just a single page.
How It Works
Section titled “How It Works”- Connect to Agent: Add the Source Crawl4ai LOP to the
Tool
sequence parameter on an Agent LOP. - Agent Prompts: When the Agent receives a prompt that requires web content, it can choose to call one of the crawl tools.
- Execution: The Source Crawl4ai LOP executes the crawl asynchronously and returns the extracted Markdown content to the Agent.
- Response: The Agent then uses this content to formulate its response.
Example Agent Prompt
Section titled “Example Agent Prompt”"Please summarize the main points from the article at https://example.com/news/latest-ai-breakthroughs and also give me an overview of the company's products from their website."
In this scenario, the Agent could:
- Call
crawl_single_page
with the URLhttps://example.com/news/latest-ai-breakthroughs
. - Call
crawl_full_website_recursively
with the URLhttps://example.com/products
. - Use the content from both tool calls to generate a comprehensive summary and overview.
Usage Examples
Section titled “Usage Examples”Crawling a Single Page
Section titled “Crawling a Single Page”- Set ‘Target URL / Sitemap / .txt’ to the full URL (e.g., https://docs.derivative.ca/Introduction_to_Python).
- Set ‘Crawl Mode’ to ‘Single Page’.
- Pulse ‘Start Crawl’.
- Monitor ‘Status’ and view results in the Output Table DAT.
Crawling from a Sitemap
Section titled “Crawling from a Sitemap”- Set ‘Target URL / Sitemap / .txt’ to the EXACT URL of the sitemap (e.g., https://example.com/sitemap.xml).
- Set ‘Crawl Mode’ to ‘Sitemap Batch’.
- (Optional) Set ‘Include/Exclude URL Patterns’ to filter URLs from the sitemap.
- Adjust ‘Max Concurrent Sessions’ based on your system.
- Pulse ‘Start Crawl’.
- Monitor ‘Status’ and ‘Progress’.
Recursive Crawl of a Small Site Section
Section titled “Recursive Crawl of a Small Site Section”- Set ‘Target URL / Sitemap / .txt’ to the starting page (e.g., https://yoursite.com/documentation/).
- Set ‘Crawl Mode’ to ‘Crawl Site Links’.
- Set ‘Max Depth’ (e.g., 3). Be cautious with high values on large sites.
- (Optional) Set ‘Exclude URL Patterns’ to avoid specific sections (e.g., /blog /forum).
- Adjust ‘Max Concurrent Sessions’.
- Pulse ‘Start Crawl’.
Initial Setup (Installation)
Section titled “Initial Setup (Installation)”- Ensure the ‘ChatTD Operator’ parameter points to your ChatTD instance.
- Use ChatTD’s Python Manager to install the ‘crawl4ai’ package.
- Return to this operator. Pulse the ‘Install/Update Playwright Browsers’ parameter.
- Monitor the Textport for download progress. Installation is complete when the logs indicate success.
Technical Notes
Section titled “Technical Notes”- Dependencies: Requires
crawl4ai
andplaywright
Python packages, installable via ChatTD. Crucially, Playwright also needs browser binaries downloaded via theInstall/Update Playwright Browsers
parameter pulse. - Resource Usage: Crawling, especially in batch modes (
Sitemap
,Recursive
,Text File
), uses headless browsers and can consume significant CPU, RAM, and network bandwidth. - Concurrency: Adjust
Max Concurrent Sessions
carefully. Too high can destabilize TouchDesigner or your system. - Memory Management: The
Memory Threshold (%)
helps prevent crashes on large crawls by pausing new sessions when system RAM usage is high. - Filtering: Use
Include URL Patterns
andExclude URL Patterns
effectively to limit the scope of crawls and avoid unwanted pages or file types. Wildcards (*
,?
) are supported. - Output Format: Content is output as Markdown in the
content
column of the output DAT, ready for ingestion by the Rag Index operator. - Stopping: Pulsing
Stop Crawl
attempts a graceful shutdown, but currently active browser tasks might take time to fully terminate.
Related Operators
Section titled “Related Operators”- Rag Index: Ingests the output of this operator to create a searchable index.
- ChatTD: Provides core services like dependency management and asynchronous task execution required by this operator.
- Source Webscraper: An alternative web scraping operator using a different backend (
aiohttp
,trafilatura
). Might be lighter weight for simpler scraping tasks not requiring full browser rendering.