Source Crawl4ai Operator
Overview
Section titled “Overview”The Source Crawl4ai LOP utilizes the crawl4ai
Python library to fetch content from web pages, sitemaps, or lists of URLs. It uses headless browsers (via Playwright) to render pages, extracts the main content, converts it to Markdown, and structures the output into a DAT table compatible with the Rag Index operator. It supports various crawling modes, URL filtering, and resource management features like concurrency limits and adaptive memory usage control.
Requirements
Section titled “Requirements”- Python Packages:
crawl4ai
: The core crawling library.playwright
: Required bycrawl4ai
for browser automation.requests
(implicitly needed for sitemap fetching).- These can be installed via the ChatTD operator’s Python manager by first installing
crawl4ai
.
- Playwright Browsers: After installing the Python packages, the necessary browser binaries must be downloaded using the
Install/Update Playwright Browsers
parameter on this operator. - ChatTD Operator: Required for dependency management (package installation) and asynchronous task execution. Ensure the
ChatTD Operator
parameter on the ‘About’ page points to your configured ChatTD instance.
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”None
Outputs
Section titled “Outputs”- Output Table (DAT): The primary output, containing the crawled content. Columns match the requirements for the Rag Index operator:
doc_id
: Unique ID for the crawled page/chunk.filename
: Source URL of the crawled page.content
: Crawled content formatted as Markdown.metadata
: JSON string containing source URL, timestamp, content length, etc.source_path
: Source URL (duplicate offilename
).timestamp
: Unix timestamp of when the content was processed.
- Internal DATs: (Accessible via operator viewer)
index_table
(summary view) andcontent_table
(detailed view of selected doc). - Status/Log: Information is logged via the linked
Logger
component within ChatTD. Key status info is also reflected in theStatus
,Progress
, andURLs Processed
parameters.
Parameters
Section titled “Parameters”Page: Crawl Config
Section titled “Page: Crawl Config” Target URL / Sitemap / .txt (Url)
op('source_crawl4ai').par.Url
Str - Default:
None
Max Depth (Recursive) (Maxdepth)
op('source_crawl4ai').par.Maxdepth
Int - Default:
2
- Range:
- 1 to N/A
- Slider Range:
- 1 to 10
Include URL Patterns (Includepatterns)
op('source_crawl4ai').par.Includepatterns
Str - Default:
None
Exclude URL Patterns (Excludepatterns)
op('source_crawl4ai').par.Excludepatterns
Str - Default:
None
Start Crawl (Startcrawl)
op('source_crawl4ai').par.Startcrawl
Pulse - Default:
None
Stop Crawl (Stopcrawl)
op('source_crawl4ai').par.Stopcrawl
Pulse - Default:
None
Current Status (Status)
op('source_crawl4ai').par.Status
Str - Default:
/DOC_GENERATOR/NETWORK_VIEWER/source_crawl4ai initialized
Progress (Progress)
op('source_crawl4ai').par.Progress
Float - Default:
0
URLs Processed (Urlsprocessed)
op('source_crawl4ai').par.Urlsprocessed
Int - Default:
0
Max Concurrent Sessions (Maxconcurrent)
op('source_crawl4ai').par.Maxconcurrent
Int - Default:
5
- Range:
- 1 to N/A
- Slider Range:
- 1 to 20
Memory Threshold (%) (Memorythreshold)
op('source_crawl4ai').par.Memorythreshold
Float - Default:
70
- Range:
- 30 to 95
- Slider Range:
- 50 to 90
Install/Update Playwright Browsers (Installplaywright)
op('source_crawl4ai').par.Installplaywright
Pulse - Default:
None
Clear Table Data (Clearoutput)
op('source_crawl4ai').par.Clearoutput
Pulse - Default:
None
Caution: Exposing the viewer of large index tables will be heavy Header
Select Doc (Selectdoc)
op('source_crawl4ai').par.Selectdoc
Int - Default:
1
- Range:
- 1 to N/A
- Slider Range:
- 1 to N/A
Display File (Displayfile)
op('source_crawl4ai').par.Displayfile
Str - Default:
None
Page: About
Section titled “Page: About” Bypass (Bypass)
op('source_crawl4ai').par.Bypass
Toggle - Default:
0
- Options:
- off, on
Show Built-in Parameters (Showbuiltin)
op('source_crawl4ai').par.Showbuiltin
Toggle - Default:
0
- Options:
- off, on
Version (Version)
op('source_crawl4ai').par.Version
Str - Default:
1.0.0
Last Updated (Lastupdated)
op('source_crawl4ai').par.Lastupdated
Str - Default:
2025-04-30
Creator (Creator)
op('source_crawl4ai').par.Creator
Str - Default:
dotsimulate
Website (Website)
op('source_crawl4ai').par.Website
Str - Default:
https://dotsimulate.com
ChatTD Operator (Chattd)
op('source_crawl4ai').par.Chattd
OP - Default:
/dot_lops/ChatTD
Clear Log (Clearlog)
op('source_crawl4ai').par.Clearlog
Pulse - Default:
None
Convert To Text (Converttotext)
op('source_crawl4ai').par.Converttotext
Toggle - Default:
0
- Options:
- off, on
Usage Examples
Section titled “Usage Examples”Crawling a Single Page
Section titled “Crawling a Single Page”1. Set 'Target URL / Sitemap / .txt' to the full URL (e.g., https://docs.derivative.ca/Introduction_to_Python).2. Set 'Crawl Mode' to 'Single Page'.3. Pulse 'Start Crawl'.4. Monitor 'Status' and view results in the Output Table DAT.
Crawling from a Sitemap
Section titled “Crawling from a Sitemap”1. Set 'Target URL / Sitemap / .txt' to the EXACT URL of the sitemap (e.g., https://example.com/sitemap.xml).2. Set 'Crawl Mode' to 'Sitemap Batch'.3. (Optional) Set 'Include/Exclude URL Patterns' to filter URLs from the sitemap.4. Adjust 'Max Concurrent Sessions' based on your system.5. Pulse 'Start Crawl'.6. Monitor 'Status' and 'Progress'.
Recursive Crawl of a Small Site Section
Section titled “Recursive Crawl of a Small Site Section”1. Set 'Target URL / Sitemap / .txt' to the starting page (e.g., https://yoursite.com/documentation/).2. Set 'Crawl Mode' to 'Crawl Site Links'.3. Set 'Max Depth' (e.g., 3). Be cautious with high values on large sites.4. (Optional) Set 'Exclude URL Patterns' to avoid specific sections (e.g., */blog* */forum*).5. Adjust 'Max Concurrent Sessions'.6. Pulse 'Start Crawl'.
Initial Setup (Installation)
Section titled “Initial Setup (Installation)”1. Ensure the 'ChatTD Operator' parameter points to your ChatTD instance.2. Use ChatTD's Python Manager to install the 'crawl4ai' package.3. Return to this operator. Pulse the 'Install/Update Playwright Browsers' parameter.4. Monitor the Textport for download progress. Installation is complete when the logs indicate success.
Technical Notes
Section titled “Technical Notes”- Dependencies: Requires
crawl4ai
andplaywright
Python packages, installable via ChatTD. Crucially, Playwright also needs browser binaries downloaded via theInstall/Update Playwright Browsers
parameter pulse. - Resource Usage: Crawling, especially in batch modes (
Sitemap
,Recursive
,Text File
), uses headless browsers and can consume significant CPU, RAM, and network bandwidth. - Concurrency: Adjust
Max Concurrent Sessions
carefully. Too high can destabilize TouchDesigner or your system. - Memory Management: The
Memory Threshold (%)
helps prevent crashes on large crawls by pausing new sessions when system RAM usage is high. - Filtering: Use
Include URL Patterns
andExclude URL Patterns
effectively to limit the scope of crawls and avoid unwanted pages or file types. Wildcards (*
,?
) are supported. - Output Format: Content is output as Markdown in the
content
column of the output DAT, ready for ingestion by the Rag Index operator. - Stopping: Pulsing
Stop Crawl
attempts a graceful shutdown, but currently active browser tasks might take time to fully terminate.
Related Operators
Section titled “Related Operators”- Rag Index: Ingests the output of this operator to create a searchable index.
- ChatTD: Provides core services like dependency management and asynchronous task execution required by this operator.
- Source Webscraper: An alternative web scraping operator using a different backend (
aiohttp
,trafilatura
). Might be lighter weight for simpler scraping tasks not requiring full browser rendering.