Source Crawl4ai
The Source Crawl4ai LOP uses the crawl4ai library to fetch content from web pages using headless browsers (via Playwright). It renders JavaScript-heavy pages, extracts content, converts it to Markdown, and structures the output into a table compatible with the RAG Index operator. It supports single page, recursive site crawling, sitemap batch, and text file URL list modes with URL filtering, memory-adaptive concurrency, and crawl history deduplication.

Requirements
Section titled “Requirements”- Python Packages:
crawl4aiandplaywright(installed via the Dependencies page) - Playwright Browsers: Browser binaries must be downloaded after package installation using the Install/Update Browsers button on the Dependencies page
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”None by default. Optionally, reference a table DAT via the URL Table parameter when using Table Input as the crawl source.
Outputs
Section titled “Outputs”- Output Table (DAT): Crawled content formatted for the RAG Index operator with columns:
doc_id,filename(source URL),content(Markdown),metadata(JSON),source_path, andtimestamp.
Initial Setup
Section titled “Initial Setup”- On the Dependencies page, pulse Install Dependencies to install the
crawl4aiandplaywrightpackages. - After installation completes, pulse Install/Update Browsers to download the required browser binaries.
- Monitor the Textport for progress. Installation is complete when logs indicate success.
- Optionally pulse Check Dependencies to verify everything is ready.
Usage Examples
Section titled “Usage Examples”Crawling a Single Page
Section titled “Crawling a Single Page”- On the Crawl Config page, set Crawl Mode to “Single Page”.
- Enter the full URL in Target URL / Sitemap / .txt (e.g.,
https://docs.derivative.ca/Introduction_to_Python). - Pulse Start Crawl.
- Monitor Current Status and view results in the output table.
Recursive Crawl of a Site Section
Section titled “Recursive Crawl of a Site Section”- Set Crawl Mode to “Crawl Site Links”.
- Enter the starting page URL in Target URL / Sitemap / .txt (e.g.,
https://yoursite.com/documentation/). - Set Max Depth (Recursive) to control how many links deep to follow (e.g., 3). Be cautious with high values on large sites.
- Optionally set Exclude URL Patterns to skip specific sections (e.g.,
*/blog* */forum*). - Adjust Max Concurrent Sessions based on your system resources.
- Pulse Start Crawl.
Crawling from a Sitemap
Section titled “Crawling from a Sitemap”- Set Crawl Mode to “Sitemap Batch”.
- Enter the exact sitemap URL in Target URL / Sitemap / .txt (e.g.,
https://example.com/sitemap.xml). This must point directly to the XML file, not just the base domain. - Optionally set Include URL Patterns or Exclude URL Patterns to filter which URLs from the sitemap are crawled.
- Pulse Start Crawl and monitor Progress.
Crawling URLs from a Table
Section titled “Crawling URLs from a Table”- Set Crawl Source to “Table Input”.
- Reference a table DAT in URL Table. The operator will extract all valid URLs from any cell in the table.
- Pulse Start Crawl. The operator automatically batch-crawls all extracted URLs with deduplication.
Using URL Filters
Section titled “Using URL Filters”Include URL Patterns and Exclude URL Patterns accept space-separated wildcard patterns using * (any characters) and ? (single character):
/products/*— only include URLs under /products/*.pdf *.zip— exclude binary file downloads*/login* */account*— exclude login and account pages
If include patterns are set, a URL must match at least one include pattern AND not match any exclude pattern to be crawled.
Crawl History and Deduplication
Section titled “Crawl History and Deduplication”When Avoid Repeats is enabled (default), the operator tracks previous crawl attempts. Repeated crawl requests with the same URL, mode, and depth return cached results instantly instead of re-crawling. This is especially useful when an agent makes repeated requests for the same content.
- Pulse Clear Scrape History to reset the cache and force fresh crawls
- Disable Avoid Repeats to always crawl fresh regardless of history
New crawl results are automatically deduplicated against the output table — duplicate URLs are skipped. Enable Clear Table on Crawl to clear previous results before each new crawl instead.
Agent Tool Integration
Section titled “Agent Tool Integration”This operator exposes 2 tools that allow Agent and Gemini Live LOPs to crawl single web pages and recursively crawl full websites to extract content as Markdown.
Use the Tool Debugger operator to inspect exact tool definitions, schemas, and parameters.
When connected to an Agent LOP, the agent can call two tools:
- crawl_single_page: Fetches the content of a single URL. Best for retrieving one specific page.
- crawl_full_website_recursively: Follows internal links from a starting URL, crawling up to 20 pages with a max depth of 3. Best for understanding an entire site’s content.
Configuring Agent Behavior
Section titled “Configuring Agent Behavior”On the Agents page:
- Agent Execution Mode: Choose “Wait for Completion” to have the agent wait for crawl results before responding, or “Background Processing” to start the crawl and let the agent continue immediately.
- Agent Return Content: Controls how much content is sent back to the agent — “Status Only”, “Summary Only”, “Truncated Content” (default), or “Full Content”. Full content is always added to the output table regardless of this setting.
- Agent Calls Add to Table: When enabled (default), agent crawls populate the output table so you can see what was crawled and feed it to downstream operators like RAG Index.
- Truncatecontent (chars): Sets the character limit per page when using “Truncated Content” mode (default: 2000).
Example Agent Workflow
Section titled “Example Agent Workflow”"Summarize the main points from https://example.com/news/latest and give me an overview of their product pages."The agent could call crawl_single_page for the news article, then crawl_full_website_recursively for the products section, and use both results to generate a comprehensive response.
Best Practices
Section titled “Best Practices”- Start with moderate concurrency: Set Max Concurrent Sessions to 5-10 and increase based on system stability. Too many parallel browser sessions can exhaust RAM.
- Use the Memory Threshold: The Memory Threshold (%) parameter pauses new browser sessions when system RAM usage exceeds the threshold. Lower it if you experience memory issues on large crawls.
- Filter aggressively on large sites: Use Include URL Patterns and Exclude URL Patterns to avoid crawling irrelevant pages, binary files, or login pages.
- Prefer “Background Processing” for collection tasks: When an agent just needs to trigger a crawl without waiting for results, set Agent Execution Mode to “Background Processing” to avoid blocking the agent.
- Use “Truncated Content” for agent responses: Full page content can overwhelm agent context windows. The default “Truncated Content” mode returns the first portion of each page, which is usually sufficient for summarization tasks.
Troubleshooting
Section titled “Troubleshooting”- “Dependencies not satisfied”: Go to the Dependencies page, pulse Install Dependencies, then Install/Update Browsers.
- Crawl returns no content: Some sites block headless browsers. Try a different URL or check if the site requires authentication.
- “Crawl already in progress”: Only one crawl can run at a time. Pulse Stop Crawl to cancel the current operation, then try again.
- Recursion depth exceeded: The recursive crawl tool limits depth to 3. For deeper crawls, use the manual Crawl Site Links mode with a higher Max Depth setting.
- High memory usage: Lower Max Concurrent Sessions and Memory Threshold (%). Large crawls with many parallel browsers can consume significant RAM.
Parameters
Section titled “Parameters”Crawl Config
Section titled “Crawl Config”op('source_crawl4ai').par.Url Str Primary input URL or path. Its interpretation depends on the selected 'Crawl Mode'. - Single Page/Recursive Mode: Provide the full starting URL of the website you want to crawl (e.g., https://derivative.ca). - Sitemap Batch Mode: Provide the **exact URL** pointing directly to the sitemap XML file (e.g., https://example.com/sitemap.xml). Do NOT just provide the base domain. - Text File (URLs) Mode: Provide the **absolute file system path** to a local text file (.txt) containing a list of URLs, one per line. Lines starting with # are ignored. Ensure the URL or path is correct for the chosen mode.
- Default:
https://dotdocs.netlify.app/
op('source_crawl4ai').par.Urltable OP Table containing URLs to crawl. URLs will be extracted from any cell containing a valid URL.
- Default:
"" (Empty String)
op('source_crawl4ai').par.Includepatterns Str Space-separated list of URL patterns that URLs MUST match to be included in the crawl. Uses TouchDesigner-style wildcards: * : Matches any sequence of zero or more characters. ? : Matches any single character. If left blank, all URLs are considered for inclusion (unless they match an exclude pattern). If patterns are provided, a URL must match at least one of them AND not match any exclude pattern to be crawled. Examples: - /products/* : Only include URLs starting with /products/ - *.pdf *.docx : Only include URLs ending with .pdf or .docx - https://site.com/blog?id=* : Only include blog posts on site.com
- Default:
"" (Empty String)
op('source_crawl4ai').par.Excludepatterns Str Space-separated list of URL patterns. URLs matching ANY of these patterns will be excluded from the crawl, regardless of include patterns. Uses TouchDesigner-style wildcards: * : Matches any sequence of zero or more characters. ? : Matches any single character. Examples: - */login* */account* : Exclude login and account pages. - *.zip *.mp4 : Exclude binary file downloads. - https://othersite.com/* : Exclude all URLs from othersite.com (useful if recursion picks up unwanted external links somehow).
- Default:
"" (Empty String)
op('source_crawl4ai').par.Status Str Displays the current operational status of the crawler (e.g., Idle, Running crawl..., Completed, Error..., Stopping...). Assumes this parameter is linked via expression to the logger output.
- Default:
"" (Empty String)
op('source_crawl4ai').par.Progress Float Shows the estimated percentage progress (0-100) of the current crawl operation. Accuracy is highest for Sitemap and Text File modes where the total URL count is known beforehand.
- Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('source_crawl4ai').par.Urlsprocessed Int Counts the total number of URLs attempted or processed (both successfully and unsuccessfully) during the last crawl operation.
- Default:
0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('source_crawl4ai').par.Startcrawl Pulse (Pulse Button) Initiates the crawl process based on the current parameters. Whether previous results are cleared depends on the "Clear Table on Crawl" setting.
- Default:
False
op('source_crawl4ai').par.Stopcrawl Pulse (Pulse Button) Attempts to gracefully stop an ongoing crawl process. May take a moment for active browser sessions to finish or timeout.
- Default:
False
op('source_crawl4ai').par.Clearontable Toggle When enabled, ANY crawl operation (manual or agent) will clear the output table before adding new results, removing all previous data. When disabled (default), new crawl results are added to existing data with automatic deduplication to prevent duplicate URLs. This setting affects both manual crawls and agent tool calls equally.
- Default:
False
op('source_crawl4ai').par.Usehistory Toggle When enabled (default), the operator maintains a history of previous crawl attempts and provides instant responses for duplicate requests. This dramatically improves response time for repeated crawls by returning cached results instead of re-crawling. When disabled, every crawl request will be executed fresh.
- Default:
True
op('source_crawl4ai').par.Clearhistory Pulse (Pulse Button) Clears all entries from the scrape history table. Use this to reset the crawl cache and force fresh crawls for all URLs. Useful when website content has changed significantly.
- Default:
False
op('source_crawl4ai').par.Maxdepth Int Maximum depth for recursive crawling. Only used when 'Crawl Mode' is set to 'Recursive Internal'. A depth of 1 crawls only the starting URL. A depth of 2 crawls the starting URL and any internal pages directly linked from it. A depth of 3 crawls those pages and pages linked from them, and so on. Increasing depth can significantly increase crawl time and the number of pages processed. Be mindful of website structure and size.
- Default:
2- Range:
- 1 to 10
- Slider Range:
- 1 to 10
op('source_crawl4ai').par.Maxconcurrent Int Maximum number of parallel browser sessions used during batch crawls (Sitemap, Recursive, Text File modes). Higher values can significantly speed up crawls on multi-URL tasks but increase RAM and CPU usage. Setting this too high relative to your system resources can lead to instability or crashes. Start with a moderate value (e.g., 5-10) and adjust based on performance and system stability.
- Default:
5- Range:
- 1 to 20
- Slider Range:
- 1 to 20
op('source_crawl4ai').par.Memorythreshold Float Memory usage threshold (as a percentage of total system RAM) for crawl4ai's adaptive dispatcher. Only used in batch crawl modes (Sitemap, Recursive, Text File). The dispatcher monitors system memory usage and pauses launching new browser sessions if usage exceeds this threshold, preventing memory exhaustion on large crawls. Default is 70%. Lower it if you experience memory issues; raise it cautiously if you have ample RAM and want to maximize concurrency.
- Default:
70.0- Range:
- 30 to 95
- Slider Range:
- 30 to 95
op('source_crawl4ai').par.Clearoutput Pulse (Pulse Button) Clears all data from the output table DAT and resets status/progress parameters to their default values.
- Default:
False
op('source_crawl4ai').par.Displayfile Str - Default:
"" (Empty String)
op('source_crawl4ai').par.Selectdoc Int Selects a row from the output table for detailed viewing when the Display parameter is set to Content. Use the slider or enter a row number (starting from 1).
- Default:
70- Range:
- 0 to 1
- Slider Range:
- 1 to 1
Agents
Section titled “Agents”op('source_crawl4ai').par.Agenttotable Toggle When enabled (default), agent tool calls will add crawled content to the output table in addition to returning it to the agent. This allows you to see what the agent crawled in the operator's UI and use it with other operators like RAG indexers. When disabled, agent crawls only return content to the agent without populating the table.
- Default:
True
op('source_crawl4ai').par.Truncatecontent Int - Default:
2000- Range:
- 0 to 1
- Slider Range:
- 100 to 10000
Dependencies
Section titled “Dependencies”op('source_crawl4ai').par.Installdependencies Pulse Install crawl4ai and playwright packages.
- Default:
False
op('source_crawl4ai').par.Checkdependencies Pulse Check status of required dependencies.
- Default:
False
op('source_crawl4ai').par.Installbrowsers Pulse Run "playwright install" to download necessary browser binaries.
- Default:
False
Changelog
Section titled “Changelog”v1.4.02025-09-24
add better dependency /installation handling and added truncate content parameter.
v1.3.02025-09-01
added table input mode.
better deduplication
better ability to gather multiple sources fm multiple agents at a time
v1.2.02025-01-16
Agent Tool Control System
- Added Agent Execution Mode parameter (Wait for Completion / Background Processing)
- Added Agent Return Content parameter (Status Only / Summary Only / Truncated Content / Full Content)
- Added Agent Calls Add to Table toggle
Scrape History System
- New history table tracks all crawl attempts with status and timestamps
- Instant cache responses for duplicate requests (sub-second vs 30+ seconds)
- Smart duplicate detection using MD5 hash of crawl settings
- Use Scrape History toggle and Clear Scrape History button
Duplicate Handling & Performance
- Transparent duplicate reporting in agent responses
- Automatic deduplication for all crawl operations
- Clear Table on Crawl parameter for unified table clearing control
- Background processing allows agents to continue without waiting
Stability Improvements
- Max depth limited to 3 to prevent recursion errors
- System recursion limit increased to 2000
- Proper RecursionError handling with user-friendly messages
- Failed crawls tracked in history to prevent repeated attempts
Enhanced Agent Responses
Agent responses now include duplicate counts, URLs, and cache status for better transparency.
## Breaking Changes
None - All changes are backward compatible
v1.1.02025-07-03
added GetTool function to the operator - this will return the tool definition for crawling a website.
v1.0.02025-04-30
created source doc for crawl4ai