Skip to content

Source Crawl4ai

v1.4.0Updated

The Source Crawl4ai LOP uses the crawl4ai library to fetch content from web pages using headless browsers (via Playwright). It renders JavaScript-heavy pages, extracts content, converts it to Markdown, and structures the output into a table compatible with the RAG Index operator. It supports single page, recursive site crawling, sitemap batch, and text file URL list modes with URL filtering, memory-adaptive concurrency, and crawl history deduplication.

Source Crawl4ai UI

  • Python Packages: crawl4ai and playwright (installed via the Dependencies page)
  • Playwright Browsers: Browser binaries must be downloaded after package installation using the Install/Update Browsers button on the Dependencies page

None by default. Optionally, reference a table DAT via the URL Table parameter when using Table Input as the crawl source.

  • Output Table (DAT): Crawled content formatted for the RAG Index operator with columns: doc_id, filename (source URL), content (Markdown), metadata (JSON), source_path, and timestamp.
  1. On the Dependencies page, pulse Install Dependencies to install the crawl4ai and playwright packages.
  2. After installation completes, pulse Install/Update Browsers to download the required browser binaries.
  3. Monitor the Textport for progress. Installation is complete when logs indicate success.
  4. Optionally pulse Check Dependencies to verify everything is ready.
  1. On the Crawl Config page, set Crawl Mode to “Single Page”.
  2. Enter the full URL in Target URL / Sitemap / .txt (e.g., https://docs.derivative.ca/Introduction_to_Python).
  3. Pulse Start Crawl.
  4. Monitor Current Status and view results in the output table.
  1. Set Crawl Mode to “Crawl Site Links”.
  2. Enter the starting page URL in Target URL / Sitemap / .txt (e.g., https://yoursite.com/documentation/).
  3. Set Max Depth (Recursive) to control how many links deep to follow (e.g., 3). Be cautious with high values on large sites.
  4. Optionally set Exclude URL Patterns to skip specific sections (e.g., */blog* */forum*).
  5. Adjust Max Concurrent Sessions based on your system resources.
  6. Pulse Start Crawl.
  1. Set Crawl Mode to “Sitemap Batch”.
  2. Enter the exact sitemap URL in Target URL / Sitemap / .txt (e.g., https://example.com/sitemap.xml). This must point directly to the XML file, not just the base domain.
  3. Optionally set Include URL Patterns or Exclude URL Patterns to filter which URLs from the sitemap are crawled.
  4. Pulse Start Crawl and monitor Progress.
  1. Set Crawl Source to “Table Input”.
  2. Reference a table DAT in URL Table. The operator will extract all valid URLs from any cell in the table.
  3. Pulse Start Crawl. The operator automatically batch-crawls all extracted URLs with deduplication.

Include URL Patterns and Exclude URL Patterns accept space-separated wildcard patterns using * (any characters) and ? (single character):

  • /products/* — only include URLs under /products/
  • *.pdf *.zip — exclude binary file downloads
  • */login* */account* — exclude login and account pages

If include patterns are set, a URL must match at least one include pattern AND not match any exclude pattern to be crawled.

When Avoid Repeats is enabled (default), the operator tracks previous crawl attempts. Repeated crawl requests with the same URL, mode, and depth return cached results instantly instead of re-crawling. This is especially useful when an agent makes repeated requests for the same content.

  • Pulse Clear Scrape History to reset the cache and force fresh crawls
  • Disable Avoid Repeats to always crawl fresh regardless of history

New crawl results are automatically deduplicated against the output table — duplicate URLs are skipped. Enable Clear Table on Crawl to clear previous results before each new crawl instead.

🔧 GetTool Enabled 2 tools

This operator exposes 2 tools that allow Agent and Gemini Live LOPs to crawl single web pages and recursively crawl full websites to extract content as Markdown.

When connected to an Agent LOP, the agent can call two tools:

  • crawl_single_page: Fetches the content of a single URL. Best for retrieving one specific page.
  • crawl_full_website_recursively: Follows internal links from a starting URL, crawling up to 20 pages with a max depth of 3. Best for understanding an entire site’s content.

On the Agents page:

  • Agent Execution Mode: Choose “Wait for Completion” to have the agent wait for crawl results before responding, or “Background Processing” to start the crawl and let the agent continue immediately.
  • Agent Return Content: Controls how much content is sent back to the agent — “Status Only”, “Summary Only”, “Truncated Content” (default), or “Full Content”. Full content is always added to the output table regardless of this setting.
  • Agent Calls Add to Table: When enabled (default), agent crawls populate the output table so you can see what was crawled and feed it to downstream operators like RAG Index.
  • Truncatecontent (chars): Sets the character limit per page when using “Truncated Content” mode (default: 2000).
"Summarize the main points from https://example.com/news/latest and give me an overview of their product pages."

The agent could call crawl_single_page for the news article, then crawl_full_website_recursively for the products section, and use both results to generate a comprehensive response.

  • Start with moderate concurrency: Set Max Concurrent Sessions to 5-10 and increase based on system stability. Too many parallel browser sessions can exhaust RAM.
  • Use the Memory Threshold: The Memory Threshold (%) parameter pauses new browser sessions when system RAM usage exceeds the threshold. Lower it if you experience memory issues on large crawls.
  • Filter aggressively on large sites: Use Include URL Patterns and Exclude URL Patterns to avoid crawling irrelevant pages, binary files, or login pages.
  • Prefer “Background Processing” for collection tasks: When an agent just needs to trigger a crawl without waiting for results, set Agent Execution Mode to “Background Processing” to avoid blocking the agent.
  • Use “Truncated Content” for agent responses: Full page content can overwhelm agent context windows. The default “Truncated Content” mode returns the first portion of each page, which is usually sufficient for summarization tasks.
  • “Dependencies not satisfied”: Go to the Dependencies page, pulse Install Dependencies, then Install/Update Browsers.
  • Crawl returns no content: Some sites block headless browsers. Try a different URL or check if the site requires authentication.
  • “Crawl already in progress”: Only one crawl can run at a time. Pulse Stop Crawl to cancel the current operation, then try again.
  • Recursion depth exceeded: The recursive crawl tool limits depth to 3. For deeper crawls, use the manual Crawl Site Links mode with a higher Max Depth setting.
  • High memory usage: Lower Max Concurrent Sessions and Memory Threshold (%). Large crawls with many parallel browsers can consume significant RAM.
Crawl Source (Crawlsource) op('source_crawl4ai').par.Crawlsource Menu

Source of URLs to crawl: use URL parameter or extract URLs from table.

Default:
url
Options:
url, table
Target URL / Sitemap / .txt (Url) op('source_crawl4ai').par.Url Str

Primary input URL or path. Its interpretation depends on the selected 'Crawl Mode'. - Single Page/Recursive Mode: Provide the full starting URL of the website you want to crawl (e.g., https://derivative.ca). - Sitemap Batch Mode: Provide the **exact URL** pointing directly to the sitemap XML file (e.g., https://example.com/sitemap.xml). Do NOT just provide the base domain. - Text File (URLs) Mode: Provide the **absolute file system path** to a local text file (.txt) containing a list of URLs, one per line. Lines starting with # are ignored. Ensure the URL or path is correct for the chosen mode.

Default:
https://dotdocs.netlify.app/
URL Table (Urltable) op('source_crawl4ai').par.Urltable OP

Table containing URLs to crawl. URLs will be extracted from any cell containing a valid URL.

Default:
"" (Empty String)
Include URL Patterns (Includepatterns) op('source_crawl4ai').par.Includepatterns Str

Space-separated list of URL patterns that URLs MUST match to be included in the crawl. Uses TouchDesigner-style wildcards: * : Matches any sequence of zero or more characters. ? : Matches any single character. If left blank, all URLs are considered for inclusion (unless they match an exclude pattern). If patterns are provided, a URL must match at least one of them AND not match any exclude pattern to be crawled. Examples: - /products/* : Only include URLs starting with /products/ - *.pdf *.docx : Only include URLs ending with .pdf or .docx - https://site.com/blog?id=* : Only include blog posts on site.com

Default:
"" (Empty String)
Exclude URL Patterns (Excludepatterns) op('source_crawl4ai').par.Excludepatterns Str

Space-separated list of URL patterns. URLs matching ANY of these patterns will be excluded from the crawl, regardless of include patterns. Uses TouchDesigner-style wildcards: * : Matches any sequence of zero or more characters. ? : Matches any single character. Examples: - */login* */account* : Exclude login and account pages. - *.zip *.mp4 : Exclude binary file downloads. - https://othersite.com/* : Exclude all URLs from othersite.com (useful if recursion picks up unwanted external links somehow).

Default:
"" (Empty String)
Current Status (Status) op('source_crawl4ai').par.Status Str

Displays the current operational status of the crawler (e.g., Idle, Running crawl..., Completed, Error..., Stopping...). Assumes this parameter is linked via expression to the logger output.

Default:
"" (Empty String)
Progress (Progress) op('source_crawl4ai').par.Progress Float

Shows the estimated percentage progress (0-100) of the current crawl operation. Accuracy is highest for Sitemap and Text File modes where the total URL count is known beforehand.

Default:
0.0
Range:
0 to 1
Slider Range:
0 to 1
URLs Processed (Urlsprocessed) op('source_crawl4ai').par.Urlsprocessed Int

Counts the total number of URLs attempted or processed (both successfully and unsuccessfully) during the last crawl operation.

Default:
0
Range:
0 to 1
Slider Range:
0 to 1
Caution: Exposing the viewer of large index tables will be heavy Header
Start Crawl (Startcrawl) op('source_crawl4ai').par.Startcrawl Pulse

(Pulse Button) Initiates the crawl process based on the current parameters. Whether previous results are cleared depends on the "Clear Table on Crawl" setting.

Default:
False
Stop Crawl (Stopcrawl) op('source_crawl4ai').par.Stopcrawl Pulse

(Pulse Button) Attempts to gracefully stop an ongoing crawl process. May take a moment for active browser sessions to finish or timeout.

Default:
False
Clear Table on Crawl (Clearontable) op('source_crawl4ai').par.Clearontable Toggle

When enabled, ANY crawl operation (manual or agent) will clear the output table before adding new results, removing all previous data. When disabled (default), new crawl results are added to existing data with automatic deduplication to prevent duplicate URLs. This setting affects both manual crawls and agent tool calls equally.

Default:
False
Avoid Repeats (Usehistory) op('source_crawl4ai').par.Usehistory Toggle

When enabled (default), the operator maintains a history of previous crawl attempts and provides instant responses for duplicate requests. This dramatically improves response time for repeated crawls by returning cached results instead of re-crawling. When disabled, every crawl request will be executed fresh.

Default:
True
Clear Scrape History (Clearhistory) op('source_crawl4ai').par.Clearhistory Pulse

(Pulse Button) Clears all entries from the scrape history table. Use this to reset the crawl cache and force fresh crawls for all URLs. Useful when website content has changed significantly.

Default:
False
Crawl Mode (Crawlmode) op('source_crawl4ai').par.Crawlmode Menu

Selects the operational mode for the crawler, determining how the 'Target URL / Sitemap / .txt' parameter is used: - Single Page: Crawls only the single URL provided. No links are followed. - Sitemap Batch: Fetches the specified XML sitemap, extracts all URLs within it, and crawls them in parallel. - Crawl Site Links: Starts at the provided URL and follows internal links (links within the same domain) up to the specified 'Max Depth'. - Text File (URLs): Reads a list of URLs from the specified local text file and crawls them in parallel.

Default:
recursive
Options:
recursive, single, sitemap, txtfile
Max Depth (Recursive) (Maxdepth) op('source_crawl4ai').par.Maxdepth Int

Maximum depth for recursive crawling. Only used when 'Crawl Mode' is set to 'Recursive Internal'. A depth of 1 crawls only the starting URL. A depth of 2 crawls the starting URL and any internal pages directly linked from it. A depth of 3 crawls those pages and pages linked from them, and so on. Increasing depth can significantly increase crawl time and the number of pages processed. Be mindful of website structure and size.

Default:
2
Range:
1 to 10
Slider Range:
1 to 10
Max Concurrent Sessions (Maxconcurrent) op('source_crawl4ai').par.Maxconcurrent Int

Maximum number of parallel browser sessions used during batch crawls (Sitemap, Recursive, Text File modes). Higher values can significantly speed up crawls on multi-URL tasks but increase RAM and CPU usage. Setting this too high relative to your system resources can lead to instability or crashes. Start with a moderate value (e.g., 5-10) and adjust based on performance and system stability.

Default:
5
Range:
1 to 20
Slider Range:
1 to 20
Memory Threshold (%) (Memorythreshold) op('source_crawl4ai').par.Memorythreshold Float

Memory usage threshold (as a percentage of total system RAM) for crawl4ai's adaptive dispatcher. Only used in batch crawl modes (Sitemap, Recursive, Text File). The dispatcher monitors system memory usage and pauses launching new browser sessions if usage exceeds this threshold, preventing memory exhaustion on large crawls. Default is 70%. Lower it if you experience memory issues; raise it cautiously if you have ample RAM and want to maximize concurrency.

Default:
70.0
Range:
30 to 95
Slider Range:
30 to 95
Clear Table Data (Clearoutput) op('source_crawl4ai').par.Clearoutput Pulse

(Pulse Button) Clears all data from the output table DAT and resets status/progress parameters to their default values.

Default:
False
Display (Display) op('source_crawl4ai').par.Display Menu

Selects which internal DAT view is shown in the operator viewer panel: the summary Index Table or the Content of the selected document row.

Default:
index
Options:
index, content
Display File (Displayfile) op('source_crawl4ai').par.Displayfile Str
Default:
"" (Empty String)
Select Doc (Selectdoc) op('source_crawl4ai').par.Selectdoc Int

Selects a row from the output table for detailed viewing when the Display parameter is set to Content. Use the slider or enter a row number (starting from 1).

Default:
70
Range:
0 to 1
Slider Range:
1 to 1
Agent Return Content (Agentreturncontent) op('source_crawl4ai').par.Agentreturncontent Menu

Controls what content is returned to the agent (only applies when Agent Execution Mode is "Wait for Completion"): • "Status Only": Returns just success/failure status and page count. Minimal response for agents that only need to know if the crawl worked. • "Summary Only" (default): Returns status, page count, and list of URLs crawled. Good balance of information without overwhelming the agent. • "Truncated Content": Returns first 2000 characters from each page plus URLs. Provides sample content without full text. • "Full Content": Returns complete crawled content from all pages. Use when agent needs access to all text content. Note: Regardless of this setting, full content is ALWAYS added to the output table (if "Agent Calls Add to Table" is enabled).

Default:
truncated
Options:
none, summary, truncated, full
Agent Execution Mode (Agentexecutionmode) op('source_crawl4ai').par.Agentexecutionmode Menu

Controls how agent tool calls are executed: • "Wait for Completion" (default): Agent waits for the entire crawl to finish before receiving a response. Best for when the agent needs the crawled content immediately. • "Background Processing": Agent receives an immediate success response while the crawl runs in the background. Best for collection tasks where the agent just needs to know the crawl started successfully. The crawled content will still be added to the table (if enabled) once the background crawl completes.

Default:
background
Options:
wait, background
Agent Calls Add to Table (Agenttotable) op('source_crawl4ai').par.Agenttotable Toggle

When enabled (default), agent tool calls will add crawled content to the output table in addition to returning it to the agent. This allows you to see what the agent crawled in the operator's UI and use it with other operators like RAG indexers. When disabled, agent crawls only return content to the agent without populating the table.

Default:
True
Truncatecontent (chars) (Truncatecontent) op('source_crawl4ai').par.Truncatecontent Int
Default:
2000
Range:
0 to 1
Slider Range:
100 to 10000
Install Dependencies (Installdependencies) op('source_crawl4ai').par.Installdependencies Pulse

Install crawl4ai and playwright packages.

Default:
False
Check Dependencies (Checkdependencies) op('source_crawl4ai').par.Checkdependencies Pulse

Check status of required dependencies.

Default:
False
Install/Update Browsers (Installbrowsers) op('source_crawl4ai').par.Installbrowsers Pulse

Run "playwright install" to download necessary browser binaries.

Default:
False
v1.4.02025-09-24

add better dependency /installation handling and added truncate content parameter.

v1.3.02025-09-01

added table input mode.

better deduplication

better ability to gather multiple sources fm multiple agents at a time

v1.2.02025-01-16

Agent Tool Control System

  • Added Agent Execution Mode parameter (Wait for Completion / Background Processing)
  • Added Agent Return Content parameter (Status Only / Summary Only / Truncated Content / Full Content)
  • Added Agent Calls Add to Table toggle

Scrape History System

  • New history table tracks all crawl attempts with status and timestamps
  • Instant cache responses for duplicate requests (sub-second vs 30+ seconds)
  • Smart duplicate detection using MD5 hash of crawl settings
  • Use Scrape History toggle and Clear Scrape History button

Duplicate Handling & Performance

  • Transparent duplicate reporting in agent responses
  • Automatic deduplication for all crawl operations
  • Clear Table on Crawl parameter for unified table clearing control
  • Background processing allows agents to continue without waiting

Stability Improvements

  • Max depth limited to 3 to prevent recursion errors
  • System recursion limit increased to 2000
  • Proper RecursionError handling with user-friendly messages
  • Failed crawls tracked in history to prevent repeated attempts

Enhanced Agent Responses

Agent responses now include duplicate counts, URLs, and cache status for better transparency.

## Breaking Changes

None - All changes are backward compatible

v1.1.02025-07-03

added GetTool function to the operator - this will return the tool definition for crawling a website.

v1.0.02025-04-30

created source doc for crawl4ai