Source Webscraper Operator
The Source Webscraper LOP allows you to extract content from websites directly within TouchDesigner. It automates the process of crawling web pages, respecting robots.txt rules and rate limits, to gather text and data based on specified criteria. This is beneficial for creating datasets for AI models (for use with Rag Index), collecting information for interactive installations, or monitoring changes on websites over time.
Note: This operator requires Python packages: validators
, aiohttp
, beautifulsoup4
, trafilatura
, robotexclusionrulesparser
.
Agent Tool Integration
Section titled “Agent Tool Integration”This operator exposes 1 tool that allow Agent and Gemini Live LOPs to scrape and extract content from websites while respecting robots.txt and rate limits for AI-driven web content gathering.
Use the Tool Debugger operator to inspect exact tool definitions, schemas, and parameters.
The Source Webscraper operator exposes web scraping capabilities as a tool for AI agents, enabling automated content extraction from websites with proper rate limiting and robots.txt compliance.
Parameters
Section titled “Parameters”Parameters are organized into pages.
op('source_webscraper').par.Starturl
Str - Default:
None
op('source_webscraper').par.Domainrestrict
Toggle - Default:
On
op('source_webscraper').par.Startscraping
Pulse - Default:
None
op('source_webscraper').par.Stopscraping
Pulse - Default:
None
op('source_webscraper').par.Scrapesingle
Pulse - Default:
None
op('source_webscraper').par.Status
Str - Default:
None
op('source_webscraper').par.Progress
Float - Default:
None
op('source_webscraper').par.Clear
Pulse - Default:
None
op('source_webscraper').par.Selectdoc
Int - Default:
1
- Range:
- 1 to N/A
- Slider Range:
- 1 to N/A
op('source_webscraper').par.Displayfile
Str - Default:
None
op('source_webscraper').par.Useragent
Str - Default:
TouchDesigner-RAG-WebScraper/1.0 (+https://derivative.ca)
op('source_webscraper').par.Ratelimit
Float - Default:
0.4
- Range:
- 0 to 5
op('source_webscraper').par.Maxdepth
Int - Default:
2
- Range:
- 0 to 2
op('source_webscraper').par.Maxurls
Int - Default:
100
- Range:
- 0 to 100
op('source_webscraper').par.Urlpatterns
Str - Default:
.*
op('source_webscraper').par.Respectrobots
Toggle - Default:
On
op('source_webscraper').par.Minlength
Int - Default:
20
- Range:
- 0 to 10000
op('source_webscraper').par.Removenav
Toggle - Default:
On
op('source_webscraper').par.Removeads
Toggle - Default:
On
op('source_webscraper').par.Waittime
Float - Default:
1
- Range:
- 0 to 1
op('source_webscraper').par.Useauth
Toggle - Default:
Off
op('source_webscraper').par.Username
Str - Default:
None
op('source_webscraper').par.Password
Str - Default:
None
op('source_webscraper').par.Token
Str - Default:
None
op('source_webscraper').par.Chattd
OP - Default:
None
op('source_webscraper').par.Popups
Toggle - Default:
On
op('source_webscraper').par.Showbuiltin
Toggle - Default:
Off
op('source_webscraper').par.Bypass
Toggle - Default:
Off
op('source_webscraper').par.Version
Str - Default:
None
op('source_webscraper').par.Lastupdated
Str - Default:
None
op('source_webscraper').par.Creator
Str - Default:
None
op('source_webscraper').par.Website
Str - Default:
None
op('source_webscraper').par.Debugprints
Toggle - Default:
Off
Callbacks
Section titled “Callbacks”onScrapeStart
onScrapeComplete
onUrlProcessed
onUrlError
onContentExtracted
onError
Usage Examples
Section titled “Usage Examples”Scraping a Single Page
Section titled “Scraping a Single Page”- Set ‘Start URL’ to the target webpage URL.
- Set ‘Max Crawl Depth’ to 0.
- Pulse ‘Start Scraping’.
- Check the output Index Table DAT for the result.
Crawling a Website Section
Section titled “Crawling a Website Section”- Set ‘Start URL’ to the website’s homepage or section start page.
- Set ‘Max Crawl Depth’ (e.g., 2).
- Enable ‘Restrict to Domain’.
- Set ‘Max URLs to Process’ to limit scope.
- (Optional) Use ‘URL Patterns (Regex)’ to focus on specific paths (e.g.,
/blog/.*
). - Pulse ‘Start Scraping’. Monitor ‘Status’ and ‘Progress’.
Using Basic Authentication
Section titled “Using Basic Authentication”- Enable ‘Use Authentication’.
- Select ‘Auth Type’:
Basic Auth
. - Enter credentials in ‘Username’ and ‘Password’.
- Set ‘Start URL’ and other parameters.
- Pulse ‘Start Scraping’.
Technical Notes
Section titled “Technical Notes”- The scraper runs asynchronously using
aiohttp
to avoid blocking the main TouchDesigner thread. - Rate limiting (
Seconds Between Requests
) is crucial for ethical scraping and avoiding IP bans. robots.txt
rules are respected by default, which may prevent scraping certain parts of a site.- Content extraction quality depends on the website structure and the selected
Extract Mode
. - The output Index Table is formatted for direct use with the Rag Index LOP.
- Scraping large sites can take significant time and generate large tables; use
Max URLs
andMax Depth
wisely.