Skip to content

Source Webscraper Operator

The Source Webscraper LOP allows you to extract content from websites directly within TouchDesigner. It automates the process of crawling web pages, respecting robots.txt rules and rate limits, to gather text and data based on specified criteria. This is beneficial for creating datasets for AI models (for use with Rag Index), collecting information for interactive installations, or monitoring changes on websites over time.

Note: This operator requires Python packages: validators, aiohttp, beautifulsoup4, trafilatura, robotexclusionrulesparser.

🔧 GetTool Enabled 1 tool

This operator exposes 1 tool that allow Agent and Gemini Live LOPs to scrape and extract content from websites while respecting robots.txt and rate limits for AI-driven web content gathering.

The Source Webscraper operator exposes web scraping capabilities as a tool for AI agents, enabling automated content extraction from websites with proper rate limiting and robots.txt compliance.

Parameters are organized into pages.

Start URL (Starturl) op('source_webscraper').par.Starturl Str
Default:
None
Restrict to Domain (Domainrestrict) op('source_webscraper').par.Domainrestrict Toggle
Default:
On
Start Scraping (Startscraping) op('source_webscraper').par.Startscraping Pulse
Default:
None
Stop Scraping (Stopscraping) op('source_webscraper').par.Stopscraping Pulse
Default:
None
Scrape Single URL (Scrapesingle) op('source_webscraper').par.Scrapesingle Pulse
Default:
None
Scrape Mode (Scrapemode) op('source_webscraper').par.Scrapemode Menu
Default:
simple
Options:
simple, browser
Current Status (Status) op('source_webscraper').par.Status Str
Default:
None
Progress (Progress) op('source_webscraper').par.Progress Float
Default:
None
Clear All (Clear) op('source_webscraper').par.Clear Pulse
Default:
None
Caution: Exposing the viewer of large index tables will be heavy Header
Display (Display) op('source_webscraper').par.Display Menu
Default:
index
Options:
index, content
Select Doc (Selectdoc) op('source_webscraper').par.Selectdoc Int
Default:
1
Range:
1 to N/A
Slider Range:
1 to N/A
Display File (Displayfile) op('source_webscraper').par.Displayfile Str
Default:
None
User Agent (Useragent) op('source_webscraper').par.Useragent Str
Default:
TouchDesigner-RAG-WebScraper/1.0 (+https://derivative.ca)
Seconds Between Requests (Ratelimit) op('source_webscraper').par.Ratelimit Float
Default:
0.4
Range:
0 to 5
Max Crawl Depth (Maxdepth) op('source_webscraper').par.Maxdepth Int
Default:
2
Range:
0 to 2
Max URLs to Process (Maxurls) op('source_webscraper').par.Maxurls Int
Default:
100
Range:
0 to 100
URL Patterns (Regex) (Urlpatterns) op('source_webscraper').par.Urlpatterns Str
Default:
.*
Respect robots.txt (Respectrobots) op('source_webscraper').par.Respectrobots Toggle
Default:
On
Extract Mode (Extractmode) op('source_webscraper').par.Extractmode Menu
Default:
auto
Options:
auto, article, conservative, aggressive
Min Content Length (Minlength) op('source_webscraper').par.Minlength Int
Default:
20
Range:
0 to 10000
Remove Navigation (Removenav) op('source_webscraper').par.Removenav Toggle
Default:
On
Remove Ads/Popups (Removeads) op('source_webscraper').par.Removeads Toggle
Default:
On
Wait Time (seconds) (Waittime) op('source_webscraper').par.Waittime Float
Default:
1
Range:
0 to 1
Use Authentication (Useauth) op('source_webscraper').par.Useauth Toggle
Default:
Off
Auth Type (Authtype) op('source_webscraper').par.Authtype Menu
Default:
none
Options:
none, basic, bearer
Username (Username) op('source_webscraper').par.Username Str
Default:
None
Password (Password) op('source_webscraper').par.Password Str
Default:
None
Bearer Token (Token) op('source_webscraper').par.Token Str
Default:
None
ChatTD (Chattd) op('source_webscraper').par.Chattd OP
Default:
None
Helper Popups (Popups) op('source_webscraper').par.Popups Toggle
Default:
On
Show Built In Pars (Showbuiltin) op('source_webscraper').par.Showbuiltin Toggle
Default:
Off
Bypass (Bypass) op('source_webscraper').par.Bypass Toggle
Default:
Off
Version (Version) op('source_webscraper').par.Version Str
Default:
None
Lastupdated (Lastupdated) op('source_webscraper').par.Lastupdated Str
Default:
None
Creator (Creator) op('source_webscraper').par.Creator Str
Default:
None
Website (Website) op('source_webscraper').par.Website Str
Default:
None
Extra Debug (Textport) (Debugprints) op('source_webscraper').par.Debugprints Toggle
Default:
Off
Available Callbacks:
  • onScrapeStart
  • onScrapeComplete
  • onUrlProcessed
  • onUrlError
  • onContentExtracted
  • onError
  1. Set ‘Start URL’ to the target webpage URL.
  2. Set ‘Max Crawl Depth’ to 0.
  3. Pulse ‘Start Scraping’.
  4. Check the output Index Table DAT for the result.
  1. Set ‘Start URL’ to the website’s homepage or section start page.
  2. Set ‘Max Crawl Depth’ (e.g., 2).
  3. Enable ‘Restrict to Domain’.
  4. Set ‘Max URLs to Process’ to limit scope.
  5. (Optional) Use ‘URL Patterns (Regex)’ to focus on specific paths (e.g., /blog/.*).
  6. Pulse ‘Start Scraping’. Monitor ‘Status’ and ‘Progress’.
  1. Enable ‘Use Authentication’.
  2. Select ‘Auth Type’: Basic Auth.
  3. Enter credentials in ‘Username’ and ‘Password’.
  4. Set ‘Start URL’ and other parameters.
  5. Pulse ‘Start Scraping’.
  • The scraper runs asynchronously using aiohttp to avoid blocking the main TouchDesigner thread.
  • Rate limiting (Seconds Between Requests) is crucial for ethical scraping and avoiding IP bans.
  • robots.txt rules are respected by default, which may prevent scraping certain parts of a site.
  • Content extraction quality depends on the website structure and the selected Extract Mode.
  • The output Index Table is formatted for direct use with the Rag Index LOP.
  • Scraping large sites can take significant time and generate large tables; use Max URLs and Max Depth wisely.