Skip to content

Source Webscraper Operator

The Source Webscraper LOP allows you to extract content from websites directly within TouchDesigner. It automates the process of crawling web pages, respecting robots.txt rules and rate limits, to gather text and data based on specified criteria. This is beneficial for creating datasets for AI models (for use with Rag Index), collecting information for interactive installations, or monitoring changes on websites over time.

Note: This operator requires Python packages: validators, aiohttp, beautifulsoup4, trafilatura, robotexclusionrulesparser.

Parameters are organized into pages.

Start URL (Starturl) op('source_webscraper').par.Starturl Str
Default:
None
Restrict to Domain (Domainrestrict) op('source_webscraper').par.Domainrestrict Toggle
Default:
On
Start Scraping (Startscraping) op('source_webscraper').par.Startscraping Pulse
Default:
None
Stop Scraping (Stopscraping) op('source_webscraper').par.Stopscraping Pulse
Default:
None
Current Status (Status) op('source_webscraper').par.Status Str
Default:
Ready
Progress (Progress) op('source_webscraper').par.Progress Float
Default:
0
Clear All (Clear) op('source_webscraper').par.Clear Pulse
Default:
None
Caution: Viewing large index tables can be slow Header
Display (Display) op('source_webscraper').par.Display Menu
Default:
index
Options:
index, content
Select Doc (Selectdoc) op('source_webscraper').par.Selectdoc Int
Default:
0
Display File (Displayfile) op('source_webscraper').par.Displayfile Str
Default:
"" (Empty String)
User Agent (Useragent) op('source_webscraper').par.Useragent Str
Default:
TouchDesigner-RAG-WebScraper/1.0 (+https://derivative.ca)
Seconds Between Requests (Ratelimit) op('source_webscraper').par.Ratelimit Float
Default:
2
Range:
0.1 to 60
Slider Range:
0.5 to 10
Max Crawl Depth (Maxdepth) op('source_webscraper').par.Maxdepth Int
Default:
2
Range:
0 to 10
Max URLs to Process (Maxurls) op('source_webscraper').par.Maxurls Int
Default:
100
Range:
1 to 10000
Slider Range:
10 to 500
URL Patterns (Regex) (Urlpatterns) op('source_webscraper').par.Urlpatterns Str
Default:
.*
Respect robots.txt (Respectrobots) op('source_webscraper').par.Respectrobots Toggle
Default:
On
Extract Mode (Extractmode) op('source_webscraper').par.Extractmode Menu
Default:
auto
Options:
auto, article, conservative, aggressive
Min Content Length (Minlength) op('source_webscraper').par.Minlength Int
Default:
1000
Range:
0 to 10000
Slider Range:
100 to 2000
Remove Navigation (Removenav) op('source_webscraper').par.Removenav Toggle
Default:
On
Remove Ads/Popups (Removeads) op('source_webscraper').par.Removeads Toggle
Default:
On
Use Authentication (Useauth) op('source_webscraper').par.Useauth Toggle
Default:
Off
Auth Type (Authtype) op('source_webscraper').par.Authtype Menu
Default:
none
Options:
none, basic, bearer
Username (Username) op('source_webscraper').par.Username Str
Default:
None
Password (Password) op('source_webscraper').par.Password Str
Default:
None
Bearer Token (Token) op('source_webscraper').par.Token Str
Default:
None
ChatTD (Chattd) op('source_webscraper').par.Chattd OP
Default:
/dot_lops/ChatTD
Helper Popups (Popups) op('source_webscraper').par.Popups Toggle
Default:
On
Show Built In Pars (Showbuiltin) op('source_webscraper').par.Showbuiltin Toggle
Default:
Off
Bypass (Bypass) op('source_webscraper').par.Bypass Toggle
Default:
Off
Available Callbacks:
  • onScrapeStart
  • onScrapeComplete
  • onUrlProcessed
  • onUrlError
  • onContentExtracted
  • onError
1. Set 'Start URL' to the target webpage URL.
2. Set 'Max Crawl Depth' to 0.
3. Pulse 'Start Scraping'.
4. Check the output Index Table DAT for the result.
1. Set 'Start URL' to the website's homepage or section start page.
2. Set 'Max Crawl Depth' (e.g., 2).
3. Enable 'Restrict to Domain'.
4. Set 'Max URLs to Process' to limit scope.
5. (Optional) Use 'URL Patterns (Regex)' to focus on specific paths (e.g., `/blog/.*`).
6. Pulse 'Start Scraping'. Monitor 'Status' and 'Progress'.
1. Enable 'Use Authentication'.
2. Select 'Auth Type': `basic`.
3. Enter credentials in 'Username' and 'Password'.
4. Set 'Start URL' and other parameters.
5. Pulse 'Start Scraping'.
  • The scraper runs asynchronously using aiohttp to avoid blocking the main TouchDesigner thread.
  • Rate limiting (Seconds Between Requests) is crucial for respecting target servers.
  • robots.txt rules are respected by default, which may prevent scraping certain parts of a site.
  • Content extraction quality depends on the website structure and the selected Extract Mode.
  • The output Index Table is formatted for direct use with the Rag Index LOP.
  • Scraping large sites can take significant time and generate large tables; use Max URLs and Max Depth wisely.