Source Webscraper Operator
Overview
Section titled “Overview”The Source Webscraper LOP allows you to extract content from websites directly within TouchDesigner. It automates the process of crawling web pages, respecting robots.txt rules and rate limits, to gather text and data based on specified criteria. This is beneficial for creating datasets for AI models (for use with Rag Index), collecting information for interactive installations, or monitoring changes on websites over time.
Note: This operator requires Python packages: validators
, aiohttp
, beautifulsoup4
, trafilatura
, robotexclusionrulesparser
.
Parameters
Section titled “Parameters”Parameters are organized into pages.
Start URL (Starturl)
op('source_webscraper').par.Starturl
Str - Default:
None
Restrict to Domain (Domainrestrict)
op('source_webscraper').par.Domainrestrict
Toggle - Default:
On
Start Scraping (Startscraping)
op('source_webscraper').par.Startscraping
Pulse - Default:
None
Stop Scraping (Stopscraping)
op('source_webscraper').par.Stopscraping
Pulse - Default:
None
Current Status (Status)
op('source_webscraper').par.Status
Str - Default:
Ready
Progress (Progress)
op('source_webscraper').par.Progress
Float - Default:
0
Clear All (Clear)
op('source_webscraper').par.Clear
Pulse - Default:
None
Caution: Viewing large index tables can be slow Header
Select Doc (Selectdoc)
op('source_webscraper').par.Selectdoc
Int - Default:
0
Display File (Displayfile)
op('source_webscraper').par.Displayfile
Str - Default:
"" (Empty String)
User Agent (Useragent)
op('source_webscraper').par.Useragent
Str - Default:
TouchDesigner-RAG-WebScraper/1.0 (+https://derivative.ca)
Seconds Between Requests (Ratelimit)
op('source_webscraper').par.Ratelimit
Float - Default:
2
- Range:
- 0.1 to 60
- Slider Range:
- 0.5 to 10
Max Crawl Depth (Maxdepth)
op('source_webscraper').par.Maxdepth
Int - Default:
2
- Range:
- 0 to 10
Max URLs to Process (Maxurls)
op('source_webscraper').par.Maxurls
Int - Default:
100
- Range:
- 1 to 10000
- Slider Range:
- 10 to 500
URL Patterns (Regex) (Urlpatterns)
op('source_webscraper').par.Urlpatterns
Str - Default:
.*
Respect robots.txt (Respectrobots)
op('source_webscraper').par.Respectrobots
Toggle - Default:
On
Min Content Length (Minlength)
op('source_webscraper').par.Minlength
Int - Default:
1000
- Range:
- 0 to 10000
- Slider Range:
- 100 to 2000
Remove Navigation (Removenav)
op('source_webscraper').par.Removenav
Toggle - Default:
On
Remove Ads/Popups (Removeads)
op('source_webscraper').par.Removeads
Toggle - Default:
On
Use Authentication (Useauth)
op('source_webscraper').par.Useauth
Toggle - Default:
Off
Username (Username)
op('source_webscraper').par.Username
Str - Default:
None
Password (Password)
op('source_webscraper').par.Password
Str - Default:
None
Bearer Token (Token)
op('source_webscraper').par.Token
Str - Default:
None
ChatTD (Chattd)
op('source_webscraper').par.Chattd
OP - Default:
/dot_lops/ChatTD
Helper Popups (Popups)
op('source_webscraper').par.Popups
Toggle - Default:
On
Show Built In Pars (Showbuiltin)
op('source_webscraper').par.Showbuiltin
Toggle - Default:
Off
Bypass (Bypass)
op('source_webscraper').par.Bypass
Toggle - Default:
Off
Callbacks
Section titled “Callbacks” Available Callbacks:
onScrapeStart
onScrapeComplete
onUrlProcessed
onUrlError
onContentExtracted
onError
Usage Examples
Section titled “Usage Examples”Scraping a Single Page
Section titled “Scraping a Single Page”1. Set 'Start URL' to the target webpage URL.2. Set 'Max Crawl Depth' to 0.3. Pulse 'Start Scraping'.4. Check the output Index Table DAT for the result.
Crawling a Website Section
Section titled “Crawling a Website Section”1. Set 'Start URL' to the website's homepage or section start page.2. Set 'Max Crawl Depth' (e.g., 2).3. Enable 'Restrict to Domain'.4. Set 'Max URLs to Process' to limit scope.5. (Optional) Use 'URL Patterns (Regex)' to focus on specific paths (e.g., `/blog/.*`).6. Pulse 'Start Scraping'. Monitor 'Status' and 'Progress'.
Using Basic Authentication
Section titled “Using Basic Authentication”1. Enable 'Use Authentication'.2. Select 'Auth Type': `basic`.3. Enter credentials in 'Username' and 'Password'.4. Set 'Start URL' and other parameters.5. Pulse 'Start Scraping'.
Technical Notes
Section titled “Technical Notes”- The scraper runs asynchronously using
aiohttp
to avoid blocking the main TouchDesigner thread. - Rate limiting (
Seconds Between Requests
) is crucial for respecting target servers. robots.txt
rules are respected by default, which may prevent scraping certain parts of a site.- Content extraction quality depends on the website structure and the selected
Extract Mode
. - The output Index Table is formatted for direct use with the Rag Index LOP.
- Scraping large sites can take significant time and generate large tables; use
Max URLs
andMax Depth
wisely.