Source Webscraper Operator

Overview

The Source Webscraper LOP allows you to extract content from websites directly within TouchDesigner. It automates the process of crawling web pages, respecting robots.txt rules and rate limits, to gather text and data based on specified criteria. This is beneficial for creating datasets for AI models (for use with Rag Index), collecting information for interactive installations, or monitoring changes on websites over time.

Note: This operator requires Python packages: validators, aiohttp, beautifulsoup4, trafilatura, robotexclusionrulesparser.

Parameters

Parameters are organized into pages.

Start URL (Starturl) op('source_webscraper').par.Starturl Str

Default:: None

Restrict to Domain (Domainrestrict) op('source_webscraper').par.Domainrestrict Toggle

Default:: On

Start Scraping (Startscraping) op('source_webscraper').par.Startscraping Pulse

Default:: None

Stop Scraping (Stopscraping) op('source_webscraper').par.Stopscraping Pulse

Default:: None

Current Status (Status) op('source_webscraper').par.Status Str

Default:: Ready

Progress (Progress) op('source_webscraper').par.Progress Float

Default:: 0

Clear All (Clear) op('source_webscraper').par.Clear Pulse

Default:: None

Caution: Viewing large index tables can be slow Header

Select Doc (Selectdoc) op('source_webscraper').par.Selectdoc Int

Default:: 0

Display File (Displayfile) op('source_webscraper').par.Displayfile Str

Default:: "" (Empty String)

User Agent (Useragent) op('source_webscraper').par.Useragent Str

Default:: TouchDesigner-RAG-WebScraper/1.0 (+https://derivative.ca)

Seconds Between Requests (Ratelimit) op('source_webscraper').par.Ratelimit Float

Default:: 2
Range:: 0.1 to 60
Slider Range:: 0.5 to 10

Max Crawl Depth (Maxdepth) op('source_webscraper').par.Maxdepth Int

Default:: 2
Range:: 0 to 10

Max URLs to Process (Maxurls) op('source_webscraper').par.Maxurls Int

Default:: 100
Range:: 1 to 10000
Slider Range:: 10 to 500

URL Patterns (Regex) (Urlpatterns) op('source_webscraper').par.Urlpatterns Str

Default:: .*

Respect robots.txt (Respectrobots) op('source_webscraper').par.Respectrobots Toggle

Default:: On

Min Content Length (Minlength) op('source_webscraper').par.Minlength Int

Default:: 1000
Range:: 0 to 10000
Slider Range:: 100 to 2000

Remove Navigation (Removenav) op('source_webscraper').par.Removenav Toggle

Default:: On

Remove Ads/Popups (Removeads) op('source_webscraper').par.Removeads Toggle

Default:: On

Use Authentication (Useauth) op('source_webscraper').par.Useauth Toggle

Default:: Off

Username (Username) op('source_webscraper').par.Username Str

Default:: None

Password (Password) op('source_webscraper').par.Password Str

Default:: None

Bearer Token (Token) op('source_webscraper').par.Token Str

Default:: None

ChatTD (Chattd) op('source_webscraper').par.Chattd OP

Default:: /dot_lops/ChatTD

Helper Popups (Popups) op('source_webscraper').par.Popups Toggle

Default:: On

Show Built In Pars (Showbuiltin) op('source_webscraper').par.Showbuiltin Toggle

Default:: Off

Bypass (Bypass) op('source_webscraper').par.Bypass Toggle

Default:: Off

Callbacks

Available Callbacks:

onScrapeStart
onScrapeComplete
onUrlProcessed
onUrlError
onContentExtracted
onError

Usage Examples

Scraping a Single Page

1. Set 'Start URL' to the target webpage URL.
2. Set 'Max Crawl Depth' to 0.
3. Pulse 'Start Scraping'.
4. Check the output Index Table DAT for the result.

Crawling a Website Section

1. Set 'Start URL' to the website's homepage or section start page.
2. Set 'Max Crawl Depth' (e.g., 2).
3. Enable 'Restrict to Domain'.
4. Set 'Max URLs to Process' to limit scope.
5. (Optional) Use 'URL Patterns (Regex)' to focus on specific paths (e.g., `/blog/.*`).
6. Pulse 'Start Scraping'. Monitor 'Status' and 'Progress'.

Using Basic Authentication

1. Enable 'Use Authentication'.
2. Select 'Auth Type': `basic`.
3. Enter credentials in 'Username' and 'Password'.
4. Set 'Start URL' and other parameters.
5. Pulse 'Start Scraping'.

Technical Notes

The scraper runs asynchronously using aiohttp to avoid blocking the main TouchDesigner thread.
Rate limiting (Seconds Between Requests) is crucial for respecting target servers.
robots.txt rules are respected by default, which may prevent scraping certain parts of a site.
Content extraction quality depends on the website structure and the selected Extract Mode.
The output Index Table is formatted for direct use with the Rag Index LOP.
Scraping large sites can take significant time and generate large tables; use Max URLs and Max Depth wisely.