Source Webscraper

v1.1.0

The Source Webscraper LOP extracts content from websites directly within TouchDesigner. It offers two scraping engines — a fast lightweight mode for static sites and a full browser mode for JavaScript-heavy pages — with recursive crawling, rate limiting, robots.txt compliance, and output formatted for direct use with the RAG Index LOP.

Key Features

Two scraping engines: Simple mode (aiohttp + BeautifulSoup) for speed, Browser mode (Playwright) for full JavaScript rendering
Recursive crawling with configurable depth, domain restriction, and URL pattern filtering
Content extraction modes: Auto Detect, Article Only, Conservative, and Aggressive
Ethical scraping controls: robots.txt compliance, per-domain rate limiting, configurable user agent
Authentication support: Basic Auth and Bearer Token for protected sites
Browser anti-detection: Stealth mode and proxy support for Browser mode
RAG-ready output: Index table formatted for direct connection to a RAG Index LOP

Agent Tool Integration

🔧 GetTool Enabled 2 tools

This operator exposes 2 tools that allow Agent and Gemini Live LOPs to scrape web pages using either fast HTTP requests or a full browser with JavaScript support.

Use the Tool Debugger operator to inspect exact tool definitions, schemas, and parameters.

When connected to an Agent LOP, the agent gains access to two scraping tools:

scrape_website_simple — Quickly fetches and extracts content from a single URL using HTTP requests. Best for static sites, documentation, and news articles.
scrape_website_with_browser — Loads a page in a headless Chromium browser with full JavaScript support. Best for modern web apps, SPAs, and dynamically loaded content.

Both tools accept a single url parameter and return the extracted content with metadata. The agent does not need to configure extraction settings — both tools use Auto Detect mode automatically.

Requirements

Python packages: validators, aiohttp, beautifulsoup4, trafilatura, robotexclusionrulesparser (prompted for automatic install on first use)
Browser mode additionally requires: playwright and Chromium browser binaries (prompted for automatic install when Browser mode is first selected)

Input/Output

Inputs

None — URLs are configured via the Start URL parameter on the Control page.

Outputs

Index Table (DAT output): One row per scraped page with columns for doc_id, filename, source_path, content, metadata (JSON), and timestamp. This format is compatible with the RAG Index LOP.
URL Table (internal): Tracks all discovered URLs with their processing status, crawl depth, and parent URL.

Usage Examples

Scraping a Single Page

On the Control page, enter a full URL (including https://) in Start URL.
Pulse Scrape Single URL.
The result appears as a new row in the output Index Table.

This is useful for testing your extraction settings on one page before crawling an entire site.

Crawling a Website Section

On the Control page, enter the starting URL in Start URL.
On the Rules page, set Max Crawl Depth (e.g., 2 to follow links two levels deep).
Set Max URLs to Process to cap how many pages are scraped (e.g., 50).
Enable Restrict to Domain on the Control page to stay on the same site.
Optionally set URL Patterns (Regex) on the Rules page to focus on specific paths (e.g., /docs/.*).
Pulse Start Scraping on the Control page.
Monitor Current Status and Progress as the crawl runs.

Scraping JavaScript-Heavy Sites

On the Control page, set Scrape Mode to Browser (Full JS).
On the Rules page, increase Wait Time (seconds) to 2-3 seconds for pages with dynamic content.
If the site blocks automated browsers, enable Use Stealth Mode on the Auth / Browser page.
Pulse Start Scraping or Scrape Single URL.

Accessing Protected Content

On the Auth / Browser page, enable Use Authentication.
Set Auth Type to Basic Auth or Bearer Token.
For Basic Auth, fill in Username and Password. For Bearer Token, paste the raw token (without the Bearer prefix) into Bearer Token.
Configure your URL and pulse Start Scraping.

Feeding Results into RAG Index

Place a rag_index operator in your network.
Wire the Source Webscraper output into the RAG Index input.
Scrape your target site — results flow directly into the index in a compatible format.

Choosing a Scrape Mode

	Simple (Fast)	Browser (Full JS)
Engine	aiohttp + BeautifulSoup + trafilatura	Playwright headless Chromium
Best for	Static HTML, documentation, news articles	SPAs, dynamic content, JS-rendered pages
Speed	Fast	Slower (launches real browser)
robots.txt	Respected when enabled	Not checked
Stealth/Proxy	Not available	Supported
Extra install	None beyond base packages	Playwright + Chromium binaries

Best Practices

Start small: Test with Scrape Single URL before launching a full crawl to verify your extraction settings produce good content.
Use rate limiting: Keep Seconds Between Requests at 1-2 seconds minimum. Lower values risk IP bans. For smaller sites, 3-5 seconds is courteous.
Restrict the domain: Enable Restrict to Domain to prevent the crawler from wandering to external sites.
Cap your crawl: Set Max URLs to Process to a reasonable number. Crawling 1000+ pages generates large tables and takes significant time.
Choose the right extract mode: Article Only works well for blogs and news. Auto Detect is a good default. Conservative and Aggressive adjust how aggressively boilerplate is stripped.
Use Browser mode sparingly: Simple mode is much faster. Only switch to Browser mode when pages require JavaScript to render their content.

Troubleshooting

“Missing dependencies” on first use: The operator will prompt you to install required packages automatically. Accept the install and wait for it to complete. The operator reinitializes after installation.
No content extracted: Try a different Extract Mode on the Rules page. Some sites need Browser mode to render content. Also check that Min Content Length is not filtering out valid pages.
Crawl stops immediately: Verify your URL includes the protocol (https://). Check that robots.txt is not blocking the path — you can temporarily disable Respect robots.txt to test.
Getting blocked or empty responses: Increase Seconds Between Requests. In Browser mode, enable Use Stealth Mode and optionally configure a Proxy Server.
Browser mode fails to start: Playwright needs to be installed separately. The operator prompts for installation when you first switch to Browser mode. If it fails, check the Textport for errors.
Large tables causing slowness: The Control page warns that exposing the viewer of large index tables will be heavy. Use Display set to Content with Select Doc to browse individual results instead of viewing the full table.

Parameters

Control

Start URL (Starturl) op('source_webscraper').par.Starturl Str

The starting URL for the web scraper. This is the entry point for crawling. Must be a valid URL including the protocol (http:// or https://). Examples: - https://example.com - https://docs.python.org/3/

Default:: "" (Empty String)

Restrict to Domain (Domainrestrict) op('source_webscraper').par.Domainrestrict Toggle

When enabled, the crawler will only follow links within the same domain as the start URL. This prevents the crawler from wandering to external websites. Example: If starting from example.com, only links to example.com will be followed.

Default:: True

Current Status (Status) op('source_webscraper').par.Status Str

Displays the current status of the scraper. This is a read-only parameter that shows messages about the scraper's current state, such as "Scraping in progress" or "Scraping complete. Processed X URLs".

Default:: Set URL and start scraping

Progress (Progress) op('source_webscraper').par.Progress Float

Shows the progress of the current scraping operation. This is a percentage (0-100) indicating how far along the scraper is in processing the maximum number of URLs. Progress is calculated as: (processed URLs / Max URLs) * 100.

Default:: 0.0
Range:: 0 to 1
Slider Range:: 0 to 100

Caution: Exposing the viewer of large index tables will be heavy Header

Start Scraping (Startscraping) op('source_webscraper').par.Startscraping Pulse

Starts the web crawling process from the Start URL. This will begin crawling from the specified URL and follow links according to your depth, domain restriction, and pattern settings. Multiple URLs will be processed based on your configuration.

Default:: False

Stop Scraping (Stopscraping) op('source_webscraper').par.Stopscraping Pulse

Immediately stops any ongoing scraping process. This will cancel all worker tasks and close browser instances or HTTP sessions. Any URLs already processed will remain in the results tables.

Default:: False

Scrape Single URL (Scrapesingle) op('source_webscraper').par.Scrapesingle Pulse

Scrapes only the single URL specified in Start URL. Unlike Start Scraping, this will not follow any links or crawl beyond the initial page. Useful for testing extraction settings on a specific page without crawling an entire site.

Default:: False

Clear All (Clear) op('source_webscraper').par.Clear Pulse

Clears all scraped data from the tables. This removes all entries from both the index and URL tables, resets the visited URLs tracking, and returns the component to its initial state. Does not affect parameter settings.

Default:: False

Display File (Displayfile) op('source_webscraper').par.Displayfile Str

Default:: "" (Empty String)

Select Doc (Selectdoc) op('source_webscraper').par.Selectdoc Int

Select a document from the index table by its row number. This controls which document is displayed in the content view. The first row (0) is the header row, so valid values start at 1.

Default:: 1
Range:: 0 to 1
Slider Range:: 1 to 2

Rules

User Agent (Useragent) op('source_webscraper').par.Useragent Str

The User-Agent HTTP header sent with requests. This identifies your scraper to websites. Default is a TouchDesigner-specific identifier. You may want to change this to mimic a standard browser if websites are blocking your scraper. Common browser User-Agents: - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15

Default:: TouchDesigner-RAG-WebScraper/1.0 (+https://derivative.ca)

Seconds Between Requests (Ratelimit) op('source_webscraper').par.Ratelimit Float

Controls how many seconds to wait between requests to the same domain. This is crucial for ethical scraping and avoiding IP bans. Guidelines: - 0: No rate limiting (not recommended) - 0.5-1: Fast scraping (use with caution) - 2-3: Moderate speed (recommended for most sites) - 5+: Very conservative (for sensitive sites) Many websites have rate limiting and will temporarily block your IP if you make too many requests too quickly. This setting helps you stay under those limits. For large websites like Wikipedia, 1-2 seconds is usually fine. For smaller sites, consider 3-5 seconds to reduce server load. Note: In Browser mode, this applies after each page is fully processed. In Simple mode, this applies between HTTP requests.

Default:: 0.4
Range:: 0 to 1
Slider Range:: 0 to 5

Max Crawl Depth (Maxdepth) op('source_webscraper').par.Maxdepth Int

Controls how many links deep the crawler will go from the start URL. - 0: Only scrape the start URL - 1: Scrape the start URL and direct links from it - 2+: Continue following links of links up to this depth Higher values will result in more pages being scraped but may take longer.

Default:: 2
Range:: 0 to 1
Slider Range:: 0 to 2

URL Patterns (Regex) (Urlpatterns) op('source_webscraper').par.Urlpatterns Str

Regular expression pattern to filter which URLs to process. Only URLs matching this pattern will be scraped. Examples: - .* (Match all URLs) - .*\.pdf$ (Only PDF files) - /blog/.* (Only URLs containing /blog/) - ^https://example\.com/docs/.* (Only docs section) Leave empty or set to .* to process all URLs.

Default:: .*

Respect robots.txt (Respectrobots) op('source_webscraper').par.Respectrobots Toggle

When enabled, the scraper will check and obey the rules in robots.txt files. robots.txt is a standard file websites use to indicate which parts of their site should not be crawled by bots. Respecting these rules is considered good practice and may be legally required in some jurisdictions. Note: This only applies to Simple mode. Browser mode does not currently check robots.txt files.

Default:: True

Max URLs to Process (Maxurls) op('source_webscraper').par.Maxurls Int

Maximum number of URLs to process before stopping. This prevents runaway crawling and sets a clear boundary. Set this based on your needs: - Small values (10-50): Quick targeted scraping - Medium values (100-500): Moderate depth exploration - Large values (1000+): Comprehensive crawling (use with caution) The scraper will stop once this many URLs have been processed, even if there are more links to follow.

Default:: 100
Range:: 0 to 1
Slider Range:: 0 to 100

Extract Mode (Extractmode) op('source_webscraper').par.Extractmode Menu

Determines what content to extract from each page: - Auto Detect: Tries to intelligently find the main content - Article Only: Extracts just the main article text (good for blogs/news) - Full Page: Gets the complete HTML of the page - Text Only: Extracts all text content without HTML - HTML Content: Same as Full Page - JSON Data: Attempts to find and extract JSON data (good for APIs) - Links Only: Extracts all links from the page In Browser mode, these options use JavaScript-based extraction. In Simple mode, they use BeautifulSoup and trafilatura.

Default:: auto
Options:: auto, article, conservative, aggressive

Remove Navigation (Removenav) op('source_webscraper').par.Removenav Toggle

When enabled, attempts to remove navigation elements, headers, and footers before extracting content. This helps focus on the main content and reduces noise in the extracted text. Most effective with Article extraction mode.

Default:: True

Remove Ads/Popups (Removeads) op('source_webscraper').par.Removeads Toggle

When enabled, attempts to remove advertisements, popups, and banners before extracting content. This is based on common class and ID patterns used for ads. Most effective with Article extraction mode.

Default:: True

Min Content Length (Minlength) op('source_webscraper').par.Minlength Int

Minimum character length for extracted content to be considered valid. Pages with content shorter than this will be skipped. This helps filter out error pages, login walls, and other non-content pages. Typical values: - 100-200: Minimal filtering - 500+: Ensures substantial content

Default:: 20
Range:: 0 to 1
Slider Range:: 0 to 1

Wait Time (seconds) (Waittime) op('source_webscraper').par.Waittime Float

Time in seconds to wait for a page to stabilize after loading. This is particularly important in Browser mode. Many modern websites load content dynamically with JavaScript. This parameter gives the page time to fully render before extraction. Guidelines: - 0.5-1: Fast loading static pages - 2-3: Pages with some JavaScript - 5+: Complex web applications with lots of dynamic content Increase this value if you notice content is missing from your extractions.

Default:: 1.0
Range:: 0 to 1
Slider Range:: 0 to 1

Auth / Browser

Use Authentication (Useauth) op('source_webscraper').par.Useauth Toggle

Enable this to use authentication when accessing websites that require login. When enabled, the scraper will use the credentials specified in the Auth parameters.

Default:: False

Username (Username) op('source_webscraper').par.Username Str

Username for Basic Authentication. Only used when Auth Type is set to Basic Auth.

Default:: "" (Empty String)

Password (Password) op('source_webscraper').par.Password Str

Password for Basic Authentication. Only used when Auth Type is set to Basic Auth.

Default:: "" (Empty String)

Bearer Token (Token) op('source_webscraper').par.Token Str

Bearer token for token-based authentication. Only used when Auth Type is set to Bearer Token. Format: The raw token without the 'Bearer ' prefix. The scraper will automatically add 'Bearer ' when sending the request.

Default:: "" (Empty String)

Browser Only Header

Use Stealth Mode (Usestealth) op('source_webscraper').par.Usestealth Toggle

When enabled, applies various techniques to make the browser harder to detect as automated. This can help bypass some anti-bot measures. Only applies in Browser mode. Techniques include: - Modifying navigator properties - Adding browser fingerprint randomization - Setting realistic headers - Emulating user behavior Note: This is not foolproof against sophisticated detection systems.

Default:: False

Proxy Server (Optional) (Proxy) op('source_webscraper').par.Proxy Str

Optional proxy server to route requests through. This can help with IP blocking or accessing geo-restricted content. Format: http://hostname:port or http://username:password@hostname:port Examples: - http://192.168.1.100:8080 - http://user:pass@proxy.example.com:8080 Only applies in Browser mode.

Default:: "" (Empty String)

Changelog

v1.1.02025-07-03

added GetTool function to the operator - this will return the tool definition for scraping a website for simple and browser based webscraping mode.

v1.0.02025-03-14

initial release / created source op for webscraper