Source Webscraper
The Source Webscraper LOP extracts content from websites directly within TouchDesigner. It offers two scraping engines — a fast lightweight mode for static sites and a full browser mode for JavaScript-heavy pages — with recursive crawling, rate limiting, robots.txt compliance, and output formatted for direct use with the RAG Index LOP.
Key Features
Section titled “Key Features”- Two scraping engines: Simple mode (aiohttp + BeautifulSoup) for speed, Browser mode (Playwright) for full JavaScript rendering
- Recursive crawling with configurable depth, domain restriction, and URL pattern filtering
- Content extraction modes: Auto Detect, Article Only, Conservative, and Aggressive
- Ethical scraping controls: robots.txt compliance, per-domain rate limiting, configurable user agent
- Authentication support: Basic Auth and Bearer Token for protected sites
- Browser anti-detection: Stealth mode and proxy support for Browser mode
- RAG-ready output: Index table formatted for direct connection to a RAG Index LOP
Agent Tool Integration
Section titled “Agent Tool Integration”This operator exposes 2 tools that allow Agent and Gemini Live LOPs to scrape web pages using either fast HTTP requests or a full browser with JavaScript support.
Use the Tool Debugger operator to inspect exact tool definitions, schemas, and parameters.
When connected to an Agent LOP, the agent gains access to two scraping tools:
- scrape_website_simple — Quickly fetches and extracts content from a single URL using HTTP requests. Best for static sites, documentation, and news articles.
- scrape_website_with_browser — Loads a page in a headless Chromium browser with full JavaScript support. Best for modern web apps, SPAs, and dynamically loaded content.
Both tools accept a single url parameter and return the extracted content with metadata. The agent does not need to configure extraction settings — both tools use Auto Detect mode automatically.
Requirements
Section titled “Requirements”- Python packages:
validators,aiohttp,beautifulsoup4,trafilatura,robotexclusionrulesparser(prompted for automatic install on first use) - Browser mode additionally requires:
playwrightand Chromium browser binaries (prompted for automatic install when Browser mode is first selected)
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”None — URLs are configured via the Start URL parameter on the Control page.
Outputs
Section titled “Outputs”- Index Table (DAT output): One row per scraped page with columns for
doc_id,filename,source_path,content,metadata(JSON), andtimestamp. This format is compatible with the RAG Index LOP. - URL Table (internal): Tracks all discovered URLs with their processing status, crawl depth, and parent URL.
Usage Examples
Section titled “Usage Examples”Scraping a Single Page
Section titled “Scraping a Single Page”- On the Control page, enter a full URL (including
https://) in Start URL. - Pulse Scrape Single URL.
- The result appears as a new row in the output Index Table.
This is useful for testing your extraction settings on one page before crawling an entire site.
Crawling a Website Section
Section titled “Crawling a Website Section”- On the Control page, enter the starting URL in Start URL.
- On the Rules page, set Max Crawl Depth (e.g., 2 to follow links two levels deep).
- Set Max URLs to Process to cap how many pages are scraped (e.g., 50).
- Enable Restrict to Domain on the Control page to stay on the same site.
- Optionally set URL Patterns (Regex) on the Rules page to focus on specific paths (e.g.,
/docs/.*). - Pulse Start Scraping on the Control page.
- Monitor Current Status and Progress as the crawl runs.
Scraping JavaScript-Heavy Sites
Section titled “Scraping JavaScript-Heavy Sites”- On the Control page, set Scrape Mode to
Browser (Full JS). - On the Rules page, increase Wait Time (seconds) to 2-3 seconds for pages with dynamic content.
- If the site blocks automated browsers, enable Use Stealth Mode on the Auth / Browser page.
- Pulse Start Scraping or Scrape Single URL.
Accessing Protected Content
Section titled “Accessing Protected Content”- On the Auth / Browser page, enable Use Authentication.
- Set Auth Type to
Basic AuthorBearer Token. - For Basic Auth, fill in Username and Password. For Bearer Token, paste the raw token (without the
Bearerprefix) into Bearer Token. - Configure your URL and pulse Start Scraping.
Feeding Results into RAG Index
Section titled “Feeding Results into RAG Index”- Place a
rag_indexoperator in your network. - Wire the Source Webscraper output into the RAG Index input.
- Scrape your target site — results flow directly into the index in a compatible format.
Choosing a Scrape Mode
Section titled “Choosing a Scrape Mode”| Simple (Fast) | Browser (Full JS) | |
|---|---|---|
| Engine | aiohttp + BeautifulSoup + trafilatura | Playwright headless Chromium |
| Best for | Static HTML, documentation, news articles | SPAs, dynamic content, JS-rendered pages |
| Speed | Fast | Slower (launches real browser) |
| robots.txt | Respected when enabled | Not checked |
| Stealth/Proxy | Not available | Supported |
| Extra install | None beyond base packages | Playwright + Chromium binaries |
Best Practices
Section titled “Best Practices”- Start small: Test with Scrape Single URL before launching a full crawl to verify your extraction settings produce good content.
- Use rate limiting: Keep Seconds Between Requests at 1-2 seconds minimum. Lower values risk IP bans. For smaller sites, 3-5 seconds is courteous.
- Restrict the domain: Enable Restrict to Domain to prevent the crawler from wandering to external sites.
- Cap your crawl: Set Max URLs to Process to a reasonable number. Crawling 1000+ pages generates large tables and takes significant time.
- Choose the right extract mode: Article Only works well for blogs and news. Auto Detect is a good default. Conservative and Aggressive adjust how aggressively boilerplate is stripped.
- Use Browser mode sparingly: Simple mode is much faster. Only switch to Browser mode when pages require JavaScript to render their content.
Troubleshooting
Section titled “Troubleshooting”- “Missing dependencies” on first use: The operator will prompt you to install required packages automatically. Accept the install and wait for it to complete. The operator reinitializes after installation.
- No content extracted: Try a different Extract Mode on the Rules page. Some sites need Browser mode to render content. Also check that Min Content Length is not filtering out valid pages.
- Crawl stops immediately: Verify your URL includes the protocol (
https://). Check that robots.txt is not blocking the path — you can temporarily disable Respect robots.txt to test. - Getting blocked or empty responses: Increase Seconds Between Requests. In Browser mode, enable Use Stealth Mode and optionally configure a Proxy Server.
- Browser mode fails to start: Playwright needs to be installed separately. The operator prompts for installation when you first switch to Browser mode. If it fails, check the Textport for errors.
- Large tables causing slowness: The Control page warns that exposing the viewer of large index tables will be heavy. Use Display set to
Contentwith Select Doc to browse individual results instead of viewing the full table.
Parameters
Section titled “Parameters”Control
Section titled “Control”op('source_webscraper').par.Starturl Str The starting URL for the web scraper. This is the entry point for crawling. Must be a valid URL including the protocol (http:// or https://). Examples: - https://example.com - https://docs.python.org/3/
- Default:
"" (Empty String)
op('source_webscraper').par.Domainrestrict Toggle When enabled, the crawler will only follow links within the same domain as the start URL. This prevents the crawler from wandering to external websites. Example: If starting from example.com, only links to example.com will be followed.
- Default:
True
op('source_webscraper').par.Status Str Displays the current status of the scraper. This is a read-only parameter that shows messages about the scraper's current state, such as "Scraping in progress" or "Scraping complete. Processed X URLs".
- Default:
Set URL and start scraping
op('source_webscraper').par.Progress Float Shows the progress of the current scraping operation. This is a percentage (0-100) indicating how far along the scraper is in processing the maximum number of URLs. Progress is calculated as: (processed URLs / Max URLs) * 100.
- Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 100
op('source_webscraper').par.Startscraping Pulse Starts the web crawling process from the Start URL. This will begin crawling from the specified URL and follow links according to your depth, domain restriction, and pattern settings. Multiple URLs will be processed based on your configuration.
- Default:
False
op('source_webscraper').par.Stopscraping Pulse Immediately stops any ongoing scraping process. This will cancel all worker tasks and close browser instances or HTTP sessions. Any URLs already processed will remain in the results tables.
- Default:
False
op('source_webscraper').par.Scrapesingle Pulse Scrapes only the single URL specified in Start URL. Unlike Start Scraping, this will not follow any links or crawl beyond the initial page. Useful for testing extraction settings on a specific page without crawling an entire site.
- Default:
False
op('source_webscraper').par.Clear Pulse Clears all scraped data from the tables. This removes all entries from both the index and URL tables, resets the visited URLs tracking, and returns the component to its initial state. Does not affect parameter settings.
- Default:
False
op('source_webscraper').par.Displayfile Str - Default:
"" (Empty String)
op('source_webscraper').par.Selectdoc Int Select a document from the index table by its row number. This controls which document is displayed in the content view. The first row (0) is the header row, so valid values start at 1.
- Default:
1- Range:
- 0 to 1
- Slider Range:
- 1 to 2
op('source_webscraper').par.Useragent Str The User-Agent HTTP header sent with requests. This identifies your scraper to websites. Default is a TouchDesigner-specific identifier. You may want to change this to mimic a standard browser if websites are blocking your scraper. Common browser User-Agents: - Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15
- Default:
TouchDesigner-RAG-WebScraper/1.0 (+https://derivative.ca)
op('source_webscraper').par.Ratelimit Float Controls how many seconds to wait between requests to the same domain. This is crucial for ethical scraping and avoiding IP bans. Guidelines: - 0: No rate limiting (not recommended) - 0.5-1: Fast scraping (use with caution) - 2-3: Moderate speed (recommended for most sites) - 5+: Very conservative (for sensitive sites) Many websites have rate limiting and will temporarily block your IP if you make too many requests too quickly. This setting helps you stay under those limits. For large websites like Wikipedia, 1-2 seconds is usually fine. For smaller sites, consider 3-5 seconds to reduce server load. Note: In Browser mode, this applies after each page is fully processed. In Simple mode, this applies between HTTP requests.
- Default:
0.4- Range:
- 0 to 1
- Slider Range:
- 0 to 5
op('source_webscraper').par.Maxdepth Int Controls how many links deep the crawler will go from the start URL. - 0: Only scrape the start URL - 1: Scrape the start URL and direct links from it - 2+: Continue following links of links up to this depth Higher values will result in more pages being scraped but may take longer.
- Default:
2- Range:
- 0 to 1
- Slider Range:
- 0 to 2
op('source_webscraper').par.Urlpatterns Str Regular expression pattern to filter which URLs to process. Only URLs matching this pattern will be scraped. Examples: - .* (Match all URLs) - .*\.pdf$ (Only PDF files) - /blog/.* (Only URLs containing /blog/) - ^https://example\.com/docs/.* (Only docs section) Leave empty or set to .* to process all URLs.
- Default:
.*
op('source_webscraper').par.Respectrobots Toggle When enabled, the scraper will check and obey the rules in robots.txt files. robots.txt is a standard file websites use to indicate which parts of their site should not be crawled by bots. Respecting these rules is considered good practice and may be legally required in some jurisdictions. Note: This only applies to Simple mode. Browser mode does not currently check robots.txt files.
- Default:
True
op('source_webscraper').par.Maxurls Int Maximum number of URLs to process before stopping. This prevents runaway crawling and sets a clear boundary. Set this based on your needs: - Small values (10-50): Quick targeted scraping - Medium values (100-500): Moderate depth exploration - Large values (1000+): Comprehensive crawling (use with caution) The scraper will stop once this many URLs have been processed, even if there are more links to follow.
- Default:
100- Range:
- 0 to 1
- Slider Range:
- 0 to 100
op('source_webscraper').par.Removenav Toggle When enabled, attempts to remove navigation elements, headers, and footers before extracting content. This helps focus on the main content and reduces noise in the extracted text. Most effective with Article extraction mode.
- Default:
True
op('source_webscraper').par.Removeads Toggle When enabled, attempts to remove advertisements, popups, and banners before extracting content. This is based on common class and ID patterns used for ads. Most effective with Article extraction mode.
- Default:
True
op('source_webscraper').par.Minlength Int Minimum character length for extracted content to be considered valid. Pages with content shorter than this will be skipped. This helps filter out error pages, login walls, and other non-content pages. Typical values: - 100-200: Minimal filtering - 500+: Ensures substantial content
- Default:
20- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('source_webscraper').par.Waittime Float Time in seconds to wait for a page to stabilize after loading. This is particularly important in Browser mode. Many modern websites load content dynamically with JavaScript. This parameter gives the page time to fully render before extraction. Guidelines: - 0.5-1: Fast loading static pages - 2-3: Pages with some JavaScript - 5+: Complex web applications with lots of dynamic content Increase this value if you notice content is missing from your extractions.
- Default:
1.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
Auth / Browser
Section titled “Auth / Browser”op('source_webscraper').par.Useauth Toggle Enable this to use authentication when accessing websites that require login. When enabled, the scraper will use the credentials specified in the Auth parameters.
- Default:
False
op('source_webscraper').par.Username Str Username for Basic Authentication. Only used when Auth Type is set to Basic Auth.
- Default:
"" (Empty String)
op('source_webscraper').par.Password Str Password for Basic Authentication. Only used when Auth Type is set to Basic Auth.
- Default:
"" (Empty String)
op('source_webscraper').par.Token Str Bearer token for token-based authentication. Only used when Auth Type is set to Bearer Token. Format: The raw token without the 'Bearer ' prefix. The scraper will automatically add 'Bearer ' when sending the request.
- Default:
"" (Empty String)
op('source_webscraper').par.Usestealth Toggle When enabled, applies various techniques to make the browser harder to detect as automated. This can help bypass some anti-bot measures. Only applies in Browser mode. Techniques include: - Modifying navigator properties - Adding browser fingerprint randomization - Setting realistic headers - Emulating user behavior Note: This is not foolproof against sophisticated detection systems.
- Default:
False
op('source_webscraper').par.Proxy Str Optional proxy server to route requests through. This can help with IP blocking or accessing geo-restricted content. Format: http://hostname:port or http://username:password@hostname:port Examples: - http://192.168.1.100:8080 - http://user:pass@proxy.example.com:8080 Only applies in Browser mode.
- Default:
"" (Empty String)
Changelog
Section titled “Changelog”v1.1.02025-07-03
added GetTool function to the operator - this will return the tool definition for scraping a website for simple and browser based webscraping mode.
v1.0.02025-03-14
initial release / created source op for webscraper