Source Docs Operator

Overview

The Source Docs LOP (formerly DocumentParser) is designed to parse local documents (HTML/HTM, Python initially, extensible) into standardized index tables within TouchDesigner. It allows you to extract content and metadata from documents found within a specified folder structure, making it suitable for indexing local documentation, code, or text files for RAG systems.

Note: Requires the beautifulsoup4 Python library for HTML parsing.

Parameters

Parameters are organized into pages.

Document Folder (Documentfolder) op('source_docs').par.Documentfolder Folder

Default:: None

Folder Depth (Folderdepth) op('source_docs').par.Folderdepth Int

Default:: 3
Range:: 1 to 20

File Pattern (Filepattern) op('source_docs').par.Filepattern Str

Default:: *.htm *.html *.py

Parse All Documents (Parseall) op('source_docs').par.Parseall Pulse

Default:: None

Stop Parsing (Stopparsing) op('source_docs').par.Stopparsing Pulse

Default:: None

Current Status (Status) op('source_docs').par.Status Str

Default:: Ready

Progress (Progress) op('source_docs').par.Progress Float

Default:: 0

Clear Index Table (Clear) op('source_docs').par.Clear Pulse

Default:: None

Caution: Viewing large index tables can be slow Header

Max Stall Time (s) (Maxstalltime) op('source_docs').par.Maxstalltime Float

Default:: 2
Range:: 0.1 to 10

Preview File (Previewfile) op('source_docs').par.Previewfile File

Default:: None

Parse Single File (Parsefile) op('source_docs').par.Parsefile Pulse

Default:: None

Analyze Document Structure (Analyzedoc) op('source_docs').par.Analyzedoc Pulse

Default:: None

Select Doc for Content View (Selectdoc) op('source_docs').par.Selectdoc Int

Default:: 0

Selected Filename (Displayfile) op('source_docs').par.Displayfile Str

Default:: "" (Empty String)

Include Sections (HTML Only) Header

Auto Analyze on Preview Change (Autoupdate) op('source_docs').par.Autoupdate Toggle

Default:: Off

Include Unmatched Sections (Includemissing) op('source_docs').par.Includemissing Toggle

Default:: On

ChatTD (Chattd) op('source_docs').par.Chattd OP

Default:: /dot_lops/ChatTD

Helper Popups (Popups) op('source_docs').par.Popups Toggle

Default:: On

Show Built In Pars (Showbuiltin) op('source_docs').par.Showbuiltin Toggle

Default:: Off

Bypass (Bypass) op('source_docs').par.Bypass Toggle

Default:: Off

Callbacks

Available Callbacks:

onParseStart
onParseComplete
onFileParsed
onFileSkipped
onAnalyzeComplete
onError

Usage Examples

Parsing All HTML Files in a Folder

1. Set 'Document Folder' to the root folder containing HTML files.
2. Set 'File Pattern' (e.g., `*.html *.htm`).
3. Adjust 'Folder Depth' as needed.
4. Pulse 'Parse All Documents'. Monitor 'Progress' and 'Status'.
5. Results appear in the output `index_table` DAT.

Parsing a Single Python File

1. Set 'Preview File' to the target `.py` file.
2. Pulse 'Parse Single File'.
3. The parsed code (as text) will be added to the `index_table`.

Customizing HTML Section Parsing

1. Set 'Preview File' to a representative HTML file.
2. Pulse 'Analyze Document Structure' on the Single/Preview page.
3. Go to the 'DocConfig' page. Toggle the dynamically generated parameters (e.g., `Include Maincontent`, `Include Sidebar`) to select desired sections.
4. (Optional) Toggle 'Include Unmatched Sections' based on preference.
5. Pulse 'Parse Single File' to test the configuration.
6. If satisfied, use 'Parse All Documents' to apply the config to all matching files.

Technical Notes

Parsing multiple documents (Parse All Documents) runs asynchronously via ChatTD.
HTML parsing uses BeautifulSoup4 to extract text content. Structure analysis generates CSS selectors to identify sections.
Python file parsing currently extracts the entire code content as text.
The output index_table is formatted for direct use with the Rag Index LOP.
Large numbers of files or very large files can take time to process.