Source Docs

v1.0.0

The Source Docs LOP parses local files into structured index tables for use with RAG pipelines. It supports HTML, Python, and generic text files, with intelligent content extraction that respects document structure. For HTML files, it can analyze the document to identify sections, letting you toggle which sections to include before parsing.

Requirements

beautifulsoup4 — required for HTML parsing. On first use, the operator will prompt you to install it automatically through ChatTD’s Python manager.

Input/Output

Inputs

None — documents are loaded from the local filesystem via parameters.

Outputs

index_table — a DAT containing one row per parsed document with columns: doc_id, filename, source_path, content, metadata, and timestamp. This format is compatible with the RAG Index LOP for direct indexing.

How Parsing Works

The operator detects file type by extension and applies an appropriate parsing strategy:

HTML / HTM — uses BeautifulSoup to locate the main content area (<article>, <main>, or content <div>), then extracts text organized by header sections (h1-h4). Section toggles on the DocConfig page control which sections are included.
Python (.py) — extracts module docstrings, class definitions, and function definitions as separate sections, preserving code structure.
All other files — splits content into sections by double newlines (paragraph breaks).

Usage Examples

Parsing a Folder of HTML Files

On the Parser page, set Document Folder to the root folder containing your files.
Enter a File Pattern such as *.html *.htm to match specific file types. Multiple patterns are space-separated.
Set Folder Depth to control how deep to scan. 1 scans only the specified folder; higher values include subfolders.
Pulse Parse All Documents [ slow ] to begin. The Progress bar and Current Status fields update as files are processed.
Parsing runs asynchronously with multiple workers. You can pulse Stop Parsing at any time to halt the process.
Results accumulate in the output index_table DAT.

Parsing a Single File

On the Parser page, set Preview File to the file you want to parse.
Pulse Parse Single File.
The parsed content appears in the index_table. This clears any previous single-file results before adding the new one.

Customizing HTML Sections with DocConfig

Set Preview File to a representative HTML document.
Switch to the DocConfig page and pulse Analyze Document Structure.
The operator scans the HTML and creates a toggle for each detected section, labeled with a preview of that section’s content. All sections are enabled by default.
Disable any sections you want to exclude from parsing.
Toggle Include Unmatched Sections to control whether sections not matched by the analysis are still included.
Return to the Parser page and pulse Parse Single File to test the configuration, or Parse All Documents to apply it across all matching files.

If Auto Update is enabled on the DocConfig page, changes to section toggles are applied automatically.

Browsing Parsed Results

Use the Display menu on the Parser page to switch between viewing the full index table and the content of individual documents. When set to Content, use Select Doc to step through parsed documents.

Best Practices

Start with a single file to verify your section configuration before running a batch parse on an entire folder.
For large document sets, keep Folder Depth as shallow as needed to avoid scanning unnecessary directories.
Use Clear Tables to reset the index table and DocConfig section toggles before starting a new parsing session.
The index_table output is designed to wire directly into a RAG Index LOP for embedding and retrieval.

Troubleshooting

“Missing dependencies” status — beautifulsoup4 is not installed. The operator will show an install prompt when you trigger any parse action. Accept the prompt and the operator will reinitialize after installation.
Parsing appears stalled — if no progress is made for the duration set in Max Stall Time, the operator auto-stops. Check the Logger for error details on individual files that failed to process.
No content extracted from HTML — the parser looks for content inside <article>, <main>, or <div class="content"> / <div id="content"> elements. If your HTML does not use these containers, content falls back to <body>. Ensure the HTML has recognizable header tags (h1-h4) for section detection.

Parameters

Parser

Document Folder (Documentfolder) op('source_docs').par.Documentfolder Folder

Default:: "" (Empty String)

Folder Depth (Folderdepth) op('source_docs').par.Folderdepth Int

1 = Current folder only, 2 = One level deep, etc.

Default:: 0
Range:: 1 to 10
Slider Range:: 1 to 10

File Pattern (Filepattern) op('source_docs').par.Filepattern Str

Default:: "" (Empty String)

Current Status (Status) op('source_docs').par.Status Str

Default:: "" (Empty String)

Progress (Progress) op('source_docs').par.Progress Float

Default:: 0.0
Range:: 0 to 1
Slider Range:: 0 to 100

Caution: Exposing the viewer of large index tables will be heavy Header

Parse All Documents [ slow ] (Parseall) op('source_docs').par.Parseall Pulse

Default:: False

Stop Parsing (Stopparsing) op('source_docs').par.Stopparsing Pulse

Default:: False

Max Stall Time (Maxstalltime) op('source_docs').par.Maxstalltime Float

Default:: 0.0
Range:: 0 to 1
Slider Range:: 0 to 1

Parse Single File (Parsefile) op('source_docs').par.Parsefile Pulse

Default:: False

Preview File (Previewfile) op('source_docs').par.Previewfile File

Default:: "" (Empty String)

Clear Tables (Clear) op('source_docs').par.Clear Pulse

Default:: False

Display File (Displayfile) op('source_docs').par.Displayfile Str

Default:: "" (Empty String)

Select Doc (Selectdoc) op('source_docs').par.Selectdoc Int

Default:: 1
Range:: 0 to 1
Slider Range:: 1 to 0

Header

DocConfig

Analyze Document Structure (Analyzedoc) op('source_docs').par.Analyzedoc Pulse

Default:: False

Auto Update (Autoupdate) op('source_docs').par.Autoupdate Toggle

Default:: False

Include Unmatched Sections (Includemissing) op('source_docs').par.Includemissing Toggle

Default:: False

Sections Header

Changelog

v1.0.02024-11-06

Initial release