Skip to content

Source Docs

v1.0.0

The Source Docs LOP parses local files into structured index tables for use with RAG pipelines. It supports HTML, Python, and generic text files, with intelligent content extraction that respects document structure. For HTML files, it can analyze the document to identify sections, letting you toggle which sections to include before parsing.

  • beautifulsoup4 — required for HTML parsing. On first use, the operator will prompt you to install it automatically through ChatTD’s Python manager.

None — documents are loaded from the local filesystem via parameters.

  • index_table — a DAT containing one row per parsed document with columns: doc_id, filename, source_path, content, metadata, and timestamp. This format is compatible with the RAG Index LOP for direct indexing.

The operator detects file type by extension and applies an appropriate parsing strategy:

  • HTML / HTM — uses BeautifulSoup to locate the main content area (<article>, <main>, or content <div>), then extracts text organized by header sections (h1-h4). Section toggles on the DocConfig page control which sections are included.
  • Python (.py) — extracts module docstrings, class definitions, and function definitions as separate sections, preserving code structure.
  • All other files — splits content into sections by double newlines (paragraph breaks).
  1. On the Parser page, set Document Folder to the root folder containing your files.
  2. Enter a File Pattern such as *.html *.htm to match specific file types. Multiple patterns are space-separated.
  3. Set Folder Depth to control how deep to scan. 1 scans only the specified folder; higher values include subfolders.
  4. Pulse Parse All Documents [ slow ] to begin. The Progress bar and Current Status fields update as files are processed.
  5. Parsing runs asynchronously with multiple workers. You can pulse Stop Parsing at any time to halt the process.
  6. Results accumulate in the output index_table DAT.
  1. On the Parser page, set Preview File to the file you want to parse.
  2. Pulse Parse Single File.
  3. The parsed content appears in the index_table. This clears any previous single-file results before adding the new one.
  1. Set Preview File to a representative HTML document.
  2. Switch to the DocConfig page and pulse Analyze Document Structure.
  3. The operator scans the HTML and creates a toggle for each detected section, labeled with a preview of that section’s content. All sections are enabled by default.
  4. Disable any sections you want to exclude from parsing.
  5. Toggle Include Unmatched Sections to control whether sections not matched by the analysis are still included.
  6. Return to the Parser page and pulse Parse Single File to test the configuration, or Parse All Documents to apply it across all matching files.

If Auto Update is enabled on the DocConfig page, changes to section toggles are applied automatically.

Use the Display menu on the Parser page to switch between viewing the full index table and the content of individual documents. When set to Content, use Select Doc to step through parsed documents.

  • Start with a single file to verify your section configuration before running a batch parse on an entire folder.
  • For large document sets, keep Folder Depth as shallow as needed to avoid scanning unnecessary directories.
  • Use Clear Tables to reset the index table and DocConfig section toggles before starting a new parsing session.
  • The index_table output is designed to wire directly into a RAG Index LOP for embedding and retrieval.
  • “Missing dependencies” status — beautifulsoup4 is not installed. The operator will show an install prompt when you trigger any parse action. Accept the prompt and the operator will reinitialize after installation.
  • Parsing appears stalled — if no progress is made for the duration set in Max Stall Time, the operator auto-stops. Check the Logger for error details on individual files that failed to process.
  • No content extracted from HTML — the parser looks for content inside <article>, <main>, or <div class="content"> / <div id="content"> elements. If your HTML does not use these containers, content falls back to <body>. Ensure the HTML has recognizable header tags (h1-h4) for section detection.
Document Folder (Documentfolder) op('source_docs').par.Documentfolder Folder
Default:
"" (Empty String)
Folder Depth (Folderdepth) op('source_docs').par.Folderdepth Int

1 = Current folder only, 2 = One level deep, etc.

Default:
0
Range:
1 to 10
Slider Range:
1 to 10
File Pattern (Filepattern) op('source_docs').par.Filepattern Str
Default:
"" (Empty String)
Current Status (Status) op('source_docs').par.Status Str
Default:
"" (Empty String)
Progress (Progress) op('source_docs').par.Progress Float
Default:
0.0
Range:
0 to 1
Slider Range:
0 to 100
Caution: Exposing the viewer of large index tables will be heavy Header
Parse All Documents [ slow ] (Parseall) op('source_docs').par.Parseall Pulse
Default:
False
Stop Parsing (Stopparsing) op('source_docs').par.Stopparsing Pulse
Default:
False
Max Stall Time (Maxstalltime) op('source_docs').par.Maxstalltime Float
Default:
0.0
Range:
0 to 1
Slider Range:
0 to 1
Parse Single File (Parsefile) op('source_docs').par.Parsefile Pulse
Default:
False
Preview File (Previewfile) op('source_docs').par.Previewfile File
Default:
"" (Empty String)
Clear Tables (Clear) op('source_docs').par.Clear Pulse
Default:
False
Display (Display) op('source_docs').par.Display Menu
Default:
index
Options:
index, content
Display File (Displayfile) op('source_docs').par.Displayfile Str
Default:
"" (Empty String)
Select Doc (Selectdoc) op('source_docs').par.Selectdoc Int
Default:
1
Range:
0 to 1
Slider Range:
1 to 0
Header
Analyze Document Structure (Analyzedoc) op('source_docs').par.Analyzedoc Pulse
Default:
False
Auto Update (Autoupdate) op('source_docs').par.Autoupdate Toggle
Default:
False
Include Unmatched Sections (Includemissing) op('source_docs').par.Includemissing Toggle
Default:
False
Sections Header
v1.0.02024-11-06

Initial release