Source Docs
The Source Docs LOP parses local files into structured index tables for use with RAG pipelines. It supports HTML, Python, and generic text files, with intelligent content extraction that respects document structure. For HTML files, it can analyze the document to identify sections, letting you toggle which sections to include before parsing.
Requirements
Section titled “Requirements”- beautifulsoup4 — required for HTML parsing. On first use, the operator will prompt you to install it automatically through ChatTD’s Python manager.
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”None — documents are loaded from the local filesystem via parameters.
Outputs
Section titled “Outputs”- index_table — a DAT containing one row per parsed document with columns:
doc_id,filename,source_path,content,metadata, andtimestamp. This format is compatible with the RAG Index LOP for direct indexing.
How Parsing Works
Section titled “How Parsing Works”The operator detects file type by extension and applies an appropriate parsing strategy:
- HTML / HTM — uses BeautifulSoup to locate the main content area (
<article>,<main>, or content<div>), then extracts text organized by header sections (h1-h4). Section toggles on the DocConfig page control which sections are included. - Python (.py) — extracts module docstrings, class definitions, and function definitions as separate sections, preserving code structure.
- All other files — splits content into sections by double newlines (paragraph breaks).
Usage Examples
Section titled “Usage Examples”Parsing a Folder of HTML Files
Section titled “Parsing a Folder of HTML Files”- On the Parser page, set
Document Folderto the root folder containing your files. - Enter a
File Patternsuch as*.html *.htmto match specific file types. Multiple patterns are space-separated. - Set
Folder Depthto control how deep to scan. 1 scans only the specified folder; higher values include subfolders. - Pulse
Parse All Documents [ slow ]to begin. TheProgressbar andCurrent Statusfields update as files are processed. - Parsing runs asynchronously with multiple workers. You can pulse
Stop Parsingat any time to halt the process. - Results accumulate in the output index_table DAT.
Parsing a Single File
Section titled “Parsing a Single File”- On the Parser page, set
Preview Fileto the file you want to parse. - Pulse
Parse Single File. - The parsed content appears in the index_table. This clears any previous single-file results before adding the new one.
Customizing HTML Sections with DocConfig
Section titled “Customizing HTML Sections with DocConfig”- Set
Preview Fileto a representative HTML document. - Switch to the DocConfig page and pulse
Analyze Document Structure. - The operator scans the HTML and creates a toggle for each detected section, labeled with a preview of that section’s content. All sections are enabled by default.
- Disable any sections you want to exclude from parsing.
- Toggle
Include Unmatched Sectionsto control whether sections not matched by the analysis are still included. - Return to the Parser page and pulse
Parse Single Fileto test the configuration, orParse All Documentsto apply it across all matching files.
If Auto Update is enabled on the DocConfig page, changes to section toggles are applied automatically.
Browsing Parsed Results
Section titled “Browsing Parsed Results”Use the Display menu on the Parser page to switch between viewing the full index table and the content of individual documents. When set to Content, use Select Doc to step through parsed documents.
Best Practices
Section titled “Best Practices”- Start with a single file to verify your section configuration before running a batch parse on an entire folder.
- For large document sets, keep
Folder Depthas shallow as needed to avoid scanning unnecessary directories. - Use
Clear Tablesto reset the index table and DocConfig section toggles before starting a new parsing session. - The index_table output is designed to wire directly into a RAG Index LOP for embedding and retrieval.
Troubleshooting
Section titled “Troubleshooting”- “Missing dependencies” status — beautifulsoup4 is not installed. The operator will show an install prompt when you trigger any parse action. Accept the prompt and the operator will reinitialize after installation.
- Parsing appears stalled — if no progress is made for the duration set in
Max Stall Time, the operator auto-stops. Check the Logger for error details on individual files that failed to process. - No content extracted from HTML — the parser looks for content inside
<article>,<main>, or<div class="content">/<div id="content">elements. If your HTML does not use these containers, content falls back to<body>. Ensure the HTML has recognizable header tags (h1-h4) for section detection.
Parameters
Section titled “Parameters”Parser
Section titled “Parser”op('source_docs').par.Documentfolder Folder - Default:
"" (Empty String)
op('source_docs').par.Folderdepth Int 1 = Current folder only, 2 = One level deep, etc.
- Default:
0- Range:
- 1 to 10
- Slider Range:
- 1 to 10
op('source_docs').par.Filepattern Str - Default:
"" (Empty String)
op('source_docs').par.Status Str - Default:
"" (Empty String)
op('source_docs').par.Progress Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 100
op('source_docs').par.Parseall Pulse - Default:
False
op('source_docs').par.Stopparsing Pulse - Default:
False
op('source_docs').par.Maxstalltime Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('source_docs').par.Parsefile Pulse - Default:
False
op('source_docs').par.Previewfile File - Default:
"" (Empty String)
op('source_docs').par.Clear Pulse - Default:
False
op('source_docs').par.Displayfile Str - Default:
"" (Empty String)
op('source_docs').par.Selectdoc Int - Default:
1- Range:
- 0 to 1
- Slider Range:
- 1 to 0
DocConfig
Section titled “DocConfig”op('source_docs').par.Analyzedoc Pulse - Default:
False
op('source_docs').par.Autoupdate Toggle - Default:
False
op('source_docs').par.Includemissing Toggle - Default:
False
Changelog
Section titled “Changelog”v1.0.02024-11-06
Initial release