Save Sources

The Save Sources LOP is a RAG utility operator that saves content from input tables to individual Markdown files. It provides intelligent filename generation from URLs, fallback column options, and comprehensive file management features, making it ideal for exporting scraped content, processed documents, or any tabular data to organized file structures.

Requirements

Input Table: A table DAT with required columns doc_id and content
Output Folder: A valid directory path for saving files
Optional Columns: source_path for URL-based filenames, custom filename column

I/O

Input

Input Table DAT: Table containing data to save as files
- Required Columns: doc_id, content
- Optional Columns: source_path, custom filename column

Output

Markdown Files: Individual .md files saved to the specified output folder
Progress Tracking: Real-time status and progress information
File Statistics: Count of successfully saved files

Parameters

Save Config

Output Folder (Outputfolder) op('save_sources').par.Outputfolder folder

The directory where Markdown files will be saved

Default:: None

Filename Prefix (Optional) (Filenameprefix) op('save_sources').par.Filenameprefix str

Optional prefix to add to the beginning of each saved filename

Default:: None

Filename Column (Optional) (Filenamecolumn) op('save_sources').par.Filenamecolumn str

If URL is not used/available, use this column for filenames. If empty/not found, 'doc_id' is used

Default:: filename

Overwrite Existing Files (Overwrite) op('save_sources').par.Overwrite toggle

If enabled, existing Markdown files with the same name will be overwritten

Default:: false

Save Markdown Files (Savemarkdown) op('save_sources').par.Savemarkdown pulse

Starts the process of saving content from the input DAT to Markdown files

Default:: None

Clear Status (Clearstatus) op('save_sources').par.Clearstatus pulse

Resets the status, progress, and files saved counters

Default:: None

Current Status (Status) op('save_sources').par.Status str

Current operation status and progress information

Default:: None

Progress (%) (Progress) op('save_sources').par.Progress float

Percentage completion of the current save operation

Default:: None

Files Saved (Filessaved) op('save_sources').par.Filessaved int

Number of files successfully saved in the current operation

Default:: None

Use URL for Filename (Useurlasfilename) op('save_sources').par.Useurlasfilename toggle

If enabled, attempts to create a safe filename from the 'source_path' column URL

Default:: false

About

Bypass (Bypass) op('save_sources').par.Bypass toggle

Bypass the operator

Default:: false

Show Built-in Parameters (Showbuiltin) op('save_sources').par.Showbuiltin toggle

Show built-in TouchDesigner parameters

Default:: false

Version (Version) op('save_sources').par.Version str

Current version of the operator

Default:: None

Last Updated (Lastupdated) op('save_sources').par.Lastupdated str

Date of last update

Default:: None

Creator (Creator) op('save_sources').par.Creator str

Operator creator

Default:: None

Website (Website) op('save_sources').par.Website str

Filename Generation Strategy

The operator uses a sophisticated multi-tier filename generation system:

1. URL-Based Filenames (Primary)

When “Use URL for Filename” is enabled and a source_path column exists:

Parses URLs to extract meaningful path components
Sanitizes special characters and path separators
Handles query parameters for generic pages
Removes common file extensions (.html, .php, etc.)
Truncates to reasonable length (100 characters)

2. Fallback Column (Secondary)

If URL generation fails or is disabled:

Uses the specified “Filename Column”
Applies basic sanitization for filesystem safety
Defaults to “filename” column if not specified

3. Document ID (Final Fallback)

If both above methods fail:

Uses the doc_id column value
Ensures every row gets a unique filename
Provides reliable fallback for any table structure

Usage

Basic Setup

Prepare Input Table: Ensure your table DAT has doc_id and content columns
Set Output Folder: Choose or create a directory for the saved files
Configure Filename Strategy: Choose URL-based, column-based, or doc_id naming
Set Overwrite Policy: Decide whether to overwrite existing files
Save Files: Click “Save Markdown Files” to begin the process

Example Table Structure

doc_id          | content                    | source_path                | filename
document_001    | # My Document...           | https://example.com/doc1   | custom_name
document_002    | ## Another Document...     | https://site.com/page2     | another_file
document_003    | ### Third Document...      |                            | manual_name

Advanced Configuration

URL-Based Naming Examples

https://example.com/articles/machine-learning → articles_machine-learning.md
https://site.com/docs/tutorial.html → docs_tutorial.md
https://blog.com/index.php?id=123 → index_php_id_123.md

Custom Prefixes

Prefix: project_ → project_articles_machine-learning.md
Prefix: 2024_ → 2024_docs_tutorial.md

Integration Examples

With Source Operators

# Chain with source operators
source_crawl = op('source_crawl4ai')
save_sources = op('save_sources')

# Configure save sources to use crawled data
save_sources.par.Outputfolder = 'project/scraped_content'
save_sources.par.Useurlasfilename = True
save_sources.par.Filenameprefix = 'scraped_'
save_sources.par.Overwrite = False

# Save the crawled content
save_sources.par.Savemarkdown.pulse()

With RAG Index

# Save sources before indexing
save_sources = op('save_sources')
rag_index = op('rag_index')

# Configure output folder
save_sources.par.Outputfolder = 'knowledge_base/documents'
save_sources.par.Useurlasfilename = True

# Save files first
save_sources.par.Savemarkdown.pulse()

# Then index the saved files
# (Configure rag_index to read from the same folder)

Batch Processing Workflow

# Process multiple source tables
source_tables = ['web_scrape_results', 'document_imports', 'api_responses']

for table_name in source_tables:
    # Configure for each source
    save_sources.par.Outputfolder = f'output/{table_name}'
    save_sources.par.Filenameprefix = f'{table_name}_'

    # Connect the appropriate input table
    save_sources.op('input_table').copy(op(table_name))

    # Save files
    save_sources.par.Savemarkdown.pulse()

    # Wait for completion (check status)
    while 'Completed' not in save_sources.par.Status.eval():
        time.sleep(0.1)

File Management Features

Overwrite Protection

Enabled: Replaces existing files with same names
Disabled: Skips files that already exist (default)
Use Case: Incremental updates without data loss

Progress Tracking

Real-time Status: Current operation phase and details
Progress Percentage: Completion percentage (0-100%)
Files Saved Counter: Number of successfully saved files
Error Reporting: Detailed error messages for troubleshooting

Filename Sanitization

Path Separators: Converts / and \ to _
Special Characters: Removes <>:"/\|?*
Unicode Support: Handles international characters safely
Length Limits: Truncates overly long filenames
Extension Management: Adds .md extension automatically

Use Cases

Web Scraping Export

Export scraped web content to organized file structure
Use URL-based filenames for intuitive organization
Maintain source traceability through filenames

Document Processing Pipeline

Save processed documents from various sources
Apply consistent naming conventions
Prepare files for further analysis or indexing

Knowledge Base Creation

Export curated content collections
Organize by topic, source, or date
Create searchable file archives

Content Migration

Export content from databases or APIs
Convert to Markdown for version control
Maintain metadata through filename conventions

Best Practices

Table Preparation

Required Columns: Always include doc_id and content
Clean Data: Remove or escape problematic characters in content
Consistent IDs: Use meaningful, unique document IDs
URL Validation: Ensure source_path contains valid URLs if using URL naming

Filename Strategy

URL Naming: Best for web-scraped content with meaningful URLs
Column Naming: Use for curated content with predefined names
Prefix Usage: Add project or date prefixes for organization
Length Consideration: Keep total path length under system limits

Performance Optimization

Batch Size: Process reasonable numbers of files at once
Folder Structure: Create organized subfolder hierarchies
Overwrite Settings: Use appropriate overwrite policies
Progress Monitoring: Check status regularly for large operations

Error Handling

Path Validation: Verify output folder exists and is writable
Content Validation: Check for empty or malformed content
Filename Conflicts: Handle duplicate filenames appropriately
Recovery: Use “Clear Status” to reset after errors

Troubleshooting

Common Issues

Permission Errors
- Verify output folder write permissions
- Check file system space availability
- Ensure folder path is accessible
Filename Conflicts
- Enable overwrite if updates are needed
- Use unique prefixes to avoid conflicts
- Check for duplicate doc_ids in input
Invalid Paths
- Use absolute paths or proper relative paths
- Verify folder exists or can be created
- Check for invalid characters in folder names
Performance Issues
- Process files in smaller batches
- Use faster storage devices
- Monitor system resources during operation

Status Messages

“Idle”: Ready for operation
“Starting…”: Initializing save process
“Saving files…”: Actively saving files
“Completed: Saved X/Y files”: Operation finished successfully
“Error: [message]”: Operation failed with specific error

This comprehensive file management system provides reliable, organized export capabilities for any RAG workflow or content processing pipeline.