Package Exports
- @dbclean/cli
- @dbclean/cli/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@dbclean/cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
๐งน DBClean CLI
Transform messy CSV data into clean, standardized datasets using AI-powered automation.
DBClean CLI is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.
๐ Project Structure
dbclean-cli/
โโโ data/
โ โโโ data.csv # Your input file
โ โโโ data_cleaned.csv # After preclean
โ โโโ data_stitched.csv # Stitched data
โ โโโ train.csv # Training set (70%)
โ โโโ validate.csv # Validation set (15%)
โ โโโ test.csv # Test set (15%)
โโโ settings/
โ โโโ instructions.txt # Custom AI instructions
โ โโโ exclude_columns.txt # Columns to skip in preclean
โโโ outputs/
โ โโโ architect_output.txt # AI schema design
โ โโโ column_mapping.json # Column transformations
โ โโโ cleaned_columns/ # Individual column results
โ โโโ cleaner_changes_analysis.txt
โโโ config.json # Project configuration
โจ Features
- ๐ค AI-Powered Cleaning - Uses advanced language models to intelligently clean and standardize data
- ๐๏ธ Schema Design - Automatically creates optimal database schemas from your data
- ๐ Outlier Detection - Uses Isolation Forest to identify and remove anomalies from your dataset.
- โ๏ธ Data Splitting - Automatically splits your cleaned data into training, validation, and test sets.
- ๐ Full Pipeline - Complete automation from raw CSV to clean, structured data
- ๐ Column-by-Column Processing - Detailed cleaning and standardization of individual columns
- ๐ฏ Model Selection - Choose from multiple AI models for different tasks
- ๐ Custom Instructions - Guide the AI with your specific cleaning requirements
- ๐ Detailed Logging - Track every change and transformation
- โก Batch Processing - Handle large datasets efficiently
- ๐ฐ Credit-Based Billing - Pay only for what you use with transparent pricing
- ๐ Usage Analytics - Track your costs and optimize your usage
๐ณ Credit System
DBClean uses a transparent, pay-as-you-go credit system:
- Minimum Balance: $0.01 required to make requests
- Precision: 4 decimal places (charges as low as $0.0001)
- Pricing: Based on actual Gemini AI model costs with no markup
- Billing: Credits deducted only after successful processing
- Transparency: Detailed usage tracking and cost breakdown
Check your balance anytime with dbclean-cli credits
or get a complete overview with dbclean-cli account
.
๐ Quick Start
Prerequisites
- Node.js (version 16 or higher)
- npm (comes with Node.js)
- A DBClean API key (sign up at dbclean.dev)
Installation
Clone and setup the CLI:
cd dbclean-cli npm install npm link
Initialize with your API credentials:
dbclean-cli init
Enter your email and API key when prompted.
Test your setup:
dbclean-cli test-auth dbclean-cli account
Basic Usage
Place your CSV file in the
data/
directory asdata.csv
Run the complete pipeline:
dbclean-cli run
Get your cleaned data from
data/data_stitched.csv
๐
๐ Detailed Usage Guide
๐ง Setup Commands
Initialize CLI
dbclean-cli init
Set up your email and API key for authentication.
Test Authentication
dbclean-cli test-auth
Verify your credentials are working.
Check Status
dbclean-cli status
View your API key status and usage information.
Account Overview
dbclean-cli account
Complete account dashboard showing credits, usage, and status.
๐ฐ Credit Management
Check Credit Balance
dbclean-cli credits
View your current credit balance and usage warnings.
View Usage Statistics
# Basic usage summary
dbclean-cli usage
# Detailed breakdown by service and model
dbclean-cli usage --detailed
Track your API usage, token consumption, and costs.
List Available Models
dbclean-cli models
See all available AI models and their pricing.
๐ Pipeline Commands
Full Pipeline (Recommended)
# Basic full pipeline
dbclean-cli run
# With custom AI model
dbclean-cli run -m "gemini-2.5-pro"
# Different models for different steps
dbclean-cli run -ma "gemini-2.5-pro" -mc "gemini-2.5-flash"
# With custom instructions and larger sample
dbclean-cli run -i -x 10
# Skip certain steps
dbclean-cli run --skip-preclean --skip-architect --skip-isosplit
๐งฉ Individual Step Commands
1. Preclean - Data Preparation
dbclean-cli preclean
Prepares your raw CSV by:
- Removing problematic newlines and special characters
- Handling non-UTF8 characters
- Creating a clean base file for AI processing
2. Architect - Schema Design
# Basic schema design
dbclean-cli architect
# With specific model and larger sample
dbclean-cli architect -m "gemini-2.5-pro" -x 10
# With custom instructions
dbclean-cli architect -i
# List available models
dbclean-cli architect --list-models
Creates an optimized schema by:
- Analyzing your data structure
- Standardizing column names
- Defining data types and formats
- Providing cleaning examples
3. Cleaner - Data Cleaning
# Clean all columns
dbclean-cli cleaner
# With specific model
dbclean-cli cleaner -m "gemini-2.5-flash"
# List available models
dbclean-cli cleaner --list-models
Processes each column to:
- Standardize formats and values
- Fix inconsistencies
- Flag problematic entries
- Apply schema-guided cleaning
4. Stitcher - Final Assembly
dbclean-cli stitcher
Creates your final dataset by:
- Applying all architect corrections
- Integrating cleaner changes
- Generating final CSV with all improvements
- Creating detailed change analysis
5. Isosplit - Outlier Detection & Splitting
dbclean-cli isosplit
Processes the stitched data to:
- Detect and remove outliers using an Isolation Forest model.
- Shuffle the cleaned data randomly.
- Split the data into
train.csv
(70%),validate.csv
(15%), andtest.csv
(15%).
๐๏ธ Command Options
Model Selection
-m <model>
- Use same model for all AI steps-ma <model>
- Specific model for architect step-mc <model>
- Specific model for cleaner step--list-models
- Show available AI models
Processing Options
-x <number>
- Sample size for architect analysis (default: 5)-i
- Use custom instructions fromsettings/instructions.txt
--skip-preclean
- Skip data preparation step--skip-architect
- Skip schema design step--skip-cleaner
- Skip column cleaning step--skip-dedupe
- Skip the deduplication step--skip-isosplit
- Skip the outlier detection and data splitting step
Output Options
--log-file <path>
- Custom log file for silent-run
๐ค AI Models
DBClean supports multiple AI models for different use cases:
Recommended Models
- gemini-2.5-pro - Excellent for complex data understanding
- gemini-2.5-flash - Great general-purpose model
- gemini-2.0-flash - Good performance for large datasets
Model Selection Tips
- For complex, messy data: Use gemini-2.5-pro
- For speed and cost: Use gemini-2.0-flash
- For mixed workloads: Use different models per step with
-ma
and-mc
๐ Custom Instructions
Create settings/instructions.txt
to guide the AI with specific requirements:
Examples of custom instructions:
- "Standardize all phone numbers to E.164 format (+1XXXXXXXXXX)"
- "Convert all dates to YYYY-MM-DD format"
- "Normalize company names (remove Inc, LLC, etc.)"
- "Flag any entries with missing critical information"
Use with: dbclean-cli run -i
๐ก Examples
Example 1: Customer Data Cleaning
# Place customer_data.csv in data/ as data.csv
dbclean-cli run -m "gemini-2.5-pro" -i -x 15
Example 2: Large Dataset (Silent Processing)
dbclean-cli silent-run -ma "gemini-2.5-pro" -mc "gemini-2.5-flash"
Example 3: Quick Test (Skip Heavy Steps)
dbclean-cli run --skip-cleaner -x 3
Example 4: Re-run Just Cleaning
dbclean-cli run --skip-preclean --skip-architect
Example 5: Skip Outlier Detection
dbclean-cli run --skip-isosplit
๐ง Configuration
config.json
Customize file paths and settings:
{
"data_dir": "data",
"data_cleaned_file_path": "data_cleaned.csv",
"data_stitched_file_path": "data_stitched.csv",
"settings__dir": "settings",
"outputs_dir": "outputs"
}
Exclude Columns
Add column names to settings/exclude_columns.txt
to skip them during preclean:
Internal_ID
Temp_Notes
Debug_Column
๐ฏ Best Practices
1. Start Small
- Begin with
-x 5
(5 rows) for initial testing - Increase sample size for better results on complex data
2. Use Custom Instructions
- Provide specific formatting requirements
- Include domain knowledge about your data
- Specify any business rules or constraints
3. Model Selection
- Use powerful models (gemini-2.5-pro) for initial architect step
- Use faster models (gemini-2.0-flash) for repetitive cleaner tasks
- Test different combinations to find optimal performance/cost
4. Iterative Approach
- Run architect first to understand data structure
- Review outputs before running full pipeline
- Use skip options to re-run specific steps
โ Troubleshooting
Common Issues
"API key not found"
dbclean-cli init # Re-enter credentials
dbclean-cli test-auth # Verify connection
"Data file not found"
- Ensure
data.csv
exists in thedata/
directory - Check file permissions and path
"Model not available"
dbclean-cli run --list-models # See available models
"Rate limit errors"
- The CLI automatically retries with delays
- Use
silent-run
for unattended processing - Consider using faster/cheaper models
Getting Help
dbclean-cli --help # General help
dbclean-cli run --help # Command-specific help
dbclean-cli test # Test console output
๐ค Support
- Documentation: dbclean.dev/docs
- Issues: Report bugs or request features
- Community: Join our Discord for support and tips
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Ready to clean your data? Start with dbclean-cli init
and transform your messy CSV files into pristine datasets! ๐