Package Exports
- @dbclean/cli
- @dbclean/cli/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@dbclean/cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
๐งน DBClean
Transform messy CSV data into clean, standardized datasets using AI-powered automation.
DBClean is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.
๐ Project Structure
After processing, your workspace will look like this:
your-project/
โโโ data.csv # Your original input file
โโโ data/
โ โโโ data_cleaned.csv # After preclean step
โ โโโ data_deduped.csv # After duplicate removal
โ โโโ data_stitched.csv # Final cleaned dataset
โ โโโ train.csv # Training set (70%)
โ โโโ validate.csv # Validation set (15%)
โ โโโ test.csv # Test set (15%)
โโโ settings/
โ โโโ instructions.txt # Custom AI instructions
โ โโโ exclude_columns.txt # Columns to skip in preclean
โโโ outputs/
โโโ architect_output.txt # AI schema design
โโโ column_mapping.json # Column transformations
โโโ cleaned_columns/ # Individual column results
โโโ cleaner_changes_analysis.html
โโโ dedupe_report.txt
โจ Features
- ๐ค AI-Powered Cleaning - Uses advanced language models to intelligently clean and standardize data
- ๐๏ธ Schema Design - Automatically creates optimal database schemas from your data
- ๐ Duplicate Detection - AI-powered duplicate identification and removal
- ๐ฏ Outlier Detection - Uses Isolation Forest to identify and remove anomalies
- โ๏ธ Data Splitting - Automatically splits cleaned data into training, validation, and test sets
- ๐ Full Pipeline - Complete automation from raw CSV to clean, structured data
- ๐ Column-by-Column Processing - Detailed cleaning and standardization of individual columns
- ๐ฏ Model Selection - Choose from multiple AI models for different tasks
- ๐ Custom Instructions - Guide the AI with your specific cleaning requirements
- ๐ฐ Credit-Based Billing - Pay only for what you use with transparent pricing
๐ณ Credit System
DBClean uses a transparent, pay-as-you-go credit system:
- Free Tier: 5 free requests per month for new users
- Minimum Balance: $0.01 required for paid requests
- Precision: 4 decimal places (charges as low as $0.0001)
- Pricing: Based on actual AI model costs with no markup
- Billing: Credits deducted only after successful processing
Check your balance anytime with dbclean credits
or get a complete overview with dbclean account
.
๐ Quick Start
1. Initialize Your Account
dbclean init
Enter your email and API key when prompted. Don't have an account? Sign up at dbclean.dev
2. Verify Setup
dbclean test-auth
dbclean account
3. Process Your Data
# Place your CSV file as data.csv in your current directory
dbclean run
Your cleaned data will be available in data/data_stitched.csv
๐
๐ Command Reference
๐ง Setup & Authentication
Command | Description |
---|---|
dbclean init |
Initialize with your email and API key |
dbclean test-auth |
Verify your credentials are working |
dbclean logout |
Remove stored credentials |
dbclean status |
Check API key status and account info |
๐ฐ Account Management
Command | Description |
---|---|
dbclean account |
Complete account overview (credits, usage, status) |
dbclean credits |
Check your current credit balance |
dbclean usage |
View API usage statistics |
dbclean usage --detailed |
Detailed breakdown by service and model |
dbclean models |
List all available AI models |
๐ Data Processing Pipeline
Command | Description |
---|---|
dbclean run |
Execute complete pipeline (recommended) |
dbclean preclean |
Clean CSV data (remove newlines, special chars) |
dbclean architect |
AI-powered schema design and standardization |
dbclean dedupe |
AI-powered duplicate detection and removal |
dbclean cleaner |
AI-powered column-by-column data cleaning |
dbclean stitcher |
Combine all changes into final CSV |
dbclean isosplit |
Detect outliers and split into train/validate/test |
๐ Complete Pipeline
The recommended approach is to use the full pipeline:
# Basic full pipeline
dbclean run
# With custom AI model
dbclean run -m "gemini-2.0-flash-exp"
# Different models for different steps
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"
# With custom instructions and larger sample
dbclean run -i -x 10
# Skip certain steps
dbclean run --skip-preclean --skip-dedupe
Pipeline Steps
- Preclean - Prepares raw CSV by removing problematic characters and formatting
- Architect - AI analyzes your data structure and creates optimized schema
- Dedupe - AI identifies and removes duplicate records intelligently
- Cleaner - AI processes each column to standardize and clean data
- Stitcher - Combines all improvements into final dataset
- Isosplit - Removes outliers and splits data for machine learning
๐๏ธ Command Options
Model Selection
-m <model>
- Use same model for all AI steps--model-architect <model>
- Specific model for architect step--model-cleaner <model>
- Specific model for cleaner step
Processing Options
-x <number>
- Sample size for architect analysis (default: 5)-i
- Use custom instructions fromsettings/instructions.txt
--input <file>
- Specify input CSV file (default: data.csv)
Skip Options
--skip-preclean
- Skip data preparation step--skip-architect
- Skip schema design step--skip-dedupe
- Skip duplicate detection step--skip-cleaner
- Skip column cleaning step--skip-isosplit
- Skip outlier detection and data splitting
๐ค AI Models
Recommended Models
Model | Best For | Speed | Cost |
---|---|---|---|
gemini-2.0-flash-exp |
General purpose, fast processing | โกโกโก | ๐ฒ |
gemini-2.0-flash-thinking |
Complex data analysis | โกโก | ๐ฒ๐ฒ |
gemini-1.5-pro |
Large, complex datasets | โก | ๐ฒ๐ฒ๐ฒ |
Model Selection Tips
- For speed and cost: Use
gemini-2.0-flash-exp
- For complex, messy data: Use
gemini-2.0-flash-thinking
for architect - For mixed workloads: Use different models per step with
--model-architect
and--model-cleaner
# List all available models
dbclean models
๐ Custom Instructions
Create custom cleaning instructions to guide the AI:
- For architect step: Use the
-i
flag with asettings/instructions.txt
file - Example instructions:
- Standardize all phone numbers to E.164 format (+1XXXXXXXXXX) - Convert all dates to YYYY-MM-DD format - Normalize company names (remove Inc, LLC, etc.) - Flag any entries with missing critical information - Ensure email addresses are properly formatted
dbclean run -i # Uses instructions from settings/instructions.txt
๐ก Usage Examples
Basic Processing
# Process a CSV file with default settings
dbclean run
# Use a specific input file
dbclean run --input customer_data.csv
Advanced Processing
# High-quality processing with larger sample
dbclean run -m "gemini-2.0-flash-thinking" -x 15 -i
# Fast processing for large datasets
dbclean run -m "gemini-2.0-flash-exp" --skip-dedupe
# Custom pipeline - architect only
dbclean run --skip-preclean --skip-cleaner --skip-dedupe --skip-isosplit
Individual Steps
# Run architect with custom model and sample size
dbclean architect -m "gemini-2.0-flash-thinking" -x 10 -i
# Clean data with specific model
dbclean cleaner -m "gemini-2.0-flash-exp"
# Remove duplicates with AI analysis
dbclean dedupe
๐ฏ Best Practices
1. Start Small and Iterate
# Test with small sample first
dbclean architect -x 3
# Review outputs, then run full pipeline
dbclean run
2. Choose the Right Models
# For complex schema design
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"
3. Use Custom Instructions
Create settings/instructions.txt
with domain-specific requirements:
Finance data requirements:
- Currency amounts in USD format ($X,XXX.XX)
- Account numbers must be 10-12 digits
- Transaction dates in YYYY-MM-DD format
4. Monitor Your Usage
# Check account status regularly
dbclean account
# Monitor detailed usage
dbclean usage --detailed
โ Troubleshooting
Common Issues
Authentication Problems
dbclean init # Re-enter credentials
dbclean test-auth # Verify connection
Data File Issues
- Ensure
data.csv
exists in current directory - Use
--input <file>
for different file names - Check file permissions and encoding
API Limits
- Check credit balance:
dbclean credits
- View usage:
dbclean usage
- Free tier: 5 requests per month, then paid credits required
Model Availability
dbclean models # See available models
Getting Help
dbclean --help # General help
dbclean run --help # Command-specific help
dbclean help-commands # Detailed command reference
๐ Output Files
After processing, you'll have:
data/data_stitched.csv
- Your final, cleaned datasetdata/train.csv
- Training data (70%)data/validate.csv
- Validation data (15%)data/test.csv
- Test data (15%)outputs/cleaner_changes_analysis.html
- Visual changes reportoutputs/architect_output.txt
- AI schema analysisoutputs/column_mapping.json
- Column transformation details
๐ค Support
- Documentation: dbclean.dev/docs
- Support: dbclean.dev/support
- API Status: Check real-time status and get your API key
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Ready to clean your data? Start with dbclean init
and transform your messy CSV files into pristine datasets! ๐