JSPM

@dbclean/cli

1.0.1
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 13
  • Score
    100M100P100Q37734F
  • License MIT

Transform messy CSV data into clean, standardized datasets using AI-powered automation

Package Exports

  • @dbclean/cli
  • @dbclean/cli/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@dbclean/cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

๐Ÿงน DBClean

Transform messy CSV data into clean, standardized datasets using AI-powered automation.

DBClean is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.

๐Ÿ“ Project Structure

After processing, your workspace will look like this:

your-project/
โ”œโ”€โ”€ data.csv                  # Your original input file
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ data_cleaned.csv      # After preclean step
โ”‚   โ”œโ”€โ”€ data_deduped.csv      # After duplicate removal
โ”‚   โ”œโ”€โ”€ data_stitched.csv     # Final cleaned dataset
โ”‚   โ”œโ”€โ”€ train.csv             # Training set (70%)
โ”‚   โ”œโ”€โ”€ validate.csv          # Validation set (15%)
โ”‚   โ””โ”€โ”€ test.csv              # Test set (15%)
โ”œโ”€โ”€ settings/
โ”‚   โ”œโ”€โ”€ instructions.txt      # Custom AI instructions
โ”‚   โ””โ”€โ”€ exclude_columns.txt   # Columns to skip in preclean
โ””โ”€โ”€ outputs/
    โ”œโ”€โ”€ architect_output.txt  # AI schema design
    โ”œโ”€โ”€ column_mapping.json   # Column transformations
    โ”œโ”€โ”€ cleaned_columns/      # Individual column results
    โ”œโ”€โ”€ cleaner_changes_analysis.html
    โ””โ”€โ”€ dedupe_report.txt

โœจ Features

  • ๐Ÿค– AI-Powered Cleaning - Uses advanced language models to intelligently clean and standardize data
  • ๐Ÿ—๏ธ Schema Design - Automatically creates optimal database schemas from your data
  • ๐Ÿ” Duplicate Detection - AI-powered duplicate identification and removal
  • ๐ŸŽฏ Outlier Detection - Uses Isolation Forest to identify and remove anomalies
  • โœ‚๏ธ Data Splitting - Automatically splits cleaned data into training, validation, and test sets
  • ๐Ÿ”„ Full Pipeline - Complete automation from raw CSV to clean, structured data
  • ๐Ÿ“Š Column-by-Column Processing - Detailed cleaning and standardization of individual columns
  • ๐ŸŽฏ Model Selection - Choose from multiple AI models for different tasks
  • ๐Ÿ“‹ Custom Instructions - Guide the AI with your specific cleaning requirements
  • ๐Ÿ’ฐ Credit-Based Billing - Pay only for what you use with transparent pricing

๐Ÿ’ณ Credit System

DBClean uses a transparent, pay-as-you-go credit system:

  • Free Tier: 5 free requests per month for new users
  • Minimum Balance: $0.01 required for paid requests
  • Precision: 4 decimal places (charges as low as $0.0001)
  • Pricing: Based on actual AI model costs with no markup
  • Billing: Credits deducted only after successful processing

Check your balance anytime with dbclean credits or get a complete overview with dbclean account.

๐Ÿš€ Quick Start

1. Initialize Your Account

dbclean init

Enter your email and API key when prompted. Don't have an account? Sign up at dbclean.dev

2. Verify Setup

dbclean test-auth
dbclean account

3. Process Your Data

# Place your CSV file as data.csv in your current directory
dbclean run

Your cleaned data will be available in data/data_stitched.csv ๐ŸŽ‰

๐Ÿ“– Command Reference

๐Ÿ”ง Setup & Authentication

Command Description
dbclean init Initialize with your email and API key
dbclean test-auth Verify your credentials are working
dbclean logout Remove stored credentials
dbclean status Check API key status and account info

๐Ÿ’ฐ Account Management

Command Description
dbclean account Complete account overview (credits, usage, status)
dbclean credits Check your current credit balance
dbclean usage View API usage statistics
dbclean usage --detailed Detailed breakdown by service and model
dbclean models List all available AI models

๐Ÿ“Š Data Processing Pipeline

Command Description
dbclean run Execute complete pipeline (recommended)
dbclean preclean Clean CSV data (remove newlines, special chars)
dbclean architect AI-powered schema design and standardization
dbclean dedupe AI-powered duplicate detection and removal
dbclean cleaner AI-powered column-by-column data cleaning
dbclean stitcher Combine all changes into final CSV
dbclean isosplit Detect outliers and split into train/validate/test

๐Ÿ”„ Complete Pipeline

The recommended approach is to use the full pipeline:

# Basic full pipeline
dbclean run

# With custom AI model
dbclean run -m "gemini-2.0-flash-exp"

# Different models for different steps
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"

# With custom instructions and larger sample
dbclean run -i -x 10

# Skip certain steps
dbclean run --skip-preclean --skip-dedupe

Pipeline Steps

  1. Preclean - Prepares raw CSV by removing problematic characters and formatting
  2. Architect - AI analyzes your data structure and creates optimized schema
  3. Dedupe - AI identifies and removes duplicate records intelligently
  4. Cleaner - AI processes each column to standardize and clean data
  5. Stitcher - Combines all improvements into final dataset
  6. Isosplit - Removes outliers and splits data for machine learning

๐ŸŽ›๏ธ Command Options

Model Selection

  • -m <model> - Use same model for all AI steps
  • --model-architect <model> - Specific model for architect step
  • --model-cleaner <model> - Specific model for cleaner step

Processing Options

  • -x <number> - Sample size for architect analysis (default: 5)
  • -i - Use custom instructions from settings/instructions.txt
  • --input <file> - Specify input CSV file (default: data.csv)

Skip Options

  • --skip-preclean - Skip data preparation step
  • --skip-architect - Skip schema design step
  • --skip-dedupe - Skip duplicate detection step
  • --skip-cleaner - Skip column cleaning step
  • --skip-isosplit - Skip outlier detection and data splitting

๐Ÿค– AI Models

Model Best For Speed Cost
gemini-2.0-flash-exp General purpose, fast processing โšกโšกโšก ๐Ÿ’ฒ
gemini-2.0-flash-thinking Complex data analysis โšกโšก ๐Ÿ’ฒ๐Ÿ’ฒ
gemini-1.5-pro Large, complex datasets โšก ๐Ÿ’ฒ๐Ÿ’ฒ๐Ÿ’ฒ

Model Selection Tips

  • For speed and cost: Use gemini-2.0-flash-exp
  • For complex, messy data: Use gemini-2.0-flash-thinking for architect
  • For mixed workloads: Use different models per step with --model-architect and --model-cleaner
# List all available models
dbclean models

๐Ÿ“ Custom Instructions

Create custom cleaning instructions to guide the AI:

  1. For architect step: Use the -i flag with a settings/instructions.txt file
  2. Example instructions:
    - Standardize all phone numbers to E.164 format (+1XXXXXXXXXX)
    - Convert all dates to YYYY-MM-DD format
    - Normalize company names (remove Inc, LLC, etc.)
    - Flag any entries with missing critical information
    - Ensure email addresses are properly formatted
dbclean run -i  # Uses instructions from settings/instructions.txt

๐Ÿ’ก Usage Examples

Basic Processing

# Process a CSV file with default settings
dbclean run

# Use a specific input file
dbclean run --input customer_data.csv

Advanced Processing

# High-quality processing with larger sample
dbclean run -m "gemini-2.0-flash-thinking" -x 15 -i

# Fast processing for large datasets
dbclean run -m "gemini-2.0-flash-exp" --skip-dedupe

# Custom pipeline - architect only
dbclean run --skip-preclean --skip-cleaner --skip-dedupe --skip-isosplit

Individual Steps

# Run architect with custom model and sample size
dbclean architect -m "gemini-2.0-flash-thinking" -x 10 -i

# Clean data with specific model
dbclean cleaner -m "gemini-2.0-flash-exp"

# Remove duplicates with AI analysis
dbclean dedupe

๐ŸŽฏ Best Practices

1. Start Small and Iterate

# Test with small sample first
dbclean architect -x 3

# Review outputs, then run full pipeline
dbclean run

2. Choose the Right Models

# For complex schema design
dbclean run --model-architect "gemini-2.0-flash-thinking" --model-cleaner "gemini-2.0-flash-exp"

3. Use Custom Instructions

Create settings/instructions.txt with domain-specific requirements:

Finance data requirements:
- Currency amounts in USD format ($X,XXX.XX)
- Account numbers must be 10-12 digits
- Transaction dates in YYYY-MM-DD format

4. Monitor Your Usage

# Check account status regularly
dbclean account

# Monitor detailed usage
dbclean usage --detailed

โ— Troubleshooting

Common Issues

Authentication Problems

dbclean init     # Re-enter credentials
dbclean test-auth # Verify connection

Data File Issues

  • Ensure data.csv exists in current directory
  • Use --input <file> for different file names
  • Check file permissions and encoding

API Limits

  • Check credit balance: dbclean credits
  • View usage: dbclean usage
  • Free tier: 5 requests per month, then paid credits required

Model Availability

dbclean models   # See available models

Getting Help

dbclean --help              # General help
dbclean run --help          # Command-specific help
dbclean help-commands       # Detailed command reference

๐Ÿ“Š Output Files

After processing, you'll have:

  • data/data_stitched.csv - Your final, cleaned dataset
  • data/train.csv - Training data (70%)
  • data/validate.csv - Validation data (15%)
  • data/test.csv - Test data (15%)
  • outputs/cleaner_changes_analysis.html - Visual changes report
  • outputs/architect_output.txt - AI schema analysis
  • outputs/column_mapping.json - Column transformation details

๐Ÿค Support

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Ready to clean your data? Start with dbclean init and transform your messy CSV files into pristine datasets! ๐Ÿš€