JSPM

@dbclean/cli

1.0.0
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 13
  • Score
    100M100P100Q37760F
  • License MIT

Transform messy CSV data into clean, standardized datasets using AI-powered automation

Package Exports

  • @dbclean/cli
  • @dbclean/cli/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@dbclean/cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

๐Ÿงน DBClean CLI

Transform messy CSV data into clean, standardized datasets using AI-powered automation.

DBClean CLI is a powerful command-line tool that automatically cleans, standardizes, and restructures your CSV data using advanced AI models. Perfect for data scientists, analysts, and anyone working with messy datasets.

๐Ÿ“ Project Structure

dbclean-cli/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ data.csv              # Your input file
โ”‚   โ”œโ”€โ”€ data_cleaned.csv      # After preclean
โ”‚   โ”œโ”€โ”€ data_stitched.csv     # Stitched data
โ”‚   โ”œโ”€โ”€ train.csv             # Training set (70%)
โ”‚   โ”œโ”€โ”€ validate.csv          # Validation set (15%)
โ”‚   โ””โ”€โ”€ test.csv              # Test set (15%)
โ”œโ”€โ”€ settings/
โ”‚   โ”œโ”€โ”€ instructions.txt      # Custom AI instructions
โ”‚   โ””โ”€โ”€ exclude_columns.txt   # Columns to skip in preclean
โ”œโ”€โ”€ outputs/
โ”‚   โ”œโ”€โ”€ architect_output.txt  # AI schema design
โ”‚   โ”œโ”€โ”€ column_mapping.json   # Column transformations
โ”‚   โ”œโ”€โ”€ cleaned_columns/      # Individual column results
โ”‚   โ””โ”€โ”€ cleaner_changes_analysis.txt
โ””โ”€โ”€ config.json              # Project configuration

โœจ Features

  • ๐Ÿค– AI-Powered Cleaning - Uses advanced language models to intelligently clean and standardize data
  • ๐Ÿ—๏ธ Schema Design - Automatically creates optimal database schemas from your data
  • ๐Ÿ”„ Outlier Detection - Uses Isolation Forest to identify and remove anomalies from your dataset.
  • โœ‚๏ธ Data Splitting - Automatically splits your cleaned data into training, validation, and test sets.
  • ๐Ÿ”„ Full Pipeline - Complete automation from raw CSV to clean, structured data
  • ๐Ÿ“Š Column-by-Column Processing - Detailed cleaning and standardization of individual columns
  • ๐ŸŽฏ Model Selection - Choose from multiple AI models for different tasks
  • ๐Ÿ“‹ Custom Instructions - Guide the AI with your specific cleaning requirements
  • ๐Ÿ” Detailed Logging - Track every change and transformation
  • โšก Batch Processing - Handle large datasets efficiently
  • ๐Ÿ’ฐ Credit-Based Billing - Pay only for what you use with transparent pricing
  • ๐Ÿ“Š Usage Analytics - Track your costs and optimize your usage

๐Ÿ’ณ Credit System

DBClean uses a transparent, pay-as-you-go credit system:

  • Minimum Balance: $0.01 required to make requests
  • Precision: 4 decimal places (charges as low as $0.0001)
  • Pricing: Based on actual Gemini AI model costs with no markup
  • Billing: Credits deducted only after successful processing
  • Transparency: Detailed usage tracking and cost breakdown

Check your balance anytime with dbclean-cli credits or get a complete overview with dbclean-cli account.

๐Ÿš€ Quick Start

Prerequisites

  • Node.js (version 16 or higher)
  • npm (comes with Node.js)
  • A DBClean API key (sign up at dbclean.dev)

Installation

  1. Clone and setup the CLI:

    cd dbclean-cli
    npm install
    npm link
  2. Initialize with your API credentials:

    dbclean-cli init

    Enter your email and API key when prompted.

  3. Test your setup:

    dbclean-cli test-auth
    dbclean-cli account

Basic Usage

  1. Place your CSV file in the data/ directory as data.csv

  2. Run the complete pipeline:

    dbclean-cli run
  3. Get your cleaned data from data/data_stitched.csv ๐ŸŽ‰

๐Ÿ“– Detailed Usage Guide

๐Ÿ”ง Setup Commands

Initialize CLI

dbclean-cli init

Set up your email and API key for authentication.

Test Authentication

dbclean-cli test-auth

Verify your credentials are working.

Check Status

dbclean-cli status

View your API key status and usage information.

Account Overview

dbclean-cli account

Complete account dashboard showing credits, usage, and status.

๐Ÿ’ฐ Credit Management

Check Credit Balance

dbclean-cli credits

View your current credit balance and usage warnings.

View Usage Statistics

# Basic usage summary
dbclean-cli usage

# Detailed breakdown by service and model
dbclean-cli usage --detailed

Track your API usage, token consumption, and costs.

List Available Models

dbclean-cli models

See all available AI models and their pricing.

๐Ÿ”„ Pipeline Commands

# Basic full pipeline
dbclean-cli run

# With custom AI model
dbclean-cli run -m "gemini-2.5-pro"

# Different models for different steps
dbclean-cli run -ma "gemini-2.5-pro" -mc "gemini-2.5-flash"

# With custom instructions and larger sample
dbclean-cli run -i -x 10

# Skip certain steps
dbclean-cli run --skip-preclean --skip-architect --skip-isosplit

๐Ÿงฉ Individual Step Commands

1. Preclean - Data Preparation

dbclean-cli preclean

Prepares your raw CSV by:

  • Removing problematic newlines and special characters
  • Handling non-UTF8 characters
  • Creating a clean base file for AI processing

2. Architect - Schema Design

# Basic schema design
dbclean-cli architect

# With specific model and larger sample
dbclean-cli architect -m "gemini-2.5-pro" -x 10

# With custom instructions
dbclean-cli architect -i

# List available models
dbclean-cli architect --list-models

Creates an optimized schema by:

  • Analyzing your data structure
  • Standardizing column names
  • Defining data types and formats
  • Providing cleaning examples

3. Cleaner - Data Cleaning

# Clean all columns
dbclean-cli cleaner

# With specific model
dbclean-cli cleaner -m "gemini-2.5-flash"

# List available models
dbclean-cli cleaner --list-models

Processes each column to:

  • Standardize formats and values
  • Fix inconsistencies
  • Flag problematic entries
  • Apply schema-guided cleaning

4. Stitcher - Final Assembly

dbclean-cli stitcher

Creates your final dataset by:

  • Applying all architect corrections
  • Integrating cleaner changes
  • Generating final CSV with all improvements
  • Creating detailed change analysis

5. Isosplit - Outlier Detection & Splitting

dbclean-cli isosplit

Processes the stitched data to:

  • Detect and remove outliers using an Isolation Forest model.
  • Shuffle the cleaned data randomly.
  • Split the data into train.csv (70%), validate.csv (15%), and test.csv (15%).

๐ŸŽ›๏ธ Command Options

Model Selection

  • -m <model> - Use same model for all AI steps
  • -ma <model> - Specific model for architect step
  • -mc <model> - Specific model for cleaner step
  • --list-models - Show available AI models

Processing Options

  • -x <number> - Sample size for architect analysis (default: 5)
  • -i - Use custom instructions from settings/instructions.txt
  • --skip-preclean - Skip data preparation step
  • --skip-architect - Skip schema design step
  • --skip-cleaner - Skip column cleaning step
  • --skip-dedupe - Skip the deduplication step
  • --skip-isosplit - Skip the outlier detection and data splitting step

Output Options

  • --log-file <path> - Custom log file for silent-run

๐Ÿค– AI Models

DBClean supports multiple AI models for different use cases:

  • gemini-2.5-pro - Excellent for complex data understanding
  • gemini-2.5-flash - Great general-purpose model
  • gemini-2.0-flash - Good performance for large datasets

Model Selection Tips

  • For complex, messy data: Use gemini-2.5-pro
  • For speed and cost: Use gemini-2.0-flash
  • For mixed workloads: Use different models per step with -ma and -mc

๐Ÿ“ Custom Instructions

Create settings/instructions.txt to guide the AI with specific requirements:

Examples of custom instructions:
- "Standardize all phone numbers to E.164 format (+1XXXXXXXXXX)"
- "Convert all dates to YYYY-MM-DD format"
- "Normalize company names (remove Inc, LLC, etc.)"
- "Flag any entries with missing critical information"

Use with: dbclean-cli run -i

๐Ÿ’ก Examples

Example 1: Customer Data Cleaning

# Place customer_data.csv in data/ as data.csv
dbclean-cli run -m "gemini-2.5-pro" -i -x 15

Example 2: Large Dataset (Silent Processing)

dbclean-cli silent-run -ma "gemini-2.5-pro" -mc "gemini-2.5-flash"

Example 3: Quick Test (Skip Heavy Steps)

dbclean-cli run --skip-cleaner -x 3

Example 4: Re-run Just Cleaning

dbclean-cli run --skip-preclean --skip-architect

Example 5: Skip Outlier Detection

dbclean-cli run --skip-isosplit

๐Ÿ”ง Configuration

config.json

Customize file paths and settings:

{
  "data_dir": "data",
  "data_cleaned_file_path": "data_cleaned.csv",
  "data_stitched_file_path": "data_stitched.csv",
  "settings__dir": "settings",
  "outputs_dir": "outputs"
}

Exclude Columns

Add column names to settings/exclude_columns.txt to skip them during preclean:

Internal_ID
Temp_Notes
Debug_Column

๐ŸŽฏ Best Practices

1. Start Small

  • Begin with -x 5 (5 rows) for initial testing
  • Increase sample size for better results on complex data

2. Use Custom Instructions

  • Provide specific formatting requirements
  • Include domain knowledge about your data
  • Specify any business rules or constraints

3. Model Selection

  • Use powerful models (gemini-2.5-pro) for initial architect step
  • Use faster models (gemini-2.0-flash) for repetitive cleaner tasks
  • Test different combinations to find optimal performance/cost

4. Iterative Approach

  • Run architect first to understand data structure
  • Review outputs before running full pipeline
  • Use skip options to re-run specific steps

โ— Troubleshooting

Common Issues

"API key not found"

dbclean-cli init  # Re-enter credentials
dbclean-cli test-auth  # Verify connection

"Data file not found"

  • Ensure data.csv exists in the data/ directory
  • Check file permissions and path

"Model not available"

dbclean-cli run --list-models  # See available models

"Rate limit errors"

  • The CLI automatically retries with delays
  • Use silent-run for unattended processing
  • Consider using faster/cheaper models

Getting Help

dbclean-cli --help           # General help
dbclean-cli run --help       # Command-specific help
dbclean-cli test            # Test console output

๐Ÿค Support

  • Documentation: dbclean.dev/docs
  • Issues: Report bugs or request features
  • Community: Join our Discord for support and tips

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Ready to clean your data? Start with dbclean-cli init and transform your messy CSV files into pristine datasets! ๐Ÿš€