JSPM

@dataset.sh/cli

0.1.1-beta.0
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 4
  • Score
    100M100P100Q46577F
  • License MIT

Dataset CLI for managing local and remote dataset storage

Package Exports

  • @dataset.sh/cli
  • @dataset.sh/cli/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@dataset.sh/cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

@dataset.sh/cli

A powerful command-line interface for managing datasets with local caching, remote downloads, and flexible storage management. Similar to package managers like pnpm, but designed specifically for dataset files.

Features

  • 📦 Local and Global Installation - Install datasets per-project or globally
  • 🔄 Intelligent Caching - Global cache with SHA-256 integrity verification
  • 🏷️ Tag and Version Support - Install by semantic tags or specific versions
  • 🔗 Symbolic Linking - Efficient storage with automatic linking strategies
  • 🌐 Multiple Servers - Support for multiple dataset servers with authentication
  • 📤 Dataset Unpacking - Extract dataset contents for direct use
  • 🔐 Security - Built-in checksum verification and retry logic

Installation

Global Installation

pnpm add -g @dataset.sh/cli
# or
npm install -g @dataset.sh/cli

After global installation, use the dataset.sh command:

dataset.sh init
dataset.sh install nlp/sentiment

Using npx (No Installation Required)

npx @dataset.sh/cli init
npx @dataset.sh/cli install nlp/sentiment
npx @dataset.sh/cli unpack nlp/sentiment

Local Project Installation

pnpm add @dataset.sh/cli
# or
npm install @dataset.sh/cli

Then use via npm scripts or npx.

Quick Start

1. Initialize a Project

# Using global installation
dataset.sh init

# Using npx (no installation required)
npx @dataset.sh/cli init

2. Install a Dataset

# Install dataset with default tag (main)
dataset.sh install nlp/sentiment
# or
npx @dataset.sh/cli install nlp/sentiment

# Install specific tag
dataset.sh install nlp/sentiment -t v1.2

# Install specific version (using version hash)
dataset.sh install nlp/sentiment -v a1b2c3d4e5f6...

# Install globally
dataset.sh install -g nlp/sentiment

3. Unpack for Direct Use

# Unpack to public/datasets/nlp/sentiment
dataset.sh unpack nlp/sentiment
# or
npx @dataset.sh/cli unpack nlp/sentiment

# Unpack to custom location
dataset.sh unpack nlp/sentiment -d ./data

Global Options

--debug

Enable detailed debug logging to stderr. This shows internal operations including:

  • Configuration loading and path resolution
  • Network requests and responses
  • Cache operations (hits/misses)
  • File system operations
  • Linking strategies and operations
# Enable debug logging for any command
dataset.sh --debug init
dataset.sh --debug install nlp/sentiment

# Using npx
npx @dataset.sh/cli --debug init
npx @dataset.sh/cli --debug install nlp/sentiment

Debug output includes timestamped logs with module prefixes:

  • [CLI] - Command-line interface operations
  • [CONFIG] - Configuration and path management
  • [NETWORK] - HTTP requests and server communication
  • [CACHE] - Cache operations and integrity checking
  • [LINKING] - File linking and symlink operations
  • [FS] - File system operations
  • [INIT] - Init command operations
  • [INSTALL] - Install command operations
  • [UNPACK] - Unpack command operations

Commands

dataset.sh init

Creates a datasets.json file in the current directory.

dataset.sh init

dataset.sh install [dataset]

Installs datasets from datasets.json or adds and installs a specific dataset.

# Install all datasets from datasets.json
dataset.sh install

# Install specific dataset
dataset.sh install nlp/sentiment

# Install with options
dataset.sh install nlp/sentiment -t v1.2 -s myserver
dataset.sh install -g nlp/sentiment -v a1b2c3d4e5f6...

Options:

  • -g, --global - Install to global directory (~/.dataset_sh/global)
  • -s, --server <profile> - Use specific server profile
  • -t, --tag <tag> - Install specific tag (default: main)
  • -v, --version <version> - Install specific version (64-character hex string)

dataset.sh unpack <dataset>

Unpacks dataset content to a destination folder. The dataset must be installed first.

# Unpack to public/datasets
dataset.sh unpack nlp/sentiment

# Unpack to custom directory
dataset.sh unpack nlp/sentiment -d ./data

# Unpack specific version
dataset.sh unpack nlp/sentiment -v a1b2c3d4e5f6...

Options:

  • -v, --version <version> - Unpack specific version (default: latest available)
  • -d, --dest <folder> - Destination folder (default: public/datasets)

Configuration

Environment Variables

  • DSH_CACHE_DIR - Global cache directory (default: ~/.dataset_sh/cache)
  • DSH_GLOBAL_DIR - Global install directory (default: ~/.dataset_sh/global)
  • DSH_PROFILE_FILE - Server profiles file (default: ~/.dataset_sh/profile.json)

Server Profiles

Create ~/.dataset_sh/profile.json to configure server access:

{
  "servers": {
    "production": {
      "host": "https://api.example.com",
      "accessKey": "your-access-key"
    },
    "staging": {
      "host": "https://staging-api.example.com",
      "accessKey": "staging-key"
    }
  }
}

datasets.json Format

The datasets.json file tracks project dependencies:

{
  "datasets": {
    "nlp/sentiment": [
      {
        "tag": "v1.2",
        "host": "https://api.example.com"
      }
    ],
    "vision/imagenet": [
      {
        "version": "a1b2c3d4e5f6789...",
        "host": "https://api.example.com"
      }
    ]
  }
}

File Organization

Local Installation Structure

project/
├── datasets.json           # Project dataset manifest
├── dsh_datasets/          # Local dataset installations
│   └── nlp/
│       └── sentiment/
│           ├── tag/
│           │   ├── main -> ../version/a1b2c3d4...
│           │   └── v1.2 -> ../version/f6e5d4c3...
│           └── version/
│               ├── a1b2c3d4.../
│               └── f6e5d4c3.../

Global Cache Structure

~/.dataset_sh/
├── cache/                 # Global cache with integrity checking
│   └── nlp/
│       └── sentiment/
│           └── version/
│               ├── a1b2c3d4.../
│               │   └── sentiment.dataset
│               └── f6e5d4c3.../
│                   └── sentiment.dataset
├── global/               # Global installations
├── profile.json         # Server configurations

How It Works

Installation Process

  1. Tag Resolution - If installing by tag, resolves to specific version via API
  2. Cache Check - Checks if dataset exists in global cache and validates checksum
  3. Download - Downloads dataset if not cached or corrupted
  4. Verification - Validates SHA-256 checksum before caching
  5. Linking - Creates symbolic links (or copies) to target location

Caching Strategy

  • Global Cache - All datasets stored in ~/.dataset_sh/cache by version
  • Integrity Checking - SHA-256 checksums verify file integrity
  • Automatic Redownload - Corrupted cache entries are automatically redownloaded
  • Cross-Platform - Uses appropriate linking strategy per platform

Network Resilience

  • Exponential Backoff - Retries failed downloads with 1s, 2s, 4s delays
  • Smart Error Handling - Distinguishes between retryable and permanent failures
  • Authentication Support - Bearer token authentication for private servers

Examples

Machine Learning Workflow

# Initialize project
dataset.sh init

# Install training data
dataset.sh install ml/training-data -t latest

# Install validation set
dataset.sh install ml/validation-data -v a1b2c3d4e5f6...

# Unpack for training script
dataset.sh unpack ml/training-data -d ./data/train
dataset.sh unpack ml/validation-data -d ./data/val

Multi-Environment Setup

# Development
dataset.sh install nlp/dataset -t dev -s staging

# Production
dataset.sh install nlp/dataset -t v2.1 -s production

Global Dataset Management

# Install commonly used datasets globally
dataset.sh install -g common/embeddings
dataset.sh install -g common/stopwords

# Use in any project without reinstalling
dataset.sh unpack common/embeddings

Error Handling

The CLI provides clear, actionable error messages:

  • Network failures - Suggests checking connection and retry
  • Authentication errors - Points to profile configuration
  • Missing datasets - Shows available versions and tags
  • Disk space issues - Advises on freeing space
  • Permission errors - Guides on fixing file permissions

Troubleshooting

Debug Mode

When encountering issues, enable debug logging to see detailed internal operations:

dataset.sh --debug install problem/dataset
# or
npx @dataset.sh/cli --debug install problem/dataset

This will show:

  • Which server profiles are being used
  • Network request details and response codes
  • Cache hit/miss information
  • File system operations and linking strategies
  • Checksum verification steps

Common Issues

"datasets.json not found"

# Run init first
dataset.sh init
# or
npx @dataset.sh/cli init

"Server profile not found"

# Check your profile configuration
cat ~/.dataset_sh/profile.json

# Or create one
mkdir -p ~/.dataset_sh
echo '{"servers":{"default":{"host":"https://api.example.com"}}}' > ~/.dataset_sh/profile.json

"Checksum verification failed"

# Clear cache and retry
rm -rf ~/.dataset_sh/cache/category/dataset
dataset.sh --debug install category/dataset
# or
npx @dataset.sh/cli --debug install category/dataset

Network issues

# Use debug mode to see network details
dataset.sh --debug install category/dataset
# or
npx @dataset.sh/cli --debug install category/dataset

# Check server connectivity
curl -v https://your-server.com/api/health

Development

Building

pnpm build

Testing

pnpm test
pnpm test:watch

Compatibility

  • Node.js >= 16.0.0
  • TypeScript >= 5.0.0
  • Cross-platform - Works on Windows, macOS, and Linux

License

MIT