Package Exports
- @dataset.sh/cli
- @dataset.sh/cli/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@dataset.sh/cli) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
@dataset.sh/cli
A powerful command-line interface for managing datasets with local caching, remote downloads, and flexible storage management. Similar to package managers like pnpm, but designed specifically for dataset files.
Features
- 📦 Local and Global Installation - Install datasets per-project or globally
- 🔄 Intelligent Caching - Global cache with SHA-256 integrity verification
- 🏷️ Tag and Version Support - Install by semantic tags or specific versions
- 🔗 Symbolic Linking - Efficient storage with automatic linking strategies
- 🌐 Multiple Servers - Support for multiple dataset servers with authentication
- 📤 Dataset Unpacking - Extract dataset contents for direct use
- 🔐 Security - Built-in checksum verification and retry logic
Installation
Global Installation
pnpm add -g @dataset.sh/cli
# or
npm install -g @dataset.sh/cli
After global installation, use the dataset.sh
command:
dataset.sh init
dataset.sh install nlp/sentiment
Using npx (No Installation Required)
npx @dataset.sh/cli init
npx @dataset.sh/cli install nlp/sentiment
npx @dataset.sh/cli unpack nlp/sentiment
Local Project Installation
pnpm add @dataset.sh/cli
# or
npm install @dataset.sh/cli
Then use via npm scripts or npx
.
Quick Start
1. Initialize a Project
# Using global installation
dataset.sh init
# Using npx (no installation required)
npx @dataset.sh/cli init
2. Install a Dataset
# Install dataset with default tag (main)
dataset.sh install nlp/sentiment
# or
npx @dataset.sh/cli install nlp/sentiment
# Install specific tag
dataset.sh install nlp/sentiment -t v1.2
# Install specific version (using version hash)
dataset.sh install nlp/sentiment -v a1b2c3d4e5f6...
# Install globally
dataset.sh install -g nlp/sentiment
3. Unpack for Direct Use
# Unpack to public/datasets/nlp/sentiment
dataset.sh unpack nlp/sentiment
# or
npx @dataset.sh/cli unpack nlp/sentiment
# Unpack to custom location
dataset.sh unpack nlp/sentiment -d ./data
Global Options
--debug
Enable detailed debug logging to stderr. This shows internal operations including:
- Configuration loading and path resolution
- Network requests and responses
- Cache operations (hits/misses)
- File system operations
- Linking strategies and operations
# Enable debug logging for any command
dataset.sh --debug init
dataset.sh --debug install nlp/sentiment
# Using npx
npx @dataset.sh/cli --debug init
npx @dataset.sh/cli --debug install nlp/sentiment
Debug output includes timestamped logs with module prefixes:
[CLI]
- Command-line interface operations[CONFIG]
- Configuration and path management[NETWORK]
- HTTP requests and server communication[CACHE]
- Cache operations and integrity checking[LINKING]
- File linking and symlink operations[FS]
- File system operations[INIT]
- Init command operations[INSTALL]
- Install command operations[UNPACK]
- Unpack command operations
Commands
dataset.sh init
Creates a datasets.json
file in the current directory.
dataset.sh init
dataset.sh install [dataset]
Installs datasets from datasets.json
or adds and installs a specific dataset.
# Install all datasets from datasets.json
dataset.sh install
# Install specific dataset
dataset.sh install nlp/sentiment
# Install with options
dataset.sh install nlp/sentiment -t v1.2 -s myserver
dataset.sh install -g nlp/sentiment -v a1b2c3d4e5f6...
Options:
-g, --global
- Install to global directory (~/.dataset_sh/global
)-s, --server <profile>
- Use specific server profile-t, --tag <tag>
- Install specific tag (default: main)-v, --version <version>
- Install specific version (64-character hex string)
dataset.sh unpack <dataset>
Unpacks dataset content to a destination folder. The dataset must be installed first.
# Unpack to public/datasets
dataset.sh unpack nlp/sentiment
# Unpack to custom directory
dataset.sh unpack nlp/sentiment -d ./data
# Unpack specific version
dataset.sh unpack nlp/sentiment -v a1b2c3d4e5f6...
Options:
-v, --version <version>
- Unpack specific version (default: latest available)-d, --dest <folder>
- Destination folder (default:public/datasets
)
Configuration
Environment Variables
DSH_CACHE_DIR
- Global cache directory (default:~/.dataset_sh/cache
)DSH_GLOBAL_DIR
- Global install directory (default:~/.dataset_sh/global
)DSH_PROFILE_FILE
- Server profiles file (default:~/.dataset_sh/profile.json
)
Server Profiles
Create ~/.dataset_sh/profile.json
to configure server access:
{
"servers": {
"production": {
"host": "https://api.example.com",
"accessKey": "your-access-key"
},
"staging": {
"host": "https://staging-api.example.com",
"accessKey": "staging-key"
}
}
}
datasets.json Format
The datasets.json
file tracks project dependencies:
{
"datasets": {
"nlp/sentiment": [
{
"tag": "v1.2",
"host": "https://api.example.com"
}
],
"vision/imagenet": [
{
"version": "a1b2c3d4e5f6789...",
"host": "https://api.example.com"
}
]
}
}
File Organization
Local Installation Structure
project/
├── datasets.json # Project dataset manifest
├── dsh_datasets/ # Local dataset installations
│ └── nlp/
│ └── sentiment/
│ ├── tag/
│ │ ├── main -> ../version/a1b2c3d4...
│ │ └── v1.2 -> ../version/f6e5d4c3...
│ └── version/
│ ├── a1b2c3d4.../
│ └── f6e5d4c3.../
Global Cache Structure
~/.dataset_sh/
├── cache/ # Global cache with integrity checking
│ └── nlp/
│ └── sentiment/
│ └── version/
│ ├── a1b2c3d4.../
│ │ └── sentiment.dataset
│ └── f6e5d4c3.../
│ └── sentiment.dataset
├── global/ # Global installations
├── profile.json # Server configurations
How It Works
Installation Process
- Tag Resolution - If installing by tag, resolves to specific version via API
- Cache Check - Checks if dataset exists in global cache and validates checksum
- Download - Downloads dataset if not cached or corrupted
- Verification - Validates SHA-256 checksum before caching
- Linking - Creates symbolic links (or copies) to target location
Caching Strategy
- Global Cache - All datasets stored in
~/.dataset_sh/cache
by version - Integrity Checking - SHA-256 checksums verify file integrity
- Automatic Redownload - Corrupted cache entries are automatically redownloaded
- Cross-Platform - Uses appropriate linking strategy per platform
Network Resilience
- Exponential Backoff - Retries failed downloads with 1s, 2s, 4s delays
- Smart Error Handling - Distinguishes between retryable and permanent failures
- Authentication Support - Bearer token authentication for private servers
Examples
Machine Learning Workflow
# Initialize project
dataset.sh init
# Install training data
dataset.sh install ml/training-data -t latest
# Install validation set
dataset.sh install ml/validation-data -v a1b2c3d4e5f6...
# Unpack for training script
dataset.sh unpack ml/training-data -d ./data/train
dataset.sh unpack ml/validation-data -d ./data/val
Multi-Environment Setup
# Development
dataset.sh install nlp/dataset -t dev -s staging
# Production
dataset.sh install nlp/dataset -t v2.1 -s production
Global Dataset Management
# Install commonly used datasets globally
dataset.sh install -g common/embeddings
dataset.sh install -g common/stopwords
# Use in any project without reinstalling
dataset.sh unpack common/embeddings
Error Handling
The CLI provides clear, actionable error messages:
- Network failures - Suggests checking connection and retry
- Authentication errors - Points to profile configuration
- Missing datasets - Shows available versions and tags
- Disk space issues - Advises on freeing space
- Permission errors - Guides on fixing file permissions
Troubleshooting
Debug Mode
When encountering issues, enable debug logging to see detailed internal operations:
dataset.sh --debug install problem/dataset
# or
npx @dataset.sh/cli --debug install problem/dataset
This will show:
- Which server profiles are being used
- Network request details and response codes
- Cache hit/miss information
- File system operations and linking strategies
- Checksum verification steps
Common Issues
"datasets.json not found"
# Run init first
dataset.sh init
# or
npx @dataset.sh/cli init
"Server profile not found"
# Check your profile configuration
cat ~/.dataset_sh/profile.json
# Or create one
mkdir -p ~/.dataset_sh
echo '{"servers":{"default":{"host":"https://api.example.com"}}}' > ~/.dataset_sh/profile.json
"Checksum verification failed"
# Clear cache and retry
rm -rf ~/.dataset_sh/cache/category/dataset
dataset.sh --debug install category/dataset
# or
npx @dataset.sh/cli --debug install category/dataset
Network issues
# Use debug mode to see network details
dataset.sh --debug install category/dataset
# or
npx @dataset.sh/cli --debug install category/dataset
# Check server connectivity
curl -v https://your-server.com/api/health
Development
Building
pnpm build
Testing
pnpm test
pnpm test:watch
Compatibility
- Node.js >= 16.0.0
- TypeScript >= 5.0.0
- Cross-platform - Works on Windows, macOS, and Linux
License
MIT