JSPM

@theanikrtgiri/create-llm

2.0.1
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 16
  • Score
    100M100P100Q90973F
  • License MIT

The fastest way to start training your own Language Model. Create production-ready LLM training projects in seconds.

Package Exports

  • @theanikrtgiri/create-llm
  • @theanikrtgiri/create-llm/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@theanikrtgiri/create-llm) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

๐Ÿš€ create-llm

The fastest way to start training your own Language Model

Create production-ready LLM training projects in seconds. Like create-next-app but for training custom language models.

npm version npm downloads GitHub stars License: MIT GitHub issues

๐Ÿ“ฆ npm Package โ€ข ๐Ÿ“– Documentation โ€ข ๐Ÿ› Report Bug โ€ข ๐Ÿ’ก Request Feature

npx @theanikrtgiri/create-llm my-awesome-llm
cd my-awesome-llm
pip install -r requirements.txt
python training/train.py

That's it! You're training an LLM. โœจ


Why create-llm?

Training a language model from scratch is complex. You need:

  • โœ… Model architecture (GPT, BERT, T5...)
  • โœ… Data preprocessing pipeline
  • โœ… Tokenizer training
  • โœ… Training loop with callbacks
  • โœ… Checkpoint management
  • โœ… Evaluation metrics
  • โœ… Text generation
  • โœ… Deployment tools

create-llm gives you all of this in one command.


Features

๐ŸŽฏ Right-Sized Templates

Choose from 4 templates optimized for different use cases:

  • NANO (1M params) - Learn in 2 minutes on any laptop
  • TINY (6M params) - Prototype in 15 minutes on CPU
  • SMALL (100M params) - Production models in hours
  • BASE (1B params) - Research-grade in days

๐Ÿ”ง Complete Toolkit

Everything you need out of the box:

  • PyTorch training infrastructure
  • Data preprocessing pipeline
  • Tokenizer training (BPE, WordPiece, Unigram)
  • Checkpoint management with auto-save
  • TensorBoard integration
  • Live training dashboard
  • Interactive chat interface
  • Model comparison tools
  • Deployment scripts

๐Ÿ“Š Smart Defaults

Intelligent configuration that:

  • Auto-detects vocab size from tokenizer
  • Automatically handles sequence length mismatches
  • Warns about model/data size mismatches
  • Detects overfitting during training
  • Suggests optimal hyperparameters
  • Handles cross-platform paths
  • Provides detailed diagnostic messages for errors

๐ŸŽจ Plugin System

Optional integrations:

  • WandB - Experiment tracking
  • HuggingFace - Model sharing
  • SynthexAI - Synthetic data generation

Quick Start

๐Ÿš€ One-Command Setup

# Using npx (recommended - no installation needed)
npx @theanikrtgiri/create-llm my-llm

# Or install globally
npm install -g @theanikrtgiri/create-llm
create-llm my-llm

๐ŸŽฏ Interactive Setup

npx @theanikrtgiri/create-llm

You'll be prompted for:

  • ๐Ÿ“ Project name
  • ๐ŸŽฏ Template (NANO, TINY, SMALL, BASE)
  • ๐Ÿ”ค Tokenizer type (BPE, WordPiece, Unigram)
  • ๐Ÿ”Œ Optional plugins (WandB, HuggingFace, SynthexAI)

โšก Quick Mode

# Specify everything upfront
npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe --skip-install

Templates

๐Ÿ“ฆ NANO (NEW!)

Perfect for learning and quick experiments

Parameters: ~1M
Hardware:   Any CPU (2GB RAM)
Time:       1-2 minutes
Data:       100+ examples
Use:        Learning, testing, demos

When to use:

  • First time training an LLM
  • Quick experiments and testing
  • Educational purposes
  • Understanding the pipeline
  • Limited data (100-1000 examples)

๐Ÿ“ฆ TINY

Perfect for prototyping and small projects

Parameters: ~6M
Hardware:   CPU or basic GPU (4GB RAM)
Time:       5-15 minutes
Data:       1,000+ examples
Use:        Prototypes, small projects

When to use:

  • Small-scale projects
  • Limited data (1K-10K examples)
  • Prototyping before scaling
  • Personal experiments
  • CPU-only environments

๐Ÿ“ฆ SMALL

Perfect for production applications

Parameters: ~100M
Hardware:   RTX 3060+ (12GB VRAM)
Time:       1-3 hours
Data:       10,000+ examples
Use:        Production, real apps

When to use:

  • Production applications
  • Domain-specific models
  • Real-world deployments
  • Good data availability
  • GPU available

๐Ÿ“ฆ BASE

Perfect for research and high-quality models

Parameters: ~1B
Hardware:   A100 or multi-GPU
Time:       1-3 days
Data:       100,000+ examples
Use:        Research, high-quality

When to use:

  • Research projects
  • High-quality requirements
  • Large datasets available
  • Multi-GPU setup
  • Competitive performance needed

Complete Workflow

1๏ธโƒฃ Create Your Project

npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe
cd my-llm

2๏ธโƒฃ Install Dependencies

pip install -r requirements.txt

3๏ธโƒฃ Add Your Data

Place your text files in data/raw/:

# Example: Download Shakespeare
curl https://www.gutenberg.org/files/100/100-0.txt > data/raw/shakespeare.txt

# Or add your own files
cp /path/to/your/data.txt data/raw/

๐Ÿ’ก Pro Tip: Start with at least 1MB of text for meaningful results

4๏ธโƒฃ Train Tokenizer

python tokenizer/train.py --data data/raw/

๐Ÿ”ค This creates a vocabulary from your data

5๏ธโƒฃ Prepare Dataset

python data/prepare.py

๐Ÿ“Š This tokenizes and prepares your data for training

6๏ธโƒฃ Start Training

# Basic training
python training/train.py

# With live dashboard (recommended!)
python training/train.py --dashboard
# Then open http://localhost:5000

# Resume from checkpoint
python training/train.py --resume checkpoints/checkpoint-1000.pt

๐Ÿ“ˆ Watch your model learn in real-time!

7๏ธโƒฃ Evaluate Your Model

python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.pt

8๏ธโƒฃ Generate Text

python evaluation/generate.py \
  --checkpoint checkpoints/checkpoint-best.pt \
  --prompt "Once upon a time" \
  --temperature 0.8

โœจ See your model's creativity in action!

9๏ธโƒฃ Interactive Chat

python chat.py --checkpoint checkpoints/checkpoint-best.pt

๐Ÿ’ฌ Chat with your trained model!

๐Ÿ”Ÿ Deploy

# To Hugging Face
python deploy.py --to huggingface --repo-id username/my-model

# To Replicate
python deploy.py --to replicate --model-name my-model

๐Ÿš€ Share your model with the world!


Project Structure

my-llm/
โ”œโ”€โ”€ ๐Ÿ“ data/
โ”‚   โ”œโ”€โ”€ raw/              # Your training data goes here
โ”‚   โ”œโ”€โ”€ processed/        # Tokenized data (auto-generated)
โ”‚   โ”œโ”€โ”€ dataset.py        # PyTorch dataset classes
โ”‚   โ””โ”€โ”€ prepare.py        # Data preprocessing script
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ models/
โ”‚   โ”œโ”€โ”€ architectures/    # Model implementations
โ”‚   โ”‚   โ”œโ”€โ”€ gpt.py       # GPT architecture
โ”‚   โ”‚   โ”œโ”€โ”€ nano.py      # 1M parameter model
โ”‚   โ”‚   โ”œโ”€โ”€ tiny.py      # 6M parameter model
โ”‚   โ”‚   โ”œโ”€โ”€ small.py     # 100M parameter model
โ”‚   โ”‚   โ””โ”€โ”€ base.py      # 1B parameter model
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ config.py        # Configuration loader
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ tokenizer/
โ”‚   โ”œโ”€โ”€ train.py         # Tokenizer training script
โ”‚   โ””โ”€โ”€ tokenizer.json   # Trained tokenizer (auto-generated)
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ training/
โ”‚   โ”œโ”€โ”€ train.py         # Main training script
โ”‚   โ”œโ”€โ”€ trainer.py       # Trainer class
โ”‚   โ”œโ”€โ”€ callbacks/       # Training callbacks
โ”‚   โ”‚   โ”œโ”€โ”€ base.py
โ”‚   โ”‚   โ”œโ”€โ”€ checkpoint.py
โ”‚   โ”‚   โ”œโ”€โ”€ logging.py
โ”‚   โ”‚   โ””โ”€โ”€ checkpoint_manager.py
โ”‚   โ””โ”€โ”€ dashboard/       # Live training dashboard
โ”‚       โ”œโ”€โ”€ dashboard_server.py
โ”‚       โ””โ”€โ”€ templates/
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ evaluation/
โ”‚   โ”œโ”€โ”€ evaluate.py      # Model evaluation
โ”‚   โ””โ”€โ”€ generate.py      # Text generation
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ plugins/          # Optional integrations
โ”‚   โ”œโ”€โ”€ wandb_plugin.py
โ”‚   โ”œโ”€โ”€ huggingface_plugin.py
โ”‚   โ””โ”€โ”€ synthex_plugin.py
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ checkpoints/      # Saved models (auto-generated)
โ”œโ”€โ”€ ๐Ÿ“ logs/            # Training logs (auto-generated)
โ”‚
โ”œโ”€โ”€ ๐Ÿ“„ llm.config.js    # Main configuration file
โ”œโ”€โ”€ ๐Ÿ“„ requirements.txt # Python dependencies
โ”œโ”€โ”€ ๐Ÿ“„ chat.py         # Interactive chat interface
โ”œโ”€โ”€ ๐Ÿ“„ deploy.py       # Deployment script
โ”œโ”€โ”€ ๐Ÿ“„ compare.py      # Model comparison tool
โ””โ”€โ”€ ๐Ÿ“„ README.md       # Project documentation

Configuration

Everything is controlled via llm.config.js:

module.exports = {
  // Model architecture
  model: {
    type: 'gpt',
    size: 'tiny',
    vocab_size: 10000,      // Auto-detected from tokenizer
    max_length: 512,
    layers: 4,
    heads: 4,
    dim: 256,
    dropout: 0.2,
  },

  // Training settings
  training: {
    batch_size: 16,
    learning_rate: 0.0006,
    warmup_steps: 500,
    max_steps: 10000,
    eval_interval: 500,
    save_interval: 2000,
    optimizer: 'adamw',
    weight_decay: 0.01,
    gradient_clip: 1.0,
    mixed_precision: false,
    gradient_accumulation_steps: 1,
  },

  // Data settings
  data: {
    max_length: 512,
    stride: 256,
    val_split: 0.1,
    shuffle: true,
  },

  // Tokenizer settings
  tokenizer: {
    type: 'bpe',
    vocab_size: 10000,
    min_frequency: 2,
    special_tokens: ["<pad>", "<unk>", "<s>", "</s>"],
  },

  // Plugins
  plugins: [
    // 'wandb',
    // 'huggingface',
    // 'synthex',
  ],
};

๐Ÿ“‹ CLI Reference

Commands

npx @theanikrtgiri/create-llm [project-name] [options]

Options

Option Description Default
--template <name> Template to use (nano, tiny, small, base, custom) Interactive
--tokenizer <type> Tokenizer type (bpe, wordpiece, unigram) Interactive
--skip-install Skip npm/pip installation false
-y, --yes Skip all prompts, use defaults false
-h, --help Show help -
-v, --version Show version -

Examples

# Interactive mode (recommended for first time)
npx @theanikrtgiri/create-llm

# Quick start with defaults
npx @theanikrtgiri/create-llm my-project

# Specify everything
npx @theanikrtgiri/create-llm my-project --template nano --tokenizer bpe --skip-install

# Skip prompts
npx @theanikrtgiri/create-llm my-project -y

Advanced Features

Live Training Dashboard

Monitor training in real-time with a web interface:

python training/train.py --dashboard

Then open http://localhost:5000 to see:

  • Real-time loss curves
  • Learning rate schedule
  • Tokens per second
  • GPU memory usage
  • Recent checkpoints

Model Comparison

Compare multiple trained models:

python compare.py checkpoints/model-v1/ checkpoints/model-v2/

Shows:

  • Side-by-side metrics
  • Sample generations
  • Performance comparison
  • Recommendation

Checkpoint Management

Automatic checkpoint management:

  • Saves best model based on validation loss
  • Keeps last N checkpoints (configurable)
  • Auto-saves on Ctrl+C
  • Resume from any checkpoint
# Resume training
python training/train.py --resume checkpoints/checkpoint-5000.pt

# Evaluate specific checkpoint
python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.pt

Custom Plugins

Create your own plugins:

# plugins/my_plugin.py
from plugins.base import BasePlugin

class MyPlugin(BasePlugin):
    def on_train_start(self, trainer):
        print("Training started!")
    
    def on_step_end(self, trainer, step, loss):
        # Log to your service
        pass

Best Practices

Data Preparation

Minimum Data Requirements:

  • NANO: 100+ examples (good for learning)
  • TINY: 1,000+ examples (minimum for decent results)
  • SMALL: 10,000+ examples (recommended)
  • BASE: 100,000+ examples (for quality)

Data Quality:

  • Use clean, well-formatted text
  • Remove HTML, markdown, or special formatting
  • Ensure consistent encoding (UTF-8)
  • Remove duplicates
  • Balance different content types

Training Tips

Avoid Overfitting:

  • Watch for perplexity < 1.5 (warning sign)
  • Use validation split (10% recommended)
  • Increase dropout if overfitting
  • Add more data if possible
  • Use smaller model for small datasets

Optimize Training:

  • Start with NANO to test pipeline
  • Use mixed precision on GPU (mixed_precision: true)
  • Increase gradient_accumulation_steps if OOM
  • Monitor GPU usage with dashboard
  • Save checkpoints frequently

Hyperparameter Tuning:

  • Learning rate: Start with 3e-4, adjust if unstable
  • Batch size: As large as GPU allows
  • Warmup steps: 10% of total steps
  • Dropout: 0.1-0.3 depending on data size

Deployment

Before Deploying:

  • Evaluate on held-out test set
  • Test generation quality
  • Check model size
  • Verify inference speed
  • Test on target hardware

Deployment Options:

  • Hugging Face Hub (easiest)
  • Replicate (API endpoint)
  • Docker container (custom)
  • Cloud platforms (AWS, GCP, Azure)

Troubleshooting

Common Issues

"Vocab size mismatch detected"

  • โœ… This is normal! The tool auto-detects and fixes it
  • The model will use the actual tokenizer vocab size

"Position embedding index error" or sequences too long

  • โœ… Automatically handled! Sequences exceeding max_length are truncated
  • The model logs warnings when truncation occurs
  • Check your data preprocessing if you see frequent truncation warnings
  • Consider increasing max_length in config if you need longer sequences
  • Note: Increasing max_length requires retraining from scratch

"Model may be too large for dataset"

  • โš ๏ธ Warning: Risk of overfitting
  • Solutions: Add more data, use smaller template, increase dropout

"Perplexity < 1.1 indicates severe overfitting"

  • โŒ Model memorized the data
  • Solutions: Add much more data, use smaller model, increase regularization

"CUDA out of memory"

  • Reduce batch_size in llm.config.js
  • Enable mixed_precision: true
  • Increase gradient_accumulation_steps
  • Use smaller model template

"Tokenizer not found"

  • Run python tokenizer/train.py --data data/raw/ first
  • Make sure data/raw/ contains .txt files

"Training loss not decreasing"

  • Check learning rate (try 1e-4 to 1e-3)
  • Verify data is loading correctly
  • Check for data preprocessing issues
  • Try longer warmup period

Getting Help


Requirements

For CLI Tool

  • Node.js 18.0.0 or higher
  • npm 8.0.0 or higher

For Training

  • Python 3.8 or higher
  • PyTorch 2.0.0 or higher
  • 4GB RAM minimum (NANO/TINY)
  • 12GB VRAM recommended (SMALL)
  • 40GB+ VRAM for BASE

Operating Systems

  • โœ… Windows 10/11
  • โœ… macOS 10.15+
  • โœ… Linux (Ubuntu 20.04+)

Development

Setup Development Environment

git clone https://github.com/theaniketgiri/create-llm.git
cd create-llm
npm install

Build

npm run build

Development Mode

npm run dev

Test Locally

node dist/index.js test-project --template nano

Run Tests

npm test

Publish

npm version patch  # or minor/major
npm publish

๐Ÿค Contributing

We welcome contributions from everyone!

๐Ÿ“– Contributing Guide โ€ข ๐Ÿ› Report Bug โ€ข ๐Ÿ’ก Request Feature

๐ŸŽฏ Areas We Need Help

Area Description Difficulty
๐Ÿ› Bug Fixes Fix issues and improve stability ๐ŸŸข Easy
๐Ÿ“ Documentation Improve guides and examples ๐ŸŸข Easy
๐ŸŽจ New Templates Add BERT, T5, custom architectures ๐ŸŸก Medium
๐Ÿ”Œ Plugins Integrate new services ๐ŸŸก Medium
๐Ÿงช Testing Increase test coverage ๐ŸŸก Medium
๐ŸŒ i18n Internationalization support ๐Ÿ”ด Hard

๐Ÿ‘ฅ Contributors

Thanks to all contributors who have helped make this project better!

Contributors


Roadmap

v1.1 (Next Release)

  • More model architectures (BERT, T5)
  • Distributed training support
  • Model quantization tools
  • Fine-tuning templates

v1.2

  • Web UI for project management
  • Automatic hyperparameter tuning
  • Model compression tools
  • More deployment targets

v2.0

  • Multi-modal support
  • Reinforcement learning from human feedback
  • Advanced optimization techniques
  • Cloud training integration

๐Ÿ“„ License

MIT ยฉ Aniket Giri

See LICENSE for more information.


Acknowledgments

Built with amazing open-source tools:

Special thanks to the LLM community for inspiration and feedback.


โญ Star History

If you find this project useful, please consider giving it a star!

Star History Chart


Made with โค๏ธ for the LLM community

GitHub โ€ข npm โ€ข Issues โ€ข Twitter

๐Ÿ™ Support This Project

If create-llm helped you, consider:

  • โญ Starring the repo
  • ๐Ÿ› Reporting bugs
  • ๐Ÿ’ก Suggesting features
  • ๐Ÿ“ Improving docs
  • ๐Ÿ”€ Contributing code

Together, let's make LLM training accessible to everyone!