Package Exports
- @theanikrtgiri/create-llm
- @theanikrtgiri/create-llm/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@theanikrtgiri/create-llm) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
๐ create-llm
The fastest way to start training your own Language Model
Create production-ready LLM training projects in seconds. Like create-next-app but for training custom language models.
๐ฆ npm Package โข ๐ Documentation โข ๐ Report Bug โข ๐ก Request Feature
npx @theanikrtgiri/create-llm my-awesome-llm
cd my-awesome-llm
pip install -r requirements.txt
python training/train.pyThat's it! You're training an LLM. โจ
Why create-llm?
Training a language model from scratch is complex. You need:
- โ Model architecture (GPT, BERT, T5...)
- โ Data preprocessing pipeline
- โ Tokenizer training
- โ Training loop with callbacks
- โ Checkpoint management
- โ Evaluation metrics
- โ Text generation
- โ Deployment tools
create-llm gives you all of this in one command.
Features
๐ฏ Right-Sized Templates
Choose from 4 templates optimized for different use cases:
- NANO (1M params) - Learn in 2 minutes on any laptop
- TINY (6M params) - Prototype in 15 minutes on CPU
- SMALL (100M params) - Production models in hours
- BASE (1B params) - Research-grade in days
๐ง Complete Toolkit
Everything you need out of the box:
- PyTorch training infrastructure
- Data preprocessing pipeline
- Tokenizer training (BPE, WordPiece, Unigram)
- Checkpoint management with auto-save
- TensorBoard integration
- Live training dashboard
- Interactive chat interface
- Model comparison tools
- Deployment scripts
๐ Smart Defaults
Intelligent configuration that:
- Auto-detects vocab size from tokenizer
- Automatically handles sequence length mismatches
- Warns about model/data size mismatches
- Detects overfitting during training
- Suggests optimal hyperparameters
- Handles cross-platform paths
- Provides detailed diagnostic messages for errors
๐จ Plugin System
Optional integrations:
- WandB - Experiment tracking
- HuggingFace - Model sharing
- SynthexAI - Synthetic data generation
Quick Start
๐ One-Command Setup
# Using npx (recommended - no installation needed)
npx @theanikrtgiri/create-llm my-llm
# Or install globally
npm install -g @theanikrtgiri/create-llm
create-llm my-llm๐ฏ Interactive Setup
npx @theanikrtgiri/create-llmYou'll be prompted for:
- ๐ Project name
- ๐ฏ Template (NANO, TINY, SMALL, BASE)
- ๐ค Tokenizer type (BPE, WordPiece, Unigram)
- ๐ Optional plugins (WandB, HuggingFace, SynthexAI)
โก Quick Mode
# Specify everything upfront
npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe --skip-installTemplates
๐ฆ NANO (NEW!)
Perfect for learning and quick experiments
Parameters: ~1M
Hardware: Any CPU (2GB RAM)
Time: 1-2 minutes
Data: 100+ examples
Use: Learning, testing, demosWhen to use:
- First time training an LLM
- Quick experiments and testing
- Educational purposes
- Understanding the pipeline
- Limited data (100-1000 examples)
๐ฆ TINY
Perfect for prototyping and small projects
Parameters: ~6M
Hardware: CPU or basic GPU (4GB RAM)
Time: 5-15 minutes
Data: 1,000+ examples
Use: Prototypes, small projectsWhen to use:
- Small-scale projects
- Limited data (1K-10K examples)
- Prototyping before scaling
- Personal experiments
- CPU-only environments
๐ฆ SMALL
Perfect for production applications
Parameters: ~100M
Hardware: RTX 3060+ (12GB VRAM)
Time: 1-3 hours
Data: 10,000+ examples
Use: Production, real appsWhen to use:
- Production applications
- Domain-specific models
- Real-world deployments
- Good data availability
- GPU available
๐ฆ BASE
Perfect for research and high-quality models
Parameters: ~1B
Hardware: A100 or multi-GPU
Time: 1-3 days
Data: 100,000+ examples
Use: Research, high-qualityWhen to use:
- Research projects
- High-quality requirements
- Large datasets available
- Multi-GPU setup
- Competitive performance needed
Complete Workflow
1๏ธโฃ Create Your Project
npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe
cd my-llm2๏ธโฃ Install Dependencies
pip install -r requirements.txt3๏ธโฃ Add Your Data
Place your text files in data/raw/:
# Example: Download Shakespeare
curl https://www.gutenberg.org/files/100/100-0.txt > data/raw/shakespeare.txt
# Or add your own files
cp /path/to/your/data.txt data/raw/๐ก Pro Tip: Start with at least 1MB of text for meaningful results
4๏ธโฃ Train Tokenizer
python tokenizer/train.py --data data/raw/๐ค This creates a vocabulary from your data
5๏ธโฃ Prepare Dataset
python data/prepare.py๐ This tokenizes and prepares your data for training
6๏ธโฃ Start Training
# Basic training
python training/train.py
# With live dashboard (recommended!)
python training/train.py --dashboard
# Then open http://localhost:5000
# Resume from checkpoint
python training/train.py --resume checkpoints/checkpoint-1000.pt๐ Watch your model learn in real-time!
7๏ธโฃ Evaluate Your Model
python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.pt8๏ธโฃ Generate Text
python evaluation/generate.py \
--checkpoint checkpoints/checkpoint-best.pt \
--prompt "Once upon a time" \
--temperature 0.8โจ See your model's creativity in action!
9๏ธโฃ Interactive Chat
python chat.py --checkpoint checkpoints/checkpoint-best.pt๐ฌ Chat with your trained model!
๐ Deploy
# To Hugging Face
python deploy.py --to huggingface --repo-id username/my-model
# To Replicate
python deploy.py --to replicate --model-name my-model๐ Share your model with the world!
Project Structure
my-llm/
โโโ ๐ data/
โ โโโ raw/ # Your training data goes here
โ โโโ processed/ # Tokenized data (auto-generated)
โ โโโ dataset.py # PyTorch dataset classes
โ โโโ prepare.py # Data preprocessing script
โ
โโโ ๐ models/
โ โโโ architectures/ # Model implementations
โ โ โโโ gpt.py # GPT architecture
โ โ โโโ nano.py # 1M parameter model
โ โ โโโ tiny.py # 6M parameter model
โ โ โโโ small.py # 100M parameter model
โ โ โโโ base.py # 1B parameter model
โ โโโ __init__.py
โ โโโ config.py # Configuration loader
โ
โโโ ๐ tokenizer/
โ โโโ train.py # Tokenizer training script
โ โโโ tokenizer.json # Trained tokenizer (auto-generated)
โ
โโโ ๐ training/
โ โโโ train.py # Main training script
โ โโโ trainer.py # Trainer class
โ โโโ callbacks/ # Training callbacks
โ โ โโโ base.py
โ โ โโโ checkpoint.py
โ โ โโโ logging.py
โ โ โโโ checkpoint_manager.py
โ โโโ dashboard/ # Live training dashboard
โ โโโ dashboard_server.py
โ โโโ templates/
โ
โโโ ๐ evaluation/
โ โโโ evaluate.py # Model evaluation
โ โโโ generate.py # Text generation
โ
โโโ ๐ plugins/ # Optional integrations
โ โโโ wandb_plugin.py
โ โโโ huggingface_plugin.py
โ โโโ synthex_plugin.py
โ
โโโ ๐ checkpoints/ # Saved models (auto-generated)
โโโ ๐ logs/ # Training logs (auto-generated)
โ
โโโ ๐ llm.config.js # Main configuration file
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ chat.py # Interactive chat interface
โโโ ๐ deploy.py # Deployment script
โโโ ๐ compare.py # Model comparison tool
โโโ ๐ README.md # Project documentationConfiguration
Everything is controlled via llm.config.js:
module.exports = {
// Model architecture
model: {
type: 'gpt',
size: 'tiny',
vocab_size: 10000, // Auto-detected from tokenizer
max_length: 512,
layers: 4,
heads: 4,
dim: 256,
dropout: 0.2,
},
// Training settings
training: {
batch_size: 16,
learning_rate: 0.0006,
warmup_steps: 500,
max_steps: 10000,
eval_interval: 500,
save_interval: 2000,
optimizer: 'adamw',
weight_decay: 0.01,
gradient_clip: 1.0,
mixed_precision: false,
gradient_accumulation_steps: 1,
},
// Data settings
data: {
max_length: 512,
stride: 256,
val_split: 0.1,
shuffle: true,
},
// Tokenizer settings
tokenizer: {
type: 'bpe',
vocab_size: 10000,
min_frequency: 2,
special_tokens: ["<pad>", "<unk>", "<s>", "</s>"],
},
// Plugins
plugins: [
// 'wandb',
// 'huggingface',
// 'synthex',
],
};๐ CLI Reference
Commands
npx @theanikrtgiri/create-llm [project-name] [options]Options
| Option | Description | Default |
|---|---|---|
--template <name> |
Template to use (nano, tiny, small, base, custom) | Interactive |
--tokenizer <type> |
Tokenizer type (bpe, wordpiece, unigram) | Interactive |
--skip-install |
Skip npm/pip installation | false |
-y, --yes |
Skip all prompts, use defaults | false |
-h, --help |
Show help | - |
-v, --version |
Show version | - |
Examples
# Interactive mode (recommended for first time)
npx @theanikrtgiri/create-llm
# Quick start with defaults
npx @theanikrtgiri/create-llm my-project
# Specify everything
npx @theanikrtgiri/create-llm my-project --template nano --tokenizer bpe --skip-install
# Skip prompts
npx @theanikrtgiri/create-llm my-project -yAdvanced Features
Live Training Dashboard
Monitor training in real-time with a web interface:
python training/train.py --dashboardThen open http://localhost:5000 to see:
- Real-time loss curves
- Learning rate schedule
- Tokens per second
- GPU memory usage
- Recent checkpoints
Model Comparison
Compare multiple trained models:
python compare.py checkpoints/model-v1/ checkpoints/model-v2/Shows:
- Side-by-side metrics
- Sample generations
- Performance comparison
- Recommendation
Checkpoint Management
Automatic checkpoint management:
- Saves best model based on validation loss
- Keeps last N checkpoints (configurable)
- Auto-saves on Ctrl+C
- Resume from any checkpoint
# Resume training
python training/train.py --resume checkpoints/checkpoint-5000.pt
# Evaluate specific checkpoint
python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.ptCustom Plugins
Create your own plugins:
# plugins/my_plugin.py
from plugins.base import BasePlugin
class MyPlugin(BasePlugin):
def on_train_start(self, trainer):
print("Training started!")
def on_step_end(self, trainer, step, loss):
# Log to your service
passBest Practices
Data Preparation
Minimum Data Requirements:
- NANO: 100+ examples (good for learning)
- TINY: 1,000+ examples (minimum for decent results)
- SMALL: 10,000+ examples (recommended)
- BASE: 100,000+ examples (for quality)
Data Quality:
- Use clean, well-formatted text
- Remove HTML, markdown, or special formatting
- Ensure consistent encoding (UTF-8)
- Remove duplicates
- Balance different content types
Training Tips
Avoid Overfitting:
- Watch for perplexity < 1.5 (warning sign)
- Use validation split (10% recommended)
- Increase dropout if overfitting
- Add more data if possible
- Use smaller model for small datasets
Optimize Training:
- Start with NANO to test pipeline
- Use mixed precision on GPU (
mixed_precision: true) - Increase
gradient_accumulation_stepsif OOM - Monitor GPU usage with dashboard
- Save checkpoints frequently
Hyperparameter Tuning:
- Learning rate: Start with 3e-4, adjust if unstable
- Batch size: As large as GPU allows
- Warmup steps: 10% of total steps
- Dropout: 0.1-0.3 depending on data size
Deployment
Before Deploying:
- Evaluate on held-out test set
- Test generation quality
- Check model size
- Verify inference speed
- Test on target hardware
Deployment Options:
- Hugging Face Hub (easiest)
- Replicate (API endpoint)
- Docker container (custom)
- Cloud platforms (AWS, GCP, Azure)
Troubleshooting
Common Issues
"Vocab size mismatch detected"
- โ This is normal! The tool auto-detects and fixes it
- The model will use the actual tokenizer vocab size
"Position embedding index error" or sequences too long
- โ Automatically handled! Sequences exceeding max_length are truncated
- The model logs warnings when truncation occurs
- Check your data preprocessing if you see frequent truncation warnings
- Consider increasing
max_lengthin config if you need longer sequences - Note: Increasing max_length requires retraining from scratch
"Model may be too large for dataset"
- โ ๏ธ Warning: Risk of overfitting
- Solutions: Add more data, use smaller template, increase dropout
"Perplexity < 1.1 indicates severe overfitting"
- โ Model memorized the data
- Solutions: Add much more data, use smaller model, increase regularization
"CUDA out of memory"
- Reduce
batch_sizein llm.config.js - Enable
mixed_precision: true - Increase
gradient_accumulation_steps - Use smaller model template
"Tokenizer not found"
- Run
python tokenizer/train.py --data data/raw/first - Make sure data/raw/ contains .txt files
"Training loss not decreasing"
- Check learning rate (try 1e-4 to 1e-3)
- Verify data is loading correctly
- Check for data preprocessing issues
- Try longer warmup period
Getting Help
- ๐ Full Documentation
- ๐ฌ Discord Community
- ๐ Report Issues
- ๐ง Email Support
Requirements
For CLI Tool
- Node.js 18.0.0 or higher
- npm 8.0.0 or higher
For Training
- Python 3.8 or higher
- PyTorch 2.0.0 or higher
- 4GB RAM minimum (NANO/TINY)
- 12GB VRAM recommended (SMALL)
- 40GB+ VRAM for BASE
Operating Systems
- โ Windows 10/11
- โ macOS 10.15+
- โ Linux (Ubuntu 20.04+)
Development
Setup Development Environment
git clone https://github.com/theaniketgiri/create-llm.git
cd create-llm
npm installBuild
npm run buildDevelopment Mode
npm run devTest Locally
node dist/index.js test-project --template nanoRun Tests
npm testPublish
npm version patch # or minor/major
npm publish๐ค Contributing
We welcome contributions from everyone!
๐ Contributing Guide โข ๐ Report Bug โข ๐ก Request Feature
๐ฏ Areas We Need Help
| Area | Description | Difficulty |
|---|---|---|
| ๐ Bug Fixes | Fix issues and improve stability | ๐ข Easy |
| ๐ Documentation | Improve guides and examples | ๐ข Easy |
| ๐จ New Templates | Add BERT, T5, custom architectures | ๐ก Medium |
| ๐ Plugins | Integrate new services | ๐ก Medium |
| ๐งช Testing | Increase test coverage | ๐ก Medium |
| ๐ i18n | Internationalization support | ๐ด Hard |
๐ฅ Contributors
Roadmap
v1.1 (Next Release)
- More model architectures (BERT, T5)
- Distributed training support
- Model quantization tools
- Fine-tuning templates
v1.2
- Web UI for project management
- Automatic hyperparameter tuning
- Model compression tools
- More deployment targets
v2.0
- Multi-modal support
- Reinforcement learning from human feedback
- Advanced optimization techniques
- Cloud training integration
๐ License
MIT ยฉ Aniket Giri
See LICENSE for more information.
Acknowledgments
Built with amazing open-source tools:
- PyTorch - Deep learning framework
- Transformers - Model implementations
- Tokenizers - Fast tokenization
- Commander.js - CLI framework
- Inquirer.js - Interactive prompts
Special thanks to the LLM community for inspiration and feedback.
โญ Star History
Made with โค๏ธ for the LLM community
GitHub โข npm โข Issues โข Twitter
๐ Support This Project
If create-llm helped you, consider:
- โญ Starring the repo
- ๐ Reporting bugs
- ๐ก Suggesting features
- ๐ Improving docs
- ๐ Contributing code
Together, let's make LLM training accessible to everyone!