Package Exports
- bare-llama-cpp
- bare-llama-cpp/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (bare-llama-cpp) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
bare-llama.cpp
Native llama.cpp bindings for Bare.
Run LLM inference directly in your Bare JavaScript applications with GPU acceleration support.
Requirements
- CMake 3.25+
- C/C++ compiler (clang, gcc, or MSVC)
- Node.js (for npm/cmake-bare)
- Bare runtime
Building
Clone with submodules:
git clone --recursive https://github.com/CameronTofer/bare-llama.cpp
cd bare-llama.cppOr if already cloned:
git submodule update --init --recursiveInstall dependencies and build:
npm installOr manually:
bare-make generate
bare-make build
bare-make installThis creates prebuilds/<platform>-<arch>/bare-llama.bare.
Build Options
For a debug build:
bare-make generate -- -D CMAKE_BUILD_TYPE=Debug
bare-make buildTo disable GPU acceleration:
bare-make generate -- -D GGML_METAL=OFF -D GGML_CUDA=OFF
bare-make buildUsage
const { LlamaModel, LlamaContext, LlamaSampler, generate } = require('bare-llama')
// Load model (GGUF format)
const model = new LlamaModel('./model.gguf', {
nGpuLayers: 99 // Offload layers to GPU (0 = CPU only)
})
// Create context
const ctx = new LlamaContext(model, {
contextSize: 2048, // Max context length
batchSize: 512 // Batch size for prompt processing
})
// Create sampler
const sampler = new LlamaSampler(model, {
temp: 0.7, // Temperature (0 = greedy)
topK: 40, // Top-K sampling
topP: 0.95 // Top-P (nucleus) sampling
})
// Generate text
const output = generate(model, ctx, sampler, 'The meaning of life is', 128)
console.log(output)
// Cleanup
sampler.free()
ctx.free()
model.free()Embeddings
const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./embedding-model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
contextSize: 512,
embeddings: true,
poolingType: 1 // -1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank
})
const tokens = model.tokenize('Hello world', true)
ctx.decode(tokens)
const embedding = ctx.getEmbeddings(-1) // Float32Array
// Reuse context for multiple embeddings
ctx.clearMemory()
const tokens2 = model.tokenize('Another text', true)
ctx.decode(tokens2)
const embedding2 = ctx.getEmbeddings(-1)
ctx.free()
model.free()Reranking
Cross-encoder reranking scores how relevant a document is to a query. Use a reranker model (e.g. BGE reranker) with poolingType: 4 (rank).
Important: You must call ctx.clearMemory() before each scoring to clear the KV cache. Without this, stale context from previous pairs corrupts the scores.
const { LlamaModel, LlamaContext, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./bge-reranker-v2-m3.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, {
contextSize: 512,
embeddings: true,
poolingType: 4 // rank pooling (required for rerankers)
})
function rerank (query, document) {
ctx.clearMemory() // critical: clear KV cache before each pair
const tokens = model.tokenize(query + '\n' + document, true)
ctx.decode(tokens)
return ctx.getEmbeddings(0)[0] // single float score
}
const query = 'What is machine learning?'
const docs = [
'Machine learning is a branch of AI that learns from data.',
'The recipe calls for two cups of flour and one egg.'
]
const scored = docs
.map((doc, i) => ({ i, score: rerank(query, doc) }))
.sort((a, b) => b.score - a.score)
for (const { i, score } of scored) {
console.log(`[${score.toFixed(4)}] ${docs[i]}`)
}
ctx.free()
model.free()Constrained Generation
const { LlamaModel, LlamaContext, LlamaSampler, generate, setQuiet } = require('bare-llama')
setQuiet(true)
const model = new LlamaModel('./model.gguf', { nGpuLayers: 99 })
const ctx = new LlamaContext(model, { contextSize: 2048 })
// JSON schema constraint (requires llguidance)
const schema = JSON.stringify({
type: 'object',
properties: { name: { type: 'string' }, age: { type: 'integer' } },
required: ['name', 'age']
})
const sampler = new LlamaSampler(model, { temp: 0, json: schema })
// Lark grammar constraint
const sampler2 = new LlamaSampler(model, { temp: 0, lark: 'start: "yes" | "no"' })Examples
| Example | Description |
|---|---|
examples/text-generation.js |
High-level generate() API |
examples/token-by-token.js |
Manual tokenize/sample/decode loop |
examples/cosine-similarity.js |
Embeddings + semantic similarity |
examples/json-constrained-output.js |
JSON schema constrained generation |
examples/lark-constrained-output.js |
Lark grammar constrained generation |
examples/tool-use-agent.js |
Agentic tool calling with multi-turn |
Run examples with:
bare examples/text-generation.js -- /path/to/model.ggufTesting
Tests use brittle and skip gracefully when models aren't available.
npm testModel-dependent tests require Ollama models installed locally:
ollama pull llama3.2:1b # generation tests
ollama pull nomic-embed-text # embedding tests
ollama pull qllama/bge-reranker-v2-m3 # reranking testsBenchmarks
npm run benchResults are saved to bench/results/ as JSON with full metadata (llama.cpp version, system info, platform). History is tracked in JSONL files for comparison across runs.
API Reference
LlamaModel
new LlamaModel(path, options?)| Option | Type | Default | Description |
|---|---|---|---|
nGpuLayers |
number | 0 | Number of layers to offload to GPU |
Properties:
name- Model name from metadataembeddingDimension- Embedding vector sizetrainingContextSize- Training context length
Methods:
tokenize(text, addBos?)- Convert text to tokens (Int32Array)detokenize(tokens)- Convert tokens back to textisEogToken(token)- Check if token is end-of-generationgetMeta(key)- Get model metadata by keyfree()- Release model resources
LlamaContext
new LlamaContext(model, options?)| Option | Type | Default | Description |
|---|---|---|---|
contextSize |
number | 512 | Maximum context length |
batchSize |
number | 512 | Batch size for processing |
embeddings |
boolean | false | Enable embedding mode |
poolingType |
number | -1 | Pooling strategy (-1=unspecified, 0=none, 1=mean, 2=cls, 3=last, 4=rank) |
Properties:
contextSize- Actual context size
Methods:
decode(tokens)- Process tokens through the modelgetEmbeddings(idx)- Get embedding vector (Float32Array)clearMemory()- Clear context for reuse (faster than creating new context)free()- Release context resources
LlamaSampler
new LlamaSampler(model, options?)| Option | Type | Default | Description |
|---|---|---|---|
temp |
number | 0 | Temperature (0 = greedy sampling) |
topK |
number | 40 | Top-K sampling parameter |
topP |
number | 0.95 | Top-P (nucleus) sampling parameter |
json |
string | - | JSON schema constraint (requires llguidance) |
lark |
string | - | Lark grammar constraint (requires llguidance) |
Methods:
sample(ctx, idx)- Sample next token (-1 for last position)accept(token)- Accept token into sampler statefree()- Release sampler resources
generate()
generate(model, ctx, sampler, prompt, maxTokens?)Convenience function for simple text generation. Returns the generated text (not including the prompt).
Utility Functions
setQuiet(quiet?)- Suppress llama.cpp outputsetLogLevel(level)- Set log level (0=off, 1=errors, 2=all)readGgufMeta(path, key)- Read GGUF metadata without loading the modelgetModelName(path)- Get model name from GGUF filesystemInfo()- Get hardware/instruction set info (AVX, NEON, Metal, CUDA)
Project Structure
index.js Main module
binding.cpp C++ native bindings
lib/
ollama-models.js Ollama model discovery
ollama.js GGUF metadata + Jinja chat templates
test/ Brittle test suite
bench/ Benchmark system
examples/ Usage examples
tools/
ollama-hyperdrive.js P2P model distribution (standalone CLI)Models
This addon works with GGUF format models. You can use models from Ollama (auto-detected from ~/.ollama/models) or download GGUF files directly from Hugging Face.
Platform Support
| Platform | Architecture | GPU Support |
|---|---|---|
| macOS | arm64, x64 | Metal |
| Linux | x64, arm64 | CUDA (if available) |
| Windows | x64, arm64 | CUDA (if available) |
| iOS | arm64 | Metal |
| Android | arm64, arm, x64, ia32 | - |
Constrained generation (llguidance)
JSON schema and Lark grammar constraints require llguidance, which is built from Rust source. This is enabled automatically on native (non-cross-compiled) builds. Cross-compiled targets (iOS, Android, Windows arm64) do not include llguidance — constrained generation is unavailable on those platforms.
Note: Lark grammar constraints are currently not working correctly — llguidance does not appear to enforce token constraints as expected (e.g. allowing "Yes" when the grammar only permits "yes"). JSON schema constraints work fine.
License
MIT