JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 805
  • Score
    100M100P100Q7821F
  • License MIT

Memory & context compression for AI applications. Keeps the gist, drops the bulk.

Package Exports

  • gist-ai
  • gist-ai/mcp

Readme

gist

Memory & context compression for AI applications. Keeps the gist, drops the bulk.

gist is a small, dependency-free TypeScript library you drop into any app that talks to an LLM. It shrinks what you send to the model — cutting token cost and keeping long conversations inside the context window — without you rewriting how your app works.

npm install gist-ai
import { Compressor, ClaudeProvider } from "gist-ai";

const gist = new Compressor({ provider: new ClaudeProvider(), budget: 8000 });

await gist.ingest(messages);                       // distills + stores
const ctx = await gist.buildContext({ query });    // packed, budget-aware

// send ctx.messages to your LLM; read ctx.stats for the savings.

The same three lines work in every app — CutMind, VoiceCAD, anything.

What it does (the three layers)

Layer Component Job
1. Memory MemoryStore Keep recent turns verbatim; distill older ones into compact, deduped facts.
2. Context ContextPacker Given a token budget, assemble system + relevant memories + recent turns, trimming to fit.
3. Vectors VectorCompressor (optional) Quantize embeddings for fast, cheap memory recall. Swappable seam.

Layers 1 + 2 are the default and need nothing but an LLM provider. Layer 3 is opt-in and is where a TurboQuant-style native/WASM core can later drop in behind the same tiny interface — no rework.

Try it now (no API key)

npm install
npm run demo

The demo runs entirely offline with a MockProvider and prints the raw-vs-packed token counts so you can see the compression ratio directly.

Content compression (Layer 2+)

Coding agents read 10k-token logs to find one error line — and pay per token. gist routes each chunk of tool output to a specialist compressor (inspired by Headroom) and keeps the original retrievable via CCR (Compress-Cache-Retrieve):

const g = new Compressor({ provider });
const r = g.compress(hugeLog);     // auto-detects type, routes, caches original
// → r.text (compressed), r.stats.saved (0..1), r.handle
const original = g.ccr.retrieve(r.handle); // nothing is ever lost

Measured on the demo (npm run demo:compress):

content type tokens saved
noisy build log log 1174→102 91% (error line preserved)
repetitive JSON json 729→61 92% (anomaly kept)
commented code code 82→64 22%
prose report prose 74→55 26% (LLMLingua-2-style pruner)
stack trace / test output trace 64–90% (keeps the error + your-code frames + failing tests; collapses node_modules frames and passing tests)
CSV / tabular / markdown table table 85–95% on big tables (keeps header + sample rows + anomaly rows + a count)

Honest framing: logs/JSON compress 80–95%, code/prose 20–50% — real-world full-session savings land ~40–50%, not the headline 95%.

Aggressive mode

Pass level: "aggressive" to squeeze much harder — logs collapse to errors + a count, JSON arrays fold to a schema + count, prose keeps fewer tokens — while still preserving every error/anomaly. Measured by npm run eval (which scores token-savings and signal-recall):

safe aggressive signal kept
varied log 44% 75%
prose 33% 59%
aggregate 85% 92% 13/13
g.compress(hugeLog, { level: "aggressive" }); // when you only need the gist

The MCP gist_compress tool takes the same level argument.

Use it inside Claude Code / Cursor / Codex (MCP server)

gist ships an MCP server so any agent can compress what it reads. It exposes:

tool what it does
gist_compress route + compress content; returns compressed text + a CCR handle + token stats
gist_retrieve recover the full original for a handle (nothing is ever lost)
gist_stats session totals — tokens in/out, saved fraction

Build it, then register with Claude Code:

cd gist && npm install && npm run build
claude mcp add gist -- node "C:\Users\Theresa\gist\dist\mcp\server.js"

Or add to a project's .mcp.json (works for Cursor too):

{
  "mcpServers": {
    "gist": { "command": "node", "args": ["C:\\Users\\Theresa\\gist\\dist\\mcp\\server.js"] }
  }
}

Verify end-to-end with npm run test:mcp (spawns the server, exercises all three tools). The server is stdio, dependency-light (@modelcontextprotocol/sdk), and keeps CCR originals in-process for the session's lifetime.

Providers

  • MockProvider — deterministic, offline. For tests and the demo.
  • ClaudeProvider — Anthropic Messages API. Reads ANTHROPIC_API_KEY, or pass { apiKey }. Dependency-free (uses fetch).
  • Your own — implement the LLMProvider interface (complete, optional embed). gist is vendor-neutral.

Integrating into a Tauri app

gist is plain ESM TypeScript, so it runs in the Tauri webview (frontend) directly. Call ingest/buildContext from your UI layer; send the packed messages to whichever model you use. The heavy vector math (Layer 3) is the only part that may later move to the Rust backend via WASM — and it's already isolated behind VectorCompressor.

Layer 3 — quantizer benchmark

npm run demo:vectors (2000 vecs × 128 dims, recall@10):

method bytes ratio recall@10 trained?
scalar-int8 128 100% no
binary 16 32× 20% no
product-quant 16 32× 50% yes (k-means)
PQ + rerank 16 32× 100% yes
turboquant b=2 40 13× 60% no
turboquant b=3 56 70% no
turboquant b=4 72 80% no
TQ b=3 + rerank 56 100% no

Recall climbs cleanly with the bit-rate (rate–distortion), and searchWithRerank() recovers exact top-K from a cheap shortlist. TurboQuant's inner-product estimator is unbiased (mean signed error ≈ 0, ~0.97 correlation with true ⟨q,x⟩).

TurboQuant WASM core

TurboQuantWasm is a faithful implementation of TurboQuant (Zandieh et al., arXiv:2504.19874, 2025), Algorithm 2 — data-oblivious (no training pass, unlike PQ):

  1. Random rotation → coordinates follow a Beta distribution (Lemma 1); realized as a multi-round randomized fast Walsh–Hadamard transform (the O(d·log d) hot loop, compiled Rust → WASM, 614 bytes, base64-inlined so it loads with zero plumbing in Node/browser/Tauri).
  2. MSE stage — (b−1)-bit Lloyd–Max codebook computed for the Beta density.
  3. QJL stage — 1-bit Quantized JL on the residual → an unbiased inner-product estimator (Lemma 4).

Rebuild the kernel with npm run build:wasm (needs rustup target add wasm32-unknown-unknown).

compress() / decompress() / similarity() form a complete codec: keys are scored via the unbiased inner-product estimator, values are reconstructed via decompress() (95% cosine fidelity at b=3) and summed.

KV-cache demo (npm run demo:kvcache) — TurboQuant's sweet spot. Both keys and values quantized; attention output stays aligned as bits/channel drop:

bits/channel K+V cache (512 tok) attn-output cosine
2.5 40 KiB (6.4×) 0.775
3.5 56 KiB (4.6×) 0.916
4.5 72 KiB (3.6×) 0.975

(Synthetic single head with random-Gaussian K/V — a hard case where attention is near-uniform; real LLM attention is peaked, where the paper reports full quality-neutrality at 3.5 bits/channel.)

The one deviation from the paper: the rotation Π and JL matrix S use fast randomized Hadamard transforms rather than dense Gaussian/QR matrices (O(d·log d) vs O(d²)) — the standard fast realization, preserving the JL guarantees in high dimension.

Persistence

compressor.snapshot() returns a JSON-serializable blob; compressor.restore(blob) rehydrates it. Persist it anywhere (localStorage, a file, a DB) so memory survives restarts.

Roadmap

  • Layer 1 — tiered conversation memory + distillation
  • Layer 2 — budget-aware context packing with relevance ranking
  • Layer 3 — scalar (4×), binary (32×), and product quantization + re-rank
  • Persistence — snapshot / restore
  • Layer 3+ — faithful TurboQuant (arXiv:2504.19874) as a Rust→WASM core
  • Layer 2+ — content-aware compression (ContentRouter + log/json/code/prose) + CCR reversibility
  • LLMLingua-2-inspired prose token pruner
  • MCP server — gist_compress / gist_retrieve / gist_stats for Claude Code, Cursor, Codex
  • Trained-model prose pruner (ModernBERT) + AST-aware code compression
  • Layer 4 (separate module) — on-device model efficiency (BitNet/GGUF)

Status

v0.1.0 — working scaffold. APIs may shift before 1.0.

MIT.