Package Exports
- composecache
Readme
ComposeCache
Adaptive compositional semantic caching for LLM APIs and RAG pipelines.
Why ComposeCache?
Existing semantic caches like GPTCache treat every query atomically. ComposeCache decomposes compositional queries (e.g., "Compare X and Y") into sub-queries, caches each independently, and enables partial hits - saving 50%+ on LLM API costs.
Quick Start
npm install composecache
npx composecache init --db postgres://localhost/myappimport { ComposeCache } from 'composecache';
const cache = new ComposeCache({
database: process.env.DATABASE_URL,
openaiApiKey: process.env.OPENAI_API_KEY
});
const response = await cache.complete({
model: 'gpt-3.5-turbo',
messages: [{ role: 'user', content: 'Compare France and Germany' }],
documents: retrievedDocs // Optional: for RAG
});
console.log(response.content); // The answer
console.log(response.cacheType); // 'exact' | 'semantic' | 'partial' | 'miss'
console.log(response.costSaved); // $ savedFeatures
- Compositional query decomposition (novel)
- Document-aware cache keys via MinHash
- Uncertainty-gated population (blocks hallucinations)
- Drop-in SDK for Node.js and Python
- Works with your own PostgreSQL database
Architecture
Query Processing Flow
flowchart TD
Q["Incoming query q"] --> C{"Classify: atomic or compositional"}
C -->|atomic| A["Compute SHA-256 key: norm(q) + doc_fingerprint + params_hash"]
C -->|compositional| D["Decompose into sub-queries s1..sk with dependencies"]
A --> P["Probe cache: exact hash first, then semantic plus document check"]
D --> P
P --> H{All hits?}
H -->|yes| R["Return cached response or compose from sub-answers"]
H -->|no or partial| G["Generate missing sub-answers via RAG plus LLM API"]
R --> F["Compose final response"]
G --> F
F --> U["Uncertainty gate: cache only if uncertainty <= threshold"]System Architecture
flowchart TD
APP["Developer application: Node.js or Python"]
subgraph SDK[ComposeCache middleware SDK npm package]
direction LR
S1[1 Decompose] --> S2[2 Probe] --> S3[3 Resolve] --> S4[4 Compose] --> S5[5 Populate]
end
subgraph MODS[Core modules]
direction LR
E["Embedder all-MiniLM-L6-v2"]
L["Decomposition LLM gpt-4o-mini"]
M["MinHash plus uncertainty estimator"]
end
DB["Developer PostgreSQL plus pgvector: exact keys and semantic vectors"]
API["Upstream LLM API OpenAI or Anthropic"]
APP --> SDK
SDK --> MODS
SDK -->|cache read write| DB
SDK -->|miss only| APIBenchmarks
These synthetic benchmark numbers were collected from a local virtual environment using a deterministic mock LLM latency of about 120ms per call.
Disclaimer: these values are not production throughput guarantees. They are controlled local measurements intended to validate algorithm behavior and relative improvements only.
Benchmark Setup
- Environment: macOS, Node.js runtime in a local virtual development environment
- Workload: compositional query "Compare GDP of France and Germany"
- Iterations: 10 per scenario
- Command:
node scripts/bench.mjs
Results
| Scenario | Avg Latency (ms) | Mock LLM Calls (10 runs) | Avg Tokens Saved |
|---|---|---|---|
| No cache baseline | 368.0 | 30 | 0 |
| ComposeCache cold (empty cache) | 146.1 | 13 | 126 |
| ComposeCache warm partial | 145.6 | 12 | 133 |
| ComposeCache warm full | 133.3 | 11 | 140 |
Terminal Output Snapshot
{
"baseline": {
"name": "No cache baseline",
"avgLatencyMs": 368,
"llmCalls": 30
},
"cold": {
"name": "ComposeCache cold (empty cache)",
"avgLatencyMs": 146.1,
"avgTokensSaved": 126,
"llmCalls": 13,
"partialRate": 0
},
"partial": {
"name": "ComposeCache warm partial",
"avgLatencyMs": 145.6,
"avgTokensSaved": 133,
"llmCalls": 12,
"partialRate": 0.1
},
"full": {
"name": "ComposeCache warm full",
"avgLatencyMs": 133.3,
"avgTokensSaved": 140,
"llmCalls": 11,
"partialRate": 0
}
}License
MIT