ComposeCache

Adaptive compositional semantic caching for LLM APIs and RAG pipelines.

Why ComposeCache?

Existing semantic caches like GPTCache treat every query atomically. ComposeCache decomposes compositional queries (e.g., "Compare X and Y") into sub-queries, caches each independently, and enables partial hits - saving 50%+ on LLM API costs.

Quick Start

npm install composecache
npx composecache init --db postgres://localhost/myapp

import { ComposeCache } from 'composecache';

const cache = new ComposeCache({
  database: process.env.DATABASE_URL,
  openaiApiKey: process.env.OPENAI_API_KEY
});

const response = await cache.complete({
  model: 'gpt-3.5-turbo',
  messages: [{ role: 'user', content: 'Compare France and Germany' }],
  documents: retrievedDocs // Optional: for RAG
});

console.log(response.content); // The answer
console.log(response.cacheType); // 'exact' | 'semantic' | 'partial' | 'miss'
console.log(response.costSaved); // $ saved

Features

Compositional query decomposition (novel)
Document-aware cache keys via MinHash
Uncertainty-gated population (blocks hallucinations)
Drop-in SDK for Node.js and Python
Works with your own PostgreSQL database

Architecture

Query Processing Flow

flowchart TD
  Q["Incoming query q"] --> C{"Classify: atomic or compositional"}

  C -->|atomic| A["Compute SHA-256 key: norm(q) + doc_fingerprint + params_hash"]
  C -->|compositional| D["Decompose into sub-queries s1..sk with dependencies"]

  A --> P["Probe cache: exact hash first, then semantic plus document check"]
  D --> P

  P --> H{All hits?}
  H -->|yes| R["Return cached response or compose from sub-answers"]
  H -->|no or partial| G["Generate missing sub-answers via RAG plus LLM API"]

  R --> F["Compose final response"]
  G --> F

  F --> U["Uncertainty gate: cache only if uncertainty <= threshold"]

System Architecture

flowchart TD
  APP["Developer application: Node.js or Python"]

  subgraph SDK[ComposeCache middleware SDK npm package]
    direction LR
    S1[1 Decompose] --> S2[2 Probe] --> S3[3 Resolve] --> S4[4 Compose] --> S5[5 Populate]
  end

  subgraph MODS[Core modules]
    direction LR
    E["Embedder all-MiniLM-L6-v2"]
    L["Decomposition LLM gpt-4o-mini"]
    M["MinHash plus uncertainty estimator"]
  end

  DB["Developer PostgreSQL plus pgvector: exact keys and semantic vectors"]
  API["Upstream LLM API OpenAI or Anthropic"]

  APP --> SDK
  SDK --> MODS
  SDK -->|cache read write| DB
  SDK -->|miss only| API

Benchmarks

These synthetic benchmark numbers were collected from a local virtual environment using a deterministic mock LLM latency of about 120ms per call.

Disclaimer: these values are not production throughput guarantees. They are controlled local measurements intended to validate algorithm behavior and relative improvements only.

Benchmark Setup

Environment: macOS, Node.js runtime in a local virtual development environment
Workload: compositional query "Compare GDP of France and Germany"
Iterations: 10 per scenario
Command: node scripts/bench.mjs

Results

Scenario	Avg Latency (ms)	Mock LLM Calls (10 runs)	Avg Tokens Saved
No cache baseline	368.0	30	0
ComposeCache cold (empty cache)	146.1	13	126
ComposeCache warm partial	145.6	12	133
ComposeCache warm full	133.3	11	140

Terminal Output Snapshot

{
  "baseline": {
    "name": "No cache baseline",
    "avgLatencyMs": 368,
    "llmCalls": 30
  },
  "cold": {
    "name": "ComposeCache cold (empty cache)",
    "avgLatencyMs": 146.1,
    "avgTokensSaved": 126,
    "llmCalls": 13,
    "partialRate": 0
  },
  "partial": {
    "name": "ComposeCache warm partial",
    "avgLatencyMs": 145.6,
    "avgTokensSaved": 133,
    "llmCalls": 12,
    "partialRate": 0.1
  },
  "full": {
    "name": "ComposeCache warm full",
    "avgLatencyMs": 133.3,
    "avgTokensSaved": 140,
    "llmCalls": 11,
    "partialRate": 0
  }
}

License

MIT