JSPM – recallmem@0.1.0

Package Exports

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (recallmem) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Persistent personal AI. Powered by Gemma 4 running locally on your own machine.

This is not a chatbot (chatbots forget you) & this is not an agent (agents don't remember you).

This IS a private AI, built on a deterministic memory framework where the LLM never touches your data.

Two products in one repo:
👤 For users: install with npx recallmem and start chatting with an AI that actually remembers you.
👨‍💻 For developers: fork it and build your own AI app on top of the memory framework in lib/.

npx recallmem

That's the install. One command. The CLI handles the rest. It clones the repo, sets up the database, pulls the local AI models, writes the config file, opens the chat in your browser. If you've already got Node, Postgres, and Ollama installed, you're chatting with your own private AI in about 5 minutes.

RecallMEM chat UI showing the AI remembering the user's name across conversations

Two chats. Different sessions. The AI remembers.

Why I built this

I wanted my own private AI for the kind of conversations I don't want sitting on someone else's server. Personal stuff. The stuff you'd actually want a real friend to help you think through.

The default model is Gemma 4 (Google's open weights model that just dropped, Apache 2.0) running locally via Ollama. You can pick any size from E2B (runs on a phone) up to the 31B Dense (best quality, needs a workstation). Or skip Ollama entirely and bring your own API key for Claude, GPT, Groq, Together, OpenRouter, or anything OpenAI-compatible. Your call.

The thing is, the memory is the actual differentiator. Not the model. Not the UI. The memory. The AI builds a profile of who you are over time. It extracts facts after every conversation. It vector-searches across every chat you've ever had to find relevant context. By the time you've used it for a week, it knows you better than ChatGPT ever will, because ChatGPT forgets you the second you close the tab.

The longer version (what's wrong with every other "private AI" tool)

Here's the problem with every "private AI" tool I tried: they all fall into one of three buckets.

Local chat UIs for Ollama. Look pretty, but the AI has zero memory between conversations. Every chat is a stranger.
Memory libraries on GitHub. Powerful, but they're SDKs. You have to build the whole UI yourself.
Cloud-based memory products like Mem0. Have the full feature set, but your data goes to their servers. Defeats the whole point.

There's a gap right in the middle: a complete personal AI app with real working memory that runs 100% on your machine. So I built it.

What it does

Persistent memory across every chat (profile + facts + vector search) with temporal awareness so the model knows what's current vs historical. Auto-extracts facts in real time, retires stale ones when the truth changes, stamps every memory with dates. Vector search over every past chat. Memory inspector you can edit. Custom rules. Wipe memory unrecoverably. File uploads (images, PDFs, code). Web search when using Anthropic. Bring your own LLM (Ollama, Anthropic, OpenAI, or any OpenAI-compatible API). Warm Claude-style dark mode.

Full feature list

Persistent memory across every chat. Three layers: a synthesized profile of who you are, an extracted facts table, and vector search over all past conversations.
Live fact extraction. Facts get extracted after every assistant reply, not just when the chat ends. Say "my birthday is 11/27" and refresh /memory a moment later, it's already there. Always uses the local FAST_MODEL so cloud users don't get billed per turn.
Temporal awareness solves context collapse. Every fact is stamped with a valid_from date. When new information contradicts an old fact ("left Acme" replaces "works at Acme"), the old fact gets retired automatically. The model always sees what's current.
Self-healing categories. Facts re-route to the correct category after every chat, edit, or delete. No LLM, just a deterministic loop. So when the categorizer improves, your existing memory improves with it.
Resumed-conversation markers. Open a chat from last week and continue it, the AI sees a system marker like [Conversation resumed 6 days later] so it knows time passed and earlier turns are historical.
Dated recall. When the vector search pulls relevant chunks from past chats, each one is prefixed with the date it came from so the model can tell history from the present.
Auto-builds your profile from the extracted facts, with date stamps in every section. Updates after every reply.
Vector search across past conversations. Ask about something you discussed last month, the AI finds it and uses it as context.
Memory inspector page. View, edit, or delete every fact, with collapsible category sections and a search filter for navigating long lists.
Sidebar chat search. Toggle between vector search (semantic, needs Ollama for embeddings) and text search (literal ILIKE on titles + transcripts, instant). Both search inside the conversations, not just titles.
Web search toggle. When you're using an Anthropic provider, a globe button next to the input lets Claude actually browse the web. Hidden for Ollama since local models don't have it.
Custom rules. Tell the AI how you want to be talked to. "Don't gaslight me." "I have dyslexia, no bullet points." "Don't add disclaimers." It applies them in every chat.
Wipe memory unrecoverably. DELETE + VACUUM FULL + CHECKPOINT. Gone for good at the database level.
File uploads. Drag and drop images, PDFs, code, text. Gemma 4 handles vision natively.
Warm dark mode. Claude-style charcoal palette via CSS variables, persisted across refreshes with no flash-of-light.
Chat history sidebar with date grouping, pinned chats, and the search toggle described above.
Markdown rendering for headings, code blocks, tables.
Streaming responses with smooth typewriter rendering.
Bring any LLM you want. Local Gemma 4 via Ollama, or plug in Anthropic (Claude), OpenAI (GPT), or any OpenAI-compatible API (Groq, Together, OpenRouter, Mistral, vLLM, LM Studio, etc).
Test connection for cloud providers before saving the API key, so you don't find out your key is wrong mid-chat.

How is this different?

Comparison table vs ChatGPT, Claude.ai, and Mem0

	RecallMEM	ChatGPT / Claude.ai	Mem0
Runs locally	✅	❌	❌
Memory retrieval is deterministic (no LLM tool calls)	✅	❌	❌
Persistent memory across chats	✅	partial	✅
Temporal awareness (memories know when they were true)	✅	❌	❌
Auto-retires stale facts when truth changes	✅	❌	❌
You can edit / delete memories	✅	partial	✅
Vector search over past chats	✅	❌	✅
Custom rules / behavior	✅	✅	❌
Bring your own LLM (any provider)	✅	❌	❌
Use local models (Gemma 4, Llama, etc)	✅	❌	❌
No account / no signup	✅	❌	❌
Free	✅	partial	partial
Source available	✅ Apache 2.0	❌	partial

The actual differentiator nobody talks about (deterministic memory)

The thing nobody is doing right is how memory is read and written.

In ChatGPT and Claude.ai with memory turned on, the LLM is in charge of memory. The model decides when to remember something during your conversation. The model decides what to remember. The model decides what to retrieve when you ask a question. The whole memory layer is implemented as model behavior. You're trusting the LLM to be a librarian, and LLMs are not librarians. They hallucinate.

RecallMEM does it backwards. The chat LLM never touches your memory database. Not for reads, not for writes. The LLM only ever sees a system prompt that's already been assembled by deterministic TypeScript and SQL. Here's the actual flow:

When you send a message (memory READ path, 100% deterministic):

Plain SQL SELECT pulls your profile from s2m_user_profiles
Plain SQL SELECT pulls your top active facts from s2m_user_facts (retired facts are excluded automatically)
Each fact is stamped with its valid_from date so the model can reason about timelines
EmbeddingGemma converts your message to a 768-dim vector (math, not generation)
pgvector cosine similarity search ranks chunks from past conversations
Each retrieved chunk is stamped with its source-chat date ([from conversation on 2026-03-12]) so the model can tell history from now
If the chat is being resumed after a multi-hour gap, a one-time system marker like [Conversation resumed 6 days later] gets injected before the new user turn
TypeScript template assembles all of it into a system prompt
Then the chat LLM gets called, with the assembled context already in its prompt

The chat LLM never queries the database. It can't decide what to retrieve. It can't pick which facts are relevant. It can't hallucinate a memory that doesn't exist, because if it's not in the prompt, it doesn't exist for the model. The retrieval is 100% deterministic SQL + cosine similarity. No LLM tool calls touching your memory store.

After every assistant reply (memory WRITE path, LLM proposes, TypeScript validates):

A small local LLM (Gemma 4 E4B via Ollama) runs in the background to extract candidate facts from the running transcript. This happens fire-and-forget after the stream closes, so you never wait for it. It always uses the local model regardless of which provider the chat itself is using, so cloud users (Claude, GPT) don't get billed per turn for extraction.

The same LLM call also returns the IDs of any existing facts the new conversation contradicts. So when you say "I just left Acme to start a new job," the extractor returns the new fact AND flags the old "User works at Acme" fact for retirement. The TypeScript layer flips those rows to is_active=false and stamps valid_to=NOW(). History is preserved, the active set always reflects current truth.

But here's the key: the LLM only proposes facts and supersession decisions. It cannot write to the database. The TypeScript layer is the actual gatekeeper, and it runs every candidate fact through six validation steps before storage:

Quality gate. Conversations under 100 characters get zero facts extracted. The LLM never even sees them.
JSON parse validation. If the LLM returns malformed JSON or no array, the entire batch is dropped.
Type validation. Only strings survive. Objects, numbers, nested arrays, all rejected.
Garbage pattern filtering. A regex filter catches the most common LLM hallucinations: meta-observations like "user asked about X", AI behavior notes like "AI suggested Y", non-facts like "not mentioned", mood observations like "had a good conversation", and anything under 10 characters.
Deduplication. Case-insensitive normalized match against the entire facts table. Duplicates get dropped.
Categorization. The category (Identity, Family, Work, Health, etc.) is decided by keyword matching in TypeScript, not by the LLM. The LLM has no say in how facts get organized.

After all six steps, the surviving facts get a plain SQL INSERT. And even then, you can edit or delete any fact in the Memory page if you don't agree with it.

Why this matters:

Predictability. When you mention "my dog" in a chat, RecallMEM always retrieves the facts that match "dog" via cosine similarity. ChatGPT retrieves whatever the model decides to retrieve, which can vary run to run.
No hallucinated retrieval. The LLM cannot remember something that isn't actually in your facts table. If it's not in the database, it's not in the prompt.
Auditability. You can look at any chat and trace exactly which facts and chunks were loaded into the system prompt. With ChatGPT, you can't see what the model decided to surface from memory.
No prompt injection memory leaks. The LLM in RecallMEM only sees what the deterministic layer feeds it. It can't query the rest of the database. With ChatGPT, the model has tool access to memory, which means a prompt injection attack could theoretically make it dump memory contents.
Your data, your database. Memory is data you control, not behavior you have to trust the model to do correctly. You can write a script that queries Postgres directly, edit facts manually, run analytics on your own conversations.

This is the actual reason RecallMEM exists. Not "another local chat UI." A memory architecture where the LLM is intentionally not in charge.

For developers (the memory framework)

Underneath the chat UI, RecallMEM is a deterministic memory framework you can fork and use in your own AI app. The whole lib/ folder is intentionally framework-shaped. It's not a polished SDK with a public API contract, but it IS a working, opinionated memory architecture you can copy into your own project.

What's in lib/ and how to embed it in your app

The core files in lib/:

lib/
├── memory.ts        Memory orchestrator. Loads profile + facts + vector recall in parallel.
├── prompts.ts       Assembles the system prompt with all the memory context.
├── facts.ts         Fact extraction (LLM proposes) + validation (TypeScript decides).
├── profile.ts       Synthesizes a structured profile from the active facts.
├── chunks.ts        Splits transcripts into chunks, embeds them, runs vector search.
├── chats.ts         Chat CRUD + transcript serialization with the smart parser.
├── post-chat.ts     The post-chat pipeline (title gen, fact extract, profile rebuild, embed).
├── rules.ts         Custom user rules / instructions.
├── embeddings.ts    EmbeddingGemma calls via Ollama.
├── llm.ts           LLM router (Ollama, Anthropic, OpenAI, OpenAI-compatible).
└── db.ts            Postgres pool + the configurable user ID resolver.

Embedding it into your own app:

The lib functions default to a single-user setup (user_id = "local-user") but you can wire in your own auth system with two function calls at startup:

import { Pool } from "pg";
import { configureDb, setUserIdResolver } from "./lib/db";

// Use your existing Postgres pool (or skip this and let lib/ create its own)
const myPool = new Pool({ connectionString: process.env.DATABASE_URL });
configureDb({ pool: myPool });

// Wire in your auth system. Called whenever a lib function needs the current user.
// Can be sync or async. Return whatever string identifies the user in your app.
setUserIdResolver(() => getCurrentUserFromMyAuthSystem());

That's it. No other changes needed. Every lib function (getProfile, getActiveFacts, searchChunks, storeFacts, rebuildProfile, etc.) reads from the configured resolver. Your auth system stays in your code, the memory framework stays in lib/.

Using the memory layer in a chat request:

import { buildMemoryAwareSystemPrompt } from "./lib/memory";
import { runPostChatPipeline } from "./lib/post-chat";
import { createChat, updateChat } from "./lib/chats";

// 1. Build the system prompt from the user's memory
const systemPrompt = await buildMemoryAwareSystemPrompt(
  userMessage,
  currentChatId
);

// 2. Send to your LLM however you want (Ollama, Claude, GPT, whatever)
const response = await yourLLM.chat([
  { role: "system", content: systemPrompt },
  ...conversationHistory,
  { role: "user", content: userMessage },
]);

// 3. Save the chat
await updateChat(chatId, [...conversationHistory, { role: "assistant", content: response }]);

// 4. (Async) Run the post-chat pipeline to extract facts, rebuild profile, embed chunks
runPostChatPipeline(chatId);

The memory framework doesn't care which LLM you use. It just assembles context. Bring your own model.

The schema lives in migrations/001_init.sql. Run it against any Postgres 17+ database with the pgvector extension installed. Tables are prefixed s2m_ (for "speak2me," the project this came from). Rename them in the migration if you want a different prefix.

License: Apache 2.0. Fork it, modify it, ship it commercially. The only ask is that you preserve the copyright notice and the NOTICE file. See CONTRIBUTING.md for the full guide.

Quick start

npx recallmem

You need three things on your machine first: Node.js 20+, Postgres 17 with pgvector, and Ollama (optional, skip if you only want cloud providers). If any are missing, the CLI tells you exactly what to install for your OS.

Architecture diagrams (system, memory layers, post-chat sequence)

System architecture

flowchart TB
    Browser["Browser<br/>Chat UI<br/>localhost:3000"]
    NextJS["Next.js App<br/>API routes + SSR"]
    Postgres[("Postgres + pgvector<br/>localhost:5432<br/>Chats, facts, profile, embeddings")]
    Ollama["Ollama<br/>localhost:11434<br/>Gemma 4 + EmbeddingGemma"]
    Cloud{{"Optional: Cloud LLMs<br/>Anthropic / OpenAI / etc.<br/>Only if you add a provider"}}

    Browser <-->|HTTP / SSE| NextJS
    NextJS <-->|SQL + vector queries| Postgres
    NextJS <-->|"/api/chat<br/>/api/embed"| Ollama
    NextJS -.->|Optional API call| Cloud

    style Cloud stroke-dasharray: 5 5
    style Ollama fill:#dfe
    style Postgres fill:#dfe
    style NextJS fill:#dfe
    style Browser fill:#dfe

Everything in green runs on your machine. The dashed cloud box only activates if you explicitly add a cloud provider in settings. Otherwise, nothing leaves your computer. Ever.

The three-layer memory system

flowchart LR
    Chat[New chat message]
    Memory["Memory loader<br/>(parallel)"]
    Profile["Layer 1: Profile<br/>Synthesized summary<br/>(IDENTITY, FAMILY,<br/>WORK, HEALTH...)"]
    Facts["Layer 2: Facts<br/>Top 50 atomic statements<br/>(pinned to system prompt)"]
    Vector["Layer 3: Vector search<br/>Top 5 chunks from past<br/>conversations<br/>(semantic similarity)"]
    Rules["User custom rules<br/>(behavior instructions)"]
    Prompt["System prompt<br/>(profile + facts + recall + rules)"]
    LLM[LLM]
    Response[Streaming response]

    Chat --> Memory
    Memory --> Profile
    Memory --> Facts
    Memory --> Vector
    Memory --> Rules
    Profile --> Prompt
    Facts --> Prompt
    Vector --> Prompt
    Rules --> Prompt
    Prompt --> LLM
    LLM --> Response

Each layer does a different job:

Profile loads instantly. It's the "who am I talking to" baseline. One database row, always loaded into every system prompt.
Facts are atomic statements you can view, edit, and delete. Stored as individual rows. Pinned into the prompt every conversation.
Vector search finds semantically relevant prose from any past conversation. Catches the stuff that doesn't fit cleanly into facts, like that idea you were working through three weeks ago.

Together, they let the AI know your name, your family, your job, AND remember the specific thing you mentioned a month ago when it becomes relevant.

What happens when you end a chat

sequenceDiagram
    actor User
    participant UI as Chat UI
    participant API as /api/chat/finalize
    participant LLM
    participant DB as Postgres

    User->>UI: Click "New chat"
    UI->>UI: Show "Saving memory..."
    UI->>API: POST chatId
    API->>LLM: Generate title (Gemma E4B)
    LLM-->>API: "Discussing project ideas"
    API->>DB: Save title
    API->>LLM: Extract facts (Gemma E4B)
    LLM-->>API: ["User's name is...", "User works at...", ...]
    API->>DB: Insert new facts (deduped)
    API->>DB: Rebuild profile from all facts
    API->>API: Embed transcript chunks
    API->>DB: Insert embeddings
    API-->>UI: Done
    UI->>UI: Clear chat, ready for next

Click "New chat", wait a few seconds, and the next conversation immediately sees the new memory.

Hardware requirements (which model fits which machine)

The biggest variable is which LLM you pick. RecallMEM lets you choose.

Fully open source (Ollama + Gemma 4 locally)

Setup	Model	RAM	Speed	Quality
Phone / iPad	Gemma 4 E2B	8GB	Fast	Basic
MacBook Air / Mac Mini M4	Gemma 4 E4B	16GB	Fast	Good
Mac Studio M2+	Gemma 4 26B MoE	32GB+	Very fast	Great
Workstation / server	Gemma 4 31B Dense	32GB+	Slower	Best

The 26B MoE is what I use as the default. It's a Mixture of Experts model, so it only activates 3.8B parameters per token even though it has 26B total. Much faster than the 31B Dense, almost the same quality. Ranked #6 globally on the Arena leaderboard.

Using cloud providers (Claude, GPT, Groq, etc.)

If you don't want to run a local LLM at all, you can plug in any cloud API:

Setup	RAM	Notes
Any laptop	~4GB free	Just runs Postgres + the Node.js app + browser. The LLM runs on the provider's servers.

You bring your own API key. The database, memory, profile, and rules still stay on your machine. Only the chat messages get sent to the provider.

One thing to know: when you use a cloud provider, your conversation goes to their servers. Your facts and profile get sent as part of the system prompt so the cloud LLM has context. This breaks the local-only guarantee for those specific conversations. Use Ollama for anything you want fully private.

CLI commands

npx recallmem            # Setup if needed, then start the app
npx recallmem init       # Setup only (deps check, DB, models, env)
npx recallmem start      # Start the server (assumes setup was done)
npx recallmem doctor     # Check what's missing or broken
npx recallmem upgrade    # Pull latest code, run pending migrations
npx recallmem version    # Print version
npx recallmem --help     # Show help

The default npx recallmem is what you'll use 99% of the time. It's smart about its state. On the first run it sets everything up, on subsequent runs it just starts the server.

If something breaks, run npx recallmem doctor first. It tells you exactly what's wrong and how to fix it.

Two ways to use it (just-run-it vs fork-and-hack)

The npx recallmem command auto-detects which workflow you're in.

Workflow 1: Just run it (most users)

You want to use RecallMEM as your daily AI tool. You don't care about the code.

npx recallmem

The CLI:

Detects nothing is installed yet
Clones the repo to ~/.recallmem (one-time, ~50MB)
Runs npm install inside ~/.recallmem
Checks your dependencies (Postgres, pgvector, Ollama)
Pulls the embedding model if missing
Asks if you want to pull a chat model (~18GB, optional)
Creates the database, runs migrations, writes the config file
Starts the server and opens the chat in your browser

Subsequent runs are instant. Just npx recallmem and the chat opens.

To upgrade later when I ship a new version:

npx recallmem upgrade

That does a git pull, runs npm install if deps changed, and applies any pending migrations.

Workflow 2: Fork it and hack on it (developers)

You want to modify the code, contribute back, run your own variant.

git clone https://github.com/RealChrisSean/RecallMEM.git
cd RecallMEM
npm install
npx recallmem

The CLI detects you're already inside a recallmem checkout and uses your current directory instead of cloning to ~/.recallmem. Hot reload works. Edits to the code are reflected immediately on the next dev server reload.

Same npx recallmem command. Different behavior because the CLI is smart about where it's running.

See CONTRIBUTING.md for the dev workflow.

Testing:

npm test          # run the suite once
npm test:watch    # re-run on file change

The test suite uses Vitest and currently covers the deterministic memory primitives (keyword inflection, the categorization router, and the regression cases that have bitten us in the past — son matching Sonnet, work matching framework, etc). It's intentionally narrow and fast (~150ms). New tests go in test/unit/ and follow the same shape as test/unit/facts.test.ts. No DB or LLM required, pure functions only.

Optional observability (Langfuse):

If you're hacking on RecallMEM and want full trace timelines for every chat turn (memory build, LLM generation, fact extraction, supersession decisions, etc), there's a built-in Langfuse integration. It's a peer dependency, so it's NOT installed by default and zero cost when unused.

npm install langfuse

Then set these in .env.local:

LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_BASEURL=http://localhost:3000  # optional, defaults to cloud.langfuse.com

Self-host Langfuse via Docker so traces stay on your machine. This is a developer-only debugging tool. Trace payloads include the actual user message content, so don't enable it on machines where conversation contents shouldn't leave the local environment.

Where things live on disk (and how to fully uninstall)

The default install location is ~/.recallmem. Override with RECALLMEM_HOME=/custom/path npx recallmem if you want it somewhere else.

What's in ~/.recallmem:

The full RecallMEM source code (cloned from GitHub)
node_modules/ with all dependencies
.env.local with your config
The Next.js build output (when you run it)

What's NOT in ~/.recallmem:

Your conversations, facts, profile, embeddings, rules, and API keys. Those all live in your Postgres database at /opt/homebrew/var/postgresql@17/ (Mac) or /var/lib/postgresql/ (Linux). The Postgres data directory is the actual source of truth.

To completely uninstall:

rm -rf ~/.recallmem        # Remove the app
dropdb recallmem           # Remove the database (or use the in-app "Nuke everything" button first)

Privacy

If you only use Ollama, nothing leaves your machine, ever. You can air-gap the computer and it keeps working. If you add a cloud provider (Claude, GPT, etc.), only the chat messages and your assembled system prompt go to that provider's servers. Your database, embeddings, and saved API keys stay local.

Privacy diagram + truly unrecoverable deletion

flowchart TB
    subgraph Local["Your machine (always private)"]
        DB[(Postgres database<br/>Chats, facts, profile, embeddings, API keys, rules)]
        App[Next.js app]
        Ollama_Box[Ollama]
    end

    subgraph CloudOpt["Optional cloud (only if you add a provider)"]
        Anthropic[Anthropic API]
        OpenAI_API[OpenAI API]
        Other[Other LLM APIs]
    end

    User[You] <--> App
    App <--> DB
    App <--> Ollama_Box
    App -.->|"Conversation messages<br/>+ system prompt<br/>(only if you pick a cloud provider)"| Anthropic
    App -.-> OpenAI_API
    App -.-> Other

    style Local fill:#dfe
    style CloudOpt stroke-dasharray: 5 5

Always on your machine, never sent anywhere:

Your chat history
Your facts and profile
Your custom rules
Your vector embeddings
Your saved API keys

Sent only when you actively use a cloud provider:

The current conversation messages
The system prompt (which includes your profile, facts, and rules so the cloud LLM has context)

Truly unrecoverable deletion

When you click "Wipe memory" or "Nuke everything" on the Memory page, the app runs:

DELETE to remove rows from query results
VACUUM FULL <table> to physically rewrite the table on disk and release the dead row space
CHECKPOINT to force Postgres to flush WAL log files

After those three steps, the data is gone from the database in any practically recoverable way.

One thing I want to be honest about: filesystem-level forensic recovery (raw disk block scanning) is a separate problem. SSDs have wear leveling, so file overwrites don't always touch the original physical cells. The complete solution is full-disk encryption (FileVault on Mac, LUKS on Linux, BitLocker on Windows). With disk encryption and a strong login password, the data is genuinely unrecoverable. Not even Apple could read it.

What it doesn't do (yet), honest limitations

I'm being honest about the limitations. This is v0.1.

No voice yet. It's text only. I want to add Whisper for speech-to-text and Piper for text-to-speech, both local. On the roadmap.
Web search works on Anthropic and Ollama. OpenAI not yet. Anthropic uses the native web_search_20250305 tool, no setup. Ollama (Gemma) uses Brave Search as a backend, which needs a free API key (5 minute setup): sign up at brave.com/search/api, pick the Free tier (2,000 searches/month), and add BRAVE_SEARCH_API_KEY=your_key_here to your .env.local. Then restart RecallMEM. When you toggle web search on the chat UI, the first time you'll see a privacy modal explaining that Brave will see your message text but NOT your memory, profile, facts, or past conversations. If the key isn't set or the quota is exhausted, the toggle still works but the AI will tell you what to do instead of failing silently. OpenAI's native web search requires the Responses API path which isn't plumbed through yet.
No multi-user. This is a personal app for one person on one machine. If you want a multi-user version, that's a separate fork.
Reasoning models (OpenAI o1/o3, Claude extended thinking) might have edge cases. They use different API parameters that I don't fully handle yet. Standard chat models work fine.
OpenAI vision isn't fully wired up. Gemma 4 (4B and up) handles images natively via Ollama. OpenAI uses a different format that I haven't plumbed through. Use Ollama or Anthropic for images.
No mobile app. It's a web app you run locally. You access it from your browser at localhost:3000. A native iOS/Android app is theoretically possible but it's a separate project I haven't started.
Fact supersession is LLM-judged and conservative. The local Gemma extractor decides whether a new fact contradicts an old one. It's intentionally cautious (only retires a fact when the replacement is unambiguous), so it might occasionally miss a real contradiction or, more rarely, retire something it shouldn't have. You can always inspect and edit/restore in the Memory page. For higher-stakes use cases, you'd want a stricter rule-based supersession layer on top, or a periodic profile-rebuild from full history.

Tech stack

Frontend / Backend: Next.js 16 (App Router) + TypeScript + Tailwind CSS v4
Database: Postgres 17 + pgvector (HNSW vector indexes)
Local LLM: Ollama with Gemma 4 (E2B / E4B / 26B MoE / 31B Dense)
Embeddings: EmbeddingGemma 300M (768 dimensions, runs in Ollama)
PDF parsing: pdf-parse v2
Markdown rendering: react-markdown + remark-gfm + @tailwindcss/typography
Cloud LLM transports (optional): Anthropic Messages API, OpenAI Chat Completions, OpenAI-compatible

Manual install (for the curious or for when npx recallmem can't be used)

If you want to know what npx recallmem is doing under the hood, or you don't want to use the CLI for some reason, here's the manual install.

macOS

# 1. Install Node.js
brew install node

# 2. Install Postgres 17 + pgvector
brew install postgresql@17 pgvector
brew services start postgresql@17

# 3. Install Ollama (skip if using cloud only)
brew install ollama
brew services start ollama

# 4. Pull the models
ollama pull embeddinggemma      # ~600MB, REQUIRED
ollama pull gemma4:26b          # ~18GB, recommended chat model
ollama pull gemma4:e4b          # ~4GB, fast model for background tasks

Linux (Ubuntu/Debian)

# 1. Node.js
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs

# 2. Postgres + pgvector
sudo apt install postgresql-17 postgresql-17-pgvector
sudo systemctl start postgresql

# 3. Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 4. Pull models
ollama pull embeddinggemma
ollama pull gemma4:26b
ollama pull gemma4:e4b

Windows

Use WSL2 with Ubuntu and follow the Linux steps. Native Windows works too but it's rougher.

Setup

# 1. Clone the repo
git clone https://github.com/RealChrisSean/RecallMEM.git
cd RecallMEM

# 2. Install dependencies
npm install

# 3. Create the database
createdb recallmem

# 4. Run migrations
npm run migrate

# 5. Configure .env.local
cat > .env.local <<EOF
DATABASE_URL=postgres://$USER@localhost:5432/recallmem
OLLAMA_URL=http://localhost:11434
OLLAMA_CHAT_MODEL=gemma4:26b
OLLAMA_FAST_MODEL=gemma4:e4b
OLLAMA_EMBED_MODEL=embeddinggemma
EOF

# 6. Start the dev server
npm run dev

Open http://localhost:3000.

Troubleshooting (the real gotchas I hit)

Stuff I've actually hit. If you run into something else, run npx recallmem doctor first. It tells you exactly what's broken.

createdb: command not found

Add Postgres to your PATH:

export PATH="/opt/homebrew/opt/postgresql@17/bin:$PATH"

extension "vector" is not available

You're running Postgres 16 or older. The pgvector Homebrew bottle only ships extensions for Postgres 17 and 18. Switch to postgresql@17. I learned this the hard way. The install error message is cryptic and the fix took me 30 minutes the first time.

Ollama silently fails to pull a new model

You've got a version mismatch between the Ollama CLI and the Ollama server. This bites you if you have both Homebrew Ollama AND the desktop Ollama app installed. Check ollama --version. Both client and server should match.

brew upgrade ollama
pkill -f "Ollama"            # kill the old desktop app server
brew services start ollama   # start the new server from Homebrew

Gemma 4 31B is slow

Two reasons:

Thinking mode is on. The app already disables it via think: false, but if you bypass the app and call Ollama directly, you'll see slow responses. Gemma 4 spends a ton of tokens "thinking" before answering when it's enabled.
Dense vs MoE. 31B Dense activates all 31B parameters per token. Switch to gemma4:26b (Mixture of Experts, only 3.8B active per token) for ~3-5x the speed with minimal quality loss. This is what I use as the default.

"My memory isn't being used in new chats"

Make sure you click "New chat" (or switch to another chat in the sidebar) to trigger the synchronous "Saving memory..." finalize step. If you just refresh the browser without ending the chat, the post-chat pipeline runs as a best-effort sendBeacon() and may not finish before the next chat starts.

The fix: always click "New chat" or switch chats in the sidebar before closing the browser if you said something you want remembered.

Contributing

Forks, PRs, bug reports, ideas, all welcome. See CONTRIBUTING.md for the dev setup and how the codebase is organized.

If you build something cool on top of RecallMEM, I'd love to hear about it.

License

Apache License 2.0. See LICENSE for the full text and NOTICE for third-party attributions. You can use, modify, fork, and redistribute this for any purpose, personal or commercial. The license includes a patent grant and the standard "no warranty, no liability" disclaimer.

Status

This is v0.1. It works. I use it every day.

It's also not "production ready" in the corporate sense. There's no CI, no error monitoring, no SLA. There's a small Vitest test suite that covers the deterministic memory primitives (keyword routing, inflection, regression cases), but it's intentionally narrow. If you want to use it as your daily AI tool, fork it, make it yours, and expect to read the code if something breaks. That's the deal.

I built RecallMEM because I wanted my own private AI. I'm sharing it because there's a real gap in the local AI ecosystem and someone needed to fill it. If this is useful to you, that's cool. If not, no hard feelings.

The repo: github.com/RealChrisSean/RecallMEM