Package Exports
- treechunk
- treechunk/summarizer/openai
Readme
TreeChunk
Contextual, hierarchical markdown chunking for RAG systems
WHAT?
Splits markdown documents into self-contained chunks that contain (hopefully) enough contextual information about what part of the document contained the chunk that it's useful for generation.
There's a static demo of this in action here: sgnt.ai/treechunk-demo.
Synopsis
Programmatic:
import { TreeChunker, OpenAISummarizer } from 'treechunk';
const summarizer = new OpenAISummarizer('Technical documentation context');
const chunker = new TreeChunker(summarizer);
await chunker.makeChunks(documentNode, async (chunk, source) => {
console.log(chunk); // The enriched chunk with context
console.log(source); // The original markdown source for this section
});Build a demo HTML page:
OPENAI_API_KEY=etc
tsx bin/demo.ts ./demo/Scamming.md "The document has come from the Wiki for an online crime game"API
TreeChunker
new TreeChunker(summarizer)- Create chunker with a summarizermakeChunks(node, onChunk, options?)- Process document, calling onChunk for each chunkonChunk: (chunk: string, source: string) => Promise<void>- Callback receives:chunk: The enriched chunk with hierarchical title and AI-generated contextsource: The original markdown source for this section
options?: TreeChunkerOptions- Optional configuration:dryRun?: boolean- When true, returns chunks without AI summaries (chunk equals source)
Summarizers
new OpenAISummarizer(context?, apiKey?)- OpenAI implementationcontext: Optional string added to promptsapiKey: Optional, defaults to OPENAI_API_KEY env var
Parser
parseMarkdown(markdown)- Parse markdown into DocumentNode treerenderDocument(node)- Convert DocumentNode back to markdown
Prior Art / See Also
This was an independent -- but not novel -- discovery, by which I mean I built it and then went to try and find out what other people who'd also built it called theirs. It brings together the following ideas:
- Document structure-based chunking, eg: LangChain's MarkdownHeaderTextSplitter
- Contextual Retrieval (see this Anthropic article for a similar take)
Todo / Next steps
- Expand out the demo
- Add raw chunk and location data to callback (low priority, as I don't need this)
License
MIT
If you use this, port this, whatever, I'd love it if you gave this project a shout-out.
Author
Peter Sergeant pete@sgnt.ai
This was built for Torn, whose Wiki the "Scamming" article is taken.