Package Exports
- treechunk
- treechunk/summarizer/openai
Readme
TreeChunk
Contextual, hierarchical markdown chunking for RAG systems
WHAT?
Splits markdown documents into self-contained chunks that contain (hopefully) enough contextual information about what part of the document contained the chunk that it's useful for generation.
There's a static demo of this in action here: sgnt.ai/treechunk-demo.
Synopsis
Programmatic:
import { TreeChunker, OpenAISummarizer } from 'treechunk';
const summarizer = new OpenAISummarizer('Technical documentation context');
const chunker = new TreeChunker(summarizer);
await chunker.makeChunks(documentNode, async (chunk, source) => {
console.log(chunk); // The enriched chunk with context
console.log(source); // The original markdown source for this section
});
Build a demo HTML page:
OPENAI_API_KEY=etc
tsx bin/demo.ts ./demo/Scamming.md "The document has come from the Wiki for an online crime game"
API
TreeChunker
new TreeChunker(summarizer)
- Create chunker with a summarizermakeChunks(node, onChunk, options?)
- Process document, calling onChunk for each chunkonChunk: (chunk: string, source: string) => Promise<void>
- Callback receives:chunk
: The enriched chunk with hierarchical title and AI-generated contextsource
: The original markdown source for this section
options?: TreeChunkerOptions
- Optional configuration:dryRun?: boolean
- When true, returns chunks without AI summaries (chunk equals source)
Summarizers
new OpenAISummarizer(context?, apiKey?)
- OpenAI implementationcontext
: Optional string added to promptsapiKey
: Optional, defaults to OPENAI_API_KEY env var
Parser
parseMarkdown(markdown)
- Parse markdown into DocumentNode treerenderDocument(node)
- Convert DocumentNode back to markdown
Prior Art / See Also
This was an independent -- but not novel -- discovery, by which I mean I built it and then went to try and find out what other people who'd also built it called theirs. It brings together the following ideas:
- Document structure-based chunking, eg: LangChain's MarkdownHeaderTextSplitter
- Contextual Retrieval (see this Anthropic article for a similar take)
Todo / Next steps
- Expand out the demo
- Add raw chunk and location data to callback (low priority, as I don't need this)
License
MIT
If you use this, port this, whatever, I'd love it if you gave this project a shout-out.
Author
Peter Sergeant pete@sgnt.ai
This was built for Torn, whose Wiki the "Scamming" article is taken.