Package Exports
- notion-md-crawler
Readme
notion-md-crawler
A library to recursively retrieve and serialize Notion pages and databases with customization for machine learning applications.
Features
- Crawling Pages and Databases: Dig deep into Notion's hierarchical structure with ease.
- Serialize to Markdown: Seamlessly convert Notion pages to Markdown for easy use in machine learning and other.
- Custom Serialization: Adapt the serialization process to fit your specific machine learning needs.
- User-Friendly: Built with customization and usability in mind, and it's type safe.
Installation
@notionhq/client
must also be installed.
Using npm:
npm install notion-md-crawler @notionhq/client
Quick Start
⚠️ Note: Before getting started, create an integration and find the token. Details on methods can be found in API section
import { Client } from "@notionhq/client";
import { crawler, pagesToString } from "notion-md-crawler";
// Need init notion client with credential.
const client = new Client({ auth: process.env.NOTION_API_KEY });
const crawl = crawler({ client });
const main = async () => {
const rootPageId = "****";
const pages = await crawl(rootPageId);
const result = pagesToString(pages);
};
main();
API
crawler
Parameters:
options.client
(NotionClient): Notion client.options.serializers
(Serializers, Optional): Used to customize the serializer.rootPageId
(string): Id of the root page to be crawled.
Returns:
Promise<Pages>
:Pages
object resulting from recursively parsing Notion pages.
Pages
Object
Key is page id, value is Page
Object.
type Pages = <string, Page>;
Page
Object
type Page = {
metadata: {
id: string;
title: string;
createdTime: string;
lastEditedTime: string;
parentId?: string;
};
properties: string[];
lines: string[];
};
Use Metadata
Since crawler
returns Page
objects and Page
object contain metadata, you can be used it for machine learning.
Custom Serialization
notion-md-crawler
gives you the flexibility to customize the serialization logic for various Notion objects to cater to the unique requirements of your machine learning model or any other use case.
Define your custom serializer
You can define your own custom serializer. You can also use the utility function for convenience.
import { BlockSerializer, crawler, serializer } from "notion-md-crawler";
const customEmbedSerializer: BlockSerializer<"embed"> = (block) => {
if (block.embed.url) return "";
// You can use serializer utility.
const caption = serializer.utils.fromRichText(block.embed.caption);
return `<figure>
<iframe src="${block.embed.url}"></iframe>
<figcaption>${caption}</figcaption>
</figure>`;
};
const serializers = {
block: {
embed: customEmbedSerializer,
},
};
const crawl = crawler({ client, serializers });
Skip serialize
Returning false
in the serializer allows you to skip the serialize of that block. This is useful when you want to omit unnecessary information.
const image: BlockSerializer<"image"> = () => false;
const crawl = crawler({ client, serializers: { block: { image } } });
Advanced: Use default serializer in custom serializer
If you want to customize serialization only in specific cases, you can use the default serializer in a custom serializer.
import { BlockSerializer, crawler, serializer } from "notion-md-crawler";
const defaultImageSerializer = serializer.block.defaults.image;
const customImageSerializer: BlockSerializer<"image"> = (block) => {
// Utility function to retrieve the link
const { title, href } = serializer.utils.fromLink(block.image);
// If the image is from a specific domain, wrap it in a special div
if (href.includes("special-domain.com")) {
return `<div class="special-image">
${defaultImageSerializer(block)}
</div>`;
}
// Use the default serializer for all other images
return defaultImageSerializer(block);
};
const serializers = {
block: {
image: customImageSerializer,
},
};
const crawl = crawler({ client, serializers });
Supported Blocks and Database properties
Blocks
Block Type | Supported |
---|---|
Text | ✅ Yes |
Bookmark | ✅ Yes |
Bulleted List | ✅ Yes |
Numbered List | ✅ Yes |
Heading 1 | ✅ Yes |
Heading 2 | ✅ Yes |
Heading 3 | ✅ Yes |
Quote | ✅ Yes |
Callout | ✅ Yes |
Equation (block) | ✅ Yes |
Equation (inline) | ✅ Yes |
Todos (checkboxes) | ✅ Yes |
Table Of Contents | ✅ Yes |
Divider | ✅ Yes |
Column | ✅ Yes |
Column List | ✅ Yes |
Toggle | ✅ Yes |
Image | ✅ Yes |
Embed | ✅ Yes |
Video | ✅ Yes |
Figma | ✅ Yes |
Google Maps | ✅ Yes |
Google Drive | ✅ Yes |
Tweet | ✅ Yes |
✅ Yes | |
Audio | ✅ Yes |
File | ✅ Yes |
Link | ✅ Yes |
Page Link | ✅ Yes |
External Page Link | ✅ Yes |
Code (block) | ✅ Yes |
Code (inline) | ✅ Yes |
Database Properties
Property Type | Supported |
---|---|
Checkbox | ✅ Yes |
Created By | ✅ Yes |
Created Time | ✅ Yes |
Date | ✅ Yes |
✅ Yes | |
Files | ✅ Yes |
Formula | ✅ Yes |
Last Edited By | ✅ Yes |
Last Edited Time | ✅ Yes |
Multi Select | ✅ Yes |
Number | ✅ Yes |
People | ✅ Yes |
Phone Number | ✅ Yes |
Relation | ✅ Yes |
Rich Text | ✅ Yes |
Rollup | ✅ Yes |
Select | ✅ Yes |
Status | ✅ Yes |
Title | ✅ Yes |
Unique Id | ✅ Yes |
Url | ✅ Yes |
Verification | □ No |
Issues and Feedback
For any issues, feedback, or feature requests, please file an issue on GitHub.
License
MIT
Made with ❤️ by TomPenguin.