Package Exports
- @aidalinfo/office-to-markdown
- @aidalinfo/office-to-markdown/dist/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@aidalinfo/office-to-markdown) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
๐ @aidalinfo/office-to-markdown
A modern TypeScript library for converting Office documents (DOCX) to Markdown format, optimized for the Bun ecosystem with advanced support for mathematical equations and tables.
๐ฌ Created through reverse engineering of Microsoft's MarkItDown - A TypeScript reimplementation that brings Python's document conversion capabilities to the JavaScript/Bun ecosystem with enhanced performance and type safety.
๐ Features
- โ DOCX to Markdown conversion with structure preservation
- โ Mathematical equation support (OMML โ LaTeX)
- โ Table handling with automatic formatting
- โ Style preservation (bold, italic, headings)
- โ Image processing with alt text
- โ Simple and advanced API for different use cases
- โ Robust error handling with specific error codes
- โ Optimized performance with Bun runtime
- โ Complete TypeScript types for better DX
๐ฆ Installation
With Bun (recommended)
bun add @aidalinfo/office-to-markdownWith npm/yarn/pnpm
npm install @aidalinfo/office-to-markdown
# or
yarn add @aidalinfo/office-to-markdown
# or
pnpm add @aidalinfo/office-to-markdownRequired Dependencies
The following dependencies are automatically installed:
mammoth- DOCX to HTML conversionturndown- HTML to Markdown conversionjszip- ZIP archive manipulation (DOCX)
๐ ๏ธ Conversion Workflow
The conversion process follows these steps:
- File Detection - MIME type and extension verification
- Preprocessing - DOCX content extraction and modification
- Math Processing - OMML โ LaTeX conversion
- Main Conversion - DOCX โ HTML via mammoth
- Post-processing - HTML โ Markdown with custom rules
๐ฏ Simple Usage
Basic Conversion
import { docxToMarkdown } from '@aidalinfo/office-to-markdown';
// Simple file conversion
const markdown = await docxToMarkdown('./document.docx');
console.log(markdown);Advanced API
import { OfficeToMarkdown } from '@aidalinfo/office-to-markdown';
const converter = new OfficeToMarkdown({
headingStyle: 'atx', // Use ## for headings
preserveTables: true, // Preserve tables
convertMath: true, // Convert equations to LaTeX
});
// Conversion with options
const result = await converter.convertDocx('./document.docx');
console.log('Title:', result.title);
console.log('Content:', result.markdown);Conversion from Different Sources
import { OfficeToMarkdown } from '@aidalinfo/office-to-markdown';
const converter = new OfficeToMarkdown();
// From file path
const result1 = await converter.convert('./document.docx');
// From Buffer
const buffer = await Bun.file('./document.docx').arrayBuffer();
const result2 = await converter.convert(buffer);
// From Bun file
const file = Bun.file('./document.docx');
const result3 = await converter.convert(file);
// Batch processing
const results = await converter.convertMultiple([
'./doc1.docx',
'./doc2.docx',
buffer
]);โ๏ธ Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
headingStyle |
'atx' | 'setext' |
'atx' |
Markdown heading style |
preserveTables |
boolean |
true |
Preserve tables |
convertMath |
boolean |
true |
Convert mathematical equations |
styleMap |
string |
- | Custom mapping for mammoth |
๐ง Technical Architecture
Module Structure
src/
โโโ converters/ # Document converters
โ โโโ base-converter.ts # Abstract base class
โ โโโ docx-converter.ts # Specialized DOCX converter
โโโ preprocessing/ # Preliminary processing
โ โโโ docx-preprocessor.ts # DOCX preprocessing (math)
โโโ math/ # Mathematical processing
โ โโโ omml-processor.ts # OMML โ LaTeX converter
โโโ utils/ # Utilities
โ โโโ html-to-markdown.ts # HTML โ Markdown conversion
โ โโโ file-detector.ts # File type detection
โ โโโ error-handler.ts # Error handling
โโโ types/ # TypeScript definitions
โโโ converter.ts # Converter types
โโโ result.ts # Result types
โโโ stream-info.ts # File info typesConversion Pipeline
- File Detection - MIME type and extension verification
- Preprocessing - DOCX content extraction and modification
- Mathematical Processing - OMML โ LaTeX conversion
- Main Conversion - DOCX โ HTML via mammoth
- Post-processing - HTML โ Markdown with custom rules
Mathematical Equation Handling
The equation conversion follows this process:
// OMML (Office Math Markup Language)
<m:f>
<m:num>1</m:num>
<m:den>2</m:den>
</m:f>
// โ Preprocessing
<w:r><w:t>$\frac{1}{2}$</w:t></w:r>
// โ Mammoth (HTML)
<p>$\frac{1}{2}$</p>
// โ Turndown (Markdown)
$\frac{1}{2}$Supported Mathematical Elements
| OMML | LaTeX | Description |
|---|---|---|
<m:f> |
\frac{}{} |
Fractions |
<m:sSup> |
^{} |
Exponents |
<m:sSub> |
_{} |
Subscripts |
<m:rad> |
\sqrt{} |
Square roots |
<m:rad><m:deg> |
\sqrt[]{} |
Nth roots |
๐จ Advanced Usage Examples
Error Handling
import {
OfficeToMarkdown,
FileConversionException,
UnsupportedFormatException
} from '@aidalinfo/office-to-markdown';
async function convertSafely(filePath: string) {
try {
const converter = new OfficeToMarkdown();
const result = await converter.convertDocx(filePath);
return result.markdown;
} catch (error) {
if (error instanceof UnsupportedFormatException) {
console.error('Unsupported format:', error.message);
} else if (error instanceof FileConversionException) {
console.error('Conversion error:', error.message);
} else {
console.error('Unexpected error:', error.message);
}
throw error;
}
}Capability Checking
import { OfficeToMarkdown } from '@aidalinfo/office-to-markdown';
const converter = new OfficeToMarkdown();
// Check supported types
const info = converter.getSupportedTypes();
console.log('Extensions:', info.extensions); // ['.docx']
console.log('MIME types:', info.mimeTypes);
// Check if a file is supported
const isSupported = await converter.isSupported('./document.pdf');
console.log('PDF supported:', isSupported); // false
// Get file information
const fileInfo = await converter.getFileInfo('./document.docx');
console.log('MIME type:', fileInfo.mimetype);
console.log('Supported:', fileInfo.supported);Usage with Node.js
import { readFile } from 'fs/promises';
import { OfficeToMarkdown } from '@aidalinfo/office-to-markdown';
// From Node.js Buffer
const buffer = await readFile('./document.docx');
const converter = new OfficeToMarkdown();
const result = await converter.convert(buffer);
console.log(result.markdown);๐งช Testing and Validation
Test Results
- โ HTML โ Markdown conversion with tables
- โ File type detection (DOCX vs others)
- โ OMML โ LaTeX mathematical conversion
- โ Error handling with specific codes
- โ Complete pipeline tested with real documents
Performance
- Speed: ~80ms for an average document (7KB)
- Fidelity: Complete preservation of structure and content
- Robustness: Graceful error handling with fallbacks
๐ง Development
Prerequisites
- Bun >= 1.2.0 (recommended) or Node.js >= 20.0.0
- TypeScript >= 4.5.0
Development Installation
git clone https://github.com/aidalinfo/extract-kit.git
cd extract-kit/packages/office-to-markdown
bun installAvailable Scripts
bun run build # Complete build (ESM + types)
bun run dev # Development mode with watch
bun run clean # Clean dist/ folderTesting
# Basic functionality test
bun run src/test.ts
# Test with real DOCX file
bun run test-docx.ts "your-file.docx"๐ Roadmap
- PPT/PPTX format support - Presentation conversion
- XLS/XLSX format support - Spreadsheet conversion
- Streaming API - Large file streaming processing
- Plugin system - Support for custom converters
- Web interface - Optional user interface
- Embedded image support - Image extraction and conversion
- CLI batch mode - Command-line interface
๐ค Contributing
Contributions are welcome! Please see our contribution guide.
Contribution Process
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
This project is licensed under MIT - see the LICENSE file for details.
๐ Acknowledgments
- Inspired by Microsoft's MarkItDown project
- Uses mammoth.js for DOCX โ HTML conversion
- Uses turndown for HTML โ Markdown conversion
- Optimized for Bun runtime
๐ Support
- Issues: GitHub Issues
- Documentation: GitHub Repository
- Email: contact@aidalinfo.com
Simple, fast, and reliable DOCX to Markdown conversion โก