JSPM

@aidalinfo/office-to-markdown

1.0.2
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 9
  • Score
    100M100P100Q56523F
  • License MIT

Modern TypeScript library for converting Office documents (DOCX) to Markdown format, optimized for Bun runtime with enhanced table support and math equation conversion.

Package Exports

  • @aidalinfo/office-to-markdown
  • @aidalinfo/office-to-markdown/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@aidalinfo/office-to-markdown) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

๐Ÿ“„ @aidalinfo/office-to-markdown

npm version TypeScript Bun

A modern TypeScript library for converting Office documents (DOCX) to Markdown format, optimized for the Bun ecosystem with advanced support for mathematical equations and tables.

๐Ÿ”ฌ Created through reverse engineering of Microsoft's MarkItDown - A TypeScript reimplementation that brings Python's document conversion capabilities to the JavaScript/Bun ecosystem with enhanced performance and type safety.

๐Ÿš€ Features

  • โœ… DOCX to Markdown conversion with structure preservation
  • โœ… Mathematical equation support (OMML โ†’ LaTeX)
  • โœ… Table handling with automatic formatting
  • โœ… Style preservation (bold, italic, headings)
  • โœ… Image processing with alt text
  • โœ… Simple and advanced API for different use cases
  • โœ… Robust error handling with specific error codes
  • โœ… Optimized performance with Bun runtime
  • โœ… Complete TypeScript types for better DX

๐Ÿ“ฆ Installation

bun add @aidalinfo/office-to-markdown

With npm/yarn/pnpm

npm install @aidalinfo/office-to-markdown
# or
yarn add @aidalinfo/office-to-markdown
# or  
pnpm add @aidalinfo/office-to-markdown

Required Dependencies

The following dependencies are automatically installed:

  • mammoth - DOCX to HTML conversion
  • turndown - HTML to Markdown conversion
  • jszip - ZIP archive manipulation (DOCX)

๐Ÿ› ๏ธ Conversion Workflow

The conversion process follows these steps:

  1. File Detection - MIME type and extension verification
  2. Preprocessing - DOCX content extraction and modification
  3. Math Processing - OMML โ†’ LaTeX conversion
  4. Main Conversion - DOCX โ†’ HTML via mammoth
  5. Post-processing - HTML โ†’ Markdown with custom rules

๐ŸŽฏ Simple Usage

Basic Conversion

import { docxToMarkdown } from '@aidalinfo/office-to-markdown';

// Simple file conversion
const markdown = await docxToMarkdown('./document.docx');
console.log(markdown);

Advanced API

import { OfficeToMarkdown } from '@aidalinfo/office-to-markdown';

const converter = new OfficeToMarkdown({
  headingStyle: 'atx',           // Use ## for headings
  preserveTables: true,          // Preserve tables
  convertMath: true,             // Convert equations to LaTeX
});

// Conversion with options
const result = await converter.convertDocx('./document.docx');
console.log('Title:', result.title);
console.log('Content:', result.markdown);

Conversion from Different Sources

import { OfficeToMarkdown } from '@aidalinfo/office-to-markdown';

const converter = new OfficeToMarkdown();

// From file path
const result1 = await converter.convert('./document.docx');

// From Buffer
const buffer = await Bun.file('./document.docx').arrayBuffer();
const result2 = await converter.convert(buffer);

// From Bun file
const file = Bun.file('./document.docx');
const result3 = await converter.convert(file);

// Batch processing
const results = await converter.convertMultiple([
  './doc1.docx',
  './doc2.docx',
  buffer
]);

โš™๏ธ Configuration Options

Option Type Default Description
headingStyle 'atx' | 'setext' 'atx' Markdown heading style
preserveTables boolean true Preserve tables
convertMath boolean true Convert mathematical equations
styleMap string - Custom mapping for mammoth

๐Ÿ”ง Technical Architecture

Module Structure

src/
โ”œโ”€โ”€ converters/           # Document converters
โ”‚   โ”œโ”€โ”€ base-converter.ts    # Abstract base class
โ”‚   โ””โ”€โ”€ docx-converter.ts    # Specialized DOCX converter
โ”œโ”€โ”€ preprocessing/        # Preliminary processing
โ”‚   โ””โ”€โ”€ docx-preprocessor.ts # DOCX preprocessing (math)
โ”œโ”€โ”€ math/                # Mathematical processing
โ”‚   โ””โ”€โ”€ omml-processor.ts    # OMML โ†’ LaTeX converter
โ”œโ”€โ”€ utils/               # Utilities
โ”‚   โ”œโ”€โ”€ html-to-markdown.ts # HTML โ†’ Markdown conversion
โ”‚   โ”œโ”€โ”€ file-detector.ts     # File type detection
โ”‚   โ””โ”€โ”€ error-handler.ts     # Error handling
โ””โ”€โ”€ types/               # TypeScript definitions
    โ”œโ”€โ”€ converter.ts         # Converter types
    โ”œโ”€โ”€ result.ts           # Result types
    โ””โ”€โ”€ stream-info.ts      # File info types

Conversion Pipeline

  1. File Detection - MIME type and extension verification
  2. Preprocessing - DOCX content extraction and modification
  3. Mathematical Processing - OMML โ†’ LaTeX conversion
  4. Main Conversion - DOCX โ†’ HTML via mammoth
  5. Post-processing - HTML โ†’ Markdown with custom rules

Mathematical Equation Handling

The equation conversion follows this process:

// OMML (Office Math Markup Language)
<m:f>
  <m:num>1</m:num>
  <m:den>2</m:den>
</m:f>

// โ†“ Preprocessing

<w:r><w:t>$\frac{1}{2}$</w:t></w:r>

// โ†“ Mammoth (HTML)

<p>$\frac{1}{2}$</p>

// โ†“ Turndown (Markdown)

$\frac{1}{2}$

Supported Mathematical Elements

OMML LaTeX Description
<m:f> \frac{}{} Fractions
<m:sSup> ^{} Exponents
<m:sSub> _{} Subscripts
<m:rad> \sqrt{} Square roots
<m:rad><m:deg> \sqrt[]{} Nth roots

๐ŸŽจ Advanced Usage Examples

Error Handling

import { 
  OfficeToMarkdown, 
  FileConversionException, 
  UnsupportedFormatException 
} from '@aidalinfo/office-to-markdown';

async function convertSafely(filePath: string) {
  try {
    const converter = new OfficeToMarkdown();
    const result = await converter.convertDocx(filePath);
    return result.markdown;
  } catch (error) {
    if (error instanceof UnsupportedFormatException) {
      console.error('Unsupported format:', error.message);
    } else if (error instanceof FileConversionException) {
      console.error('Conversion error:', error.message);
    } else {
      console.error('Unexpected error:', error.message);
    }
    throw error;
  }
}

Capability Checking

import { OfficeToMarkdown } from '@aidalinfo/office-to-markdown';

const converter = new OfficeToMarkdown();

// Check supported types
const info = converter.getSupportedTypes();
console.log('Extensions:', info.extensions); // ['.docx']
console.log('MIME types:', info.mimeTypes);

// Check if a file is supported
const isSupported = await converter.isSupported('./document.pdf');
console.log('PDF supported:', isSupported); // false

// Get file information
const fileInfo = await converter.getFileInfo('./document.docx');
console.log('MIME type:', fileInfo.mimetype);
console.log('Supported:', fileInfo.supported);

Usage with Node.js

import { readFile } from 'fs/promises';
import { OfficeToMarkdown } from '@aidalinfo/office-to-markdown';

// From Node.js Buffer
const buffer = await readFile('./document.docx');
const converter = new OfficeToMarkdown();
const result = await converter.convert(buffer);

console.log(result.markdown);

๐Ÿงช Testing and Validation

Test Results

  • โœ… HTML โ†’ Markdown conversion with tables
  • โœ… File type detection (DOCX vs others)
  • โœ… OMML โ†’ LaTeX mathematical conversion
  • โœ… Error handling with specific codes
  • โœ… Complete pipeline tested with real documents

Performance

  • Speed: ~80ms for an average document (7KB)
  • Fidelity: Complete preservation of structure and content
  • Robustness: Graceful error handling with fallbacks

๐Ÿ”ง Development

Prerequisites

  • Bun >= 1.2.0 (recommended) or Node.js >= 20.0.0
  • TypeScript >= 4.5.0

Development Installation

git clone https://github.com/aidalinfo/extract-kit.git
cd extract-kit/packages/office-to-markdown
bun install

Available Scripts

bun run build          # Complete build (ESM + types)
bun run dev            # Development mode with watch
bun run clean          # Clean dist/ folder

Testing

# Basic functionality test
bun run src/test.ts

# Test with real DOCX file
bun run test-docx.ts "your-file.docx"

๐Ÿš€ Roadmap

  • PPT/PPTX format support - Presentation conversion
  • XLS/XLSX format support - Spreadsheet conversion
  • Streaming API - Large file streaming processing
  • Plugin system - Support for custom converters
  • Web interface - Optional user interface
  • Embedded image support - Image extraction and conversion
  • CLI batch mode - Command-line interface

๐Ÿค Contributing

Contributions are welcome! Please see our contribution guide.

Contribution Process

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under MIT - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Inspired by Microsoft's MarkItDown project
  • Uses mammoth.js for DOCX โ†’ HTML conversion
  • Uses turndown for HTML โ†’ Markdown conversion
  • Optimized for Bun runtime

๐Ÿ“ž Support


@aidalinfo/office-to-markdown

Simple, fast, and reliable DOCX to Markdown conversion โšก