JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 19
  • Score
    100M100P100Q50235F
  • License MIT

A lightweight TypeScript library designed to reconstruct paragraphs from OCRed inputs and transcriptions.

Package Exports

  • paragrafs
  • paragrafs/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (paragrafs) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

paragrafs

wakatime Bun Node.js CI GitHub License GitHub Release codecov Size typescript npm GitHub issues GitHub stars CodeRabbit Pull Request Reviews

A lightweight TypeScript library designed to reconstruct paragraphs from OCRed inputs and transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability.

Features

  • Segment Recognition: Intelligently groups text into logical paragraphs
  • Filler Removal: Identifies and removes common speech fillers (uh, umm, etc.)
  • Gap Detection: Detects significant pauses to identify paragraph breaks
  • Timestamp Formatting: Converts seconds to readable timestamps (HH:MM:SS)
  • Punctuation Awareness: Uses punctuation to identify natural segment breaks
  • Customizable Parameters: Configure minimum words per segment, max segment length, etc.
  • Arabic Support: Handles Arabic question marks and other non-Latin punctuation
  • Transcript Formatting: Converts raw token streams into readable text with appropriate line breaks

Installation

npm install paragrafs

or

pnpm install paragrafs

or

yarn add paragrafs

Usage

Basic Example

import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs';

// Example token from OCR or transcription
const token = {
    start: 0,
    end: 5,
    text: 'This is a sample text. It should be properly segmented.',
};

// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);

// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);

console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."

Working with Transcriptions

import {
    markAndCombineSegments,
    mapSegmentsIntoFormattedSegments,
    formatSegmentsToTimestampedTranscript,
} from 'paragrafs';

// Example transcription segments
const segments = [
    {
        start: 0,
        end: 6.5,
        text: 'The quick brown fox!',
        tokens: [
            { start: 0, end: 1, text: 'The' },
            { start: 1, end: 2, text: 'quick' },
            { start: 2, end: 3, text: 'brown' },
            { start: 3, end: 6.5, text: 'fox!' },
        ],
    },
    {
        start: 8,
        end: 13,
        text: 'Jumps right over the',
        tokens: [
            { start: 8, end: 9, text: 'Jumps' },
            { start: 9, end: 10, text: 'right' },
            { start: 10, end: 11, text: 'over' },
            { start: 12, end: 13, text: 'the' },
        ],
    },
];

// Options for segment formatting
const options = {
    fillers: ['uh', 'umm', 'hmmm'],
    gapThreshold: 3,
    maxSecondsPerSegment: 12,
    minWordsPerSegment: 3,
};

// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);

// Get timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10);

console.log(transcript);
// Output:
// 0:00: The quick brown fox!
// 0:08: Jumps right over the

API Reference

Core Functions

estimateSegmentFromToken(token: Token): Segment

Splits a single token into word-level tokens and estimates timing for each word.

markTokensWithDividers(tokens: Token[], options): MarkedToken[]

Marks tokens with segment breaks based on fillers, gaps, and punctuation.

groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]

Groups marked tokens into logical segments based on maximum segment length.

mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]

Merges segments with too few words into the previous segment.

mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[]

Converts marked segments into clean, formatted segments with proper text representation.

formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number): string

Formats segments into a human-readable transcript with timestamps.

markAndCombineSegments(segments: Segment[], options): MarkedSegment[]

Combined utility that processes segments through all the necessary steps.

Types

type Token = {
    start: number; // Start time in seconds
    end: number; // End time in seconds
    text: string; // The transcribed text
};

type Segment = Token & {
    tokens?: Token[]; // Word-by-word breakdown with timings
};

type MarkedToken = 'SEGMENT_BREAK' | Token;

type MarkedSegment = {
    start: number;
    end: number;
    tokens: MarkedToken[];
};

Utility Functions

isEndingWithPunctuation(text: string): boolean

Checks if the text ends with punctuation (including Arabic punctuation).

formatSecondsToTimestamp(seconds: number): string

Formats seconds into a human-readable timestamp (H:MM:SS).

Use Cases

  • OCR Post-Processing: Clean up scanned text by properly reconstructing paragraphs
  • Transcript Formatting: Convert raw transcriptions into readable text
  • Subtitle Generation: Create properly formatted subtitles from audio transcriptions
  • Document Reconstruction: Rebuild properly formatted documents from extracted text

Contributing

Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests.

To get started:

  1. Fork the repository
  2. Install dependencies: bun install (requires Bun)
  3. Make your changes
  4. Run tests: bun test
  5. Submit a pull request

License

paragrafs is released under the MIT License. See the LICENSE.MD file for more details.

Author

Ragaeeb Haq


Built with TypeScript and Bun. Uses ESM module format.