Package Exports

paragrafs
paragrafs/dist/index.mjs

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (paragrafs) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

paragrafs

GitHub License GitHub Release npm GitHub issues CodeRabbit Pull Request Reviews

A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability.

Features

Segment reconstruction – marks filler words, hints, and time gaps to create natural paragraph boundaries and merges overly short segments back into their predecessors.【F:src/transcript.ts†L40-L204】【F:src/transcript.ts†L236-L300】
Timestamped formatting – produces human-friendly transcripts with optional custom formatting callbacks and automatic timestamp rendering.【F:src/transcript.ts†L212-L300】
Ground-truth alignment – synchronises AI generated tokens with human edited text, interpolating timings for missing words and removing unknown tokens when applying the ground truth.【F:src/utils/transcriptUtils.ts†L1-L226】【F:src/transcript.ts†L328-L395】
Selection helpers – exposes utilities to find tokens for string queries or cursor selections, enabling rich text editors to jump to precise timestamps.【F:src/transcript.ts†L424-L493】
Utility toolkit – includes helpers for timestamp formatting, punctuation detection (including Arabic punctuation), and hint creation for multi-word matching.【F:src/utils/textUtils.ts†L4-L73】
Bun-native toolchain – powered by the upstream tsdown CLI for bundling and Biome for linting, so the same commands run locally and in CI without any custom wrappers.【F:package.json†L7-L41】【F:tsdown.config.ts†L1-L9】【F:biome.json†L1-L16】

Installation

npm install paragrafs

pnpm install paragrafs

yarn add paragrafs

bun add paragrafs

Usage

Basic Example

import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs';

// Example token from transcription
const token = {
    start: 0,
    end: 5,
    text: 'This is a sample text. It should be properly segmented.',
};

// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);

// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);

console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."

Working with Transcriptions

import {
    markAndCombineSegments,
    mapSegmentsIntoFormattedSegments,
    formatSegmentsToTimestampedTranscript,
} from 'paragrafs';

// Example transcription segments
const segments = [
    {
        start: 0,
        end: 6.5,
        text: 'The quick brown fox!',
        tokens: [
            { start: 0, end: 1, text: 'The' },
            { start: 1, end: 2, text: 'quick' },
            { start: 2, end: 3, text: 'brown' },
            { start: 3, end: 6.5, text: 'fox!' },
        ],
    },
    {
        start: 8,
        end: 13,
        text: 'Jumps right over the',
        tokens: [
            { start: 8, end: 9, text: 'Jumps' },
            { start: 9, end: 10, text: 'right' },
            { start: 10, end: 11, text: 'over' },
            { start: 12, end: 13, text: 'the' },
        ],
    },
];

// Options for segment formatting
const options = {
    fillers: ['uh', 'umm', 'hmmm'],
    gapThreshold: 3,
    maxSecondsPerSegment: 12,
    minWordsPerSegment: 3,
};

// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);

// Get timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10);

console.log(transcript);
// Output:
// 0:00: The quick brown fox!
// 0:08: Jumps right over the

Aligning AI Tokens to Human-Edited Text

import { updateSegmentWithGroundTruth } from 'paragrafs';

const rawSegment = {
    start: 0,
    end: 10,
    text: 'The Buick crown flock jumps right over the crazy dog.',
    tokens: [
        /* AI-generated word timestamps */
    ],
};

const aligned = updateSegmentWithGroundTruth(rawSegment, 'The quick brown fox jumps right over the lazy dog.');
console.log(aligned.tokens);
// Each token now matches the ground-truth words exactly,
// with missing words interpolated where needed.

Commands

bun run build – compiles the library with the official tsdown pipeline configured in tsdown.config.ts.【F:package.json†L33-L41】【F:tsdown.config.ts†L1-L9】
bun run lint – runs Biome’s formatter and linter against the repository root.【F:package.json†L33-L41】【F:biome.json†L1-L16】
bun test – executes the Bun test suite with coverage output configured in package.json.【F:package.json†L33-L41】

API Reference

Transcript builders

estimateSegmentFromToken(token: Token): Segment – splits multi-word tokens into per-word timings so they can participate in downstream processing.【F:src/transcript.ts†L15-L39】
markTokensWithDividers(tokens: Token[], options: MarkTokensWithDividersOptions): MarkedToken[] – inserts divider markers based on fillers, hints, punctuation, and timing gaps.【F:src/transcript.ts†L44-L121】
groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[] – chunks marked tokens into bounded-length segments.【F:src/transcript.ts†L123-L171】
mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[] – merges segments that contain fewer than the required word count into their predecessors.【F:src/transcript.ts†L173-L211】
cleanupIsolatedTokens(markedTokens: MarkedToken[]): MarkedToken[] – removes redundant divider markers that would isolate a single token on a line.【F:src/transcript.ts†L314-L326】
markAndCombineSegments(segments: Segment[], options): MarkedSegment[] – convenience pipeline that flattens tokens, marks dividers, groups, and merges short runs in one call.【F:src/transcript.ts†L302-L326】
mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[] – flattens marked segments into readable text while respecting optional line duration caps.【F:src/transcript.ts†L236-L300】
formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number, formatTokens?: (buffer: Token) => string): string – emits newline separated transcript lines with timestamps or a custom formatter.【F:src/transcript.ts†L204-L234】

Ground-truth alignment

updateSegmentWithGroundTruth(segment: Segment, groundTruth: string): GroundedSegment – applies LCS-based alignment to replace tokens with the ground-truth words while flagging unmatched entries.【F:src/transcript.ts†L328-L359】
applyGroundTruthToSegment(segment: Segment, groundTruth: string): Segment – wraps updateSegmentWithGroundTruth and filters unknown tokens for production-ready output.【F:src/transcript.ts†L361-L395】
mergeSegments(segments: Segment[], delimiter?: string): Segment – concatenates sequential segments into one continuous block, preserving timing.【F:src/transcript.ts†L397-L411】
splitSegment(segment: Segment, splitTime: number): Segment[] – divides a segment into two at a specific timestamp.【F:src/transcript.ts†L413-L448】

Editor helpers

getFirstMatchingToken(tokens: Token[], query: string): Token | null – scans for the first occurrence of a hint sequence produced by createHints.【F:src/transcript.ts†L450-L493】
getFirstTokenForSelection(segment: Segment, selectionStart: number, selectionEnd: number): Token | null – maps character selections within segment.text back to the corresponding timed token.【F:src/transcript.ts†L495-L546】

Utility functions

createHints(...hints: string[]): Hints – splits one or more hint strings into lookup maps keyed by the first word.【F:src/utils/textUtils.ts†L49-L73】
formatSecondsToTimestamp(seconds: number): string – renders numeric durations into m:ss or h:mm:ss strings.【F:src/utils/textUtils.ts†L14-L33】
isEndingWithPunctuation(text: string): boolean – checks for trailing punctuation, including Arabic variants.【F:src/utils/textUtils.ts†L4-L12】
tokenizeGroundTruth(groundTruth: string): string[] – tokenises human transcripts while attaching punctuation to the preceding word.【F:src/utils/textUtils.ts†L75-L112】

Types

type Token = {
    start: number;
    end: number;
    text: string;
};

type Segment = Token & {
    tokens: Token[];
};

type MarkedToken = Token | typeof SEGMENT_BREAK | typeof ALWAYS_BREAK;

type MarkedSegment = {
    start: number;
    end: number;
    tokens: MarkedToken[];
};

type GroundedToken = Token & { isUnknown?: boolean };

type GroundedSegment = Omit<Segment, 'tokens'> & { tokens: GroundedToken[] };

Use Cases

Transcript Formatting: Convert raw transcriptions into readable text
Subtitle Generation: Create properly formatted subtitles from audio transcriptions
Document Reconstruction: Rebuild properly formatted documents from extracted text

Contributing

Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests.

To get started:

Fork the repository
Install dependencies: bun install (requires Bun)
Make your changes
Run linting: bun run lint
Build the package: bun run build
Run tests: bun test
Submit a pull request

License

paragrafs is released under the MIT License. See the LICENSE.MD file for more details.

Author

Ragaeeb Haq

Built with TypeScript and Bun. Uses ESM module format.