JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 105
  • Score
    100M100P100Q75597F
  • License MIT

Library for converting the various transcript file formats to a common format.

Package Exports

    This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (transcriptator) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

    Readme

    Transcriptator

    GitHub forks GitHub stars

    npm npm install size License Number of Contributors

    Issues opened PRs open PRs closed

    Library for converting the various transcript file formats to a common format.

    Originally designed to help users of the Podcast Namespace podcast:transcript tag.

    Installation

    This is a Node.js module available through npm or yarn.

    Using npm:

    npm install transcriptator

    Using yarn:

    yarn add transcriptator

    Using CDN:

    transcriptator jsDelivr CDN

    Usage

    There are three primary methods and two types. See the jsdoc for additional information.

    The convertFile function accepts the transcript file data and parses it in to an array of Segment. If transcriptFormat is not defined, will use determineFormat to attempt to identify the type.

    convertFile(data: string, transcriptFormat: TranscriptFormat = undefined): Array<Segment>

    The determineFormat function accepts the transcript file data and attempts to identify the TranscriptFormat.

    determineFormat(data: string): TranscriptFormat

    The combineSingleWordSegments function is a helper function for combining the previously parsed Segment objects together. The only allowable use case is when the existing Segment only contain a single word in the body.

    combineSingleWordSegments(segments: Array<Segment>, maxLength = 32): Array<Segment>

    The TranscriptFormat enum defines the allowable transcript types supported by Transcriptator.

    The Segment type defines the segment/cue of the transcript.

    Supported File Formats

    SRT

    Transcripts which follow the SRT/SubRip format

    1
    00:00:00,780 --> 00:00:06,210
    Adam Curry: podcasting 2.0 March
    4 2023 Episode 124 on D flat
    
    2
    00:00:06,210 --> 00:00:12,990
    formable hello everybody welcome
    to a delayed board meeting of
    

    The timestamp may contain the hour and minutes but is not required. The millisecond may be separated with either a comma or decimal.

    Attempts to find the speaker's name from the beginning of the first line of each segment.

    References:

    HTML

    HTML data in format below are considered to be transcripts.

    The elements cite, time, and p are used to define a segment. The cite element is not required. The order is also not required.

    The elements may either be a child of the document directly or a direct child of the html or body element.

    Elements do not need to be on separate lines.

    Example 1

    <html>
        <body>
            <cite>Alban:</cite>
            <time>0:00</time>
            <p>
                It is so stinking nice to like, show up and record this show. And Travis has already put together an
                outline. Kevin's got suggestions, I throw my thoughts into the mix. And then Travis goes and does all the
                work from there, too. It's out into the wild. And I don't see anything. That's an absolute joy for at least
                two thirds of the team. Yeah, I mean, exactly.
            </p>
            <cite>Kevin:</cite>
            <time>0:30</time>
            <p>
                You guys remember, like two months ago, when you were like, We're going all in on video Buzzcast. I was
                like, that's, I mean, I will agree and commit and disagree, disagree and commit, I'll do something. But I
                don't want to do this.
            </p>
        </body>
    </html>

    Example 2

    <p>
        It is so stinking nice to like, show up and record this show. And Travis has already put together an outline.
        Kevin's got suggestions, I throw my thoughts into the mix. And then Travis goes and does all the work from there,
        too. It's out into the wild. And I don't see anything. That's an absolute joy for at least two thirds of the team.
        Yeah, I mean, exactly.
    </p>
    <time>0:00</time>
    <p>
        You guys remember, like two months ago, when you were like, We're going all in on video Buzzcast. I was like,
        that's, I mean, I will agree and commit and disagree, disagree and commit, I'll do something. But I don't want to do
        this.
    </p>
    <time>0:30</time>

    JSON

    JSON data in one of the formats below are considered to be transcripts.

    In both formats, the data does not need to be in pretty print format.

    Format 1

    {
        "version": "1.0.0",
        "segments": [
            {
                "speaker": "Alban",
                "startTime": 0.0,
                "endTime": 4.8,
                "body": "It is so stinking nice to"
            },
            {
                "speaker": "Alban",
                "startTime": 0.0,
                "endTime": 4.8,
                "body": "like, show up and record this"
            }
        ]
    }

    There must be a segments list of objects containing speaker, startTime, endTime, and body.

    The startTime and endTime are assumed to be in seconds.

    Format 2

    [
        {
            "start": 1,
            "end": 5000,
            "text": "Subtitles: @marlonrock1986 (^^V^^)"
        },
        {
            "start": 25801,
            "end": 28700,
            "text": "It's another hot, sunny day today\nhere in Southern California."
        }
    ]

    The top level element must be a list of objects containing start, end, and text.

    The start and end are assumed to be in milliseconds.

    Attempts to find the speaker's name from the beginning of the text value.

    WebVTT

    Transcripts which follow the WebVTT/VTT format

    WEBVTT
    
    1
    00:00:00.001 --> 00:00:05.000
    Subtitles: @marlonrock1986 (^^V^^)
    
    2
    00:00:25.801 --> 00:00:28.700
    It's another hot, sunny day today
    here in Southern California.
    

    The timestamp may contain the hour and minutes but is not required. The millisecond may be separated with either a comma or decimal.

    Attempts to find the speaker's name from the beginning of the first line of each segment.

    References:

    Test Transcripts

    Transcripts used for testing are excerpts from the following shows.

    Contributing

    Please see the Contribution Guide