JSPM

@dataset.sh/file

0.1.1-beta.1
  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 77
  • Score
    100M100P100Q53675F
  • License MIT

TypeScript library for reading and writing DatasetFile ZIP-based archive format

Package Exports

  • @dataset.sh/file
  • @dataset.sh/file/dist/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@dataset.sh/file) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

@dataset.sh/file

A TypeScript library for reading and writing DatasetFile ZIP-based archive format. This library provides an efficient way to store and access structured datasets with support for collections, type annotations, and binary files.

Features

  • 📦 ZIP-based format - Efficient compression and packaging of datasets
  • 📊 Multiple collections - Organize data into named collections (train, test, validation, etc.)
  • 🏷️ Type annotations - Include type information for each collection
  • 🔤 Typelang support - Define schemas using TypeScript-like syntax for cross-platform compatibility
  • 📎 Binary files - Attach model weights, images, or other binary assets

Installation

pnpm add @dataset.sh/file

Optional: Typelang Compiler

For enhanced type validation and cross-platform type generation, you can also install the Typelang compiler:

pnpm add @dataset.sh/typelang

Quick Start

Writing a Dataset

import {DatasetFile, DatasetFileWriter} from '@dataset.sh/file';

// Create a new dataset file
const writer = DatasetFile.open('my-dataset.dataset', 'w') as DatasetFileWriter;

// Add metadata
writer.updateMeta({
    author: 'Your Name',
    authorEmail: 'you@example.com',
    description: 'My awesome dataset',
    tags: ['nlp', 'classification'],
    dataset_metadata: {
        version: '1.0.0',
        created: new Date().toISOString()
    }
});

// Add a collection with data and Typelang schema
const trainData = [
    {id: 1, text: 'Hello world', label: 'greeting'},
    {id: 2, text: 'How are you?', label: 'question'}
];

// Define schema using Typelang syntax
const typeSchema = `// use TrainItem
type TrainItem = {
  id: int
  text: string
  label: string
}`;

await writer.addCollection('train', trainData, typeSchema);

// Add binary files (optional)
const modelWeights = Buffer.from('...');
writer.addBinaryFile('model.bin', modelWeights);

await writer.close();

Reading a Dataset

import {DatasetFile, DatasetFileReader} from '@dataset.sh/file';

// Open an existing dataset
const reader = DatasetFile.open('my-dataset.dataset', 'r') as DatasetFileReader;

// Access metadata
console.log('Author:', reader.meta.author);
console.log('Collections:', reader.collections());

// Read a collection
const trainCollection = reader.collection('train');

// Get type annotation (raw Typelang schema)
const typeAnnotation = await trainCollection.typeAnnotation();
console.log('Type annotation:', typeAnnotation);

// Generate code from type annotation
const codeUsage = await trainCollection.generateCode();
if (codeUsage) {
    console.log('Type name:', codeUsage.useClass);
    console.log('Compilation result:', codeUsage.result);
}

// Access data
console.log('First 5 items:', trainCollection.top(5));
console.log('Random sample:', trainCollection.randomSample(3));

// Iterate through data
for (const item of trainCollection) {
    console.log(item);
}

// Convert to array
const allData = trainCollection.toList();

// Access binary files
const modelData = reader.openBinaryFile('model.bin');

reader.close();

API Reference

DatasetFile

Main entry point for opening dataset files.

DatasetFile.open(filePath: string, mode: 'r' | 'w')

Opens a dataset file for reading or writing.

  • filePath: Path to the dataset file
  • mode: 'r' for reading, 'w' for writing
  • Returns: DatasetFileReader or DatasetFileWriter

DatasetFileWriter

Used for creating new dataset files.

Methods

  • updateMeta(meta: Partial<DatasetFileMeta>): Update dataset metadata
  • async addCollection(name: string, data: any[], type_annotation?: string): Add a data collection with optional Typelang schema
  • addBinaryFile(fileName: string, data: Buffer): Add a binary file
  • async close(): Close and save the dataset file

DatasetFileReader

Used for reading existing dataset files.

Properties

  • meta: Dataset metadata

Methods

  • collections(): Get list of collection names
  • collection(name: string): Get a collection reader
  • coll(name: string): Shorthand for collection()
  • binaryFiles(): List binary file names
  • openBinaryFile(fileName: string): Read a binary file
  • close(): Close the dataset file

CollectionReader

Reader for individual collections within a dataset.

Properties

  • length: Number of items in the collection

Methods

  • async typeAnnotation(): Get raw Typelang schema string
  • async generateCode(): Generate code usage information from type annotation (returns CodeUsage with source, useClass, and compile result)
  • top(n: number): Get first n items
  • randomSample(n: number): Get random sample
  • toList(): Convert to array
  • [Symbol.iterator](): Iterate through items

File Format

DatasetFile uses a ZIP archive with the following structure:

dataset.dataset/
├── meta.json           # Dataset metadata
├── coll/              # Collections folder
│   ├── train/
│   │   ├── data.jsonl # Data in JSON Lines format
│   │   └── type.tl    # Typelang schema (optional)
│   └── test/
│       ├── data.jsonl
│       └── type.tl
└── bin/               # Binary files folder
    └── model.bin

Typelang Support

This library supports Typelang, a TypeScript-flavored schema definition language for cross-platform type generation.

Using Typelang Schemas

import {DatasetFile, DatasetFileWriter} from '@dataset.sh/file';

const writer = DatasetFile.open('typed-dataset.dataset', 'w') as DatasetFileWriter;

// Define complex types with Typelang
const userSchema = `// use User
type Address = {
  street: string
  city: string
  country: string
  postalCode?: string
}

type User = {
  id: string
  name: string
  email: string
  age: int
  address: Address
  tags: string[]
  status: "active" | "inactive" | "pending"
}`;

const userData = [{
    id: 'u1',
    name: 'Alice',
    email: 'alice@example.com',
    age: 30,
    address: {
        street: '123 Main St',
        city: 'San Francisco',
        country: 'USA'
    },
    tags: ['developer', 'team-lead'],
    status: 'active'
}];

await writer.addCollection('users', userData, userSchema);
await writer.close();

Generic Types

const responseSchema = `// use ApiResponse
type Response<T> = {
  success: bool
  data?: T
  error?: string
  timestamp: string
}

type UserData = {
  userId: string
  username: string
}

type ApiResponse = Response<UserData>`;

await writer.addCollection('responses', responseData, responseSchema);

Examples

Working with NLP Datasets

const writer = DatasetFile.open('nlp-dataset.dataset', 'w') as DatasetFileWriter;

writer.updateMeta({
    description: 'Sentiment analysis dataset',
    tags: ['nlp', 'sentiment', 'classification']
});

const data = [
    {text: 'This movie is great!', sentiment: 'positive'},
    {text: 'Terrible experience.', sentiment: 'negative'}
];

const sentimentSchema = `// use SentimentItem
type SentimentItem = {
  text: string
  sentiment: "positive" | "negative" | "neutral"
}`;

await writer.addCollection('train', data, sentimentSchema);

await writer.close();

Reading Python-created Datasets

This library is fully compatible with datasets created using the Python dataset-sh library, including those with Typelang type annotations:

const reader = DatasetFile.open('python-dataset.dataset', 'r') as DatasetFileReader;

// Read collections created in Python
const collection = reader.collection('data');

// Check for type annotation and generate code
const typeAnnotation = await collection.typeAnnotation();
if (typeAnnotation) {
    console.log('Type annotation:', typeAnnotation);
    const codeUsage = await collection.generateCode();
    if (codeUsage) {
        console.log('Type name:', codeUsage.useClass);
        console.log('Validation errors:', codeUsage.result.errors);
    }
}

// Iterate through data
for (const item of collection) {
    console.log(item);
}

reader.close();

Development

Building

pnpm build

Testing

pnpm test
pnpm test:watch
pnpm test:coverage

Running Examples

pnpm example
pnpm verify-python

Requirements

  • Node.js >= 16.0.0
  • TypeScript >= 5.0.0

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues and feature requests, please use the GitHub issue tracker.