JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 44
  • Score
    100M100P100Q52071F
  • License MIT

Fixed-chunk dedup + entropy compression utility (MSH1 binary format)

Package Exports

  • msh-zip
  • msh-zip/lib/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (msh-zip) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

mshzip

Fixed-chunk deduplication + entropy compression utility

Eliminates duplicate chunks via SHA-256 hashing, then applies gzip entropy compression. Outputs the MSH1 binary container format. Zero external dependencies.

100MB repeated pattern →  1.4KB (100.0% reduction, 694 MB/s)
 50MB log file         →  9.0MB ( 82.0% reduction,  56 MB/s)
 50MB JSON             →  7.4MB ( 85.1% reduction,  57 MB/s)

Install

# Library
npm install mshzip

# Global CLI
npm install -g mshzip

Quick Start

Simple API

const { pack, unpack } = require('mshzip');

const data = Buffer.from('hello world'.repeat(1000));
const compressed = pack(data);
const restored = unpack(compressed);

console.log(data.equals(restored)); // true
console.log(compressed.length);     // ~46 bytes

With Options

const compressed = pack(data, {
  chunkSize: 256,              // 'auto' (default) | 8 ~ 16MB
  frameLimit: 64 * 1024 * 1024, // bytes per frame (default: 64MB)
  codec: 'gzip',               // 'gzip' (default) | 'none'
  crc: true                    // append CRC32 per frame
});

Streaming API

const { PackStream, UnpackStream } = require('mshzip');
const { pipeline } = require('stream/promises');
const fs = require('fs');

// Compress
const ps = new PackStream({ chunkSize: 128, codec: 'gzip' });
await pipeline(
  fs.createReadStream('data.bin'),
  ps,
  fs.createWriteStream('data.msh')
);
console.log(ps.stats);
// { bytesIn, bytesOut, frameCount, dictSize, chunkSize }

// Decompress
await pipeline(
  fs.createReadStream('data.msh'),
  new UnpackStream(),
  fs.createWriteStream('restored.bin')
);

Parallel Processing

const { WorkerPool } = require('mshzip/lib/parallel');

const pool = new WorkerPool(4);
await pool.init();
const results = await pool.runAll([
  { type: 'pack', inputPath: 'a.bin', outputPath: 'a.msh' },
  { type: 'pack', inputPath: 'b.bin', outputPath: 'b.msh' },
]);
pool.destroy();

CLI

mshzip pack -i <input> -o <output> [options]
  -i <path>          Input file (- = stdin)
  -o <path>          Output file (- = stdout)
  --chunk <N|auto>   Chunk size in bytes or auto (default: auto)
  --frame <N>        Max bytes per frame (default: 64MB)
  --codec <type>     gzip | none (default: gzip)
  --crc              Append CRC32 checksum
  --verbose          Verbose output

mshzip unpack -i <input> -o <output>
mshzip info -i <file>
mshzip multi pack|unpack <files...> --out-dir <dir> [--workers N]

Examples

# Compress / decompress
mshzip pack -i data.bin -o data.msh
mshzip unpack -i data.msh -o data.bin

# File info
mshzip info -i data.msh

# Parallel batch
mshzip multi pack *.log --out-dir ./compressed --workers 4

# Pipe (stdin/stdout)
cat data.bin | mshzip pack -i - -o - > data.msh
mshzip unpack -i data.msh -o - | sha256sum

How It Works

Compression (Pack)

Input data
  ↓
1. Auto-detect optimal chunk size (samples 1MB, tests 8 candidates)
  ↓
2. Split into fixed-size chunks (last chunk zero-padded)
  ↓
3. SHA-256 hash each chunk → deduplicate
   - New chunk → add to dictionary (assign global index)
   - Existing chunk → reuse existing index
  ↓
4. Build frame (split at frameLimit, default 64MB)
   - Dictionary section: new chunks for this frame
   - Sequence section: varint-encoded index array
  ↓
5. gzip compress payload (level=1, speed priority)
  ↓
6. Frame = Header(32B) + PayloadSize(4B) + Compressed payload [+ CRC32(4B)]
  ↓
Output MSH1 file

Decompression (Unpack)

MSH1 file
  ↓
1. Validate magic number 'MSH1'
2. Parse frame header (32B)
3. Decompress payload (gzip)
4. Extract new chunks from dictionary section → accumulate to global dict
5. Decode sequence section (varint) → index array
6. Look up dictionary by index → reassemble original chunks in order
7. Trim to origBytes (remove padding)
  ↓
Original data

Compression Example

640B input (chunkSize=128B)
  → 5 chunks: [A, B, A, A, C]  (A repeats 3 times)
  → Dictionary: 3 × 128B = 384B (only A, B, C stored)
  → Sequence: [0, 1, 0, 0, 2] → 5 bytes (varint)
  → gzip compressed output

100MB repeated pattern
  → 781,250 chunks, only ~10 unique
  → Dict: 1,280B + Seq: ~781KB → gzip → ~1.4KB
  → 99.9999% reduction!

Performance

Compression Ratio by Data Type

Data Type Ratio Pack Speed Notes
Repeated pattern ~95-100% 200-694 MB/s Most chunks are duplicates
Log files ~82% 56 MB/s Timestamps vary, structure repeats
JSON documents ~85% 57 MB/s Key names and structure repeat
Mixed text ~40% 100 MB/s Partial duplication
All zeros ~99% 500+ MB/s Perfect dedup
Random binary -1~-2% 50 MB/s No duplicates, overhead only
Pre-compressed (mp4, zip, jpg) -1~-2% - Dedup ineffective

Auto Chunk Size Detection

Data Type Auto-selected Reason
10B pattern repeat 128B Small pattern → small chunk maximizes dedup
256B pattern repeat 256B Matches pattern size → optimal 1:1 mapping
Log files 256B Log line ~200B → captures structural repetition
JSON documents 512B JSON object-level repetition
Random binary 2048B No dedup possible → larger chunks minimize seq overhead

MSH1 Format

Frame Structure

┌──────────────────────────────────────────────────────────────────┐
│ Frame Header (32B)  │ PayloadSize (4B)  │ Payload  │ CRC32 (4B) │
└──────────────────────────────────────────────────────────────────┘

Frame Header (32 bytes, Little-Endian)

Offset Size Type Field Description
0 4 bytes magic MSH1 (0x4D534831)
4 2 uint16 version Format version (currently: 1)
6 2 uint16 flags Bit flags (0x0001 = CRC32)
8 4 uint32 chunkSize Chunk size (8B ~ 16MB)
12 1 uint8 codecId 0 = none, 1 = gzip
13 3 - padding Reserved (0x000000)
16 4 uint32 origBytesLo Original size lower 32 bits
20 4 uint32 origBytesHi Original size upper 32 bits
24 4 uint32 dictEntries New dictionary entries in this frame
28 4 uint32 seqCount Sequence index count

Payload (after decompression)

[Dictionary section: dictEntries × chunkSize bytes] [Sequence section: seqCount uvarint values]

Multi-frame

┌──────────┬──────────┬─────┬──────────┐
│ Frame #0 │ Frame #1 │ ... │ Frame #N │
└──────────┴──────────┴─────┴──────────┘

- Frames split at frameLimit (default 64MB)
- Dictionary accumulates across frames (global dictionary)
- origBytesLo + origBytesHi supports 64-bit original size (4GB+)

Key Features

Feature Description
Fixed-chunk Dedup SHA-256 hash-based duplicate elimination
Auto Chunk Detection Analyzes input to auto-select optimal chunk size (8~4096B)
Streaming Node.js Transform Stream (PackStream / UnpackStream)
Parallel Processing Worker Thread pool for multi-file batch operations
Frame-based Dictionary accumulates across frames, 64-bit support (4GB+)
CRC32 Optional per-frame integrity verification
Cross-compatible MSH1 files 100% compatible with Python mshzip
CLI pack / unpack / info / multi — 4 commands
stdin/stdout Pipe-friendly with -i - / -o -

Defaults

Constant Value Description
DEFAULT_CHUNK_SIZE 'auto' Auto-detect (fallback: 128B)
DEFAULT_FRAME_LIMIT 64MB Max input bytes per frame
DEFAULT_CODEC 'gzip' gzip level=1 (speed priority)
MIN_CHUNK_SIZE 8B Minimum chunk size
MAX_CHUNK_SIZE 16MB Maximum chunk size

Module Structure

Module File Role
constants lib/constants.js MSH1 magic, header size, defaults, codec IDs
varint lib/varint.js LEB128 variable-length integer encode/decode
packer lib/packer.js Compression engine: SHA-256 dict, chunking, frame assembly
unpacker lib/unpacker.js Decompression engine: frame parsing, dict accumulation
stream lib/stream.js Transform Stream (PackStream / UnpackStream)
parallel lib/parallel.js Worker Thread pool (multi-file)
cli cli.js CLI entry point (pack / unpack / info / multi)

Limitations

  1. Dictionary memory: With highly unique chunks (random data), dict size approaches original
  2. Pre-compressed data: mp4, zip, jpg etc. see no dedup benefit (-1~-2%)
  3. Codecs: Only gzip (level=1) and none supported
  4. No cross-file dict sharing: Each file gets an independent dictionary in parallel mode
  5. V8 Map limit: ~16.77M chunks max before needing larger chunk size (5GB+ files)
  6. Auto-detect overhead: ~16ms (1MB sample × 8 candidates) — negligible in most cases

Cross-compatibility

MSH1 files are 100% interoperable between Node.js and Python implementations.

# Node.js compress → Python decompress
node cli.js pack -i data.bin -o data.msh
python -m mshzip unpack -i data.msh -o restored.bin

# Python compress → Node.js decompress
python -m mshzip pack -i data.bin -o data.msh
node cli.js unpack -i data.msh -o restored.bin

Python package: mshzip on PyPI


License

MIT