JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 37
  • Score
    100M100P100Q83853F
  • License MIT

Fast audio duplicate detection using Chromaprint fingerprinting

Package Exports

  • audio-duplicates
  • audio-duplicates/lib/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (audio-duplicates) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Audio Duplicates

npm version License: MIT Node.js CI npm downloads

A high-performance audio duplicate detection library built with native C++ and Chromaprint fingerprinting technology. Quickly find duplicate audio files across large collections with robust detection that handles different encodings, bitrates, and formats.

✨ Features

  • 🚀 High Performance: Native C++ implementation ~200x faster than JavaScript
  • 🎵 Format Support: MP3, WAV, FLAC, OGG, M4A, AAC, WMA, and more
  • ⚡ Fast Matching: Optimized inverted index for O(1) duplicate lookups
  • 🔧 Robust Detection: Handles different bitrates, sample rates, and encodings
  • 💻 CLI Tool: Full-featured command-line interface for batch processing
  • 📝 TypeScript Support: Complete TypeScript definitions included
  • 🌍 Cross-Platform: Windows, macOS, and Linux support
  • 📊 Progress Reporting: Real-time progress bars and statistics

📦 Installation

Prerequisites

Install the required system libraries first:

macOS

brew install chromaprint libsndfile

Ubuntu/Debian

sudo apt-get update
sudo apt-get install libchromaprint-dev libsndfile1-dev

Windows

Download and install:

Install the Package

npm install -g audio-duplicates

Local Installation (for API usage)

npm install audio-duplicates

The package automatically uses prebuilt binaries when available, falling back to source compilation if needed.

🚀 Quick Start

CLI Usage

Scan for Duplicates

# Scan a single directory
audio-duplicates scan /path/to/music

# Scan multiple directories
audio-duplicates scan /music/collection1 /music/collection2

# Scan with custom threshold
audio-duplicates scan /path/to/music --threshold 0.9

# Save results to file
audio-duplicates scan /path/to/music --output duplicates.json --format json

Compare Two Files

audio-duplicates compare song1.mp3 song2.mp3

Generate Fingerprint

# Generate and display fingerprint
audio-duplicates fingerprint song.mp3

# Save fingerprint to file
audio-duplicates fingerprint song.mp3 --output fingerprint.json

API Usage

Basic Duplicate Detection

const audioDuplicates = require('audio-duplicates');

async function findDuplicates() {
  // Scan directory for duplicates
  const duplicates = await audioDuplicates.scanDirectoryForDuplicates('/path/to/music', {
    threshold: 0.85,
    onProgress: (progress) => {
      console.log(`Processing: ${progress.current}/${progress.total} - ${progress.file}`);
    }
  });

  // Display results
  duplicates.forEach((group, index) => {
    console.log(`\nDuplicate Group ${index + 1}:`);
    group.files.forEach(file => {
      console.log(`  ${file.path} (similarity: ${file.similarity})`);
    });
  });
}

findDuplicates().catch(console.error);

Manual Fingerprint Comparison

const audioDuplicates = require('audio-duplicates');

async function compareFiles() {
  // Generate fingerprints
  const fp1 = await audioDuplicates.generateFingerprint('file1.mp3');
  const fp2 = await audioDuplicates.generateFingerprint('file2.mp3');

  // Compare fingerprints
  const result = await audioDuplicates.compareFingerprints(fp1, fp2);

  console.log('Similarity Score:', result.similarityScore);
  console.log('Are Duplicates:', result.isDuplicate);
  console.log('Confidence:', result.confidence);
}

compareFiles().catch(console.error);

Batch Processing with Index

const audioDuplicates = require('audio-duplicates');

async function batchProcess() {
  // Initialize index for batch processing
  await audioDuplicates.initializeIndex();

  // Add files to index
  const files = ['song1.mp3', 'song2.mp3', 'song3.mp3'];
  for (const file of files) {
    const fileId = await audioDuplicates.addFileToIndex(file);
    console.log(`Added ${file} with ID: ${fileId}`);
  }

  // Find all duplicates in the index
  const duplicateGroups = await audioDuplicates.findAllDuplicates();
  console.log('Found', duplicateGroups.length, 'duplicate groups');

  // Get index statistics
  const stats = await audioDuplicates.getIndexStats();
  console.log('Index Stats:', stats);

  // Clear index when done
  await audioDuplicates.clearIndex();
}

batchProcess().catch(console.error);

TypeScript Usage

import * as audioDuplicates from 'audio-duplicates';
import { DuplicateGroup, ScanOptions, Fingerprint } from 'audio-duplicates';

async function findDuplicatesTyped(): Promise<DuplicateGroup[]> {
  const options: ScanOptions = {
    threshold: 0.85,
    maxDuration: 300, // 5 minutes max
    onProgress: (progress: { current: number; total: number; file: string }) => {
      console.log(`${progress.current}/${progress.total}: ${progress.file}`);
    }
  };

  return await audioDuplicates.scanDirectoryForDuplicates('/path/to/music', options);
}

async function generateTypedFingerprint(filePath: string): Promise<Fingerprint> {
  return await audioDuplicates.generateFingerprint(filePath);
}

📖 API Reference

Core Functions

generateFingerprint(filePath: string): Promise<Fingerprint>

Generate an audio fingerprint from a file.

const fingerprint = await audioDuplicates.generateFingerprint('song.mp3');
console.log('Duration:', fingerprint.duration);
console.log('Sample Rate:', fingerprint.sampleRate);

generateFingerprintLimited(filePath: string, maxDuration: number): Promise<Fingerprint>

Generate fingerprint with duration limit (in seconds).

// Only fingerprint first 30 seconds
const fingerprint = await audioDuplicates.generateFingerprintLimited('song.mp3', 30);

compareFingerprints(fp1: Fingerprint, fp2: Fingerprint): Promise<MatchResult>

Compare two fingerprints and return similarity metrics.

const result = await audioDuplicates.compareFingerprints(fp1, fp2);
console.log('Similarity:', result.similarityScore); // 0.0 to 1.0
console.log('Is Duplicate:', result.isDuplicate);   // boolean
console.log('Confidence:', result.confidence);      // 0.0 to 1.0

Index Management

initializeIndex(): Promise<boolean>

Initialize the fingerprint index for batch processing.

addFileToIndex(filePath: string): Promise<number>

Add a file to the index and return its unique ID.

findAllDuplicates(): Promise<DuplicateGroup[]>

Find all duplicate groups in the current index.

getIndexStats(): Promise<IndexStats>

Get statistics about the current index.

const stats = await audioDuplicates.getIndexStats();
console.log('Files:', stats.fileCount);
console.log('Index Size:', stats.indexSize);
console.log('Load Factor:', stats.loadFactor);

clearIndex(): Promise<boolean>

Clear the current index and free memory.

Configuration

setSimilarityThreshold(threshold: number): Promise<boolean>

Set the similarity threshold (0.0 to 1.0) for duplicate detection.

await audioDuplicates.setSimilarityThreshold(0.9); // Stricter matching

High-Level Utilities

scanDirectoryForDuplicates(directory: string, options?: ScanOptions): Promise<DuplicateGroup[]>

Scan a directory for duplicates with progress reporting.

Options:

  • threshold?: number - Similarity threshold (default: 0.85)
  • maxDuration?: number - Max duration to fingerprint in seconds
  • onProgress?: (progress) => void - Progress callback
  • recursive?: boolean - Scan subdirectories (default: true)

🖥️ CLI Reference

Commands

scan <directories...>

Scan directories for duplicate audio files.

# Basic scan
audio-duplicates scan /music

# Advanced options
audio-duplicates scan /music \
  --threshold 0.9 \
  --format json \
  --output results.json \
  --max-duration 180 \
  --no-progress

Options:

  • --threshold <number> - Similarity threshold (0.0-1.0, default: 0.85)
  • --format <format> - Output format: json, csv, or text (default: text)
  • --output <file> - Output file path
  • --max-duration <seconds> - Maximum duration to fingerprint
  • --no-progress - Disable progress bar
  • --recursive - Scan subdirectories (default: true)

compare <file1> <file2>

Compare two audio files directly.

audio-duplicates compare song1.mp3 song2.wav --max-duration 60

fingerprint <file>

Generate and display fingerprint for an audio file.

audio-duplicates fingerprint song.mp3 --output fingerprint.json

Global Options

  • -v, --verbose - Verbose output with detailed information
  • --threshold <number> - Global similarity threshold
  • --format <format> - Global output format

📊 Performance

Benchmarks

On a modern CPU (Apple M1):

  • Fingerprint Generation: 2-5x real-time (faster than playback)
  • Index Lookup: ~1ms per query
  • Full Comparison: 10-50ms depending on file length
  • Memory Usage: ~4KB per minute of audio
  • Scalability: Efficiently handles 10,000+ files

Example Performance

Collection Size: 10,000 files (50GB)
Scan Time: ~8 minutes
Memory Usage: ~200MB
Duplicates Found: 847 groups (2,341 files)

🔧 Advanced Usage

Custom Similarity Thresholds

// Exact duplicates only (very strict)
await audioDuplicates.setSimilarityThreshold(0.95);

// Similar versions (more permissive)
await audioDuplicates.setSimilarityThreshold(0.75);

// Near-identical files (default)
await audioDuplicates.setSimilarityThreshold(0.85);

Handling Large Collections

async function processLargeCollection(directories) {
  await audioDuplicates.initializeIndex();

  for (const dir of directories) {
    console.log(`Processing directory: ${dir}`);

    // Process in batches to manage memory
    const duplicates = await audioDuplicates.scanDirectoryForDuplicates(dir, {
      threshold: 0.85,
      maxDuration: 300, // Limit to 5 minutes per file
      onProgress: (progress) => {
        if (progress.current % 100 === 0) {
          console.log(`Processed ${progress.current}/${progress.total} files`);
        }
      }
    });

    console.log(`Found ${duplicates.length} duplicate groups in ${dir}`);
  }

  // Get final results
  const allDuplicates = await audioDuplicates.findAllDuplicates();
  console.log(`Total duplicate groups: ${allDuplicates.length}`);

  await audioDuplicates.clearIndex();
}

Output Formats

JSON Output

audio-duplicates scan /music --format json --output results.json
{
  "summary": {
    "totalFiles": 1500,
    "duplicateGroups": 23,
    "duplicateFiles": 67,
    "spaceWasted": "1.2GB"
  },
  "duplicateGroups": [
    {
      "groupId": 1,
      "avgSimilarity": 0.94,
      "files": [
        {
          "path": "/music/song1.mp3",
          "size": 5242880,
          "similarity": 1.0
        },
        {
          "path": "/music/copy/song1.mp3",
          "size": 5242880,
          "similarity": 0.94
        }
      ]
    }
  ]
}

CSV Output

audio-duplicates scan /music --format csv --output results.csv

🐛 Troubleshooting

Common Issues

Build Errors

# macOS: Install dependencies
brew install chromaprint libsndfile

# Ubuntu: Install dependencies
sudo apt-get install libchromaprint-dev libsndfile1-dev

# Clear npm cache and rebuild
npm cache clean --force
npm rebuild

Runtime Errors

"Could not locate bindings file"

npm run build

"Failed to open audio file"

  • Check file format is supported
  • Verify file permissions
  • Ensure file is not corrupted

"Index not initialized"

// Always initialize before using index functions
await audioDuplicates.initializeIndex();

Performance Optimization

For large collections:

  • Use maxDuration to limit fingerprint length
  • Process directories in batches
  • Increase similarity threshold for faster results
  • Use SSD storage for audio files

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Run the test suite: npm test
  5. Submit a pull request

Development Setup

git clone https://github.com/mcande21/audio-duplicates.git
cd audio-duplicates
npm install
npm run build
npm test

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments


Happy duplicate hunting! 🎵