Package Exports
- audio-duplicates
- audio-duplicates/lib/index.js
This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (audio-duplicates) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.
Readme
Audio Duplicates
A high-performance audio duplicate detection library built with native C++ and Chromaprint fingerprinting technology. Quickly find duplicate audio files across large collections with robust detection that handles different encodings, bitrates, and formats.
✨ Features
- 🚀 High Performance: Native C++ implementation ~200x faster than JavaScript
- 🧠 Memory Optimized: 80-90% memory usage reduction with advanced pool management
- ⚡ Parallel Processing: Multi-threaded scanning with configurable concurrency
- 🎵 Format Support: MP3, WAV, FLAC, OGG, M4A, AAC, WMA, and more
- ⚡ Fast Matching: Optimized inverted index for O(1) duplicate lookups
- 🔧 Robust Detection: Handles different bitrates, sample rates, and encodings
- 💻 CLI Tool: Full-featured command-line interface with memory monitoring
- 📝 TypeScript Support: Complete TypeScript definitions included
- 🌍 Cross-Platform: Windows, macOS, and Linux support
- 📊 Progress Reporting: Real-time progress bars and detailed statistics
- 🔍 Memory Monitoring: Live memory usage tracking and automatic cleanup
📦 Installation
Prerequisites
Install the required system libraries first:
macOS
brew install chromaprint libsndfile
Ubuntu/Debian
sudo apt-get update
sudo apt-get install libchromaprint-dev libsndfile1-dev
Windows
Download and install:
Install the Package
Global Installation (Recommended for CLI)
npm install -g audio-duplicates
Local Installation (for API usage)
npm install audio-duplicates
The package automatically uses prebuilt binaries when available, falling back to source compilation if needed.
Dependencies
Production Dependencies
- chalk (^4.1.2) - Terminal string styling for colorized CLI output
- cli-progress (^3.12.0) - Real-time progress bars for CLI operations
- commander (^11.0.0) - Command-line interface framework
- node-addon-api (^7.0.0) - Native addon development API
- node-gyp-build (^4.8.0) - Prebuilt binary loading and fallback
- p-limit (^7.1.1) - Concurrency control for parallel processing
- prebuild-install (^7.1.1) - Prebuilt binary installation
Development Dependencies
- chai (^4.3.7) - Assertion library for testing
- mocha (^10.2.0) - JavaScript test framework
- node-gyp (^10.0.0) - Native addon build tool
- prebuildify (^6.0.0) - Prebuilt binary generation
System Requirements
- Node.js: >=18.0.0
- System Libraries: Chromaprint, libsndfile
🚀 Quick Start
CLI Usage
Scan for Duplicates
# Basic scan
audio-duplicates scan /path/to/music
# Scan multiple directories
audio-duplicates scan /music/collection1 /music/collection2
# High-performance parallel scanning with memory management
audio-duplicates scan /path/to/music --parallel --threads 8 --memory-limit 512 --memory-stats
# Custom file types and threshold
audio-duplicates scan /path/to/music --extensions "mp3,flac,wav" --threshold 0.9
# Save results with progress tracking
audio-duplicates scan /path/to/music --output duplicates.json --format json --verbose
# Limit fingerprint duration for large files
audio-duplicates scan /path/to/music --max-duration 180
Compare Two Files
# Basic comparison
audio-duplicates compare song1.mp3 song2.mp3
# Compare with duration limit
audio-duplicates compare song1.mp3 song2.mp3 --max-duration 60
# Verbose comparison with detailed output
audio-duplicates compare song1.mp3 song2.mp3 --verbose
Generate Fingerprint
# Generate and display fingerprint
audio-duplicates fingerprint song.mp3
# Save fingerprint to file
audio-duplicates fingerprint song.mp3 --output fingerprint.json
# Limit fingerprint to first 30 seconds
audio-duplicates fingerprint song.mp3 --max-duration 30
API Usage
Basic Duplicate Detection
const audioDuplicates = require('audio-duplicates');
async function findDuplicates() {
// Scan directory for duplicates
const duplicates = await audioDuplicates.scanDirectoryForDuplicates('/path/to/music', {
threshold: 0.85,
onProgress: (progress) => {
console.log(`Processing: ${progress.current}/${progress.total} - ${progress.file}`);
}
});
// Display results
duplicates.forEach((group, index) => {
console.log(`\nDuplicate Group ${index + 1}:`);
group.files.forEach(file => {
console.log(` ${file.path} (similarity: ${file.similarity})`);
});
});
}
findDuplicates().catch(console.error);
Manual Fingerprint Comparison
const audioDuplicates = require('audio-duplicates');
async function compareFiles() {
// Generate fingerprints
const fp1 = await audioDuplicates.generateFingerprint('file1.mp3');
const fp2 = await audioDuplicates.generateFingerprint('file2.mp3');
// Compare fingerprints
const result = await audioDuplicates.compareFingerprints(fp1, fp2);
console.log('Similarity Score:', result.similarityScore);
console.log('Are Duplicates:', result.isDuplicate);
console.log('Confidence:', result.confidence);
}
compareFiles().catch(console.error);
Batch Processing with Index
const audioDuplicates = require('audio-duplicates');
async function batchProcess() {
// Initialize index for batch processing
await audioDuplicates.initializeIndex();
// Add files to index
const files = ['song1.mp3', 'song2.mp3', 'song3.mp3'];
for (const file of files) {
const fileId = await audioDuplicates.addFileToIndex(file);
console.log(`Added ${file} with ID: ${fileId}`);
}
// Find all duplicates in the index
const duplicateGroups = await audioDuplicates.findAllDuplicates();
console.log('Found', duplicateGroups.length, 'duplicate groups');
// Get index statistics
const stats = await audioDuplicates.getIndexStats();
console.log('Index Stats:', stats);
// Clear index when done
await audioDuplicates.clearIndex();
}
batchProcess().catch(console.error);
TypeScript Usage
import * as audioDuplicates from 'audio-duplicates';
import { DuplicateGroup, ScanOptions, Fingerprint } from 'audio-duplicates';
async function findDuplicatesTyped(): Promise<DuplicateGroup[]> {
const options: ScanOptions = {
threshold: 0.85,
maxDuration: 300, // 5 minutes max
onProgress: (progress: { current: number; total: number; file: string }) => {
console.log(`${progress.current}/${progress.total}: ${progress.file}`);
}
};
return await audioDuplicates.scanDirectoryForDuplicates('/path/to/music', options);
}
async function generateTypedFingerprint(filePath: string): Promise<Fingerprint> {
return await audioDuplicates.generateFingerprint(filePath);
}
📖 API Reference
Core Functions
generateFingerprint(filePath: string): Promise<Fingerprint>
Generate an audio fingerprint from a file.
const fingerprint = await audioDuplicates.generateFingerprint('song.mp3');
console.log('Duration:', fingerprint.duration);
console.log('Sample Rate:', fingerprint.sampleRate);
generateFingerprintLimited(filePath: string, maxDuration: number): Promise<Fingerprint>
Generate fingerprint with duration limit (in seconds).
// Only fingerprint first 30 seconds
const fingerprint = await audioDuplicates.generateFingerprintLimited('song.mp3', 30);
compareFingerprints(fp1: Fingerprint, fp2: Fingerprint): Promise<MatchResult>
Compare two fingerprints and return similarity metrics.
const result = await audioDuplicates.compareFingerprints(fp1, fp2);
console.log('Similarity:', result.similarityScore); // 0.0 to 1.0
console.log('Is Duplicate:', result.isDuplicate); // boolean
console.log('Confidence:', result.confidence); // 0.0 to 1.0
Index Management
initializeIndex(): Promise<boolean>
Initialize the fingerprint index for batch processing.
addFileToIndex(filePath: string): Promise<number>
Add a file to the index and return its unique ID.
findAllDuplicates(): Promise<DuplicateGroup[]>
Find all duplicate groups in the current index.
getIndexStats(): Promise<IndexStats>
Get statistics about the current index.
const stats = await audioDuplicates.getIndexStats();
console.log('Files:', stats.fileCount);
console.log('Index Size:', stats.indexSize);
console.log('Load Factor:', stats.loadFactor);
clearIndex(): Promise<boolean>
Clear the current index and free memory.
Memory Management (v1.1.2)
getMemoryPoolStats(): Promise<MemoryPoolStats>
Get detailed statistics about native memory pool usage.
const stats = await audioDuplicates.getMemoryPoolStats();
console.log('Peak Usage:', (stats.peakUsage / 1024 / 1024).toFixed(1) + 'MB');
console.log('Total Allocated:', (stats.totalAllocated / 1024 / 1024).toFixed(1) + 'MB');
console.log('Active Allocations:', stats.activeAllocations);
getStreamingStats(): Promise<StreamingStats>
Get statistics about streaming audio processing.
const stats = await audioDuplicates.getStreamingStats();
console.log('Files Processed:', stats.filesProcessed);
console.log('Total Duration:', stats.totalDuration + 's');
console.log('Average Processing Speed:', stats.avgProcessingSpeed + 'x realtime');
clearMemoryPool(): Promise<boolean>
Force cleanup of the native memory pool.
// Clean up memory after processing
await audioDuplicates.clearMemoryPool();
Configuration
setSimilarityThreshold(threshold: number): Promise<boolean>
Set the similarity threshold (0.0 to 1.0) for duplicate detection.
await audioDuplicates.setSimilarityThreshold(0.9); // Stricter matching
High-Level Utilities
scanDirectoryForDuplicates(directory: string, options?: ScanOptions): Promise<DuplicateGroup[]>
Scan a directory for duplicates with progress reporting (sequential processing).
scanDirectoryForDuplicatesParallel(directory: string, options?: ScanOptions): Promise<DuplicateGroup[]>
Scan a directory for duplicates using parallel processing for improved performance.
scanMultipleDirectoriesForDuplicates(directories: string[], options?: ScanOptions): Promise<DuplicateGroup[]>
Scan multiple directories for duplicates across all directories.
ScanOptions:
threshold?: number
- Similarity threshold (default: 0.85)maxDuration?: number
- Max duration to fingerprint in secondsextensions?: string[]
- File extensions to scan (default: ['.wav'])concurrency?: number
- Number of concurrent operations for parallel processingonProgress?: (progress) => void
- Progress callback with detailed informationrecursive?: boolean
- Scan subdirectories (default: true)
Progress Callback Details:
The onProgress
callback receives detailed progress information:
const options = {
onProgress: (progress) => {
switch (progress.phase) {
case 'discovery':
console.log(`Found ${progress.audioFiles} audio files`);
break;
case 'processing':
console.log(`Processing: ${progress.current}/${progress.total} - ${progress.file}`);
if (progress.parallel) {
console.log(`Running ${progress.concurrency} threads`);
}
break;
case 'duplicate_detection':
console.log('Analyzing fingerprints for duplicates...');
break;
}
}
};
🖥️ CLI Reference
Commands
scan <directories...>
Scan directories for duplicate audio files with advanced performance features.
# Basic scan
audio-duplicates scan /music
# High-performance scan with all features
audio-duplicates scan /music \
--parallel \
--threads 8 \
--memory-limit 512 \
--memory-stats \
--threshold 0.9 \
--extensions "mp3,flac,wav,m4a" \
--format json \
--output results.json \
--max-duration 180 \
--verbose
Performance Options:
--parallel
- Enable parallel processing for faster scanning-j, --threads <number>
- Number of threads for parallel processing (0=auto, default: CPU count)--memory-limit <mb>
- Memory limit in MB (default: 256)--memory-stats
- Show detailed memory statistics during processing
Detection Options:
--threshold <number>
- Similarity threshold (0.0-1.0, default: 0.85)--extensions <extensions>
- File extensions to scan (comma-separated, default: wav)--max-duration <seconds>
- Maximum duration to fingerprint per file
Output Options:
--format <format>
- Output format:json
,csv
, ortext
(default: text)--output <file>
- Output file path--no-progress
- Disable progress bar--recursive
- Scan subdirectories (default: true)
compare <file1> <file2>
Compare two audio files directly.
# Basic comparison
audio-duplicates compare song1.mp3 song2.wav
# Advanced comparison
audio-duplicates compare song1.mp3 song2.wav --max-duration 60 --verbose
Options:
--max-duration <seconds>
- Maximum duration to fingerprint
fingerprint <file>
Generate and display fingerprint for an audio file.
# Generate fingerprint
audio-duplicates fingerprint song.mp3
# Save to file with duration limit
audio-duplicates fingerprint song.mp3 --output fingerprint.json --max-duration 30
Options:
--output <file>
- Output file path--max-duration <seconds>
- Maximum duration to fingerprint
Global Options
These options apply to all commands:
-v, --verbose
- Verbose output with detailed information and memory stats--threshold <number>
- Global similarity threshold (0.0-1.0)--format <format>
- Global output format (json|csv|text)-j, --threads <number>
- Global thread count for parallel operations
📊 Performance
Benchmarks
On a modern CPU (Apple M1):
- Fingerprint Generation: 2-5x real-time (faster than playback)
- Index Lookup: ~1ms per query
- Full Comparison: 10-50ms depending on file length
- Memory Usage: ~4KB per minute of audio
- Scalability: Efficiently handles 10,000+ files
Memory Optimization Features (v1.1.2)
Advanced Memory Management:
- Memory Pool: Efficient native memory allocation with automatic cleanup
- Streaming Processing: Large files processed in chunks to minimize memory footprint
- Garbage Collection: Automatic memory cleanup with configurable limits
- Memory Monitoring: Real-time tracking of both Node.js and native memory usage
Performance Monitoring:
# Enable memory statistics during scanning
audio-duplicates scan /music --memory-stats --memory-limit 256
Example Memory Statistics Output:
🧠 Memory Statistics:
Peak Node.js memory: 89.2MB heap + 156.4MB external
Native memory pool: 45.7MB peak usage
Total allocated: 892.3MB
Memory warnings: 0
Memory pool cleared
Example Performance
Collection Size: 10,000 files (50GB)
Scan Time: ~6 minutes (parallel) / ~8 minutes (sequential)
Memory Usage: ~80MB (with optimization) / ~200MB (without)
Duplicates Found: 847 groups (2,341 files)
Memory Reduction: 80-90% vs previous versions
🔧 Advanced Usage
Custom Similarity Thresholds
// Exact duplicates only (very strict)
await audioDuplicates.setSimilarityThreshold(0.95);
// Similar versions (more permissive)
await audioDuplicates.setSimilarityThreshold(0.75);
// Near-identical files (default)
await audioDuplicates.setSimilarityThreshold(0.85);
Handling Large Collections
const MemoryMonitor = require('audio-duplicates/lib/memory_monitor');
async function processLargeCollection(directories) {
// Set up memory monitoring
const memoryMonitor = new MemoryMonitor({
memoryLimitMB: 512,
enabled: true
});
memoryMonitor.start();
memoryMonitor.onMemoryWarning((totalMB, ratio) => {
console.log(`⚠️ Memory warning: ${totalMB.toFixed(1)}MB (${(ratio * 100).toFixed(1)}%)`);
});
await audioDuplicates.initializeIndex();
for (const dir of directories) {
console.log(`Processing directory: ${dir}`);
// Use parallel processing for large collections
const duplicates = await audioDuplicates.scanDirectoryForDuplicatesParallel(dir, {
threshold: 0.85,
maxDuration: 300, // Limit to 5 minutes per file
concurrency: 8, // Use 8 threads
extensions: ['.mp3', '.flac', '.wav', '.m4a'],
onProgress: (progress) => {
if (progress.phase === 'processing' && progress.current % 100 === 0) {
console.log(`Processed ${progress.current}/${progress.total} files [${progress.concurrency} threads]`);
}
}
});
console.log(`Found ${duplicates.length} duplicate groups in ${dir}`);
// Get memory statistics
const poolStats = await audioDuplicates.getMemoryPoolStats();
console.log(`Memory usage: ${(poolStats.peakUsage / 1024 / 1024).toFixed(1)}MB`);
}
// Get final results
const allDuplicates = await audioDuplicates.findAllDuplicates();
console.log(`Total duplicate groups: ${allDuplicates.length}`);
// Cleanup
await audioDuplicates.clearMemoryPool();
await audioDuplicates.clearIndex();
memoryMonitor.stop();
}
Output Formats
JSON Output
audio-duplicates scan /music --format json --output results.json
{
"summary": {
"totalFiles": 1500,
"duplicateGroups": 23,
"duplicateFiles": 67,
"spaceWasted": "1.2GB"
},
"duplicateGroups": [
{
"groupId": 1,
"avgSimilarity": 0.94,
"files": [
{
"path": "/music/song1.mp3",
"size": 5242880,
"similarity": 1.0
},
{
"path": "/music/copy/song1.mp3",
"size": 5242880,
"similarity": 0.94
}
]
}
]
}
CSV Output
audio-duplicates scan /music --format csv --output results.csv
🐛 Troubleshooting
Common Issues
Build Errors
# macOS: Install dependencies
brew install chromaprint libsndfile
# Ubuntu: Install dependencies
sudo apt-get install libchromaprint-dev libsndfile1-dev
# Clear npm cache and rebuild
npm cache clean --force
npm rebuild
Runtime Errors
"Could not locate bindings file"
npm run build
"Failed to open audio file"
- Check file format is supported
- Verify file permissions
- Ensure file is not corrupted
"Index not initialized"
// Always initialize before using index functions
await audioDuplicates.initializeIndex();
Performance Optimization
For large collections:
- Enable parallel processing: Use
--parallel
flag for multi-threaded scanning - Configure memory limits: Set
--memory-limit
to prevent excessive memory usage - Optimize thread count: Use
--threads
to match your CPU cores - Limit fingerprint duration: Use
--max-duration
to process only file segments - Monitor memory usage: Enable
--memory-stats
for detailed memory tracking - Process in batches: Scan directories separately for very large collections
- Increase similarity threshold: Higher thresholds (0.9+) reduce processing time
- Use SSD storage: Faster I/O significantly improves performance
- Specify file types: Use
--extensions
to scan only needed formats
🤝 Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Make your changes and add tests
- Run the test suite:
npm test
- Submit a pull request
Development Setup
git clone https://github.com/mcande21/audio-duplicates.git
cd audio-duplicates
npm install
npm run build
npm test
📄 License
MIT License - see LICENSE file for details.
🙏 Acknowledgments
- Chromaprint - Audio fingerprinting library
- libsndfile - Audio file I/O library
- Node-API - Native addon interface
🔗 Related Projects
- AcoustID - Audio identification service
- fpcalc - Command-line fingerprinting tool
- MusicBrainz - Music metadata database
Happy duplicate hunting! 🎵