amazon-science/uniqsketch v1.2.1
amazon-science/uniqsketch
Captured source
source ↗published Apr 6, 2026seen 5dcaptured 15hhttp 200method plain
v1.2.1: performance, clustering, C++17 modernization
Repository: amazon-science/uniqsketch
Tag: v1.2.1
Published: 2026-04-06T16:09:09Z
Prerelease: no
Release notes:
Performance
- Replaced kseq++ (C++ wrapper) with kseq.h (Heng Li) for faster sequence I/O (~13% speedup in querysketch)
- Batch-read pattern in querySample eliminates per-read lock contention under OpenMP
- Rolling hash in lowComplexity filter replaces per-position recomputation
- Stack-allocated spectrum arrays in lowComplexity replace heap-allocated vectors
- Replaced std::endl with "\n" in hot loops to avoid unnecessary stream flushes
- BloomFilter constructor uses value-initialized allocation instead of manual zeroing
- comparesketch pre-caches all k-mer hashes to avoid redundant file I/O (M+N reads instead of M×N)
- comparesketch reuses per-thread Bloom filters with memset instead of re-allocating
- Sort lambda and range-for loops use const references to avoid copies
Features
- Genomically spaced signature selection in buildSketch — signatures are evenly distributed across reference genome positions instead of randomly shuffled
- Added read-level log output (logread_*.tsv) in querysketch listing matched read IDs per reference
- comparesketch now supports compressed FASTA input (.fa.gz) via kseq.h/zlib
- Added CMake build system (CMakeLists.txt) alongside existing Makefile
- Added
--cluster=Noption to uniqsketch: automatically clusters nearly identical references using single-linkage union-find on pairwise unique k-mer counts, selects a representative per cluster, and outputsclusters.tsvmapping all members to their representatives
Error Handling
- Added file existence validation for all input files across all three tools
- Typo'd filenames now produce clear error messages instead of core dumps
- Missing @list files are caught before processing
- Bloom filter error message now reports k-mer counts, k value, and suggests causes/fixes
- Empty lines in list files are skipped
Code Quality
- Modern C++17 style throughout: nullptr, using aliases, static_cast, structured bindings
- Removed using namespace std from headers (BloomFilter.hpp, ntcard.hpp)
- All headers are self-contained with explicit includes
- Fixed typo: identifyDiffernce → identifyDifference
- Fixed version string bug in querysketch ("querysketch1.1.0" → "querysketch Version 1.1.0")
- Fixed signed/unsigned comparison warnings in readCutoff checks
- Consistent Doxygen-style documentation on all public functions
- Removed unused b2c lookup table from SequenceUtil.hpp
- Simplified getBaseId to handle edge cases (no slash, no extension, multiple dots)
Tests
- Added comparesketchtest: loadBloomFilter, checkBloomFilter (identical and different references)
- Added BloomFilter tests: empty filter, get_pop, store/load roundtrip
- Extended getBaseId tests: multiple dots, no extension, no directory separator
- Total: 15 tests across 4 test suites (up from 9 across 3)
Dependencies
- Replaced kseq++ (kseq++.hpp, seqio.hpp, config.hpp) with kseq.h from seqtk
- Updated THIRD-PARTY-LICENSES accordingly
Notability
notability 3.0/10Routine minor release, no notable traction.