sarvamai/lhotse-asr
forked from lhotse-speech/lhotse
Captured source
source ↗sarvamai/lhotse-asr
Description: Tools for handling multimodal data in machine learning projects.
License: Apache-2.0
Stars: 0
Forks: 1
Open issues: 0
Created: 2025-06-29T07:21:57Z
Pushed: 2025-07-01T14:23:25Z
Default branch: master
Fork: yes
Parent repository: lhotse-speech/lhotse
Archived: no
README:
Lhotse
Lhotse is a Python library aiming to make multimodal (speech, audio, video, image, text) data preparation flexible and accessible to a wider community. Alongside k2, it is a part of the next generation Kaldi speech processing library.
Tutorial presentations and materials
- (_Interspeech 2023_) Tutorial notebook 
- (_Interspeech 2023_) Tutorial slides
- (_Interspeech 2021_) Recorded lecture (3h)
About
Main goals (updated for 2025)
- Scale to multimodal data pipelines including audio, text, image, and video modalities.
- Provide state-of-the-art dataloading algorithms such as dataset blending and efficient on-the-fly bucketing.
- Handle data randomization (or de-duplication) for distributed multi-node training.
- Attract a wider community to multimodal processing tasks with a Python-centric design.
- Provide standard data preparation recipes for commonly used corpora.
- Flexible data preparation for model training with the notion of audio/video cuts.
- Support for efficient sequential I/O data formats such as Lhotse Shar (similar to webdataset).
Tutorials
We offer the following tutorials available in examples directory:
- Basic complete Lhotse workflow 
- Transforming data with Cuts 
- WebDataset integration 
- How to combine multiple datasets 
- Lhotse Shar: storage format optimized for sequential I/O and modularity 
- Image and Video Support in Lhotse 
Examples of use
Check out the following links to see how Lhotse is being put to use:
- Icefall recipes: where k2 and Lhotse meet.
- Minimal ESPnet+Lhotse example: 
Main ideas
Like Kaldi, Lhotse provides standard data preparation recipes, but extends that with a seamless PyTorch integration through task-specific Dataset classes. The data and meta-data are represented in human-readable text manifests and exposed to the user through convenient Python classes.
Lhotse introduces the notion of audio cuts, designed to ease the training data construction with operations such as mixing, truncation and padding that are performed on-the-fly to minimize the amount of storage required. Data augmentation and feature extraction are supported both in pre-computed mode, with highly-compressed feature matrices stored on disk, and on-the-fly mode that computes the transformations upon request. Additionally, Lhotse introduces feature-space cut mixing to make the best of both worlds.
Installation
Lhotse supports Python version 3.7 and later.
Pip
Lhotse is available on PyPI:
pip install lhotse
To install the latest, unreleased version, do:
pip install git+https://github.com/lhotse-speech/lhotse
Development installation
For development installation, you can fork/clone the GitHub repo and install with pip:
git clone https://github.com/lhotse-speech/lhotse cd lhotse pip install -e '.[dev]' pre-commit install # installs pre-commit hooks with style checks
Running unit tests
pytest test
Running linter checks
pre-commit run
This is an editable installation (-e option), meaning that your changes to the source code are automatically reflected when importing lhotse (no re-install needed). The [dev] part means you're installing extra dependencies that are used to run tests, build documentation or launch jupyter notebooks.
Environment variables
Lhotse uses several environment variables to customize it's behavior. They are as follows:
LHOTSE_REQUIRE_TORCHAUDIO- when it's set and not any of1|True|true|yes, we'll not check for torchaudio being installed and remove it from the requirements. It will disable many functionalities of Lhotse but the basic capabilities will remain (including reading audio withsoundfile).LHOTSE_AUDIO_DURATION_MISMATCH_TOLERANCE- used when we load audio from a file and receive a different number of samples than declared inRecording.num_samples. This is sometimes necessary because different codecs (or even different versions of the same codec) may use different padding when decoding compressed audio. Typically values up to 0.1, or even 0.3 (second) are still reasonable, and anything beyond that indicates a serious issue.LHOTSE_AUDIO_BACKEND- may be set to any of the values returned from CLI `lhotse…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine fork of ASR repo.