deeppavlov.models.nemo

class deeppavlov.models.nemo.asr.NeMoASR(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], **kwargs)[source]

ASR model on NeMo modules.

__init__(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], **kwargs)None[source]

Initializes NeuralModules for ASR.

Parameters
  • load_path – Path to a directory with pretrained checkpoints for JasperEncoder and JasperDecoderForCTC.

  • nemo_params_path – Path to a file containig labels and params for AudioToMelSpectrogramPreprocessor, JasperEncoder, JasperDecoderForCTC and AudioInferDataLayer.

__call__(audio_batch: List[Union[str, _io.BytesIO]])List[str][source]

Transcripts audio batch to text.

Parameters

audio_batch – Batch to be transcribed. Elements could be either paths to audio files or Binary I/O objects.

Returns

Batch of transcripts.

Return type

text_batch

class deeppavlov.models.nemo.tts.NeMoTTS(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], vocoder: str = 'waveglow', **kwargs)[source]

TTS model on NeMo modules.

__init__(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], vocoder: str = 'waveglow', **kwargs)None[source]

Initializes NeuralModules for TTS.

Parameters
  • load_path – Path to a directory with pretrained checkpoints for TextEmbedding, Tacotron2Encoder, Tacotron2DecoderInfer, Tacotron2Postnet and, if Waveglow vocoder is selected, WaveGlowInferNM.

  • nemo_params_path – Path to a file containig sample_rate, labels and params for TextEmbedding, Tacotron2Encoder, Tacotron2Decoder, Tacotron2Postnet and TranscriptDataLayer.

  • vocoder – Vocoder used to convert from spectrograms to audio. Available options: waveglow (needs pretrained checkpoint) and griffin-lim.

__call__(text_batch: List[str], path_batch: Optional[List[str]] = None)Union[List[_io.BytesIO], List[str]][source]

Creates wav files or file objects with speech.

Parameters
  • text_batch – Text from which human audible speech should be generated.

  • path_batch – i-th element of path_batch is the path to save i-th generated speech file. If argument isn’t specified, the synthesized speech will be stored to Binary I/O objects.

Returns

List of Binary I/O objects with generated speech if path_batch was not specified, list of paths to files

with synthesized speech otherwise.

deeppavlov.models.nemo.common.ascii_to_bytes_io(batch: Union[str, list])Union[_io.BytesIO, list][source]

Recursively searches for strings in the input batch and converts them into the base64-encoded bytes wrapped in Binary I/O objects.

Parameters

batch – A string or an iterable container with strings at some level of nesting.

Returns

The same structure where all strings are converted into the base64-encoded bytes wrapped in Binary I/O objects.

deeppavlov.models.nemo.common.bytes_io_to_ascii(batch: Union[_io.BytesIO, list])Union[str, list][source]

Recursively searches for Binary I/O objects in the input batch and converts them into ASCII-strings.

Parameters

batch – A BinaryIO object or an iterable container with BinaryIO objects at some level of nesting.

Returns

The same structure where all BinaryIO objects are converted into strings.

class deeppavlov.models.nemo.asr.AudioInferDataLayer(*args: Any, **kwargs: Any)[source]

Data Layer for ASR pipeline inference.

__init__(*, audio_batch: List[Union[str, _io.BytesIO]], batch_size: int = 32, sample_rate: int = 16000, int_values: bool = False, trim_silence: bool = False, **kwargs)None[source]

Initializes Data Loader.

Parameters
  • audio_batch – Batch to be read. Elements could be either paths to audio files or Binary I/O objects.

  • batch_size – How many samples per batch to load.

  • sample_rate – Target sampling rate for data. Audio files will be resampled to sample_rate if it is not already.

  • int_values – If true, load data as 32-bit integers.

  • trim_silence – Trim leading and trailing silence from an audio signal if True.

class deeppavlov.models.nemo.tts.TextDataLayer(*args: Any, **kwargs: Any)[source]
__init__(*, text_batch: List[str], labels: List[str], batch_size: int = 32, bos_id: Optional[int] = None, eos_id: Optional[int] = None, pad_id: Optional[int] = None, **kwargs)None[source]

A simple Neural Module for loading text data.

Parameters
  • text_batch – Texts to be used for speech synthesis.

  • labels – List of string labels to use when to str2int translation.

  • batch_size – How many strings per batch to load.

  • bos_id – Label position of beginning of string symbol. If None is initialized as len(labels).

  • eos_id – Label position of end of string symbol. If None is initialized as len(labels) + 1.

  • pad_id – Label position of pad symbol. If None is initialized as len(labels) + 2.

class deeppavlov.models.nemo.vocoder.WaveGlow(*, denoiser_strength: float = 0.0, n_window_stride: int = 160, **kwargs)[source]
__init__(*, denoiser_strength: float = 0.0, n_window_stride: int = 160, **kwargs)None[source]

Wraps WaveGlowInferNM module.

Parameters
  • denoiser_strength – Denoiser strength for waveglow.

  • n_window_stride – Stride of window for FFT in samples used in model training.

  • kwargs – Named arguments for WaveGlowInferNM constructor.

class deeppavlov.models.nemo.vocoder.GriffinLim(*, sample_rate: float = 16000.0, n_fft: int = 1024, mag_scale: float = 2048.0, power: float = 1.2, n_iters: int = 50, **kwargs)[source]
__init__(*, sample_rate: float = 16000.0, n_fft: int = 1024, mag_scale: float = 2048.0, power: float = 1.2, n_iters: int = 50, **kwargs)None[source]

Uses Griffin Lim algorithm to generate speech from spectrograms.

Parameters
  • sample_rate – Generated audio data sample rate.

  • n_fft – The number of points to use for the FFT.

  • mag_scale – Multiplied with the linear spectrogram to avoid audio sounding muted due to mel filter normalization.

  • power – The linear spectrogram is raised to this power prior to running the Griffin Lim algorithm. A power of greater than 1 has been shown to improve audio quality.

  • n_iters – Number of iterations of convertion magnitude spectrograms to audio signal.