Speech recognition and synthesis (ASR and TTS)

DeepPavlov contains models for automatic speech recognition (ASR) and text synthesis (TTS) based on pre-build modules from NeMo (v0.10.0) - NVIDIA toolkit for defining and building Conversational AI applications. Named arguments for modules initialization are taken from the NeMo config file (please do not confuse with the DeepPavlov config file that defines model pipeline).

Speech recognition

The ASR pipeline is based on Jasper: an CTC-based end-to-end model. The model transcripts speech samples without any additional alignment information. NeMoASR contains following modules:

NeMo config file for ASR should contain labels argument besides named arguments for the modules above. labels is a list of characters that can be output by the ASR model used in model training.

Speech synthesis

The TTS pipeline that creates human audible speech from text is based on Tacotron 2 and Waveglow models. NeMoTTS contains following modules:

  • TextEmbedding - uses arguments from TextEmbedding section of the NeMo config file. Needs pretrained checkpoint.

  • Tacotron2Encoder - uses arguments from Tacotron2Encoder section of the NeMo config file. Needs pretrained checkpoint.

  • Tacotron2DecoderInfer - uses arguments from Tacotron2Decoder section of the NeMo config file. Needs pretrained checkpoint.

  • Tacotron2Postnet - uses arguments from Tacotron2Postnet section of the NeMo config file. Needs pretrained checkpoint.

  • WaveGlow - uses arguments from WaveGlowNM section of the NeMo config file. Needs pretrained checkpoint.

  • GriffinLim - uses arguments from GriffinLim section of the NeMo config file.

  • TextDataLayer - uses arguments from TranscriptDataLayer section of the NeMo config file.

NeMo config file for TTS should contain labels and sample_rate args besides named arguments for the modules above. labels is a list of characters used in TTS model training.

Audio encoding end decoding.

ascii_to_bytes_io() and bytes_io_to_ascii() was added to the library to achieve uniformity at work with both text and audio data. Components can be used to encode binary data to ascii string and decode back.

Quck Start

Preparation

Install requirements and download model files.

python -m deeppavlov install asr_tts
python -m deeppavlov download asr_tts

Examples below use sounddevice library. Install it with pip install sounddevice==0.3.15. You may need to install libportaudio2 package with sudo apt-get install libportaudio2 to make sounddevice work.

Note

ASR reads and TTS generates single channel WAV files. Files transferred to ASR are resampled to the frequency specified in the NeMo config file (16 kHz for models from DeepPavlov configs).

Speech recognition

DeepPavlov asr config contains minimal pipeline for english speech recognition using QuartzNet15x5En pretrained model. To record speech on your computer and print transcription run following script:

from io import BytesIO

import sounddevice as sd
from scipy.io.wavfile import write

from deeppavlov import build_model, configs

sr = 16000
duration = 3

print('Recording...')
myrecording = sd.rec(duration*sr, samplerate=sr, channels=1)
sd.wait()
print('done')

out = BytesIO()
write(out, sr, myrecording)

model = build_model(configs.nemo.asr)
text_batch = model([out])

print(text_batch[0])

Speech synthesis

DeepPavlov tts config contains minimal pipeline for speech synthesis using Tacotron2 and WaveGlow pretrained models. To generate audiofile and save it to hard drive run following script:

from deeppavlov import build_model, configs

model = build_model(configs.nemo.tts)
filepath_batch = model(['Hello world'], ['~/hello_world.wav'])

print(f'Generated speech has successfully saved at {filepath_batch[0]}')

Speech to speech

Previous examples assume files with speech to recognize and files to be generated are on the same system where the DeepPavlov is running. DeepPavlov asr_tts config allows sending files with speech to recognize and receiving files with generated speech from another system. This config is recognizes received speech and re-sounds it.

Run asr_tts in REST Api mode:

python -m deeppavlov riseapi asr_tts

This python script supposes that you already have file with speech to recognize. You can use code from speech recognition example to record speech on your system. 127.0.0.1 should be replased by address of system where DeepPavlov has started.

from base64 import encodebytes, decodebytes

from requests import post

with open('/path/to/wav/file/with/speech', 'rb') as fin:
    input_speech = fin.read()

input_ascii = encodebytes(input_speech).decode('ascii')

resp = post('http://127.0.0.1:5000/model', json={"speech_in_encoded": [input_ascii]})
text, generated_speech_ascii = resp.json()[0]
generated_speech = decodebytes(generated_speech_ascii.encode())

with open('/path/where/to/save/generated/wav/file', 'wb') as fout:
    fout.write(generated_speech)

print(f'Speech transcriptions is: {text}')

Warning

NeMo library v0.10.0 doesn’t allow to infer batches longer than one without compatible NVIDIA GPU.

Models training

To get your own pre-trained checkpoints for NeMo modules see Speech recognition and Speech Synthesis tutorials. Pre-trained models list could be found here.