Ifactor audio

9/1/2023

Because the encoder is only run one time during inference, it tends to take a small fraction of the runtime, less than five percent in most cases. The encoder network consists of an embedding layer, followed by convolution layers with activations, and ends with a bidirectional LSTM. These can be passed into TensorRT layers.Īs in the ONNX-parsed implementation of Tacotron2, we split it up into three subnetworks: the encoder, decoder, and post-net, which we created with the EncoderBuilder, DecoderPlainBuilder, and PostNetBuilder classes, respectively.Īt the start of the build methods, we created the INetworkDefinition object from IBuilder, and also added inputs. To get the weights into a format that the TensorRT C++ API can consume, we created the JSONModelImporter and LayerData classes, which handle reading and storing the weights as Weights. To export the Tacotron2 PyTorch model to JSON, use the following command: statedict = dict(torch.load(statedict_path)) Another option would be to get the weights using the PyTorch C++ API. For ease of use and readability, we used a single-level JSON structure. To build the network manually, we first needed an easy way to get the weights from PyTorch to C++. This enabled us to make several modifications, including allowing variable length sequences in the same batch to be processed. To gain more flexibility in the Tacotron2 network, instead of parsing the exported ONNX model and having TensorRT automatically create the network, we manually constructed the network via the IBuilder interface of TensorRT. This is particularly important in Tacotron2, where we must launch a network execution to generate each mel-scale spectrogram frame, of which there are roughly 86 per second of audio. For more information about creating and running networks with the C++ API, see Using the C++ API. This helped to reduce the amount of overhead time needed on the CPU to coordinate and launch work on the GPU. To start with, we used the C++ TensorRT interface rather than the Python bindings. The implementation discussed in this post is available as part of the NVIDIA Deep Learning Examples GitHub repository. To accomplish this, we made several decisions, that while they require more effort, result in additional performance. For this implementation, we wanted to get the lowest latency TTS inference that we could.

In a previous post, How to Deploy Real-Time Text-to-Speech Applications on GPUs using TensorRT, you learned how to import a TTS model from PyTorch into TensorRT, to perform faster inference with minimal effort. Below, we detail the effort of creating a high-performance, TTS, inference implementation, using NVIDIA TensorRT and CUDA. Our goal in creating the Riva TTS pipeline, was to enable conversational AIs to respond with natural sounding speech in as little time as possible, making for an engaging user experience. For more information about the networks, as well as how to train them using PyTorch, see Generate Natural Sounding Speech from Text in Real-Time. This TTS model is composed of the Tacotron2 network, which maps character sequences to mel-scale spectrograms, followed by the NVIDIA WaveGlow network, which generates time-domain waveforms from the mel-scale spectrograms. For more information about the Riva Server, see Introducing NVIDIA Riva: A GPU-Accelerated SDK for Developing Speech AI Applications. In this post, we focus on optimizations made to a TTS pipeline in Riva, as shown in Figure 1. Generating high-quality, natural-sounding speech from text with low latency, also known as text-to-speech (TTS), can be one of the most computationally challenging of those tasks. NVIDIA Riva is an application framework that provides several pipelines for accomplishing conversational AI tasks.

0 Comments

Ifactor audio

Leave a Reply.

Author

Archives

Categories