by ryan | May 18, 2024
This post is a summarizes my work during the first four months as a PhD Student which got published at the IEEE ICASSP 2024 SASB Workshop called Open Implementation and Study of BEST-RQ for Speech Processing (link to paper here). Code here (this is a pull request and I'm currently working on cleaning it up).
Outline of the walkthrough:
BEST-RQ is a recent model for automatic speech recognition (ASR) i.e. where a computer takes in an audio and generates a transcription. It was developed by researchers from Google (2022) and is the model behind Google USM (2023).
My PhD topic is about Efficient and Effective Self-Supervised Learning Models for Speech.
If this doesn’t make sense to you thats ok, just know my topic involves efficiency (i.e. the speed of these models, the time it takes to train them, the amount of data these models need, …).
So when you start a PhD what do you do? I don’t know what others do, but I decided to use these smart people called my advisors and ask them for advice. They recommended looking at the recent research in this field and start from there.
So after discussions with my advisors and going through around 100 papers (most of them I just looked at the title and abstract, so this didn’t take too too long), I thought Google’s BEST-RQ would be a good place to start.
Heres why:
I struggled quite a bit with getting this to work, and in my paper I couldn’t go into much detail. So, I thought this walkthrough could be an easy-to-read and a more-in-depth description of my work.
In this section I give an overview of the architecture of BEST-RQ along with how I implemented each part. For my experiments, I compare my implementation of BEST-RQ with wav2vec 2.0 (a widely used model invented by Meta researchers). After describing the implementation, I conclude with a bullet-point summary of my experiments and results.
Fbank
class.InputNormalization
class.global
type, which calculated a moving mean and standard deviation of the whole data.sentence
as the norm_type
.
global
you will try to normalize with the mean and standard deviation of the training dataset..yaml
file which will be used by Speechbrain to create python objects.
# in yaml file
# define variables
sample_rate: 16000
n_fft: 400
n_mels: 80
# for calculating mel-spectrogram
compute_features: !new:speechbrain.lobes.features.Fbank
sample_rate: !ref <sample_rate>
n_fft: !ref <n_fft>
n_mels: !ref <n_mels>
normalize: !new:speechbrain.processing.features.InputNormalization
norm_type: global
update_until_epoch: 4
train.py
file in the following way.
# in train.py
feats = self.hparams.compute_features(wavs)
# get current epoch
current_epoch = self.hparams.epoch_counter.current
# normalize input
feats = self.modules.normalize(feats, wav_lens, epoch=current_epoch)
In BEST-RQ, the authors describe a ‘masking’ strategy where they randomly choose one percent of frames to be a ‘starting’ frame for a mask and that frame and the following three frames are selected to be ‘masked’.
In the case of BEST-RQ, ‘to mask’ means to replace the frames selected with random noise from a normal distribution (mean of 0 and standard deviation of 0.1). Basically, the goal of the pre-training will be to try to get the model to use the unmasked audio sections reconstruct the masked sections. The idea with this is that if a model can reconstruct or ‘guess’ masked section then the model must have learned good representations of audio that can then be used for transcribing or other audio tasks.
Although conceptually simple to understand, the masking part was one of the hardest parts for me to code. The trouble for me was that dimensions get reduced 4x with CNN layers. My biggest question was if we randomly mask frames and then reduce the dimensionality by 4x, there is a chance that the mask will fall between indices.
For example if you randomly select frame 2 to be the starting mask, and then mask the next 3 frames, you will mask frames [2,3,4,5]. But when the dimensions get reduced, and we want to predict the masks targets (more on how these target are created in the next section).
So do we predict for index 0 (originally consisting of frames [0,1,2,3]) or index 1, (original consisting of [4,5,6,7]). We could even predict both!?
We asked the authors (#thanksauthorsforbeingsoresponsive), and they said they only predicted on indices that were complexly masked. So I decided so code up my mask in the simplest way I could think of.
This way, every index that the model predicts will be a fully masked section. And to make sure that everything behaves well we pad the spectrogram, if needed, to always have a length divisible by 4.
Here is the code for the padding in the train.py
file. (For the code for the mask see mask.py
)
# Calculate the amount of padding needed to make the tensor divisible by 4
current_dim_size = feats.shape[dim_to_pad]
padding_needed = (4 - (current_dim_size % 4)) % 4 # Ensure positive padding
# Define the padding
padding = [0, 0, 0, 0, 0, 0] # Initialize padding for all dimensions
padding[dim_to_pad * 2] = padding_needed # Set padding for the chosen dimension
# add in padding to features and mask
feats = torch.nn.functional.pad(feats, padding)
.view()
function and then give that to the quantizer. This is found in the train.py
file).💡 Tricky parts to remember
(x @ self.P)
Here is the code.
# quantizer.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.linalg import vector_norm
class RandomProjectionQuantizer(nn.Module):
def __init__(self, input_dim, cb_dim, cb_vocab):
super().__init__()
self.input_dim = input_dim
self.cb_dim = cb_dim
self.cb_vocab = cb_vocab
# Section 3.1 "projection matrix A use Xavier initialization"
P_init = torch.empty((input_dim, cb_dim))
self.register_buffer("P", nn.init.xavier_uniform_(P_init))
# normalize random matrix for codebook
self.register_buffer("CB", F.normalize(torch.randn(cb_vocab, cb_dim)))
def forward(self, x):
x = F.normalize(x @ self.P, dim=2)
return vector_norm((self.CB.unsqueeze(1) - x.unsqueeze(1)), dim=-1).argmin(dim=1)
# this line of code above is very condensed and thus confusing
# basically this is to avoid doing for loops and find the closest
# entry in the code book for each projected frame
# this single line of code honestly merits its own tutorial
# let me know if you'd like me to write one :)
.yaml
file and then use in the train.py
.
# Transformer parameters
d_model: 576
nhead: 8
num_encoder_layers: 12
num_decoder_layers: 0
d_ffn: 2048
transformer_dropout: 0.1
activation: !name:torch.nn.GELU
output_neurons: 5000
encoder_layerdrop: 0.05
CNN: !new:speechbrain.lobes.models.convolution.ConvolutionFrontEnd
input_shape: (8, 10, 80)
num_blocks: 2
num_layers_per_block: 1
out_channels: (128, 32)
kernel_sizes: (3, 3)
strides: (2, 2)
residuals: (False, False)
Transformer: !new:speechbrain.lobes.models.transformer.TransformerASR.TransformerASR # yamllint disable-line rule:line-length
input_size: 640
tgt_vocab: !ref <output_neurons>
d_model: !ref <d_model>
nhead: !ref <nhead>
num_encoder_layers: !ref <num_encoder_layers>
num_decoder_layers: !ref <num_decoder_layers>
d_ffn: !ref <d_ffn>
dropout: !ref <transformer_dropout>
activation: !ref <activation>
encoder_module: conformer
attention_type: RelPosMHAXL
normalize_before: True
causal: False
layerdrop_prob: !ref <encoder_layerdrop>
# I use the following wrapper so the decoder isn't run
# This is because by default the TransformerASR will try to run a decoder
# but we don't have any decoder layers
wrapper: !new:speechbrain.lobes.models.transformer.TransformerASR.EncoderWrapper
transformer: !ref <Transformer>
# convolutions
src = self.modules.CNN(feats)
# conformer layers (use wrapper so that decoder isn't used)
enc_out = self.modules.wrapper(src, wav_lens)
linear: !new:speechbrain.nnet.linear.Linear
input_size: !ref <d_model>
n_neurons: !ref <cb_vocab>
# in compute_forward function
# linear layer to get logits
logits = self.modules.linear(enc_out)
# get starting indicies of masked area
mask_idx = mask[::divis_by] // divis_by
# get logits of masked area
logits = logits[:,mask_idx,:]
# get targets of masked area
targets = targets[:,mask_idx]
B, T, C = logits.shape
# reshape flattenting out batch dimension
# this makes it so we can use the loss function easier
# then we return these two values (the logits and targets)
# to be passed on the compute_objectives function
return logits.view(B * T, C), targets.view(B*T)
# in compute_objectives function
pred, targets = predictions
loss = F.cross_entropy(pred, targets)
return loss
And voila! Those are all the main components to BEST-RQ.
The details of the experiments are described in more detail the paper but I give a list below to summarize the experiments and results.
Although conceptually simple, BEST-RQ has a lot of tricky details that if not implement probably will make it so that the model doesn’t perform well. I hope this post can help clarifiy some of the details.
In future work, I would like to scale up the size of my BEST-RQ implementation and experiment with by altering the architecture and with different datasets.
asr tutorial cs project