To calculate word level confidence score Kaldi uses a method called MBR Decoding. MBR Decoding is a decoding process that minimize word level error rate (instead of minimizing the whole utterance cost) to calculate the result. This may not give the accurate result but can be use to calculate the confidence score up to some level. Just don’t expect too much as the performance is not well-accurate.
Here are some key concepts:
1. Levenshtein Distance: Levenshtein Distance or Edit Distance compute difference between two sentences. It computes how many words are different between the two. Lets say X and Y are two word sequence shown below. The Levenshtein distance would be 3 where Ɛ represent empty word
To calculate the Levenshtein distance you can use following recursive algorithm where A and B are word sequence with length of N+1
As in all recursive algorithm to decrease amount of duplicate computation Kaldi used the memoization technique and store the above three circumstances in a1, a2 and a3 respectively
2. Forward-Backward Algorithm: Lets say you want to calculate the probability of seeing a waveform(or MFCC features) given a path in a lattice (or on HHM FST). Then the Forward-Backward Algorithm is nothing more than a optimized way to compute this probability.
3. Gamma Calculation: TBA
4. MBR Decoding: TBA
Delta-Delta feature is proposed in 1986 by S. Furui and Hermann Ney in 1990. It’s simply add first and second derivative of cepstrum to the feature vector. By doing that they say it can capture spectral dynamics and improve overall accuracy.
The only problem is that in a discrete signal space getting derivative from the signal increase spontaneous noise level so instead of simple first and second order derivative HTK proposed a
differentiation filter. This filter basically is a convoluted low-pass filter on top of discrete signal derivative to smooth out the result and remove unwanted noises. In Fig 1 you can see the result of simple second derivative vs the proposed
HTK filter for a Delta-Delta feature (order=2, window=2) is a 9 element FIR filter with following coefficient(Θ is window size which is 2 in HTK)
|• Reverberation:||Is the effect of sound bouncing the walls and getting back in a room. The time is roughly between 1 and 2 second in an ordinary room. You can use
compressor filterin speech instead of
CMVNto normalize in real-time.
IEEE ICASSP ’86 – Isolated Word Recognition Based on Emphasized Spectral Dynamics
IEEE ICASSP ’90 – Experiments on mixture-density phoneme-modelling for 1000-word DARPA task
Desh Raj Blog – Award-winning classic papers in ML and NLP
|• Lattices:||Are a graph containing
|• Arcs:||Are start from one state to another state. Each state arcs can be accessed with arc iterator and arcs only retain their next state. each arcs have weight and input and output label.|
|• States:||Are simple decimal number starting from
|• Topological Sort:||An FST is
|• Note 1:||You can get max state with
|• Note 2:||You can prune lattices by creating dead end path. Dead end path is a path that’s not get end up to the final state. After that
|• Link:||Same as arc|
|• Token:||Are same as state. They have costs|
|• FrameToks:||A link list that contain all tokens in a single frame|
|• Adaptive Beam:||Used in pruning before creating lattice and through decoding|
|• NEmitting Tokens:||Non Emitting Tokens or NEmitting Tokens are tokens that generate from emitting token in the same frame and have
|• Emitting Tokens:||Emitting Tokens are tokens that surpass from a frame to another frame|
A Simplified Block Diagram of ASR Process in Kaldi
|• Costs:||Are Log Negative Probability, so a higher cost means lower probability.|
|• Frame:||Each 10ms of audio that using MFCC turned into a fixed size vector called a frame.|
|• Beam:||Cutoff would be
|• Cutoff:||The maximum cost that all cost higher than this value will not be processed and removed.|
|• Epsilon:||The zero label in
|• Lattices:||Are the same as FSTs, instead each token keeps in a framed based array called
|• Rescoring:||A language model scoring system that applied after final state to improve final result by using stronger LM model than
|• HCLG(FST):||The main FST used in the decoding. The iLabel in this FST is TransitionIDs.|
|• Model(MDL):||A model that used to convert sound into acoustic cost and TransitionIDs.|
|• TransitionIDs:||A number that contain information about state and corresponding PDF id.|
|• Emiting States:||States that have pdfs associated with them and emit phoneme. In other word states that have their
|• Bakis Model:||Is a HMM that state transitions proceed from left to right. In a Bakis HMM, no transitions go from a higher-numbered state to a lower-numbered state.|
|• Max Active:||Uses to calculate cutoff to determince maximum number of tokens that will be processed inside emitting process.|
|• Graph Cost:||is a sum of the LM cost, the (weighted) transition probabilities, and any pronunciation cost.|
|• Acoustic Cost:||Cost that is got from the decodable object.|
|• Acoustic Scale:||A floating number that multiply in all Log Likelihood (inside the decodable object).|
Fig. 1. Demonstration of Finite State Automata vs Lattices, Courtesy of Peter F. Brown
The command below generates a tone signal out of the speaker and receives it back through the mic. Measuring the phase diff will reveal the round-trip latency.
alsa_delay hw:1,0 hw:0,0 44100 256 2 1 1
hw:1,0 refer to the recording device that can be retrieved from
arecord -l and
hw:0,0 refer to the playback device. Again can be retrieved from
aplay -l .
44100 is the sampling rate.
256 is the buffer size.
256 works best for me. Lower numbers corrupt the test and higher numbers just bring more latency to the table. Don’t know exactly what
output arguments are but
1 respectively works magically for me. I just tinkering around and found these numbers. No other number works for me.
1. Focusrite Scarlett Solo Latency: 2.5ms
2. Shure SM57 Mic Latency: 2.5ms
3. OverAll Delay: 14ms with non-RT mode
You can tinker around the effect of latency with
pactl load-module module-loopback latency_msec=15
To end the loopback mode
pactl unload-module module-loopback
As Always Useful links
Arun Raghavan – Beamforming in PulseAudio
Arch Linux Wiki – Professional Audio, Realtime kernel
Let’s Enhance Kaldi, Here are some links along the way. Look like YouTube is progressing a lot during the last couple of years so basically here is just a bunch of random videos creating my favorite playlist to learn all the cool stuff under the Kaldi’s hood.
Lattices: A more complex form of
FST‘s, The first version decoders were based on FST’s (like
onlinedecoders). For Minimum Bayesian Risk Calculation Using
Latticeswill give you a better paved way
faster-decoder: Old decoder, very simple to understand how decoding process is done
lattice-faster-decoder: general decoder, same as
faster-decoderbut output lattices instead of
DecodableInterface: An interface that connects decoder to the features. decoder uses this
Decodableobject to pull CMVN features from it.
BestPath: An FST that constructed from the Best Path (path with maximum likelihood) in the decoded FST.
nBestPath: An FST constructed from the top N Best Path in the decoded FST.
GetLinearSymbolSequence: The final step in the recognition process, get a BestPath FST or Lattice and output the recognized words with the path weight.
CompactLattices need to be converted using
Strongly Connected Component: A set that all components are accessible (in two ways) by it’s member.
Thanks to this marvelous framework, a trained model is at disposal with WER of absolute zero percent over the 10 minutes of continuous speech file. The final piece to this puzzle would be implementing a semi-online decoding tool using GStreamer. As always useful links for further inspection
Here I am, pursuing once more the old-fashioned machine learning. I’ll keep it short and write down useful links