Speech Recognition II(Developing Kaldi)

Posted by Bijan in Speech Recognition

Let’s Enhance Kaldi, Here are some links along the way. Look like YouTube is progressing a lot during the last couple of years so basically here is just a bunch of random videos creating my favorite playlist to learn all the cool stuff under the Kaldi’s hood.

YouTube

Keith Chugg (USC) – Viterbi Algorithm
Lim Zhi Hao (NTU) – WFST: A Nice Channel On Weighted Finite State Transducers
Dan Povey (JHU) – ICASSP 2011 Kaldi Workshop: Dan Explaining Kaldi Basics
Luis Serrano – The Covariance Matrix: To Understand GMM Acoustic Modeling

Kaldi

Mehryar Mohri (NYU) – Speech Recognition with WFST: A joint work of RWTH and NYU
Mehryar Mohri (NYU), Afshin Rostamizadeh – Foundations of Machine Learning
George Doddington (US DoD) ICASSP 2011 – Human Assisted Speaker Recognition
GitHub Kaldi – TED-LIUM Result: GMM, SGMM, Triple Deltas Comparison
EE Columbia University – Speech Recognition Spring 2016
D. Povey – Generating Lattices in the WFST : For understanding LattceFasterDecoder

Notes

Lattices: A more complex form of FST‘s, The first version decoders were based on FST’s (like faster-decoder and online decoders). For Minimum Bayesian Risk Calculation Using Lattices will give you a better paved way
faster-decoder: Old decoder, very simple to understand how decoding process is done
lattice-faster-decoder: general decoder, same as faster-decoder but output lattices instead of FSTs
DecodableInterface: An interface that connects decoder to the features. decoder uses this Decodable object to pull CMVN features from it.
BestPath: An FST that constructed from the Best Path (path with maximum likelihood) in the decoded FST.
nBestPath: An FST constructed from the top N Best Path in the decoded FST.
GetLinearSymbolSequence: The final step in the recognition process, get a BestPath FST or Lattice and output the recognized words with the path weight. CompactLattices need to be converted using ConvertLattice
Strongly Connected Component: A set that all components are accessible (in two ways) by it’s member.
The Main Function in Decoder is ProcessEmitting that pulls loglikelihood from the decodable object

‹ next post prev post ›

Speech Recognition II(Developing Kaldi)

YouTube

Kaldi

Notes

Leave a Reply Cancel reply

BijoKH

BijoKH