This is the golden formula in the speech recognition.
The argmax function means find the value of w that makes p(x|w) maximum. Here x is observation acoustic signal. So basically we compute all possible sequence and then for each one of them calculate the possibility of seeing such an acoustic signal. This is a very computation intensive process but by using HMM and CTC we try to minimize searching space. The process of guessing the correct sequence is called decoding in the speech recognition research field.
Transition Matrix
HMM is just bunch of states that transition from one state to the other. These would be called on every emitting transitions and all of them can be expressed in a matrix that would be called transition matrix.
• Occupation counts: |
.occs It’s the per-transition-id occupation counts. They are rarely needed. e.g. might be used somewhere in the basis-fMLLR scripts. |
• FMLLR: |
An acoustic feature extraction technique like MFCC but with focus on multi-speaker adaptation. |
• Beam: |
Cutoff would be Best Cost –Beam (Around 10 to 16) |
• Deterministic FST: |
A FST that each state has at most one transition with any given input label and there are no input eps-labels. |
Questions
- Why to use -logarithm probabilities: For numerical stability.
- What’s the difference between WFSA and WFST: Acceptors only have output, but transducers have input and output
- sometimes in implementation We implement WFSA as WFST But all nodes have same input and output This is done to simply implement WFSA using normal WFSA without changing the implementation
- What are the input and output nodes in WFST: Inputs are usually phoneme and outputs are words and usually along the way of phonemes, the output is just empty or epsilon except final node
- OxinaBox Kaldi-Notes Train
- VpanaYotov: Decoding graph construction in Kaldi: A visual walkthrough
- Jonathan-Hui Medium: Speech Recognition GMM-HMMl
- Mehryar Mohri: Weighted finite-state transducers in speech recognition