The command below generates a tone signal out of the speaker and receives it back through the mic. Measuring the phase diff will reveal the round-trip latency.
alsa_delay hw:1,0 hw:0,0 44100 256 2 1 1
Here hw:1,0 refer to the recording device that can be retrieved from arecord -l and hw:0,0 refer to the playback device. Again can be retrieved from aplay -l .
The 44100 is the sampling rate. 256 is the buffer size. 256 works best for me. Lower numbers corrupt the test and higher numbers just bring more latency to the table. Don’t know exactly what nfrags input and output arguments are but 2 1 and 1 respectively works magically for me. I just tinkering around and found these numbers. No other number works for me.
1. Focusrite Scarlett Solo Latency: 2.5ms
2. Shure SM57 Mic Latency: 2.5ms
3. OverAll Delay: 14ms with non-RT mode
You can tinker around the effect of latency with
pactl load-module module-loopback latency_msec=15
To end the loopback mode
pactl unload-module module-loopback
As Always Useful links
Arun Raghavan – Beamforming in PulseAudio
Arch Linux Wiki – Professional Audio, Realtime kernel
Let’s Enhance Kaldi, Here are some links along the way. Look like YouTube is progressing a lot during the last couple of years so basically here is just a bunch of random videos creating my favorite playlist to learn all the cool stuff under the Kaldi’s hood.
LattceFasterDecoderLattices: A more complex form of FST‘s, The first version decoders were based on FST’s (like faster-decoder and online decoders). For Minimum Bayesian Risk Calculation Using Lattices will give you a better paved wayfaster-decoder: Old decoder, very simple to understand how decoding process is donelattice-faster-decoder: general decoder, same as faster-decoder but output lattices instead of FSTsDecodableInterface: An interface that connects decoder to the features. decoder uses this Decodable object to pull CMVN features from it.BestPath: An FST that constructed from the Best Path (path with maximum likelihood) in the decoded FST.nBestPath: An FST constructed from the top N Best Path in the decoded FST.GetLinearSymbolSequence: The final step in the recognition process, get a BestPath FST or Lattice and output the recognized words with the path weight. CompactLattices need to be converted using ConvertLatticeStrongly Connected Component: A set that all components are accessible (in two ways) by it’s member.ProcessEmitting that pulls loglikelihood from the decodable objectThanks to this marvelous framework, a trained model is at disposal with WER of absolute zero percent over the 10 minutes of continuous speech file. The final piece to this puzzle would be implementing a semi-online decoding tool using GStreamer. As always useful links for further inspection
gst_caps_to_string(caps)Here I am, pursuing once more the old-fashioned machine learning. I’ll keep it short and write down useful links