DNN或GMM对开头静音解码错误

问

I found out that when running the decoder either GMM or DNN on a audio that contains long silence in beginning and ending, there are a lot of deletion in the result. Just by manual deleting the silence, the WER reduced from 65% -> 15%.

I guess this might be the problem of feature extraction. Do you have a good solution for this instead of manipulating the audio?

我发现，当在开始和结束时包含长时间静音的音频上运行GMM或DNN解码器时，结果中会有很多删除。只需手动删除静音，WER就会从65％-> 15％降低。

我猜这可能是特征提取的问题。您对此有一个好的解决方案，而不是操纵音频吗？

答

I'm surprised it would cause deletions and not insertions. My expectation would be that the cepstral mean subtraction would normalize the signal to be louder than normal, generating insertions rather than deletions.
Is it deleting occasional words, or whole chunks? Is it deleting the last part of the sentence? If it is deleting the last part of the sentence,
make sure to use an up-to-date version of the code, there was a decoder bug-fix that is possibly relevant to this.我很惊讶这会导致删除而不是插入。我的期望是，倒谱均值减法会将信号归一化为比正常声音大，从而产生插入而不是删除。
是删除偶发单词还是整个单词？是否删除句子的最后部分？如果要删除句子的最后一部分，请确保使用最新版本的代码，其中有可能与此相关的解码器错误修复。

答

Thanks for your quick respond.

there was a decoder bug-fix that is possibly relevant to this.

I will try to update the code, and see if it fix the problem.我将尝试更新代码，并查看它是否可以解决问题。

My expectation would be that the cepstral mean subtraction would normalize the signal to be louder than normal, generating insertions rather than deletions.

But the DNN decoder does not perform online CMVN, doesn't it?但是DNN解码器不执行在线CMVN，不是吗？

Is it deleting occasional words, or whole chunks?

It deleted the whole chunks, sometime the whole utterance.它删除了整个块，有时删除了整个话语。

答

If you still have a problem after updating the code, please respond with more details. There are many different decoders and different setups, and without more details it's not possible to respond.如果更新代码后仍然遇到问题，请提供更多详细信息。有许多不同的解码器和不同的设置，
没有更多细节，就无法响应。

答

I've updated the Kaldi code, and the problem is still there.我已经更新了Kaldi代码，但问题仍然存在。

WER:
- Audios have long sil.: %WER 66.91 [ 550 / 822, 81 ins, 180 del, 289 sub ]
- Audios does not have long sil.: %WER 15.69 [ 129 / 822, 32 ins, 25 del, 72 sub ]

Decoding conditions:解码条件：
--- Acoustic model: DNN - trained by using script local/online/run_nnet2.sh.
--- Decoder: online2-wav-nnet2-latgen-faster (using script steps/online/nnet2/decode.sh with --per-utt true)

Ivector extractor config (Kept default from script, does not change anything):Ivector提取器配置（保留脚本默认值，不更改任何内容）：
--num-gselect=5
--min-post=0.025
--posterior-scale=0.1
--max-remembered-frames=1000

答

I will reconfirm that this does happen. The explanation (when using CMN) is that with very long silences, the accumulated CMN stats does not reflect the stats of the audio, but more that of the silence. So the expected feature values in the spoken part is different than what is input. CMN is not the same as gain control. The training conditions are now mismatched with the testing conditions. Therefore a whole bunch of low scoring hypothesis are active. The simple solution to this problem is to put a cheap silence detector to strip the silences (but leave some around the utterance) before a decode.我将再次确认确实发生了。（使用CMN时）的解释是，在很长的静音时间内，累积的CMN统计信息不会反映音频的统计信息，而是更多的静音信息。因此，口语部分中的预期特征值与输入的值不同。CMN与增益控制不同。现在，训练条件与测试条件不匹配。因此，一大堆低分假设是活跃的。解决此问题的简单方法是在解码之前放置廉价的静音检测器以消除静音（但在发声前后留一些）。

I have noticed this very consistently not just with Kaldi but a plethora of other decoders. Its absolutely normal.我不仅在Kaldi中而且在
其他许多解码器中都非常一致地注意到这一点。它绝对正常。

答

The CMVN in the online-nnet2 decoder uses a short sliding window and I don't think that can be the only reason.online-nnet2解码器中的CMVN使用较短的滑动窗口，我认为这不是唯一的原因。
I think the silence is throwing off the iVector extraction. We have seen this before in the ASPIRE challenge. If you see silence that is different
from what you saw in training, it can cause weird effects in iVector extraction. I actually have some changes to the online-decoding setup to
support estimating the iVector only on speech, to try to solve this problem, but they have not been checked in yet. Vijay and I need to do
some more testing first.我认为这种沉默阻碍了iVector的提取。我们在ASPIRE挑战中已经看到了这一点。如果您看到的沉默与您在训练中看到的有所不同，那么它可能会在iVector提取中引起奇怪的影响。我实际上对在线解码设置进行了一些更改，以支持仅在语音上估计iVector，以尝试解决此问题，但尚未对其进行检查。Vijay和我需要先做一些测试。
For now I suggest that you add the option --max-count=100 (or maybe even smaller, --max-count=10, which is a stronger effect) in the iVector
extraction config file (it's not the top-level config file, it's mentioned in that file though).现在，我建议您在iVector中添加--max-count = 100选项（甚至可能更小，--max-count = 10，效果更强）。提取配置文件（它不是顶级配置文件，但在该文件中已提及）。

BTW, Vijay, something we should also look into at some point is modifying the training to include longer periods of silence. I think one problem might be that the iVector extractor never saw extended periods of silence, and due to interaction with the sliding-window CMN, it means there is certain types of data that it never saw.顺便说一句，维杰（Vijay），我们在某些时候还应该研究的是修改训练以包括更长的沉默期。我认为一个问题可能是iVector提取器从未看到过长时间的静音，并且由于与滑动窗口CMN的交互作用，这意味着某些类型的数据从未见过。

答

Thanks for your advise, I changed the max-count=5, and it does help.
Looking forward for the changes in online-decoding setup.感谢您的建议，我更改了max-count = 5，它确实有帮助。
期待在线解码设置中的更改。

友情链接

汕头招聘网 | 山东招聘网 | 郑州教育培训 | 软件下载