当前位置：锐英源 / 开源技术 / 语音识别开源 / kaldi的nnet2模型在线语音识别中长时间语音处理优化

服务方向

人工智能数据处理
人工智能培训
kaldi数据准备
小语种语音识别
语音识别标注
语音识别系统
语音识别转文字
kaldi开发技术服务
软件开发: 运动控制卡上位机; 机械加工软件
软件开发培训: Java 安卓移动开发; VC++; C#软件; 汇编和破解; 驱动开发

联系方式

固话：0371-63888850
手机：138-0381-0136
Q Q：396806883
微信：ryysoft

kaldi的nnet2模型在线语音识别中长时间语音处理优化

问

We are now using the nnet2 code and want to perform online recognition of conversations. The online-wav-gmm-decode-faster uses OnlineFasterDecoder, which is able to detect end of utterance (as opposed to just endpointing the recording). Is there a way to perform the utterance detection in online decoding using nnet2 models? Please advise. Thanks.

我们现在正在使用nnet2代码，并希望执行对话的在线识别。online-wav-gmm-decode-faster使用OnlineFasterDecoder，它能够检测话语的结束（与仅终结录音相反）。是否可以使用nnet2模型实现在线解码执行话语检测？请指教。谢谢。

答

I think what you are looking for is the --do-endpoint flag in online2-wav-nnet2-latgen-faster.我认为您正在寻找的是 online2-wav-nnet2-latgen-faster中的--do-endpoint标志。

Dan

online2-wav-nnet2-latgen-faster using "SingleUtterance" decoder -- need to
run utterance detection on conversations
https://sourceforge.net/p/kaldi/discussion/1355348/thread/d4658e60/?limit=25#0f6e

Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/kaldi/discussion/1355348/

To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/

答

Looking at the code in the program (https://github.com/vimal-manohar91/kaldi-git/blob/master/src/online2bin/online2-wav-nnet2-latgen-faster.cc), when endpoint is detected, the data processing for "utt" ends; but the "utt" corresponds identically to the entire wavefile (spk2utt = utt2spk = wav1 wav1\n wav2 wav2\n ...). Logic returns to the main loop, gets the next utterance, in our case, the next wavefile. Am I missing something?查看程序中的代码（https://github.com/vimal-manohar91/kaldi-git/blob/master/src/online2bin/online2-wav-nnet2-latgen-faster.cc）， “ utt”的数据处理结束；但是“ utt”与整个wavefile完全相同（spk2utt = utt2spk = wav1 wav1 \ n wav2 wav2 \ n ...）。逻辑返回到主循环，获取下一个发音，在本例中为下一个wavefile。我错过了什么吗？

答

Look at the lines:
if (do_endpointing && decoder.EndpointDetected(endpoint_config))
break;
In this case it may quit before the end of the wave file.
Dan检查下面行：
if（do_endpointing && coder.EndpointDetected（endpoint_config））
break；
在这种情况下，它可能在wave文件结束之前退出。

问

Hi Dan,
I'm having a similar problem with online2-wav-nnet2-latgen-faster not giving the full, expected output for a 6 minute WAV. I found this thread in my efforts to understand the issue.

I followed the directions @ http://kaldi.sourceforge.net/online_decoding.html (Excellent documentation BTW!) and was able to get everything working! Then I swapped in my 8k mono WAV file (6mins), in lieu of the ENG_M (16 seconds) and I'm having trouble getting an output lattice for anything more than the first 45 seconds.

To Debug, I stepped through the code that sections the WAV into parts and passes those to the frontend, I tried out the online flag, so that the chunk_length would get set correctly. I turned off the code you listed above (decoder.EndpointDetected && do_endpointing) so that it wouldn't quit before the end of the file. I put in debugging logs, so I could see the progress of the while loop that decodes the wave in order. I also stepped through and logged the while loop in decoder.AdvanceDecoding() and all seem to decode & report "num frames decoded" correctly.

Questions:

Is there alimit to the amount of audio features that the SingleUtterance can decode? I didn't seem to see any limitations, esp. considering that a while loop to progressively decode is perfect.
More precisely, is there a maximum file size that I can use as input? Is this why there are provisions to pass multiple (pre-segmented) files in via the command-line?
Is the concept of Utterance tied to a VAD opening, or is a programmatic notion, like a single WAV passed in with multiple VAD open/closes, will be considered a single utterance (i.e. uttlist.size() == 1)我在online2-wav-nnet2-latgen-faster上遇到了类似的问题，在6分钟的WAV中没有给出完整的预期输出。我在理解该问题的努力中找到了这个线索。

我按照@ @ kaldi.sourceforge.net/online_decoding.html的指示进行操作（非常棒的文档，顺便说一句！），并且能够使所有工作正常进行！然后我换上了我的8k单声道WAV文件（6分钟），而不是ENG_M（16秒），并且在获取输出单元45秒后时遇到了麻烦。

为了进行调试，我逐步完成了将WAV分成多个部分并将其传递给前端的代码，我尝试了在线标志，以便能够正确设置chunk_length。我关闭了上面列出的代码（decoder.EndpointDetected && do_endpointing），以便在文件结尾之前不会退出。我放入了调试日志，因此可以看到按顺序解码wave的while循环的进度。我也逐步执行了操作，并在while循环中记录了coder.AdvanceDecoding（），所有这些似乎都能正确解码并报告“已解码的帧数”。

问题：
- SingleUtterance可以解码的音频功能数量是否有限制？我似乎没有看到任何限制，特别是。考虑到逐步循环的while循环是完美的。
- 更准确地说，是否可以使用最大文件大小作为输入？这就是为什么有规定通过命令行传递多个（预分段的）文件的原因吗？
- 是将话语概念与VAD开头绑定在一起，还是将程式化的概念视为单个话语（例如uttlist.size（）== 1），例如将单个WAV与多个VAD开/关传递进来一样

Thanks again for any guidance, and your efforts on this great project!
cheers

答

Hi,
There are various reasons why long files can fail to be decoded correctly.
The programs are routinely tested for files up to 20 seconds or so. The right solution is to break your input up into chunks first, or in real-time applications, modify the decoder to terminate the processing when you've reached a silence (using the endpoint detection) and start a new decoding.

Reasons why decoding might fail in general for longer utterances include:
- Roundoff errors (might be fixed by compiling in double)
- Graph search errors due to pruning, where you get stuck in a part of the graph that can't reach back to the main loop (might be fixed by increasing --min-active from its default value of 200).
However, for the online-nnet2 setup there is another reason decoding might fail for longer utterances, which we only recently discovered. It seems that the iVector estimation when applied to longer utterances can produce iVectors which are somehow not typical of those seen in training, and which can mess up the decode.
I recently added the --max-count option, which you have to set in the iVector extraction config, not on the main command line of the program. Try setting it to 100 or 200, and see if it helps. I will be interested to hear how you do.有多种原因导致长文件无法正确解码。
定期对程序进行文件最长20秒左右的测试。正确的解决方案是先将输入分成多个块，或者在实时应用程序中，在达到静默状态（使用端点检测）时修改解码器以终止处理，并开始新的解码。

一般而言，较长时间的语音解码可能会失败的原因包括：
-舍入错误（可能通过双精度编译来解决）
-由于修剪导致图搜索错误，您陷入了无法回到图的一部分的情况。主循环（可以通过将--min-active从其默认值200增加来固定）。
但是，对于online-nnet2设置，还有另一个原因是较长的发言可能会失败，这是我们最近才发现的。看来，将iVector评估应用于较长的发声时，可能会产生特别iVector，它们在某种程度上不是训练中所见的iVector，并且会弄乱解码。
我最近添加了--max-count选项，您必须在iVector提取配置中设置该选项，而不是在程序的主命令行上进行设置。尝试将其设置为100或200，看看是否有帮助。我很想听听你的情况。

Dan

答

Hello Daniel!

I've met similar problem with truncating long audio files recognition. I'm using mainly unchanged decoding configurations from WSJ receipt. But recognition fails for files longer than 8-10 seconds.

For instance I have recorded audio counting from one to twenty. After running online recognition I got recognized numbers from ONE to SIXTEEN only. Even if I crop the beginning and counting starts from "two", "three" or even "four" the last recognized word is still SIXTEEN.

Unfortunately the following workarounds also gave no effect:

Recompiled Kaldi with KALDI_DOUBLEPRECISION=1 didn't help.
Tried various combinations for --min-active (100, 200, 1000) and --max-count (100, 200, 1000) but it only made recognition result worse.

I see same behaviour with online2-wav-nnet2-latgen-faster and nnet-latgen-faster.

Code is updated from trunk and actual for today (2 March).

It should be noticed that my train set consists of phrases not longer that 5-7 seconds. If I extend set with long test cases should it help?

Thank you for your help!我在截断长音频文件时遇到了类似的问题。我使用的主要是来自WSJ收据的解码配置。但是，对于超过8-10秒的文件，识别将失败。

例如，我录制的音频计数从1到20。运行在线识别后，我只能从1到SIXTEEN识别数字。即使我剪裁开始并且从“ 2”，“ 3”甚至“ 4”开始计数，最后识别的单词仍然是SIXTEEN。

不幸的是，以下变通办法也没有效果：

用KALDI_DOUBLEPRECISION = 1重新编译Kaldi并没有帮助。
尝试了--min-active（100，200，1000）和--max-count（100，200，1000）的各种组合，但只会使识别结果更糟。

我看到与online2-wav-nnet2-latgen-faster和nnet-latgen-faster相同的行为。

今天（3月2日）从主干开始更新代码，并实际更新。

应该注意的是，我的训练包含的短语不超过5-7秒。如果我用长测试用例扩展集，是否有帮助？

答

Hm. Try decoding just the part of the file after SIXTEEN and see what happens- is it correctly recognized then?
Also you could try running decoding with --verbose=6. You should see lines like this.
VLOG[6]
(online2-wav-nnet2-latgen-threaded:GetCutoff():lattice-faster-online-decoder.cc:826)
Number of tokens active on frame 35 is 6454
If towards the end of the file you see a small number of tokens active (e.g. less than 200) then it is definitely a search failure.
BTW, make sure to do "make depend -j 20" before re-making... sometimes dependencies are not tracked correctly.

嗯尝试仅对SIXTEEN之后的文件部分进行解码，看看会发生什么，然后正确识别了吗？
您也可以尝试使用--verbose = 6运行解码。您应该看到这样的行。
VLOG [6]
（online2-wav-nnet2-latgen-threaded：GetCutoff（）：lattice-faster-online-decoder.cc:826）
在第35帧上活动的令牌数为6454。
如果在文件末尾，您会看到一个如果激活的令牌数量很少（例如少于200个），则肯定是搜索失败。
顺便说一句，请确保在重新制作之前执行“ makedepend -j 20”操作...有时未正确跟踪依赖关系。

答

Daniel, thanks for your reply.

Yes, counting started from sixteen was recognized until TWENTY successfully (a kind of). I made a small set of audio files with counting started with every value from 1 to 20. The results are in the attachment.

I also ran decoding with --verbose=6 option and saw that number of tokens active is 2000 and more until very last frame.

Re-running "make depend; make" didn't help.是的，从十六开始识别一直20都成功（一种）。我制作了一小组音频文件，并从1到20的每个值开始计数。结果在附件中。

我还使用--verbose = 6选项进行了解码，发现活动的令牌数量为2000个，直到最后一帧为止。

重新运行“ makedepend; make”并没有帮助。

答

OK, so it's likely not a search error.
What it could be is that the iVector extraction is causing a problem.
Something we've seen is that when test data is very different from training data, the iVectors don't always get estimated in a way that leads to good
results. Try, in the iVector extraction config (one of the config files provided on the command line, the same one in which the --ivector-extractor
option is provided), adding the option "--max-count=100". This increases the scale of the prior term in iVector extraction when the data count gets
large, and keeps the iVector to a more sane value.
I am currently working on estimating the iVector only from the speech (excluding the silence), which seems to increase the robustness to unseen
acoustic conditions.好的，因此可能不是搜索错误。
可能是iVector提取引起问题。
我们已经看到的是，当测试数据与训练数据有很大不同时，iVector并不总是以产生良好结果的方式进行估算。尝试在iVector提取配置（
命令行中提供的配置文件之一，其中提供--ivector-extractor 选项的配置文件中的同一文件）中添加选项“ --max-count = 100”。
当数据数量变大时，这会增加iVector提取中前项的规模，并使iVector 保持更合理的值。
我目前正在仅根据演讲估算iVector（不包括静音），这似乎增强了对看不见的声学条件的鲁棒性。

答

Hm... You know, setting --max-count helped but in a little bit unexpected way.

I tried to set different values to --max-count as you have advices at this topic earlier, I used 100, 200 and even 1000 but got no results. Today I decided to try again and set --max-count=1 just to see what would happen and you know, the problem gone!

Made some investigations: when having --max-count more than 60 or zero audio file tail doesn't recognized. Value about 10 fixes the problem with optimal recognition results.嗯...您知道，设置--max-count很有帮助，但是有点出乎意料。

我尝试为--max-count设置不同的值，因为您之前对此主题有建议，我使用了100、200甚至1000，但是没有结果。今天，我决定再次尝试并设置--max-count = 1只是为了看看会发生什么，而且问题已经解决了！

进行了一些调查：当--max-count超过60或零时，音频文件尾部无法识别。大约10的值可通过最佳识别结果解决该问题。

答

OK. A smaller max-count (if nonzero) has more effect so this is not surprising. It likely means your test data is too mismatched to your training data, or your training data had too little of some important variability (e.g. volume). How much training data did you have for this system, and was it similar to the test data? Dan

好。较小的最大计数（如果非零）会产生更大的影响，因此这并不奇怪。
这很可能意味着您的测试数据与训练数据太不匹配，或者您的训练数据的一些重要差异（例如数量）太少。
您为此系统拥有多少训练数据，它与测试数据相似吗？

答

Hi Dan, thanks for your great answer! I added the --max-count option to my iVector extraction config and tried both 100 & 200, however that shortened the length of the transcription. Do you have an example audio file where iVectors generated are not typical of those from the training? So I've started exploring the segmentation of the input file and have been reading the docs on 'src/featbin/extract-segments'. Do you have any good examples of piped usage in concert with online2-wav-nnet2-latgen-faster?嗨，丹，
谢谢您的出色回答！我在我的iVector提取配置中添加了--max-count选项，并尝试了100和200，但是这缩短了转录时间。您是否有一个示例音频文件，其中生成的iVector不是训练训中的典型音频文件？

因此，我开始研究输入文件的分段，并已阅读有关“ src / featbin / extract-segments”的文档。您是否有任何与online2-wav-nnet2-latgen-faster配合使用的管道用法的好例子？

友情链接

汕头招聘网 | 山东招聘网 | 郑州教育培训 | 软件下载