当前位置：锐英源 / 开源技术 / 语音识别开源 / kaldi语音活动检测VAD和噪音处理思想，欢迎加入kaldi QQ群：14372360

服务方向

联系方式

固话：0371-63888850
手机：138-0381-0136
Q Q：396806883
微信：ryysoft

锐英源精品原创，禁止全文或局部转载，禁止任何形式的非法使用，侵权必究。锐英源软件对经典开源项目有大量翻译，翻译内容技术层次较高，对初学者有深究意义。有幸浏览到的朋友请关注头条号，可以获取最新更新。English

kaldi语音活动检测VAD和噪音处理思想

背景

kaldi语音识别开源项目博大精深，锐英源在深入学习，也把学习心得分享出来，致谢kaldi开源项目团队。

中文

Kaldi 中是否有用于语音活动检测的工具/脚本/模型？如果它们存在，我们如何使用它们？仅识别具有音乐或噪音的音频文件中的语音片段将非常有用。

不是这里的专家，所以可能有一个内置的。但是，通过管道 Sox 进行 VAD 并将其构建到方案甚至“wav.scp”中以获得相当好的结果是相当简单的。

有没有人尝试在没有 VAD 的嘈杂条件下使用 Kaldi 解码器？
目前，我们正在一个对话系统中使用 Kaldi，用户经常在嘈杂的街道上给我们打电话，我们希望消除我们的 VAD。
任何人都可以建议如何对静音和噪音进行建模以及如何训练相应的模型（更可靠地估计静音模型）？
Kaldi 工具包中是否提供任何特殊培训？如果没有，实施起来有多难？

我们用于 BABEL 之类的基本方法是训练一个系统，该系统具有静音音素，可能还有噪音音素（但这些需要在训练记录中标记为单词）。您可以
针对不同类别的噪音使用不同的噪音手机：例如咳嗽、笑声、非语音噪音等。然后只需在数据上运行识别器即可。

有几个人提供了用于语音活动检测的代码，但我对接受它持谨慎态度，因为里面的任何东西都很难取出或停止支持，我宁愿等到看起来有一个标准的、确定的方法。

所以我们可以训练不同类别噪声的模型，但我们不知道如何去做。例如，对于静音模型，我们不做任何事情，我们不会在转录中编写任何静音音素，而且似乎已经过训练。我们有一些语音文件或噪音不清晰的音频文件，那么我们应该如何以最佳方式处理它们？我们在 KALDI 中找到了一些用于 VAD 的代码（compute-vad.cc）：http ://kaldi.sourceforge.net/tools.html 。是否有使用此功能的示例？

在构建模型时，静音音素会被处理并插入到脚本中。看看词典 fst 的构造。如果某些数据是高质量转录而其他数据是低质量没有噪声标记），则可能有机会创建利用高质量语音数据的脚本来获取低质量数据上的噪声标记。

顺便说一句，该 VAD 代码不适用于语音识别应用程序，它用于说话者和语言 ID。这是非常基本的，基于能量的，并且不能确保最小段长度（逐帧决定）。

很抱歉恢复旧线程，但问题不断出现。拥有 Kaldi 内部 VAD 不是更有意义吗？实施高质量的 VAD 需要重新生成 Kaldi 在内部使用的所有功能并检查识别状态 - 我们是否识别了音素？我们是在一个词中还是在词之间？显然，这不仅仅与功率和平滑度有关。

是的，它在我的待办事项清单上；我有一个学生在做这个。现在在在线解码设置中，您应该使用解码器的回溯来确定您是否处于沉默状态。寻找名称中带有“端点”的选项。

English

Is there any tool/script/model in Kaldi for Voice Activity Detection? If they exist, how could we use them? It would be very useful to recognize only speech segments in audio files that have music or noise.

Not an expert here, so there may be one built in. However, it is fairly simple to pipe Sox to do VAD and build it into the recipes or even the "wav.scp" to get fairly good results.

Have anyone tried to use Kaldi decoders in noisy conditions without VAD?
Currently, we are using Kaldi in a dialogue system with users calling us
often from noisy street and we would like eliminate our VAD.
Can anyone suggest how to model the silence and noise and how to train
corresponding models (estimate the silence model more robustly)?
Is any special training available in Kaldi toolkit? If not, how hard would
it be to implement it?

The basic approach we've used for things like BABEL is to train a system
that has a silence phone, and possibly noise phones also (but these would
need to be marked as words in the training transcripts). You can have
different noise phones for different categories of noise: things like
cough, laugh, non-speech noise, etc. Then just run the recognizer on the
data.

A couple of people have offered to contribute code for voice activity
detection, but I have been cautious about accepting it since whatever is in
there will be hard to take out or stop supporting, and I'd rather wait till
it seems like there is a standard, definitive approach.

So we can train models of differents categories of noise, but we don't know how to do it. For example, for silence model we don't do anything, we don't write any silence phones in the transcriptions and seems it's trained.
We have some audio files with non clean speech or noise, so how we should process them in the best way?

We found some code in KALDI for VAD (compute-vad.cc): http://kaldi.sourceforge.net/tools.html. Is there any example that uses this functionality?

Silence phone is handled and inserted in the scripts when model is
constructed. Look at lexicon fst construction.

If some of the data is high quality transcription and other is low quality
(without noise markers) there may be opportunity for creating scripts that
leverage the high quality speech data to get the noise markings on low
quality data.

BTW, that VAD code is not intended for speech recognition applications,
it's for speaker and language id. It's extremely basic, energy-based, and
doesn't ensure a minimum segment length (decision is frame by frame).

Sorry to revive an old thread, but the question keeps coming up.
Wouldn't it make more sense to have a Kaldi-internal VAD?
Implementing a high quality VAD would require re-generating all the features Kaldi uses internally and inspecting the recognition state - did we recognize a phoneme? Are we inside a word or between words?
It is obviously not just about power and smoothing.

Did anything change in the past few months?

What is the recommended best practice (assuming we are online-decoding)?

Yes it would, and it's on my TODO list; I have a student working on it.
Right now in the online-decoding setup, you are supposed to use the
backtrace of the decoder to work out whether you are in a silence.
Look for an option with "endpointing" in the name.

友情链接