锐英源软件
第一信赖

精通

英语

开源

擅长

开发

培训

胸怀四海 

第一信赖

当前位置:锐英源 / 开源技术 / 语音识别开源 / online2-wav-nnet2-latgen-threaded高WER

服务方向

人工智能数据处理
人工智能培训
kaldi数据准备
小语种语音识别
语音识别标注
语音识别系统
语音识别转文字
kaldi开发技术服务
软件开发
运动控制卡上位机
机械加工软件
软件开发培训
Java 安卓移动开发
VC++
C#软件
汇编和破解
驱动开发

联系方式

固话:0371-63888850

音素:138-0381-0136

Q Q:396806883
微信:ryysoft
头条号:软件技术及人才和养生
kaldi QQ群:14372360

online2-wav-nnet2-latgen-threaded高WER

 

I have compiled and run the new threaded decoder (online2-wav-nnet2-latgen-threaded) and tested it against the single-threaded version (online2-wav-nnet2-latgen-faster) results. On my test set I've seen that the WER is higher with the threaded version, by 0.5% (from ~11.5% to 12%). I am using the same parameters in both decoders (the single decoder uses the online option, and all numeric parameters are the same), on the same machine.Does it make any sense? Or maybe there is another parameter I should tune for the threaded decoder? 我已经编译并运行了新的线程解码器(online2-wav-nnet2-latgen-threaded),并针对单线程版本(online2-wav-nnet2-latgen-faster)的结果对其进行了测试。在我的测试集上,我发现线程版本的WER更高,为0.5%(从〜11.5%到12%)。我在同一台机器上的两个解码器中都使用相同的参数(单个解码器使用online选项,并且所有数字参数都相同)。也许还有另一个我应该为线程解码器调整的参数?

 

 

I believe it should be the same. Did you notice whether the output tends to be different under certain circumstances, e.g. the beginning vs. end of sentences, or the first utterance of a speaker vs. later utterances? 我相信应该是一样的。您是否注意到某些情况下的输出是否趋于不同?句子的开头与结尾,还是说话者的第一句话与后来的讲话?

 

 

There is a difference on handling start/end of sentences. The threaded decoder seem to miss on SOS, while the regular decoder is worse on EOS.Some typical examples: 处理句子的开头/结尾有所不同。 线程解码器在SOS上似乎会丢失,而常规解码器在EOS上会更差一些典型示例:

mc allen texas (regular - correct)
allen texas (threade)
august eleven (regular)
august eleventh (threaded - correct)

Overall, although sometimes the threaded decoder is correct, the regular decoder is better, and have much less deletions: 总体而言,尽管有时线程解码器是正确的,但常规解码器更好,删除次数少得多:

regular - {'correct': 5090, 'substitution': 366, 'deletion': 287, 'insertion': 72}
threaded - {'correct': 5064, 'substitution': 356, 'deletion': 323, 'insertion': 65}

Both decoders were run with the same VAD, which should be deterministic, and the difference in WER was reproduced in a second attempt, so I find it hard to believe it is the case here. 两个解码器都使用相同的VAD运行,这应该是确定性的,并且WER的差异是在第二次尝试中重现的,因此我很难相信这里是这种情况。

 

 

I'd like you to try something else before I start investigating more closely. Please try setting --use-most-recent-ivector=false
in the iVector extraction configs for both decoders, and report the WERs you get. This should in theory make the decoders more comparable. What this relates to is, are the decoders allowed to look ahead to stats of the most recent frame available, when computing iVectors, or are they limited to data up to the current frame that they are decoding. If --use-most-recent-ivector=true (the default), then the outcome of
decoding can depend on things like how loaded the machine was, which makes things hard to compare.
我希望您在开始进一步研究之前尝试其他尝试。在两个解码器的iVector提取配置中使用--use-most-recent-ivector=false并报告您获得的WER从理论上讲,这应该使解码器比较可比。与此相关的是,在计算iVector时,是允许解码器提前查看可用的最新帧的统计信息,还是将解码器限于当前正在解码的当前帧的数据。 如果--use-most-recent-ivector=true(默认值),那么解码的结果可能取决于机器的负载情况,这使事情很难进行比较。

 

 

I have made two tests with --use-most-recent-ivector=false, and also disabled my VAD and left [oov] and [noise] tokens (which i usually filter out).The WER is very similar now, but the outputs still differ significantly. The regular decoder has much more words (9308 vs. 9181) in its output transcriptions, many of them are '[noise]' at EOS, some are other differences.About 10% of the transcriptions are different (200 out of 2000), but only 66 (out of 200) transcriptions have differences which aren't noise. 我已经使用--use-most-recent-ivector = false进行了两次测试,并且禁用了我的VAD,并留下了[oov]和[noise]令牌(我通常会过滤掉).WER现在非常相似,但是输出仍然有很大差异。常规解码器的输出转录中有更多的单词(9308 vs. 9181),其中许多在EOS处为“ [noise]”,有些则是其他差异。大约有10%的转录不同(2000个中的200个),但只有66个(200个转录中)有差异,这些差异不是噪音。

 

 

OK that's interesting- can you get the ctm output by piping through lattice-1best then lattice-align-words then nbest-to-ctm (or use
steps/get_ctm.sh) and tell me what the duration of the word-final [noise] typically is? 首先使用lattice-1best,然后使用lattice-align-words,然后使用nbest-to-ctm(或使用steps / get_ctm.sh),并告诉我单词final的持续时间是多少[噪声]

 

sorry for the late response, but here is what i get from the CTM: 很抱歉收到您的延迟回复,但以下是我从CTM获得的信息:

regular:
%WER 16.70 [ 960 / 5749, 172 ins, 301 del, 487 sub ] exp/nnet1/results-regular/wer_21
number of utterances which ended in [noise]: 928/2067
average noise length (at EOS): 0.8993965517241379
average noise start time: 2.92693965517241
threaded:
%WER 16.47 [ 947 / 5749, 164 ins, 315 del, 468 sub ] exp/nnet1/results-threaded/wer_21
number of utterances which ended in [noise]: 945
average noise length (at EOS): 0.835407407407406
average noise start time: 2.9096296296296256

 

 

Are you setting the acoustic scale on the command line, or leaving it at its default value? I'm wondering whether there could be a bug where in one case the acoustic scale is not being passed in correctly. 我想知道是否可能存在一个错误,即在某种情况下无法正确传递声级。

 

 

i'm setting it from the cmdline in both cases 我在两种情况下都从cmdline设置它

 

 

To what value? The complete command lines would be helpful. 有什么价值?完整的命令行将很有帮助。

 

 

Also, please do "svn up", 'make depend -j 8', 'make -j 8', and rerun and make sure the results are unchanged. I want to make sure it's not a
compilation problem or out-of-date code. 另外,请执行“ svn up”,“ makedepend -j 8”,“ make -j 8”,然后重新运行并确保结果不变。我想确保它不是编译问题或过时的代码。

 

 

I will create a fresh installation of Kaldi and see if it can be reproduced. Just before I do this, my settings include: 我将创建一个全新的Kaldi装置,并查看是否可以复制。在执行此操作之前,我的设置包括:

+ ./configure --shared --threaded-math=y
+ ATLAS 3.11.31 and openFST 1.4.1
+ changing compilation flags from -msse/-msse2 to -mavx/-O3 for optimization

maybe something related to the threaded math libraries? 也许与线程数学库有关的东西?

 

 

I doubt it's related to threaded math libraries. 我怀疑这与线程数学库有关。

 

 

Just to recap in case anyone gets to this thread -- there was a bug in the way the threaded decoder handled the last few frames of likelihood, and it was fixed by Dan in commit r4923 只是为了重述以防万一有人进入该问题-线程解码器处理最后几个似然帧的方式存在一个错误,由Dan在commit r4923中修复。

友情链接
版权所有 Copyright(c)2004-2021 锐英源软件
公司注册号:410105000449586 豫ICP备08007559号 最佳分辨率 1024*768
地址:郑州大学北校区院(文化路97号院)内