锐英源软件
第一信赖

精通

英语

开源

擅长

开发

培训

胸怀四海 

第一信赖

当前位置:锐英源 / 开源技术 / 语音识别开源 / gmm-acc-stats-ali的bus error
服务方向
人工智能数据处理
人工智能培训
kaldi数据准备
小语种语音识别
语音识别标注
语音识别系统
语音识别转文字
kaldi开发技术服务
软件开发
运动控制卡上位机
机械加工软件
软件开发培训
Java 安卓移动开发
VC++
C#软件
汇编和破解
驱动开发
联系方式
固话:0371-63888850
手机:138-0381-0136
Q Q:396806883
微信:ryysoft

gmm-acc-stats-ali的bus error


I have been trying to run kaldi on the NASA HECC clusters through a PBS job submission system. The code calls the utils/run.pl to parallelize the jobs across a single node at the moment.

The default WSJ script used to run on the custers with no issues, but after performing an update of the code, and recompiling the tools and src, I am facing a problem with steps/train_mono.sh,我一直在尝试通过PBS作业提交系统在NASA HECC集群上运行kaldi。该代码调用utils / run.pl来并行化单个节点上的作业。

默认的WSJ脚本曾经在custer上运行,没有任何问题,但是在执行代码更新并重新编译工具和src之后,我遇到了steps / train_mono.sh问题,

steps/train_mono.sh --boost-silence 1.25 --nj 10 --cmd run.pl data/train_si84_2kshort data/lang_nosp exp/mono0a
steps/train_mono.sh: Initializing monophone system.
steps/train_mono.sh: Compiling training graphs
steps/train_mono.sh: Aligning data equally (pass 0)
run.pl: 10 / 10 failed, log is in exp/mono0a/log/align.0.*.log

Looking at the log files, I see:

# align-equal-compiled "ark:gunzip -c exp/mono0a/fsts.1.gz|" "ark,s,cs:apply-cmvn  --utt2spk=ark:data/train_si84_2kshort/split10/1/utt2spk scp:data/train_si84_2kshort/split10/1/cmvn.scp scp:data/train_si84_2kshort/split10/1/feats.scp ark:- | add-deltas ark:- ark:- |" ark,t:- | gmm-\
acc-stats-ali --binary=true exp/mono0a/0.mdl "ark,s,cs:apply-cmvn  --utt2spk=ark:data/train_si84_2kshort/split10/1/utt2spk scp:data/train_si84_2kshort/split10/1/cmvn.scp scp:data/train_si84_2kshort/split10/1/feats.scp ark:- | add-deltas ark:- ark:- |" ark:- exp/mono0a/0.1.acc
# Started at Wed Jun 17 22:51:12 PDT 2015
#
gmm-acc-stats-ali --binary=true exp/mono0a/0.mdl 'ark,s,cs:apply-cmvn  --utt2spk=ark:data/train_si84_2kshort/split10/1/utt2spk scp:data/train_si84_2kshort/split10/1/cmvn.scp scp:data/train_si84_2kshort/split10/1/feats.scp ark:- | add-deltas ark:- ark:- |' ark:- exp/mono0a/0.1.acc
align-equal-compiled 'ark:gunzip -c exp/mono0a/fsts.1.gz|' 'ark,s,cs:apply-cmvn  --utt2spk=ark:data/train_si84_2kshort/split10/1/utt2spk scp:data/train_si84_2kshort/split10/1/cmvn.scp scp:data/train_si84_2kshort/split10/1/feats.scp ark:- | add-deltas ark:- ark:- |' ark,t:-
add-deltas ark:- ark:-
apply-cmvn --utt2spk=ark:data/train_si84_2kshort/split10/1/utt2spk scp:data/train_si84_2kshort/split10/1/cmvn.scp scp:data/train_si84_2kshort/split10/1/feats.scp ark:-
add-deltas ark:- ark:-
apply-cmvn --utt2spk=ark:data/train_si84_2kshort/split10/1/utt2spk scp:data/train_si84_2kshort/split10/1/cmvn.scp scp:data/train_si84_2kshort/split10/1/feats.scp ark:-
bash: line 1: 14949 Broken pipe             align-equal-compiled "ark:gunzip -c exp/mono0a/fsts.1.gz|" "ark,s,cs:apply-cmvn  --utt2spk=ark:data/train_si84_2kshort/split10/1/utt2spk scp:data/train_si84_2kshort/split10/1/cmvn.scp scp:data/train_si84_2kshort/split10/1/feats.scp ark:- \
| add-deltas ark:- ark:- |" ark,t:-
     14950 Bus error               (core dumped) | gmm-acc-stats-ali --binary=true exp/mono0a/0.mdl "ark,s,cs:apply-cmvn  --utt2spk=ark:data/train_si84_2kshort/split10/1/utt2spk scp:data/train_si84_2kshort/split10/1/cmvn.scp scp:data/train_si84_2kshort/split10/1/feats.scp ark:- | a\
dd-deltas ark:- ark:- |" ark:- exp/mono0a/0.1.acc
# Accounting: time=0 threads=1
# Ended (code 135) at Wed Jun 17 22:51:12 PDT 2015, elapsed time 0 seconds

Could anyone help me with figuring out how to go about identifying the exact problem and any possible solutions?有人可以帮助我弄清楚如何确定确切的问题和任何可能的解决方案吗?

Usually the only thing that can cause a bus error is recompiling code while something is running.
Other than that, it's possible it could be caused by compiling for the wrong hardware, but I haven't seen that. I've also, IIRC, had people
complain that they had a core dump while using Docker, and they felt it was caused by some kind of file system issue.通常,唯一可能导致总线错误的是 在运行某些程序时重新编译代码。 除此之外,可能是由于编译错误的硬件引起的,但我还没有看到。我还有IIRC,有人抱怨他们在使用Docker时发生了核心转储,他们认为这是由某种文件系统问题引起的。

 

Thanks for the response. I think I can safely reject the first and second hypothesis (recompiling while running and wrong hardware). The cluster seems to be uniform in terms of hardware, be it the node I am compiling it in, or the node I am running it in.

With regards to the third hypothesis, I guess I will contact the cluster sys admins and see if they can help me with that.您好Povey教授,非常
感谢您的回复。我认为我可以安全地拒绝第一个和第二个假设(在运行时重新编译以及错误的硬件)。集群在硬件方面似乎是统一的,无论是我正在对其进行编译的节点,还是我正在其进行运行的节点。

关于第三个假设,我想我会联系集群系统管理员,看看他们是否可以帮助我。


Based on the advice I received from the tech support of the cluster, I did a complete coredump during failure, and isolated the problem to lie with the
gmm-acc-stats-ali binary.
On performing a gdb debug with the core file and binary, I found the following backtrace:您好Povey教授,
根据我从集群的技术支持获得的建议,我在故障期间进行了一次完整的转储,并通过
gmm-acc-stats-ali二进制文件找出了问题所在。
在对核心文件和二进制文件执行gdb调试时,我发现以下回溯:

gdb ~/Workspace/Software/kaldi/src/gmmbin/gmm-acc-stats-ali core.14727
...
...
(gdb) where
#0  0x00000000005ac1b6 in kaldi::VectorBase::ApplyPow (this=, this@entry=, 
    power=, power@entry=) at kaldi-vector.cc:441
Cannot access memory at address 0x2f3a6e69622f727b

The only place I can find the ApplyPow in this binary is in我可以在此二进制文件中找到ApplyPow的唯一位置是

 src/gmm/mle-diag-gmm.cc L:185 function:AccumulateFromPosteriors
#called by
src/gmm/mle-diag-gmm.cc L:202 function:AccumulateFromDiag
#called by
src/mle/mle-am-diag-gmm.cc L:74 function:AccumulateForGmm
#called by
src/gmmbin/gmm-acc-stats-ali.cc L:98

I am not sure what to make of the error, or why it is accessing an incorrect memory at all.我不确定该怎么做,或者为什么它根本访问不正确的内存。

That line of the program is this: ifdef HAVE_MKL template<> void VectorBase::ApplyPow(float power) { vsPowx(dim_, data_, power, data_); } template<> void VectorBase::ApplyPow(double power) { vdPowx(dim_, data_, power, data_); } So it looks like you are using MKL (you must have configured for MKL). Possibly it's some mismatch in MKL version between the machine you compiled on and where you ran. Ran "make test" in the matrix/ directory on the machine where you compiled, and where you are running.

程序的这一行是这样的:

ifdef HAVE_MKL

template <>
void VectorBase <float> :: ApplyPow(float power){vsPowx(dim_,data_,
power,data_); }
template <>
void VectorBase <double> :: ApplyPow(double power){vdPowx(dim_,data_,
power,data_); }

因此,看起来您正在使用MKL(必须已为MKL配置)。
在MKL版本中,可能是您在其上编译的计算机与您运行的计算机之间存在某些不匹配。
在编译和运行的机器上的matrix /目录中运行“ make test” 。

Something else to bear in mind is that if you don't do "make depend" before make, dependencies can become out of date and not trigger
recompilation. Doing "make clean" and "make" would be the safest thing to do, though.还有一点要记住的是,如果您在make之前不执行“ makedepend”操作,则依赖项可能会过时并且不会触发重新编译。不过,“make clean”和“make”将是最安全
的操作。

Apologies for the delay in response. I tried the make-test, and they run successfully in both the machine where I compile the binaries, and the machines where I run the actual code. One thing to note is, the binaries work fine when I don't run them from the scripts like train_mono.sh, etc. However, they fail with the core dump error on running through the script through run.pl When I put a kaldi_log / kaldi_vlog statement in line 441 of kaldi-vector.cc, that code runs fine, but I face a segfault in another binary. I guess this indicates a much more fundamental error in the build. I guess something is wrong in the environment where the binaries are being compiled / run. In any case, the normal build works perfectly fine on a local machine. If you have any other options, I can try that out. If not, I do thank you for responding so quickly to my queries.

抱歉回复延迟。我尝试了make-test,它们在我编译二进制文件的计算机和运行实际代码的计算机中都成功运行。
需要注意的一件事是,当我不从train_mono.sh等脚本中运行二进制文件时,它们运行良好。但是,它们在通过run.pl运行脚本时会因核心转储错误而失败。

当我在kaldi-vector.cc的第441行中输入kaldi_log / kaldi_vlog语句时,该代码运行良好,但是在另一个二进制文件中却遇到了段错误。我猜这表明构建中存在更根本的错误。我猜在二进制文件正在编译/运行的环境中有问题。

无论如何,普通的构建在本地计算机上都可以正常工作。

如果您还有其他选择,我可以尝试一下。如果没有,我非常感谢您对我的询问如此迅速的答复。

It could possibly be some difference in your environment that makes the difference- perhaps your default shell is not bash? (echo $SHELL
to check). In that case, for example your .cshrc would not be invoked because the run.pl invokes bash directly. You could also try things
like

which gmm-acc-stats-ali
ldd gmm-acc-stats-ali
utils/run.pl foo.log which gmm-acc-stats-ali
utils/run.pl bar.log ldd gmm-acc-stats-ali

and check that they give the same output.

您的环境可能有所不同,这可能有所不同-也许您的默认外壳不是bash?(回显$ SHELL 以进行检查)。在那种情况下,例如,您的.cshrc不会被调用,
因为run.pl直接调用bash。您也可以尝试类似

哪个gmm-acc-stats-ali
ldd gmm-acc-stats-ali
utils / run.pl foo.log哪个gmm-acc-stats-ali
utils / run.pl bar.log ldd gmm-acc-stats-ali

并检查它们是否提供相同的输出。

Akshay's reply won't have shown up due to a bug in Sourceforge that moderated messages don't get posted to the list. It is below. So there are no obvious differences in how the programs are linked or which ones are used. Now, in general Kaldi won't work correctly if your default shell is not bash (you might have to ask to change it, or chsh -s bash or ypchsh -s bash might work.) But it's not obvious to me right this second, why it would cause the problems you are experiencing. My guess is on some weird filesystem or OS behavior where it is caching something, or possibly an effect due to user limits (e.g. process limits or memory limits), but that wouldn't normally cause a core dump. Try to run the complete pipeline of commands from csh, and then from bash (by starting a bash shell by typing "bash", sourcing path.sh and running them), and see if the behavior is different. You can also try changing the --nj parameter and seeing if this makes a difference- if it does, user limits might be responsible (you can type "ulimit -a" to see what limits apply to you). Please don't send the complete output- try to see if anything seems weird first.由于Sourceforge中存在一个错误,即被审核的邮件不会发布到列表中,因此Akshay的回复不会显示。在下面。
因此,程序的链接方式或使用方式没有明显差异。
现在,通常情况下Kaldi将无法正常工作,如果你默认的shell不是bash(你可能要问去改变它,或CHSH -s bash或ypchsh -s可能会奏效。)但是,这不是明显说我的看法对, 第二,为什么会引起您遇到的问题。我的猜测是在某些奇怪的文件系统或OS行为上,它正在缓存某些东西,或者可能是由于用户限制(例如,进程限制或内存限制)而产生的影响,但是通常不会导致核心转储。
尝试从csh,然后从bash运行完整的命令管道(通过键入“ bash”启动bash shell,获取path.sh并运行它们),然后查看行为是否不同。您也可以尝试
更改--nj参数,看看是否有区别-如果确实如此,则可能是用户限制引起的(您可以键入“ ulimit -a” 查看适用于您的限制)。请不要发送完整的输出-
尝试先看看是否有任何奇怪的东西。

Hello Prof. Povey,
The default shell is csh. However, the environment variables propagate through the scripts and seem to point correctly, with the only difference being the SHLVL.

Here are the commands and their subsequent outputs in the compile machine and one of the machines where the code is run using qsub:您好Povey教授,
默认外壳为csh。但是,环境变量会通过脚本传播并指向正确的位置,唯一的区别是SHLVL。

以下是编译机以及使用qsub运行代码的机器之一中的命令及其后续输出:

Script:

which gmm-acc-stats-ali
ldd `which gmm-acc-stats-ali`
utils/run.pl foo.log which gmm-acc-stats-ali
utils/run.pl bar.log ldd `which gmm-acc-stats-ali`

Output (from compile machine):

 /u/achandra/Workspace/Software/kaldi/src/gmmbin/gmm-acc-stats-ali
        linux-vdso.so.1 =>  (0x00007fffedb06000)
        libfst.so.1 => /u/achandra/Workspace/Software/kaldi/tools/openfst/lib/libfst.so.1 (0x00007fffed6f0000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fffed4aa000)
        libmkl_intel_lp64.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_intel_lp64.so (0x00007fffed1a4000)
        libmkl_sequential.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_sequential.so (0x00007fffed016000)
        libmkl_core.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_core.so (0x00007fffece45000)
    libiomp5.so => /nasa/mw/2014b/sys/os/glnxa64/libiomp5.so (0x00007fffecb2a000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fffec90c000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fffec693000)
        libstdc++.so.6 => /nasa/pkgsrc/2014Q2/gcc47/lib64/libstdc++.so.6 (0x00007fffec38d000)
    libgcc_s.so.1 => /nasa/mw/2014b/sys/os/glnxa64/libgcc_s.so.1 (0x00007fffec176000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fffebdfa000)
        /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)

foo.log:
/u/achandra/Workspace/Software/kaldi/src/gmmbin/gmm-acc-stats-ali
bar.log:
        linux-vdso.so.1 =>  (0x00007fffedb06000)
        libfst.so.1 => /u/achandra/Workspace/Software/kaldi/tools/openfst/lib/libfst.so.1 (0x00007fffed6f0000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fffed4aa000)
        libmkl_intel_lp64.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_intel_lp64.so (0x00007fffed1a4000)
        libmkl_sequential.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_sequential.so (0x00007fffed016000)
        libmkl_core.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_core.so (0x00007fffece45000)
    libiomp5.so => /nasa/mw/2014b/sys/os/glnxa64/libiomp5.so (0x00007fffecb2a000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fffec90c000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fffec693000)
        libstdc++.so.6 => /nasa/pkgsrc/2014Q2/gcc47/lib64/libstdc++.so.6 (0x00007fffec38d000)
    libgcc_s.so.1 => /nasa/mw/2014b/sys/os/glnxa64/libgcc_s.so.1 (0x00007fffec176000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fffebdfa000)
        /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)

Output (from run machine):

 /u/achandra/Workspace/Software/kaldi/src/gmmbin/gmm-acc-stats-ali
    linux-vdso.so.1 =>  (0x00002aaaaaaab000)
    libfst.so.1 => /u/achandra/Workspace/Software/kaldi/tools/openfst/lib/libfst.so.1 (0x00002aaaaaaae000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaaaefa000)
    libmkl_intel_lp64.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_intel_lp64.so (0x00002aaaab0fe000)
    libmkl_sequential.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_sequential.so (0x00002aaaab404000)
    libmkl_core.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_core.so (0x00002aaaab592000)
    libiomp5.so => /nasa/mw/2014b/sys/os/glnxa64/libiomp5.so (0x00002aaaab763000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaaba7e000)
    libm.so.6 => /lib64/libm.so.6 (0x00002aaaabc9c000)
    libstdc++.so.6 => /nasa/pkgsrc/2014Q2/gcc47/lib64/libstdc++.so.6 (0x00002aaaabf15000)
    libgcc_s.so.1 => /nasa/mw/2014b/sys/os/glnxa64/libgcc_s.so.1 (0x00002aaaac21b000)
    libc.so.6 => /lib64/libc.so.6 (0x00002aaaac432000)
    /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
foo.log:
/u/achandra/Workspace/Software/kaldi/src/gmmbin/gmm-acc-stats-ali

bar.log:
        linux-vdso.so.1 =>  (0x00002aaaaaaab000)
        libfst.so.1 => /u/achandra/Workspace/Software/kaldi/tools/openfst/lib/libfst.so.1 (0x00002aaaaaaae000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaaaefa000)
        libmkl_intel_lp64.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_intel_lp64.so (0x00002aaaab0fe000)
        libmkl_sequential.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_sequential.so (0x00002aaaab404000)
        libmkl_core.so => /nasa/intel/mkl/10.0.011/lib/em64t/libmkl_core.so (0x00002aaaab592000)
        libiomp5.so => /nasa/mw/2014b/sys/os/glnxa64/libiomp5.so (0x00002aaaab763000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaaba7e000)
    libm.so.6 => /lib64/libm.so.6 (0x00002aaaabc9c000)
        libstdc++.so.6 => /nasa/pkgsrc/2014Q2/gcc47/lib64/libstdc++.so.6 (0x00002aaaabf15000)
    libgcc_s.so.1 => /nasa/mw/2014b/sys/os/glnxa64/libgcc_s.so.1 (0x00002aaaac21b000)
        libc.so.6 => /lib64/libc.so.6 (0x00002aaaac432000)
        /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
友情链接
版权所有 Copyright(c)2004-2021 锐英源软件
公司注册号:410105000449586 豫ICP备08007559号 最佳分辨率 1024*768
地址:郑州大学北校区院(文化路97号院)内