本文以局域网内两台服务器192.168.1.82与192.168.1.39为例,简述kaidi多机多卡环境搭建。我们以82服务器作为SGE master以及node1 worker,39服务器作为node2 worker。
Step1. NFS安装配置:
在82安装nfs server并修改etc/exports使相关目录可以挂载到nfs client端
sudo apt-get install nfs-kernel-server
vim /etc/exprots
### 编辑添加kaldi路径
在39安装nfs client并将82端kaldi路径挂载到39相同目录下
sudo apt-get install nfs-common
cd /home/dev
mkdir kaldi
sudo mount /home/dev/kaldi
### 挂载路径错误使可解除挂载
umount: sudo umount -f -l /home/dev/kaldi
Step2. SGE安装:
### 安装前查看主机名称与ip后面使用
hostname -I
vim /etc/hosts
因为这里我们将82既做主机又作为worker node1所以同时安装master与client
sudo apt-get install gridengine-master gridengine-client
sudo service gridengine-master restar
#修复gridengine-client 有需要时使用
cd Downloads
mkdir gec
dpkg -X gridengine-client_8.1.9+dfsg-9_amd64.deb gec
cd gec/usr/lib/gridengine
sudo cp spooldefaults.bin /usr/lib/gridengine
cd /usr/lib/x86_64-linux-gnu
sudo ln -s
sudo apt install gridengine-master gridengine-qmon gridengine-exec
sudo apt-get install gridengine-client gridengine-exec
cd Downloads
mkdir gec
dpkg -X gridengine-client_8.1.9+dfsg-9_amd64.deb gec
cd gec/usr/lib/gridengine
sudo cp spooldefaults.bin /usr/lib/gridengine
cd /usr/lib/x86_64-linux-gnu
sudo ln -s
sudo apt-get install gridengine-qmon gridengine-exec
cell name保持默认
host name 与本机hostname一致
(PS:第一次装可能会输入错误,若错误不知道咋改 则可卸载按上述流程重装。。。)
apt-get --purge remove -y gridengine-client
apt-get --purge remove -y gridengine-master
apt-get --purge remove -y gridengine-common
apt-get --purge remove -y gridengine-exec
rm -rf `locate gridengine`
vim /etc/hosts
#添加并 删除原有 your_hostname
Step3. SGE安装:
qconf -am hostname(master节点)
qconf -aq
complex values中添加ram及gpu相关属性
qconf -mc
#name shortcut type relop requestable consumable default urgency
mem_free mf MEMORY <= YES YES 1G 0
#name shortcut type relop requestable consumable default urgency
gpu g INT <= YES YES 0 10000
ram_free ram_free MEMORY <= YES JOB 1G 0
添加管理host及两个submit host 两个执行host
qconf -ah hostname ##(master)
qconf -as hostname ##(master)
qconf -as hostname ##(client)
qconf -ae
hostname master_hostname
complex_values ram_free=160G,gpu=4
qconf -ae
hostname client_hostname
complex_values ram_free=160G,gpu=4
qconf -ap smp
pe_name smp
slots 9999
qconf -mq all.q
pe_list make smp
修改all.q prolog
qconf -mq all.q
prolog /var/lib/gridengine/default/common/
在指定目录下创建prolog.sh并chmod 777为可执行文件(client机也需要添加)
vim /var/lib/gridengine/default/common/
chmod 777 /var/lib/gridengine/default/common/
function test_ok {
if [ ! -z "$JOB_SCRIPT" ] && [ "$JOB_SCRIPT" != QLOGIN ] && [ "$JOB_SCRIPT" != QRLOGIN ]; then
if [ ! -f "$JOB_SCRIPT" ]; then
echo "$0: warning: no such file $JOB_SCRIPT, will wait" 1>&2
return 1;
if [ ! -z "$SGE_STDERR_PATH" ]; then
if [ ! -d "`dirname $SGE_STDERR_PATH`" ]; then
echo "$0: warning: no such directory $JOB_SCRIPT, will wait." 1>&2
return 1;
return 0;
if ! test_ok; then
sleep 2;
if ! test_ok; then
sleep 4;
if ! test_ok; then
sleep 8;
exit 0;
添加各个执行节点到host group (@allhosts)
qconf -ahgrp @allhosts
group_name @allhosts
hostlist master_hostname client_hostname
hostlist @allhosts
` qhost -q` 指令可以查看all.q中各个host的配置情况
qhost -q
global - - - - - - - - - -
host1 lx-amd64 64 2 32 64 10.40 251.8G 30.7G 61.0G 0.0
all.q BIP 0/4/64
host2 lx-amd64 64 2 32 64 9.40 251.8G 8.9G 95.4G 0.0
all.q BIP 0/6/64
` qstat -f ` 指令可查看执行host状态,若state栏存在 E
或 au
等其他状态, 则表示存在错误可用指令 qmod
queuename qtype resv/used/tot. load_avg arch states
all.q@host1 BIP 0/4/64 10.92 lx-amd64 E
all.q@host2 BIP 0/6/64 9.81 lx-amd64 E
qmod -d all.q
qmod -d all.q@host1
qmod -d all.q@host2
qmod -d all.q@host1 用来关闭节点 states 显示 d
qmod -e all.q@host1 用来打开节点 states 无标记
qstat -u '*'
` qstat -j jobid`
` qdel job-ID 删除job,如删除上述任务:
qdel 22`
` qdel -u usrname` 删除用户的所有任务
使用命令 qalter -h u jobid
并使用命令qalter -h U jobid
Category | State | SEG Letter Code |
Pending | pending/ pending, user hold | qw |
Pending | pending, user and system hold | hqw |
Pending | pending, user and system hold, re-queue | hRwq |
Running | running | r |
Running | transferring | t |
Running | running, re-submit | Rr |
Running | transferring, re-submit | Rt |
Suspended | obsuspended | s, ts |
Suspended | queue suspended | S, tS |
Suspended | queue suspended by alarm | T, tT |
Suspended | allsuspended withre-submit | Rs,Rts,RS, RtS, RT, RtT |
Error | allpending states with error | Eqw, Ehqw, EhRqw |
Deleted | all running and suspended states with deletion | dr,dt,dRr,dRt,ds, dS, dT,dRs, dRS, dRT |
常见错误Common Error:
#1 @allhosts配置不正确
commlib error: can't resolve host name (can't resolve rdata hostname "host1")
#2 配置问题
can't get password entry for user "host1". Either user does not exist or error with NIS/LDAP etc
#3 修改mconf qconf -mconf min user id
job rejected: your user id 0 is lower than minimum user id 1000 of cluster configuration
#4 找不到目录,发现挂载消失需重新挂载
error: can't chdir to /home/user/egs/aishell/s5: No such file or directory
#5 权限问题chmod 777或者 vim /etc/exprots修改配置?
error: can't open output file ".../make_mfcc/train/q/make_mfcc_pitch_train.log": Permission denied
#6 qstat -u '*' 状态总为 qw
#7 /var/lib/gridengine/default/common/act_qmaster 中的master改为host
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "host2": got send error
