T630-GPU服务器宕机、自动重启日志记录

630-GPU服务器宕机,自动重启,日志记录:A fatal error was detected on a component at bus 128 device 3 function 0

故障原因:

造成机器宕机的原因是当多GPU高负载工作时, GPU 温度达到阈值(95度)触发了bus fatal error,导致重启宕机。

根本原因是IDRAC 温控进程异常,无法准确实时的反馈GPU实际工作温度,从而使GPU过热宕机;

Racadm直接调整风扇转速方式:

查看当前值:

[root@xxxxx ~]#racadm -r BMCIP -u xxx -p xxx get System.ThermalSettings.FanSpeedoffset

Security Alert: Certificate is invalid - self signed certificate

Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.

[Key=System.Embedded.1#ThermalSettings.1]

FanSpeedOffset=Off

设置风扇转速值为3:【0 low fan speed、1 medium fan speed、2 high fan speed、3 max fan speed】

[root@xxxxx ~]# racadm -r BMCIP -u xxx -p xxx set System.ThermalSettings.FanSpeedoffset 3

Security Alert: Certificate is invalid - self signed certificate

Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.

[Key=System.Embedded.1#ThermalSettings.1]

Object value modified successfully

设置完成后再次查看:

[root@xxxxx ~]#racadm -r BMCIP -u xxx -p xxx get System.ThermalSettings.FanSpeedoffset

Security Alert: Certificate is invalid - self signed certificate

Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.

[Key=System.Embedded.1#ThermalSettings.1]

FanSpeedOffset=Max Fan Speed

通过调整风扇转速,服务器运行正常。

版权声明:
作者:郭靖
链接:https://www.sxszhian.com/archives/8190
来源:上海永驰网络科技有限公司
文章版权归作者所有,未经允许请勿转载。

THE END
分享
二维码
打赏
< <上一篇
下一篇>>