我这显卡是不是坏了 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
Sign Up Now
For Existing Member  Sign In
请不要在回答技术问题时复制粘贴 AI 生成的内容
sty

我这显卡是不是坏了

  •  
  •   sty Aug 12, 2024 3284 views
    This topic created in 625 days ago, the information mentioned may be changed or developed.
    $ nvidia-smi Unable to determine the device handle for GPU0000:01:00.0: Unknown Error 

    之前用一个 7b 的模型做推理,连续用了 20 多个小时。后面就不间断出现这个错误了,但是一重启就好了,是不是显卡硬件坏了?

    17 replies    2024-08-12 19:42:15 +08:00
    GoRoad
        1
    GoRoad  
       Aug 12, 2024   1
    不是工业级的显卡,长时间用可能会出现各种问题,要是重启后能正常,那大概率还没坏 可能是过热之类了
    sty
        2
    sty  
    OP
       Aug 12, 2024
    @GoRoad 一周多了,每天都得重启,更新了一下驱动也没用。有没有可能是某些区块坏了,要跑一段时间才能碰到坏的区块
    DigitalG
        3
    DigitalG  
       Aug 12, 2024   1
    “不间断”,是间隔多久?重启就好的话,我遇到过,有可能是 nvidia driver 自动更新导致的。可以看看 driver 版本是不是比那了,或者去系统日志里看看。再配置关闭自动更新。
    HojiOShi
        4
    HojiOShi  
       Aug 12, 2024   1
    用的啥显卡,是不是矿卡啊。
    sty
        5
    sty  
    OP
       Aug 12, 2024
    @DigitalG 坏了之后,driver 我自己更新过了。在使用的时候没报错,比如我跑 3 个小时的训练,能跑完。反而是空闲的时候就报上面这个错。一天 1 到 2 次吧,每天都有
    sty
        6
    sty  
    OP
       Aug 12, 2024
    @HojiOShi 3090ti ,买了三年多了不咋用,最近 3 个月开始用的
    cinlen
        7
    cinlen  
       Aug 12, 2024   1
    dmesg 看看内核日志有无异常
    rickiey
        8
    rickiey  
       Aug 12, 2024   1
    监控下温度,频率,显存这些数据,还有功率
    sty
        9
    sty  
    OP
       Aug 12, 2024
    @cinlen [ 2.018550] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
    [ 369.857712] NVRM: GPU 0000:01:00.0: Failed to enable MSI; falling back to PCIe virtual-wire interrupts.
    [ 493.216012] NVRM: GPU 0000:01:00.0: Failed to enable MSI; falling back to PCIe virtual-wire interrupts.
    [ 1537.808965] NVRM: GPU 0000:01:00.0: Failed to enable MSI; falling back to PCIe virtual-wire interrupts.
    [ 1764.689999] NVRM: GPU 0000:01:00.0: Failed to enable MSI; falling back to PCIe virtual-wire interrupts.
    [ 1766.588211] NVRM: GPU 0000:01:00.0: Failed to enable MSI; falling back to PCIe virtual-wire interrupts.
    [ 1775.551022] NVRM: GPU 0000:01:00.0: Failed to enable MSI; falling back to PCIe virtual-wire interrupts.
    老哥帮忙看下
    sty
        10
    sty  
    OP
       Aug 12, 2024
    @rickiey nvidia-smi
    ```
    Mon Aug 12 15:30:33 2024
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
    |-----------------------------------------+------------------------+----------------------+
    | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
    | | | MIG M. |
    |=========================================+========================+======================|
    | 0 NVIDIA GeForce RTX 3090 Ti Off | 00000000:01:00.0 Off | Off |
    | 30% 41C P0 N/A / 450W | 1MiB / 24564MiB | 0% Default |
    | | | N/A |
    +-----------------------------------------+------------------------+----------------------+

    +-----------------------------------------------------------------------------------------+
    | Processes: |
    | GPU GI CI PID Type Process name GPU Memory |
    | ID ID Usage |
    |=========================================================================================|
    | No running processes found |
    +-----------------------------------------------------------------------------------------+
    ```
    cinlen
        11
    cinlen  
       Aug 12, 2024
    在正常和异常时分别执行一下 lspci -s 01:00.0 -nnDk 命令看看这张显卡的驱动名是什么。 我有一张 nvidia telsa 温度飙到 90 摄氏度都没出现过你这个问题。
    sty
        12
    sty  
    OP
       Aug 12, 2024
    @cinlen 正常情况下 lspci -s 01:00.0 -nnDk
    0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2203] (rev a1)
    Subsystem: Device [7377:2000]
    Kernel driver in use: nvidia
    libkmod: kmod_config_parse: /etc/modprobe.d/blacklist-nouveau.conf line 1: ignoring bad line starting with 'cklist'
    Kernel modules: nouveau, nvidia_drm, nvidia
    daishuge
        13
    daishuge  
       Aug 12, 2024 via Android
    圈外人,想问一下这种能不能走保修,如果是正规平台买的话,谢谢
    lsp7572
        14
    lsp7572  
       Aug 12, 2024
    搜了下网上有人碰到,比如电源问题啥的,这个你自己搜索过,或者尝试过没,从问题没看出来试过
    sty
        15
    sty  
    OP
       Aug 12, 2024
    @lsp7572 我能搜到的软件解决办法都试过。这台机器在机房,走流程挺麻烦,如果软件层面解决不了,那就报硬件流程去了
    huaijin
        16
    huaijin  
       Aug 12, 2024
    设备管理器,看看显卡驱动是不是损坏了
    sweelia
        17
    sweelia  
       Aug 12, 2024   1
    2080ti 改 22g ,遇到训练几天就中断,提示通讯/io 相关的异常,内核驱动进入了异常状态,只能重启恢复。
    大聪明的我以为是驱动兼容性问题,写了个脚本自动重启,自动恢复训练。
    然后过了 2 个多星期驱动彻底不认卡了。仔细检查是显存虚了,拆下重焊恢复正常,然后限制最高功率,加散热,目前几个月没再遇到需要重启的情况
    About     Help     Advertise     Blog     API     FAQ     Solana     3316 Online   Highest 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 46ms UTC 12:13 PVG 20:13 LAX 05:13 JFK 08:13
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86