NVIDIA-SMI a échoué car il ne pouvait pas communiquer avec le pilote NVIDIA

Question

J'exécute une instance AWS EC2 g2.2xlarge avec Ubuntu 14.04 LTS . J'aimerais observer l'utilisation du processeur graphique lors de la formation de mes modèles TensorFlow . Une erreur s'est produite lors de l'exécution de 'nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls nvidia-bug-report.sh nvidia-debugdump nvidia-xconfig nvidia-cuda-mps-control nvidia-persistenced nvidia-cuda-mps-server nvidia-smi ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia ii nvidia-346 352.63-0ubuntu0.14.04.1 AMD64 Transitional package for nvidia-346 ii nvidia-346-dev 346.46-0ubuntu1 AMD64 NVIDIA binary Xorg driver development files ii nvidia-346-uvm 346.96-0ubuntu0.0.1 AMD64 Transitional package for nvidia-346 ii nvidia-352 375.26-0ubuntu1 AMD64 Transitional package for nvidia-375 ii nvidia-375 375.39-0ubuntu0.14.04.1 AMD64 NVIDIA binary driver - version 375.39 ii nvidia-375-dev 375.39-0ubuntu0.14.04.1 AMD64 NVIDIA binary Xorg driver development files ii nvidia-modprobe 375.26-0ubuntu1 AMD64 Load the NVIDIA kernel driver and create device files ii nvidia-opencl-icd-346 352.63-0ubuntu0.14.04.1 AMD64 Transitional package for nvidia-opencl-icd-352 ii nvidia-opencl-icd-352 375.26-0ubuntu1 AMD64 Transitional package for nvidia-opencl-icd-375 ii nvidia-opencl-icd-375 375.39-0ubuntu0.14.04.1 AMD64 NVIDIA OpenCL ICD ii nvidia-prime 0.6.2.1 AMD64 Tools to enable NVIDIA's Prime ii nvidia-settings 375.26-0ubuntu1 AMD64 Tool for configuring the NVIDIA graphics driver ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ $ inxi -G Graphics: Card-1: Cirrus Logic Gd 5446 Card-2: NVIDIA GK104GL [GRID K520] X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X $ lspci -k | grep -A 2 -E "(VGA|3D)" 00:02.0 VGA compatible controller: Cirrus Logic Gd 5446 Subsystem: XenSource, Inc. Device 0001 Kernel driver in use: cirrus 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) Subsystem: NVIDIA Corporation Device 1014 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

J'ai suivi ces instructions pour installer CUDA 7 et cuDNN:

$Sudo apt-get -q2 update $Sudo apt-get upgrade $Sudo reboot

=============================================== ======================

Après le redémarrage, mettez à jour initramfs en exécutant '$ Sudo update-initramfs -u'

Maintenant, veuillez éditer le fichier /etc/modprobe.d/blacklist.conf sur la liste noire nouveau. Ouvrez le fichier dans un éditeur et insérez les lignes suivantes à la fin du fichier.

liste noire nouveau liste noire lbm-nouveau options nouveau modeset = 0 alias nouveau désactivé alias lbm-nouveau désactivé

Enregistrez et quittez le fichier.

Maintenant, installez les outils essentiels à la construction, mettez à jour initramfs et redémarrez comme ci-dessous:

$Sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential $Sudo update-initramfs -u $Sudo reboot

=============================================== =======================

Après le redémarrage, exécutez les commandes suivantes pour installer Nvidia.

$Sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run $Sudo chmod 700 ./cuda_7.0.28_linux.run $Sudo ./cuda_7.0.28_linux.run $Sudo update-initramfs -u $Sudo reboot

=============================================== =======================

Maintenant que le système est installé, vérifiez l'installation en exécutant ce qui suit.

$Sudo modprobe nvidia $Sudo nvidia-smi -q | head`enter code here`

Vous devriez voir la sortie comme 'nvidia.png'.

Maintenant, lancez les commandes suivantes . $

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery $make $./deviceQuery

Cependant, 'nvidia-smi' ne montre toujours pas l'activité du processeur graphique alors que Tensorflow est en train de former des modèles:

ubuntu@ip-10-0-1-48:~$ ipython Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec 6 2015, 18:08:32) Type "copyright", "credits" or "license" for more information. IPython 4.1.2 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: import tensorflow as tf I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally ubuntu@ip-10-0-1-48:~$ nvidia-smi Thu Mar 30 05:45:26 2017 +------------------------------------------------------+ | NVIDIA-SMI 346.46 Driver Version: 346.46 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID K520 Off | 0000:00:03.0 Off | N/A | | N/A 35C P0 38W / 125W | 10MiB / 4095MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

nuicca · Answer

J'ai résolu "Echec de NVIDIA-SMI car il ne pouvait pas communiquer avec le pilote NVIDIA" sur mon ordinateur portable ASUS avec GTX 950m et Ubuntu 18.04 en désactivant le contrôle de démarrage sécurisé du BIOS.

Heapify · Answer

J'obtenais la même erreur sur mon Ubuntu 16.04 (noyau Linux 4.14) dans Google Compute Engine avec le processeur graphique K80. J'ai mis à jour le noyau à 4.14 et le problème a été résolu. Voici comment j'ai mis à jour mon noyau Linux de 4.13 à 4.14:

Step 1: Check the existing kernel of your Ubuntu Linux: uname -a Step 2: Ubuntu maintains a website for all the versions of kernel that have been released. At the time of this writing, the latest stable release of Ubuntu kernel is 4.15. If you go to this link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will see several links for download. Step 3: Download the appropriate files based on the type of OS you have. For 64 bit, I would download the following deb files: wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers- 4.15.0-041500_4.15.0-041500.201802011154_all.deb wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers- 4.15.0-041500-generic_4.15.0-041500.201802011154_AMD64.deb wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image- 4.15.0-041500-generic_4.15.0-041500.201802011154_AMD64.deb Step 4: Install all the downloaded deb files: Sudo dpkg -i *.deb Step 5: Reboot your machine and check if the kernel has been updated by: uname -a

Vous devriez voir que votre noyau a été mis à jour et que nvidia-smi devrait fonctionner.

gowin · Answer

Exécutez ce qui suit pour obtenir le bon pilote NVIDIA:

Appareils ubuntu-drivers Sudo

Puis choisissez le bon et lancez:

Sudo apt install

dbl001 · Answer

Je devais installer le pilote NVIDIA 367.57 et CUDA 7.5 avec Tensorflow sur l'instance g2.2xlarge Ubuntu 14.04LTS. par exemple nvidia-graphics-drivers-367_367.57.orig.tar

Maintenant, le GPU GRID K520 fonctionne pendant que je forme des modèles tensorflow:

ubuntu@ip-10-0-1-70:~$ nvidia-smi Sat Apr 1 18:03:32 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.57 Driver Version: 367.57 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID K520 Off | 0000:00:03.0 Off | N/A | | N/A 39C P8 43W / 125W | 3800MiB / 4036MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 2254 C python 3798MiB | +-----------------------------------------------------------------------------+ ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GRID K520" CUDA Driver Version / Runtime Version 8.0 / 7.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 4036 MBytes (4232052736 bytes) ( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores GPU Max Clock rate: 797 MHz (0.80 GHz) Memory Clock rate: 2500 Mhz Memory Bus Width: 256-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support Host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3 Compute Mode: < Default (multiple Host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520 Result = PASS