Nvidia On Linux

Graphics Drivers
CUDA And GPU Programming
- CUDA And GPU Properties Extracted With Cpp
Machine Learning And Jetson

Graphics Drivers

nVidia is the best and worst graphics card for Linux. It is the worst because it is fraught with proprietary nonsense and it is the best, well, because it works pretty well.

If you need a system where you can audit all the source code, nVidia hardware may not be an option. But if you just need some simple Linux workstations for 3d graphics, it might be the simplest option.

I find that using nVidia’s automagical installer/driver just works. Usually.

I have separate notes for CUDA.

Also for CentOS specific package technology involving Nvidia drivers see my CentOS notes.

Drivers

At the current time (late 2012) the Linux drivers live here. Note that "Linux x86/IA32" is for 32 bit systems. (Check yours with something like file /sbin/init). These days, you probably want "Linux x86_64/AMD64/EM64T".

What version are you currently using? Check with this.

cat /proc/driver/nvidia/version

Installing and Updates

It turns out that GPU drivers are deeply in touch with the kernel. The driver itself is a kernel module. This module must match the kernel and must be built to fit. The Nvidia installer automagically takes care of all this (assuming you have a build environment with a complier, etc).

The problem is that whenever you update your machine and there is a kernel update (which is about every two weeks in my experience), the graphics will stop working. You must reboot into the new kernel (you can’t fix it right after doing the update while running the previous kernel). Then you’ll be in some no-mans-land text console with no prompt (CentOS6). Use "Alt-F2" to go to a console with a getty login prompt. Log in and re install the Nvidia driver. This also is the process after you first install CentOS.

I find that I do this so often that I have a tiny script to make it automatic so I don’t have to answer questions and generally hold its hand. My little script looks like:

#!/bin/bash
sh /pro/nvidia/current -a -q -X --ui=none -f -n

For the Debian style distributions this works.

#!/bin/bash

echo "Shutting down X server..."
sudo service lightdm stop

echo "Running NVIDIA kernel module installer..."
sudo sh ~/src/NVIDIA-Linux-x86_64-304.117.run -a -q -X --ui=none -f -n

And that lives in a directory with an assortment of drivers where current is a link to the one I need most often:

:->[host][~]$ ls /usr/local/nvidia/
NVIDIA-Linux-x86-304.64.run             NVIDIA-Linux-x86_64-304.64.run
NVIDIA-Linux-x86_64-173.14.22-pkg2.run  current
NVIDIA-Linux-x86_64-190.53-pkg2.run     nvfix
NVIDIA-Linux-x86_64-195.36.15-pkg2.run

Update Process

When I update I usually do it remotely. I log in and do sudo yum -y update. Then if a new kernel has been installed, I do sudo reboot. Then wait a couple of minutes (sleep 111). And then log in again. This time everything seems fine and is updated, but the users sitting at the workstation will find a confusing text screen with no prompt. This is because graphics are actually dead. This is when you need to run the nvfix script shown above, that’s sudo /usr/local/nvidia/nvfix of course since it must be run as root. Then you must sudo reboot again. At that point everything should be cool. It’s a good idea to wait and log back in when it comes up. I’ve had machines mysteriously not wake up after the reboot.

ElRepo

It might be smarter these days to try to use prepackaged proprietary drivers from the ElRepo repository.

https://elrepo.org/tiki/kmod-nvidia

One problem I had after upgrading from 7.x to 7.4 is that although the modules seem inserted and everything seems fine, no graphics happen. This talks about it and has some good general troubleshooting tips. It seems that lightdm wasn’t starting or staying started. But doing systemctl start lightdm seems to have started it and system enable lightdm seems to have cured it.

Nouveau Issues

In CentOS 6 and later the default thing to do on installation is to use the new open source Nouveau drivers. That’s nice and I’m glad that someone’s working on a wholesome alternative. But the problem is that these drivers under-perform, by a factor of 2 in my tests. Test it yourself before committing.

Now the really gruesome bit is that you can’t easily install the proprietary drivers while the Nouveau ones are in. Maybe nVidia will fix their installer to be less stupid but for now it’s quite a chore to extricate the Nouveau driver. The best plan is to often reinstall CentOS and make sure you select the reduced graphics mode. I forget what it’s called, but it doesn’t just affect the installation graphics, it affects what drivers are installed. With the low quality (or whatever it’s called) mode, the normal non-accelerated X drivers are installed and those can be replaced by the nVidia installer.

Legacy

Sometimes you’ll have an older machine:

:->[ws9-ablab.ucsd.edu][~]$ lspci | grep -i [n]vi
01:00.0 VGA compatible controller: NVIDIA Corporation NV43
[GeForce 6600 GT] (rev a2)

And running the normal installer fails with some kind of message about legacy drivers. On the machine above I had to run NVIDIA-Linux-x86_64-304.64.run and then it worked. This version was found on the driver page above and called Latest Legacy GPU version (304.xx series). There are other legacy series like 71.86.xx, 96.43.xx, and 173.14.xx. Use what the installers suggest.

Manual Tweaking With xrandr

I had two vertical 1080x1920 monitors and the "Display" program in Mate was just garbling them. Here’s what I did to sort that out.

xrandr --fb 2160x1920   \
       --output HDMI-1  \
       --auto \
       --pos 0x0 \
       --output DVI-I-1 \
       --auto \
       --pos 1080x0

Or more recently with a different card…

xrandr --fb 2160x1920 \
    --output HDMI-0 --auto --rotate left --pos 0x0 \
    --output DVI-D-0 --auto --rotate right --pos 1080x0

Here’s another example of my 3 vertical HP monitor setup which each have the slightly unusual resolution of 1920x1200.

xrandr --fb 3600x1920 \
       --output VGA-0   --auto --pos 0x0 \
       --output DVI-D-0 --auto --pos 1200x0 \
       --output HDMI-0  --auto --pos 2400x0

Also note these, which I did not need, if required for emphasis.

--rotate left
--output A --left-of B

In CentOS 7’s Mate I’m finding that the System->Preference->Hardware->Displays tool just can’t put my vertical monitors together properly. What works is to close that, use an xrandr command as shown above. Then go back to the Displays GUI tool when everything is correct. Then it will come up detected correctly and this is when you want to click "Apply" and then "Apply system-wide". I don’t know what that writes but it once it’s written, things work as they should. Well, not the display manager of course, but who cares about that?

Dummy

From the xpra Xdummy documentation. "Proprietary drivers often install their own copy of libGL which conflicts with the use of software GL rendering. You cannot use this GL library to render directly on Xdummy (or Xvfb)."

This is why you might have trouble using non-interactive rendering tools.

Here is one way Andrey got this problem solved. First he grabbed a libGL.so.1 from a Mesa system (no nvidia drivers). That can be stored locally with no privileges.

Then run the application with something like this.

LD_PRELOAD=/home/${USER}/tmp/libGL.so.1 /usr/bin/Xvfb :96 -cc 4 -screen 0 1024x768x16

AMD

Just some quick notes on AMD/ATI drivers. AMD tries to match nVidia, but they’re a bit behind. However, here are some programs that might come in handy.

amdcccle
fglrxinfo
fgl_glxgears

CUDA And GPU Programming

Resources

Setup

You might need one or more of these.

apt install nvidia-driver
apt install nvidia-dev
apt install nvidia-support
apt install nvidia-cuda-toolkit

nvidia-smi - Checking GPU Action

The "smi" stands for System Management Interface. This command is important for (1) seeing what kind of GPU your system thinks it has access to and (2) how hard that GPU is working right now.

cat /proc/driver/nvidia/version
sudo apt install nvidia-smi
nvidia-smi
nvidia-smi -l 1 # One second refresh.

What processes are using your nvidia device? This can be interesting.

$ sudo fuser -v /dev/nvidia*
USER        PID ACCESS COMMAND
/dev/nvidia0:        root        752 F...m Xorg
/dev/nvidiactl:      root        752 F...m Xorg
/dev/nvidia-modeset: root        752 F.... Xorg

Also for real time monitoring try this.

nvidia-smi pmon

Oh and check out nvtop! That’s a very nice visualization tool.

sudo apt install nvtop

CUDA Specs From Software

Writing a program that needs some CUDA? How can you check if what you have is sufficient? After stumbling into some kind of bug with the photogrammetry project Meshroom, I wanted to know how to check my CUDA Compute Capability, whatever the hell that is. I dug into the AliceVision source code and pulled out the offending checks that said I did not have a CUDA-capable card. Specifically from here. I distilled it into the following short program which does all the checks Meshroom seems to know about. These checks seem generally useful so here is the program.

ckgpu.cpp

// Compile with `g++ -o ckgpu ckgpu.cpp -lcudart`
#include <string>
#include <iostream>
#include <sstream>
#include <cuda_runtime.h>

// ================== gpuSupportCUDA ==================
bool gpuSupportCUDA(int minComputeCapabilityMajor,
    int minComputeCapabilityMinor,
    int minTotalDeviceMemory=0) {
    int nbDevices = 0;
    cudaError_t success;
    success = cudaGetDeviceCount(&nbDevices);
    if (success != cudaSuccess) {
        std::cout << "cudaGetDeviceCount failed: " << cudaGetErrorString(success);
        nbDevices = 0;
    }

    if(nbDevices > 0) {
        for(int i = 0; i < nbDevices; ++i) {
            cudaDeviceProp deviceProperties;
            if(cudaGetDeviceProperties(&deviceProperties, i) != cudaSuccess) {
                std::cout << "Cannot get properties for CUDA gpu device " << i;
                continue;
            }
            if((deviceProperties.major > minComputeCapabilityMajor ||
                (deviceProperties.major == minComputeCapabilityMajor &&
                 deviceProperties.minor >= minComputeCapabilityMinor)) &&
                deviceProperties.totalGlobalMem >= (minTotalDeviceMemory*1024*1024)) {
                std::cout << "Supported CUDA-Enabled GPU detected." << std::endl;
                return true;
            }
            else {
                std::cout << "CUDA-Enabled GPU detected, but the compute capabilities is not enough.\n"
                    << " - Device " << i << ": " << deviceProperties.major << "." << deviceProperties.minor
                    << ", global memory: " << int(deviceProperties.totalGlobalMem / (1024*1024)) << "MB\n"
                    << " - Requirements: " << minComputeCapabilityMajor << "." << minComputeCapabilityMinor
                    << ", global memory: " << minTotalDeviceMemory << "MB\n";
            }
        } // End for i<nbDevices
        std::cout << ("CUDA-Enabled GPU not supported.");
    } // End if nbDevices
    else { std::cout << ("Can't find CUDA-Enabled GPU."); }
    return false;
} // End gpuSupportCUDA()

// ================== gpuInformationCUDA ==================
std::string gpuInformationCUDA() {
    std::string information;
    int nbDevices = 0;
    if( cudaGetDeviceCount(&nbDevices) != cudaSuccess ) {
        std::cout << ( "Could not determine number of CUDA cards in this system" );
        nbDevices = 0;
    }
    if(nbDevices > 0) {
        information = "CUDA-Enabled GPU.\n";
        for(int i = 0; i < nbDevices; ++i) {
            cudaDeviceProp deviceProperties;
            if(cudaGetDeviceProperties( &deviceProperties, i) != cudaSuccess ) {
                std::cout << "Cannot get properties for CUDA gpu device " << i;
                continue;
            }
            if( cudaSetDevice( i ) != cudaSuccess ) {
                std::cout << "Device with number " << i << " does not exist" ;
                continue;
            }
            std::size_t avail;
            std::size_t total;
            if(cudaMemGetInfo(&avail, &total) != cudaSuccess) { // if the card does not provide this information.
                avail = 0;
                total = 0;
                std::cout << "Cannot get available memory information for CUDA gpu device " << i << ".";
            }
            std::stringstream deviceSS;
            deviceSS << "Device information:" << std::endl
                << "\t- id:                      " << i << std::endl
                << "\t- name:                    " << deviceProperties.name << std::endl
                << "\t- compute capability:      " << deviceProperties.major << "." << deviceProperties.minor << std::endl
                << "\t- total device memory:     " << deviceProperties.totalGlobalMem / (1024 * 1024) << " MB " << std::endl
                << "\t- device memory available: " << avail / (1024 * 1024) << " MB " << std::endl
                << "\t- per-block shared memory: " << deviceProperties.sharedMemPerBlock << std::endl
                << "\t- warp size:               " << deviceProperties.warpSize << std::endl
                << "\t- max threads per block:   " << deviceProperties.maxThreadsPerBlock << std::endl
                << "\t- max threads per SM(X):   " << deviceProperties.maxThreadsPerMultiProcessor << std::endl
                << "\t- max block sizes:         "
                << "{" << deviceProperties.maxThreadsDim[0]
                << "," << deviceProperties.maxThreadsDim[1]
                << "," << deviceProperties.maxThreadsDim[2] << "}" << std::endl
                << "\t- max grid sizes:          "
                << "{" << deviceProperties.maxGridSize[0]
                << "," << deviceProperties.maxGridSize[1]
                << "," << deviceProperties.maxGridSize[2] << "}" << std::endl
                << "\t- max 2D array texture:    "
                << "{" << deviceProperties.maxTexture2D[0]
                << "," << deviceProperties.maxTexture2D[1] << "}" << std::endl
                << "\t- max 3D array texture:    "
                << "{" << deviceProperties.maxTexture3D[0]
                << "," << deviceProperties.maxTexture3D[1]
                << "," << deviceProperties.maxTexture3D[2] << "}" << std::endl
                << "\t- max 2D linear texture:   "
                << "{" << deviceProperties.maxTexture2DLinear[0]
                << "," << deviceProperties.maxTexture2DLinear[1]
                << "," << deviceProperties.maxTexture2DLinear[2] << "}" << std::endl
                << "\t- max 2D layered texture:  "
                << "{" << deviceProperties.maxTexture2DLayered[0]
                << "," << deviceProperties.maxTexture2DLayered[1]
                << "," << deviceProperties.maxTexture2DLayered[2] << "}" << std::endl
                << "\t- number of SM(x)s:        " << deviceProperties.multiProcessorCount << std::endl
                << "\t- registers per SM(x):     " << deviceProperties.regsPerMultiprocessor << std::endl
                << "\t- registers per block:     " << deviceProperties.regsPerBlock << std::endl
                << "\t- concurrent kernels:      " << (deviceProperties.concurrentKernels ? "yes":"no") << std::endl
                << "\t- mapping host memory:     " << (deviceProperties.canMapHostMemory ? "yes":"no") << std::endl
                << "\t- unified addressing:      " << (deviceProperties.unifiedAddressing ? "yes":"no") << std::endl
                << "\t- texture alignment:       " << deviceProperties.textureAlignment << " byte" << std::endl
                << "\t- pitch alignment:         " << deviceProperties.texturePitchAlignment << " byte" << std::endl;
            information += deviceSS.str();
        } // End for i<nbDevices
    } // End nbDevices>0
    else { information = "No CUDA-Enabled GPU."; }
    return information;
} // End gpuInformationCUDA()

int main(int argc, char **argv){
    gpuSupportCUDA(2,0);
    std::cout << gpuInformationCUDA();
    return 0;
}

As you can see, contrary to what Meshroom believes for some erroneous reason, I do have a GPU that can pass the very same checks that software uses.

$ g++ -o ckgpu ckgpu.cpp  -lcudart
$ ./ckgpu
Supported CUDA-Enabled GPU detected.
CUDA-Enabled GPU.
Device information:
        - id:                      0
        - name:                    GeForce GTX 1050 Ti
        - compute capability:      6.1
        - total device memory:     4039 MB
        - device memory available: 3797 MB
        - per-block shared memory: 49152
        - warp size:               32
        - max threads per block:   1024
        - max threads per SM(X):   2048
        - max block sizes:         {1024,1024,64}
        - max grid sizes:          {2147483647,65535,65535}
        - max 2D array texture:    {131072,65536}
        - max 3D array texture:    {16384,16384,16384}
        - max 2D linear texture:   {131072,65000,2097120}
        - max 2D layered texture:  {32768,32768,2048}
        - number of SM(x)s:        6
        - registers per SM(x):     65536
        - registers per block:     65536
        - concurrent kernels:      yes
        - mapping host memory:     yes
        - unified addressing:      yes
        - texture alignment:       512 byte
        - pitch alignment:         32 byte

Compiling

Using CUDA is pretty well behaved because it is easiest (required maybe) to use nvcc which seems to be a gcc wrapper that just includes all the right stuff properly.

nvcc -O3 -arch sm_30 -lineinfo -DDEBUG -c kernel.cu
nvcc -O3 -arch sm_30 -lineinfo -DDEBUG -o x.naive_transpose kernel.o

Concurrency

Can concurrently do any of the following.

Compute
move data from host to device
move data from device to host
4-way concurrency would also have CPU involved
each thread can do basic 3-way so many more parallel concurrencies

This is serial (input, compute, output).

iiiiiiccccccoooooo

This is with concurrencies.

iiicccooo
   iiicccooo

Nvidia offers a fancy visual profiler that does the visualizations quite nicely to optimize concurrency.

Organizational Structures

SM - Streaming Multiprocessors
- Scalar processors or "cores" (32 or so out of maybe 512)
- Shared memory
- L1 data cache
- Shared registers
- Special Function Units
- Clocks
blocks
warp
- executed in parallel (SIMD)
- contains 32 threads
threads
registers are 32bit
global memory is not really system wide global.
- coalesce loads and stores
shared memory
- the /cfs of GPUs
- 32 banks of 4 bytes
- Needs syncthreads()
banks
- like a doorway to provide access to threads
- two threads accessing one bank get serialized
- best to get each thread accessing their own unique bank
streams
- a queue of work
- ordered list of commands
- FIFO
- multiple streams have no ordering between them
- if not specified, goes to default stream, 0.
- multistream programming needs >0 stream for async
kernel - is the callback like function that runs in the CUDA cores.
- __global__ void mykernelfn(const int a, const int b){...}
- kernel<<<blocks,threads,[smem[,stream]]>>>();

Examples

Example Performance of Transposing a Matrix

Using GPU 0: Tesla K80
Matrix size is 16000
Total memory required per matrix is 2048.000000 MB
Total time CPU is 1.255781 sec
Performance is 3.261715 GB/s
Total time GPU is 0.067238 sec
Performance is 60.918356 GB/s

Same Matrix Transpose Optimizing Concurrent Memory Access

Using GPU 0: Tesla K80
Matrix size is 16000
Total memory required per matrix is 2048.000000 MB
Total time CPU is 1.256058 sec
Performance is 3.260996 GB/s
Total time GPU is 0.035628 sec
Performance is 114.964519 GB/s

Machine Learning And Jetson

Nvidia is into machine learning in a big way. They have specialized products and dev kits.

Resources

Jetson Nano
- Getting Started With Jetson Nano Developer Kit
https://github.com/dusty-nv/jetson-inference/
- Hello AI World
- Two Days to a Demo

Jetson Nano

Model P3448-0000
Gigabit ethernet
(x4) USB3.0 ports
4K HDMI and DisplayPort connector (groan)
MIPI CSI (Mobile Industry Processor Interface Camera Serial Interface) - listed as working with Raspberry Pi Camera Module V2
Dedicated UART header
40 pin header (GPIO, I2C, UART)
J48 jumper - connected means micro-usb2.0 jack operates in device mode, otherwise power supply
J40 jumpers - power, reset, etc
J15 PWM fan header
J18 M.2 Key E connector

https://github.com/dusty-nv/jetson-inference

Tools

sudo nvpmodel -q # Check active power mode.
tegrastats # Sort of a top for jetson. Includes power too.

GPIO

echo 38 > /sys/class/gpio/export # Map GPIO pin
echo out > /sys/class/gpio/gpio38/direction # Set direction
echo 1 > /sys/class/gpio/gpio38/value # Bit banging
echo 38 > /sys/class/gpio/unexport  # Unmap GPIO
cat /sys/kernel/debug/gpio # Diagnostic

Video

Argus (libargus) = Nvidia’s library

12 CSI lanes.

nvarguscamerasrc

nvgstcapture # Camera view application

v4l2 puts video streams on /dev/video

nvhost-msenc
nvhost-nvdec
gstinspect