Nvidia On Linux

Graphics Drivers
CUDA And GPU Programming
- CUDA And GPU Properties Extracted With Cpp
Machine Learning And Jetson

Graphics Drivers

nVidia is the best and worst graphics card for Linux. It is the worst because it is fraught with proprietary nonsense and it is the best, well, because it works pretty well.

If you need a system where you can audit all the source code, nVidia hardware may not be an option. But if you just need some simple Linux workstations for 3d graphics, it might be the simplest option.

I find that using nVidia’s automagical installer/driver just works. Usually.

Also for CentOS specific package technology involving Nvidia drivers see my CentOS notes.

Drivers

At the current time (late 2012) the Linux drivers live here. Note that "Linux x86/IA32" is for 32 bit systems. (Check yours with something like file /sbin/init). These days, you probably want "Linux x86_64/AMD64/EM64T".

What version are you currently using? Check with this.

cat /proc/driver/nvidia/version

Installing and Updates

It turns out that GPU drivers are deeply in touch with the kernel. The driver itself is a kernel module. This module must match the kernel and must be built to fit. The Nvidia installer automagically takes care of all this (assuming you have a build environment with a complier, etc).

The problem is that whenever you update your machine and there is a kernel update (which is about every two weeks in my experience), the graphics will stop working. You must reboot into the new kernel (you can’t fix it right after doing the update while running the previous kernel). Then you’ll be in some no-mans-land text console with no prompt (CentOS6). Use "Alt-F2" to go to a console with a getty login prompt. Log in and re install the Nvidia driver. This also is the process after you first install CentOS.

I find that I do this so often that I have a tiny script to make it automatic so I don’t have to answer questions and generally hold its hand. My little script looks like:

#!/bin/bash
sh /pro/nvidia/current -a -q -X --ui=none -f -n

For the Debian style distributions this works.

#!/bin/bash

echo "Shutting down X server..."
sudo service lightdm stop

echo "Running NVIDIA kernel module installer..."
sudo sh ~/src/NVIDIA-Linux-x86_64-304.117.run -a -q -X --ui=none -f -n

And that lives in a directory with an assortment of drivers where current is a link to the one I need most often:

:->[host][~]$ ls /usr/local/nvidia/
NVIDIA-Linux-x86-304.64.run             NVIDIA-Linux-x86_64-304.64.run
NVIDIA-Linux-x86_64-173.14.22-pkg2.run  current
NVIDIA-Linux-x86_64-190.53-pkg2.run     nvfix
NVIDIA-Linux-x86_64-195.36.15-pkg2.run

Update Process

When I update I usually do it remotely. I log in and do sudo yum -y update. Then if a new kernel has been installed, I do sudo reboot. Then wait a couple of minutes (sleep 111). And then log in again. This time everything seems fine and is updated, but the users sitting at the workstation will find a confusing text screen with no prompt. This is because graphics are actually dead. This is when you need to run the nvfix script shown above, that’s sudo /usr/local/nvidia/nvfix of course since it must be run as root. Then you must sudo reboot again. At that point everything should be cool. It’s a good idea to wait and log back in when it comes up. I’ve had machines mysteriously not wake up after the reboot.

ElRepo

It might be smarter these days to try to use prepackaged proprietary drivers from the ElRepo repository.

https://elrepo.org/tiki/kmod-nvidia

One problem I had after upgrading from 7.x to 7.4 is that although the modules seem inserted and everything seems fine, no graphics happen. This talks about it and has some good general troubleshooting tips. It seems that lightdm wasn’t starting or staying started. But doing systemctl start lightdm seems to have started it and system enable lightdm seems to have cured it.

Nouveau Issues

In CentOS 6 and later the default thing to do on installation is to use the new open source Nouveau drivers. That’s nice and I’m glad that someone’s working on a wholesome alternative. But the problem is that these drivers under-perform, by a factor of 2 in my tests. Test it yourself before committing.

Now the really gruesome bit is that you can’t easily install the proprietary drivers while the Nouveau ones are in. Maybe nVidia will fix their installer to be less stupid but for now it’s quite a chore to extricate the Nouveau driver. The best plan is to often reinstall CentOS and make sure you select the reduced graphics mode. I forget what it’s called, but it doesn’t just affect the installation graphics, it affects what drivers are installed. With the low quality (or whatever it’s called) mode, the normal non-accelerated X drivers are installed and those can be replaced by the nVidia installer.

Legacy

Sometimes you’ll have an older machine:

:->[ws9-ablab.ucsd.edu][~]$ lspci | grep -i [n]vi
01:00.0 VGA compatible controller: NVIDIA Corporation NV43
[GeForce 6600 GT] (rev a2)

And running the normal installer fails with some kind of message about legacy drivers. On the machine above I had to run NVIDIA-Linux-x86_64-304.64.run and then it worked. This version was found on the driver page above and called Latest Legacy GPU version (304.xx series). There are other legacy series like 71.86.xx, 96.43.xx, and 173.14.xx. Use what the installers suggest.

Manual Tweaking With xrandr

I had two vertical 1080x1920 monitors and the "Display" program in Mate was just garbling them. Here’s what I did to sort that out.

xrandr --fb 2160x1920   \
       --output HDMI-1  \
       --auto \
       --pos 0x0 \
       --output DVI-I-1 \
       --auto \
       --pos 1080x0

Or more recently with a different card…

xrandr --fb 2160x1920 \
    --output HDMI-0 --auto --rotate left --pos 0x0 \
    --output DVI-D-0 --auto --rotate right --pos 1080x0

Here’s another example of my 3 vertical HP monitor setup which each have the slightly unusual resolution of 1920x1200.

xrandr --fb 3600x1920 \
       --output VGA-0   --auto --pos 0x0 \
       --output DVI-D-0 --auto --pos 1200x0 \
       --output HDMI-0  --auto --pos 2400x0

Also note these, which I did not need, if required for emphasis.

--rotate left
--output A --left-of B

In CentOS 7’s Mate I’m finding that the System->Preference->Hardware->Displays tool just can’t put my vertical monitors together properly. What works is to close that, use an xrandr command as shown above. Then go back to the Displays GUI tool when everything is correct. Then it will come up detected correctly and this is when you want to click "Apply" and then "Apply system-wide". I don’t know what that writes but it once it’s written, things work as they should. Well, not the display manager of course, but who cares about that?

Dummy

From the xpra Xdummy documentation. "Proprietary drivers often install their own copy of libGL which conflicts with the use of software GL rendering. You cannot use this GL library to render directly on Xdummy (or Xvfb)."

This is why you might have trouble using non-interactive rendering tools.

Here is one way Andrey got this problem solved. First he grabbed a libGL.so.1 from a Mesa system (no nvidia drivers). That can be stored locally with no privileges.

Then run the application with something like this.

LD_PRELOAD=/home/${USER}/tmp/libGL.so.1 /usr/bin/Xvfb :96 -cc 4 -screen 0 1024x768x16

AMD

Just some quick notes on AMD/ATI drivers. AMD tries to match nVidia, but they’re a bit behind. However, here are some programs that might come in handy.

amdcccle
fglrxinfo
fgl_glxgears

CUDA And GPU Programming

Resources

Setup

You might need one or more of these.

apt install nvidia-driver
apt install nvidia-dev
apt install nvidia-support
apt install nvidia-cuda-toolkit

Invalid (Rotated) Repo Signing Keys

I recently (turns out, after 2022-04-27) did a normal apt update on an unremarkable Ubuntu machine and got this disturbing unexpected error.

http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease' is no longer signed.

It looks like Nvidia is revoking repo keys on some kind of annoyingly frequent schedule now. I’m not exactly clear on the practical real security benefits. But whatever.

If you see this, this fixed it for me.

distro=ubuntu2004
arch=x86_64
wget
https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo mv -v /etc/apt/sources.list.d/cuda.list /tmp/
sudo apt update # Works now.
sudo apt upgrade

The solution is described here.

nvidia-smi - Checking GPU Action

The "smi" stands for System Management Interface. This command is important for (1) seeing what kind of GPU your system thinks it has access to and (2) how hard that GPU is working right now.

cat /proc/driver/nvidia/version
sudo apt install nvidia-smi
nvidia-smi
nvidia-smi -l 1 # One second refresh.

What processes are using your nvidia device? This can be interesting.

$ sudo fuser -v /dev/nvidia*
USER        PID ACCESS COMMAND
/dev/nvidia0:        root        752 F...m Xorg
/dev/nvidiactl:      root        752 F...m Xorg
/dev/nvidia-modeset: root        752 F.... Xorg

Also for real time monitoring try this.

nvidia-smi pmon

Oh and check out nvtop! That’s a very nice visualization tool.

sudo apt install nvtop

Of course if you’re using an Nvidia computer with an Nvidia distro, this won’t be available! Gah! Maybe install from source? Yes, that seems to work. Here’s the process that worked for me.

apt install cmake libncurses-dev
git clone https://github.com/Syllo/nvtop.git
cd ~/src
mkdir -p nvtop/build && cd nvtop/build
cmake .. -DNVIDIA_SUPPORT=ON
make
sudo make install

Or maybe no GPU shows up with lspci. Good to check that first.

Or maybe you need to ensure this library — or one like it — is present. Also nvtop needs this.

LD_PRELOAD=/usr/lib/nvidia-367/libnvidia-ml.so nvidia-smi

This is the Nvidia Management Library. This page offers downloads that my be helpful. Or they may be malware. Hard to say.

CUDA Specs From Software

Writing a program that needs some CUDA? How can you check if what you have is sufficient? After stumbling into some kind of bug with the photogrammetry project Meshroom, I wanted to know how to check my CUDA Compute Capability, whatever the hell that is. I dug into the AliceVision source code and pulled out the offending checks that said I did not have a CUDA-capable card. Specifically from here. I distilled it into the following short program which does all the checks Meshroom seems to know about. These checks seem generally useful so here is the program.

ckgpu.cpp

// Compile with `g++ -o ckgpu ckgpu.cpp -lcudart`
#include <string>
#include <iostream>
#include <sstream>
#include <cuda_runtime.h>

// ================== gpuSupportCUDA ==================
bool gpuSupportCUDA(int minComputeCapabilityMajor,
    int minComputeCapabilityMinor,
    int minTotalDeviceMemory=0) {
    int nbDevices = 0;
    cudaError_t success;
    success = cudaGetDeviceCount(&nbDevices);
    if (success != cudaSuccess) {
        std::cout << "cudaGetDeviceCount failed: " << cudaGetErrorString(success);
        nbDevices = 0;
    }

    if(nbDevices > 0) {
        for(int i = 0; i < nbDevices; ++i) {
            cudaDeviceProp deviceProperties;
            if(cudaGetDeviceProperties(&deviceProperties, i) != cudaSuccess) {
                std::cout << "Cannot get properties for CUDA gpu device " << i;
                continue;
            }
            if((deviceProperties.major > minComputeCapabilityMajor ||
                (deviceProperties.major == minComputeCapabilityMajor &&
                 deviceProperties.minor >= minComputeCapabilityMinor)) &&
                deviceProperties.totalGlobalMem >= (minTotalDeviceMemory*1024*1024)) {
                std::cout << "Supported CUDA-Enabled GPU detected." << std::endl;
                return true;
            }
            else {
                std::cout << "CUDA-Enabled GPU detected, but the compute capabilities is not enough.\n"
                    << " - Device " << i << ": " << deviceProperties.major << "." << deviceProperties.minor
                    << ", global memory: " << int(deviceProperties.totalGlobalMem / (1024*1024)) << "MB\n"
                    << " - Requirements: " << minComputeCapabilityMajor << "." << minComputeCapabilityMinor
                    << ", global memory: " << minTotalDeviceMemory << "MB\n";
            }
        } // End for i<nbDevices
        std::cout << ("CUDA-Enabled GPU not supported.");
    } // End if nbDevices
    else { std::cout << ("Can't find CUDA-Enabled GPU."); }
    return false;
} // End gpuSupportCUDA()

// ================== gpuInformationCUDA ==================
std::string gpuInformationCUDA() {
    std::string information;
    int nbDevices = 0;
    if( cudaGetDeviceCount(&nbDevices) != cudaSuccess ) {
        std::cout << ( "Could not determine number of CUDA cards in this system" );
        nbDevices = 0;
    }
    if(nbDevices > 0) {
        information = "CUDA-Enabled GPU.\n";
        for(int i = 0; i < nbDevices; ++i) {
            cudaDeviceProp deviceProperties;
            if(cudaGetDeviceProperties( &deviceProperties, i) != cudaSuccess ) {
                std::cout << "Cannot get properties for CUDA gpu device " << i;
                continue;
            }
            if( cudaSetDevice( i ) != cudaSuccess ) {
                std::cout << "Device with number " << i << " does not exist" ;
                continue;
            }
            std::size_t avail;
            std::size_t total;
            if(cudaMemGetInfo(&avail, &total) != cudaSuccess) { // if the card does not provide this information.
                avail = 0;
                total = 0;
                std::cout << "Cannot get available memory information for CUDA gpu device " << i << ".";
            }
            std::stringstream deviceSS;
            deviceSS << "Device information:" << std::endl
                << "\t- id:                      " << i << std::endl
                << "\t- name:                    " << deviceProperties.name << std::endl
                << "\t- compute capability:      " << deviceProperties.major << "." << deviceProperties.minor << std::endl
                << "\t- total device memory:     " << deviceProperties.totalGlobalMem / (1024 * 1024) << " MB " << std::endl
                << "\t- device memory available: " << avail / (1024 * 1024) << " MB " << std::endl
                << "\t- per-block shared memory: " << deviceProperties.sharedMemPerBlock << std::endl
                << "\t- warp size:               " << deviceProperties.warpSize << std::endl
                << "\t- max threads per block:   " << deviceProperties.maxThreadsPerBlock << std::endl
                << "\t- max threads per SM(X):   " << deviceProperties.maxThreadsPerMultiProcessor << std::endl
                << "\t- max block sizes:         "
                << "{" << deviceProperties.maxThreadsDim[0]
                << "," << deviceProperties.maxThreadsDim[1]
                << "," << deviceProperties.maxThreadsDim[2] << "}" << std::endl
                << "\t- max grid sizes:          "
                << "{" << deviceProperties.maxGridSize[0]
                << "," << deviceProperties.maxGridSize[1]
                << "," << deviceProperties.maxGridSize[2] << "}" << std::endl
                << "\t- max 2D array texture:    "
                << "{" << deviceProperties.maxTexture2D[0]
                << "," << deviceProperties.maxTexture2D[1] << "}" << std::endl
                << "\t- max 3D array texture:    "
                << "{" << deviceProperties.maxTexture3D[0]
                << "," << deviceProperties.maxTexture3D[1]
                << "," << deviceProperties.maxTexture3D[2] << "}" << std::endl
                << "\t- max 2D linear texture:   "
                << "{" << deviceProperties.maxTexture2DLinear[0]
                << "," << deviceProperties.maxTexture2DLinear[1]
                << "," << deviceProperties.maxTexture2DLinear[2] << "}" << std::endl
                << "\t- max 2D layered texture:  "
                << "{" << deviceProperties.maxTexture2DLayered[0]
                << "," << deviceProperties.maxTexture2DLayered[1]
                << "," << deviceProperties.maxTexture2DLayered[2] << "}" << std::endl
                << "\t- number of SM(x)s:        " << deviceProperties.multiProcessorCount << std::endl
                << "\t- registers per SM(x):     " << deviceProperties.regsPerMultiprocessor << std::endl
                << "\t- registers per block:     " << deviceProperties.regsPerBlock << std::endl
                << "\t- concurrent kernels:      " << (deviceProperties.concurrentKernels ? "yes":"no") << std::endl
                << "\t- mapping host memory:     " << (deviceProperties.canMapHostMemory ? "yes":"no") << std::endl
                << "\t- unified addressing:      " << (deviceProperties.unifiedAddressing ? "yes":"no") << std::endl
                << "\t- texture alignment:       " << deviceProperties.textureAlignment << " byte" << std::endl
                << "\t- pitch alignment:         " << deviceProperties.texturePitchAlignment << " byte" << std::endl;
            information += deviceSS.str();
        } // End for i<nbDevices
    } // End nbDevices>0
    else { information = "No CUDA-Enabled GPU."; }
    return information;
} // End gpuInformationCUDA()

int main(int argc, char **argv){
    gpuSupportCUDA(2,0);
    std::cout << gpuInformationCUDA();
    return 0;
}

As you can see, contrary to what Meshroom believes for some erroneous reason, I do have a GPU that can pass the very same checks that software uses.

$ g++ -o ckgpu ckgpu.cpp  -lcudart
$ ./ckgpu
Supported CUDA-Enabled GPU detected.
CUDA-Enabled GPU.
Device information:
    - id:                      0
    - name:                    GeForce GTX 1050 Ti
    - compute capability:      6.1
    - total device memory:     4039 MB
    - device memory available: 3797 MB
    - per-block shared memory: 49152
    - warp size:               32
    - max threads per block:   1024
    - max threads per SM(X):   2048
    - max block sizes:         {1024,1024,64}
    - max grid sizes:          {2147483647,65535,65535}
    - max 2D array texture:    {131072,65536}
    - max 3D array texture:    {16384,16384,16384}
    - max 2D linear texture:   {131072,65000,2097120}
    - max 2D layered texture:  {32768,32768,2048}
    - number of SM(x)s:        6
    - registers per SM(x):     65536
    - registers per block:     65536
    - concurrent kernels:      yes
    - mapping host memory:     yes
    - unified addressing:      yes
    - texture alignment:       512 byte
    - pitch alignment:         32 byte

vulkaninfo

This seems to be like the OpenGL tester and just shows what kind of OpenGL features seem to be supported.

$ vulkaninfo | grep -A7 VkPhysicalDeviceProperties
VkPhysicalDeviceProperties:
 --------------------------
        apiVersion     = 4206797 (1.3.205)
        driverVersion  = 142622784 (0x8804040)
        vendorID       = 0x10de
        deviceID       = 0xa5ba03d7
        deviceType     = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName     = NVIDIA Tegra Xavier (nvgpu)

Note this can take up to 8 seconds to run and is run at every shell launch on Jetson Xavier.

lshw

What interesting thing does lshw say about your system?

$ sudo lshw -C system
mic-730ai
    description: Computer
    product: Jetson-AGX
    vendor: Unknown
    version: Not Specified
    serial: 1421121018345
    width: 64 bits
    capabilities: smbios-3.0.0 dmi-3.0.0 smp cp15_barrier setend swp tagged_addr_disabled
    configuration: boot=normal family=Unknown sku=Unknown

/proc/driver/nvidia

I think this only works if there is an nvidia in your lsmod.

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  460.91.03  Fri Jul  2 06:04:10 UTC 2021
GCC version:  gcc version 10.2.1 20210110 (Debian 10.2.1-6)

Here are some other things to maybe look for.

dmesg | grep -iE 'nvidia|nvrm|agp|vga'
ls -l /dev/dri/* /dev/nvidia*

jetson-release

This produces a lot of useful looking information, but I wonder if it’s just using envp and jtop.

$ jetson_release -v
'DISPLAY' environment variable not set... skipping surface info
 - NVIDIA Jetson UNKNOWN
   * Jetpack UNKNOWN [L4T 34.1.1]
   * NV Power Mode: MODE_15W_DESKTOP - Type: 7
   * jetson_stats.service: active
 - Board info:
   * Type: UNKNOWN
   * SOC Family: tegra194 - ID:
   * Module: UNKNOWN - Board: P2822-0000
   * Code Name: galen
   * CUDA GPU architecture (ARCH_BIN): NONE
   * Serial Number: 1421121018345
 - Libraries:
   * CUDA: NOT_INSTALLED
   * cuDNN: NOT_INSTALLED
   * TensorRT: NOT_INSTALLED
   * Visionworks: NOT_INSTALLED
   * OpenCV: 4.5.4 compiled CUDA: NO
   * VPI: NOT_INSTALLED
   * Vulkan: 1.3.203
 - jetson-stats:
   * Version 3.1.4
   * Works on Python 3.8.10

jetsonUtilities

This can be a helpful diagnostic.

git clone https://github.com/jetsonhacks/jetsonUtilities
cd ./jetsonUtilities
./jetsonInfo.py

Here’s what you don’t want to see.

NVIDIA Jetson UNKNOWN
 L4T 34.1.1 [ JetPack UNKNOWN ]
   Ubuntu 20.04.4 LTS
   Kernel Version: 5.10.65-tegra
'DISPLAY' environment variable not set... skipping surface info
 CUDA NOT_INSTALLED
   CUDA Architecture: NONE
 OpenCV version: 4.5.4
   OpenCV Cuda: NO
 CUDNN: NOT_INSTALLED
 TensorRT: NOT_INSTALLED
 Vision Works: NOT_INSTALLED
 VPI: NOT_INSTALLED
 Vulcan: 1.3.203

GPU Compiler Version

Is nvcc installed and working reasonably? Note this is probably not in a default path.

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_11_23:44:05_PST_2021
Cuda compilation tools, release 11.4, V11.4.166
Build cuda_11.4.r11.4/compiler.30645359_0

deviceQuery

If that looks a little too tricky, here’s another approach to finding out exactly what you have. This requires that you have a working nvcc but other than that, this procedure was quite painless, produced a lot of good information, compiles on exotic architectures (such as Nvidia’s own aarch64 Jetson products) and was in the helpful form of an illustrative code example. I approve!

git clone https://github.com/NVIDIA/cuda-samples
cd cuda-samples/1_Utilities/deviceQuery
make
./deviceQuery

Here’s the kind of output you should be hoping to see.

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1050 Ti"
  CUDA Driver Version / Runtime Version          11.2 / 11.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 4040 MBytes (4235919360 bytes)
  (006) Multiprocessors, (128) CUDA Cores/MP:    768 CUDA Cores
  GPU Max Clock rate:                            1392 MHz (1.39 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.2, NumDevs = 1
Result = PASS

Kernel Module Checks

Obviously start with lsmod and maybe lsmod | grep nv to have a look at what kernel modules are active. You might be able to learn more with something like this.

$ modinfo /usr/lib/modules/5.10.65-tegra/kernel/drivers/gpu/nvgpu/nvgpu.ko | grep version
vermagic:       5.10.65-tegra SMP preempt mod_unload modversions aarch64

jetson-stats

This one seems to require a service, but it installed ok.

sudo pip3 install jetson-stats
sudo systemctl restart jetson_stats.service
jtop

It actually required a reboot but after that, yes, it is very nice! Kind of an htop sort of thing for the GPU cores.

Compiling

Using CUDA is pretty well behaved because it is easiest (required maybe) to use nvcc which seems to be a gcc wrapper that just includes all the right stuff properly.

nvcc -O3 -arch sm_30 -lineinfo -DDEBUG -c kernel.cu
nvcc -O3 -arch sm_30 -lineinfo -DDEBUG -o x.naive_transpose kernel.o

Concurrency

Can concurrently do any of the following.

Compute
move data from host to device
move data from device to host
4-way concurrency would also have CPU involved
each thread can do basic 3-way so many more parallel concurrencies

This is serial (input, compute, output).

iiiiiiccccccoooooo

This is with concurrencies.

iiicccooo
   iiicccooo

Nvidia offers a fancy visual profiler that does the visualizations quite nicely to optimize concurrency.

Organizational Structures

SM - Streaming Multiprocessors
- Scalar processors or "cores" (32 or so out of maybe 512)
- Shared memory
- L1 data cache
- Shared registers
- Special Function Units
- Clocks
blocks
warp
- executed in parallel (SIMD)
- contains 32 threads
threads
registers are 32bit
global memory is not really system wide global.
- coalesce loads and stores
shared memory
- the /cfs of GPUs
- 32 banks of 4 bytes
- Needs syncthreads()
banks
- like a doorway to provide access to threads
- two threads accessing one bank get serialized
- best to get each thread accessing their own unique bank
streams
- a queue of work
- ordered list of commands
- FIFO
- multiple streams have no ordering between them
- if not specified, goes to default stream, 0.
- multistream programming needs >0 stream for async
kernel - is the callback like function that runs in the CUDA cores.
- __global__ void mykernelfn(const int a, const int b){...}
- kernel<<<blocks,threads,[smem[,stream]]>>>();

Examples

Example Performance of Transposing a Matrix

Using GPU 0: Tesla K80
Matrix size is 16000
Total memory required per matrix is 2048.000000 MB
Total time CPU is 1.255781 sec
Performance is 3.261715 GB/s
Total time GPU is 0.067238 sec
Performance is 60.918356 GB/s

Same Matrix Transpose Optimizing Concurrent Memory Access

Using GPU 0: Tesla K80
Matrix size is 16000
Total memory required per matrix is 2048.000000 MB
Total time CPU is 1.256058 sec
Performance is 3.260996 GB/s
Total time GPU is 0.035628 sec
Performance is 114.964519 GB/s

Machine Learning And Jetson

Nvidia is into machine learning in a big way. They have specialized products and dev kits.

Resources

Jetson Nano
- Getting Started With Jetson Nano Developer Kit
https://github.com/dusty-nv/jetson-inference/
- Hello AI World
- Two Days to a Demo

Jetson Nano

Model P3448-0000
Gigabit ethernet
(x4) USB3.0 ports
4K HDMI and DisplayPort connector (groan)
MIPI CSI (Mobile Industry Processor Interface Camera Serial Interface) - listed as working with Raspberry Pi Camera Module V2
Dedicated UART header
40 pin header (GPIO, I2C, UART)
J48 jumper - connected means micro-usb2.0 jack operates in device mode, otherwise power supply
J40 jumpers - power, reset, etc
J15 PWM fan header
J18 M.2 Key E connector

https://github.com/dusty-nv/jetson-inference

Tools

sudo nvpmodel -q # Check active power mode.
tegrastats # Sort of a top for jetson. Includes power too.

GPIO

echo 38 > /sys/class/gpio/export # Map GPIO pin
echo out > /sys/class/gpio/gpio38/direction # Set direction
echo 1 > /sys/class/gpio/gpio38/value # Bit banging
echo 38 > /sys/class/gpio/unexport  # Unmap GPIO
cat /sys/kernel/debug/gpio # Diagnostic

Video

Argus (libargus) = Nvidia’s library

12 CSI lanes.

nvarguscamerasrc

nvgstcapture # Camera view application

v4l2 puts video streams on /dev/video

nvhost-msenc
nvhost-nvdec
gstinspect