Resources

Checking GPU Action

I do not understand the name of this command at all but this is important for (1) seeing what kind of GPU your system thinks it has access to and (2) how hard that GPU is working right now.

cat /proc/driver/nvidia/version
sudo apt-get install nvidia-smi
nvidia-smi
nvidia-smi -l 1 # One second refresh.

What processes are using your nvidia device? This can be interesting.

$ sudo fuser -v /dev/nvidia*
USER        PID ACCESS COMMAND
/dev/nvidia0:        root        752 F...m Xorg
/dev/nvidiactl:      root        752 F...m Xorg
/dev/nvidia-modeset: root        752 F.... Xorg

Compiling

Using CUDA is pretty well behaved because it is easiest (required maybe) to use nvcc which seems to be a gcc wrapper that just includes all the right stuff properly.

nvcc -O3 -arch sm_30 -lineinfo -DDEBUG -c kernel.cu
nvcc -O3 -arch sm_30 -lineinfo -DDEBUG -o x.naive_transpose kernel.o

Concurrency

Can concurrently do any of the following.

  • Compute

  • move data from host to device

  • move data from device to host

  • 4-way concurrency would also have CPU involved

  • each thread can do basic 3-way so many more parallel concurrencies

This is serial (input, compute, output).

iiiiiiccccccoooooo

This is with concurrencies.

iiicccooo
   iiicccooo

Nvidia offers a fancy visual profiler that does the visualizations quite nicely to optimize concurrency.

Organizational Structures

  • SM - Streaming Multiprocessors

    • Scalar processors or "cores" (32 or so out of maybe 512)

    • Shared memory

    • L1 data cache

    • Shared registers

    • Special Function Units

    • Clocks

  • blocks

  • warp

    • executed in parallel (SIMD)

    • contains 32 threads

  • threads

  • registers are 32bit

  • global memory is not really system wide global.

    • coalesce loads and stores

  • shared memory

    • the /cfs of GPUs

    • 32 banks of 4 bytes

    • Needs syncthreads()

  • banks

    • like a doorway to provide access to threads

    • two threads accessing one bank get serialized

    • best to get each thread accessing their own unique bank

  • streams

    • a queue of work

    • ordered list of commands

    • FIFO

    • multiple streams have no ordering between them

    • if not specified, goes to default stream, 0.

    • multistream programming needs >0 stream for async

  • kernel - is the callback like function that runs in the CUDA cores.

    • __global__ void mykernelfn(const int a, const int b){...}

    • kernel<<<blocks,threads,[smem[,stream]]>>>();

Examples

Example Performance of Transposing a Matrix
Using GPU 0: Tesla K80
Matrix size is 16000
Total memory required per matrix is 2048.000000 MB
Total time CPU is 1.255781 sec
Performance is 3.261715 GB/s
Total time GPU is 0.067238 sec
Performance is 60.918356 GB/s
Same Matrix Transpose Optimizing Concurrent Memory Access
Using GPU 0: Tesla K80
Matrix size is 16000
Total memory required per matrix is 2048.000000 MB
Total time CPU is 1.256058 sec
Performance is 3.260996 GB/s
Total time GPU is 0.035628 sec
Performance is 114.964519 GB/s