Resources

Compiling

Using CUDA is pretty well behaved because it is easiest (required maybe) to use nvcc which seems to be a gcc wrapper that just includes all the right stuff properly.

nvcc -O3 -arch sm_30 -lineinfo -DDEBUG -c kernel.cu
nvcc -O3 -arch sm_30 -lineinfo -DDEBUG -o x.naive_transpose kernel.o

Concurrency

Can concurrently do any of the following. * Compute * move data from host to device * move data from device to host * 4-way concurrency would also have CPU involved * each thread can do basic 3-way so many more parallel concurrencies

This is serial (input, compute, output).

iiiiiiccccccoooooo

This is with concurrencies.

iiicccooo
   iiicccooo

Nvidia offers a fancy visual profiler that does the visualizations quite nicely to optimize concurrency.

Organizational Structures

  • SM - Streaming Multiprocessors

    • Scalar processors or "cores" (32 or so out of maybe 512)

    • Shared memory

    • L1 data cache

    • Shared registers

    • Special Function Units

    • Clocks

  • blocks

  • warp

    • executed in parallel (SIMD)

    • contains 32 threads

  • threads

  • registers are 32bit

  • global memory is not really system wide global.

    • coalesce loads and stores

  • shared memory

    • the /cfs of GPUs

    • 32 banks of 4 bytes

    • Needs syncthreads()

  • banks

    • like a doorway to provide access to threads

    • two threads accessing one bank get serialized

    • best to get each thread accessing their own unique bank

  • streams

    • a queue of work

    • ordered list of commands

    • FIFO

    • multiple streams have no ordering between them

    • if not specified, goes to default stream, 0.

    • multistream programming needs >0 stream for async

  • kernel - is the callback like function that runs in the CUDA cores.

    • __global__ void mykernelfn(const int a, const int b){...}

    • kernel<<<blocks,threads,[smem[,stream]]>>>();

Examples

Example Performance of Transposing a Matrix
Using GPU 0: Tesla K80
Matrix size is 16000
Total memory required per matrix is 2048.000000 MB
Total time CPU is 1.255781 sec
Performance is 3.261715 GB/s
Total time GPU is 0.067238 sec
Performance is 60.918356 GB/s
Same Matrix Transpose Optimizing Concurrent Memory Access
Using GPU 0: Tesla K80
Matrix size is 16000
Total memory required per matrix is 2048.000000 MB
Total time CPU is 1.256058 sec
Performance is 3.260996 GB/s
Total time GPU is 0.035628 sec
Performance is 114.964519 GB/s