# Introduction to CUDA Programming

Hemant Shukla

hshukla@lbl.gov



### **Trends**

### Scientific Data Deluge

LSST 0.5 PB/month

JGI 5 TB/yr \* LOFAR 500 GB/s

SKA 100 x LOFAR

### **Energy Efficiency**

Exascale will need 1000x Performance enhancement with 10x energy consumption Flops/watt Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith





 $<sup>^</sup>st$  Jeff Broughton (NERSC) and JGI

# **Developments**

### **Industry**

Emergence of more cores on single chips

Number of cores per chip double every two years

Systems with millions of concurrent threads

Systems with inter and intra-chip parallelism

Architectural designs driven by reduction in Energy Consumption

New Parallel Programming models, languages, frameworks, ...

#### **Academia**

Graphical Processing Units (GPUs) are adopted as co-processors for high performance computing



### **Architectural Differences**



Less than 20 cores
1-2 threads per core
Latency is hidden by large cache



512 cores 10s to 100s of threads per core Latency is hidden by fast context switching

GPUs don't run without CPUs



### **Results**

### Large scale Cosmological Simulations with GAMER





### **Applications**

# X-ray computed tomography



Alain Bonissent et al.

Total volume 560 x 560 x 960 pixels 360 projections Speed up = 110x

# EoR with diesel powered radio interferometry



Lincoln Greenhill et al.

512 antennas, correlated visibilities for 130,000 baseline pairs, each with 768 channels and 4 polarizations ~ 20 Tflops. Power budget 20 kW.

INTEL Core2 Quad 2.66GHz = 1121 ms NVIDIA GPU C1060 = 103.4 ms

### N-body with SCDM



K. Nitadori et al.

4.5 giga-particles, R = 630 Mpc 2000x more volume than Kawai et al.



# **GPU**



# **GPU H/W Example**

### **NVIDIA FERMI**



16 Stream Multiprocessors (SM)
512 CUDA cores (32/SM)
IEEE 754-2008 floating point (DP and SP)
6 GB GDDR5 DRAM (Global Memory)
ECC Memory support
Two DMA interface



Reconfigurable L1 Cache and Shared Memory 48 KB / 16 KB

L2 Cache 768 KB



Load/Store address width 64 bits. Can calculate addresses of 16 threads per clock.



# **Programming Models**

**CUDA** (Compute Unified Device Architecture)

**OpenCL** 

Microsoft's DirectCompute

Third party wrappers are also available for Python, Perl, Fortran, Java, Ruby, Lua, MATLAB and IDL, and Mathematica

Compilers from PGI, RCC, HMPP, Copperhead



### **CUDA**

Parallel Computing Architecture

Application

C/C++ DX OpenCL FORTRAN Java Python

NVIDIA CUDA Compatible GPU

CUDA Device Driver
CUDA Toolkit (compiler, debugger, profiler, lib)
CUDA SDK (examples)
Windows, Mac OS, Linux



Libraries – FFT, Sparse Matrix, BLAS, RNG, CUSP, Thrust...



### **Dataflow**





### S/W Abstraction

Grids



**Threads** 



Blocks



512-1024 threads / block

Kernel is executed by threads processed by CUDA Core

Maximum 8 blocks per SM 32 parallel threads are executed at the same time in a *WARP* 

One grid per kernel with multiple concurrent kernels









# **Memory Hierarchy**

#### **Private memory**

Visible only to the thread

#### **Shared memory**

Visible to all the threads in a block

#### **Global memory**

Visible to all the threads
Visible to host
Accessible to multiple kernels
Data is stored in row major order

### **Constant memory (Read Only)**

Visible to all the threads in a block





# **CUDA API Examples**



### Which GPU do I have?

```
#include <stdio.h>
int main()
{
     int noOfDevices;
     /* get no. of device */
     cudaGetDeviceCount (&noOfDevice);
     cudaDeviceProp prop;
     for (int i = 0; i < noOfDevices; i++)
          /*get device properties */
          cudaGetDeviceProperties (&prop, i );
          printf ("Device Name:\t %s\n", prop.name);
          printf ("Total global memory:\t %ld\n",
                    prop.totalGlobalMem);
          printf ("No. of SMs:\t %d\n",
                    prop.multiProcessorCount);
          printf ("Shared memory / SM:\t %ld\n",
                    prop.sharedMemPerBlock);
          printf("Registers / SM:\t %d\n",
                    prop.reasPerBlock);
     return 1;
```

#### Use

# cudaGetDeviceCount cudaGetDeviceProperties

#### Compilation

> nvcc whatDevice.cu -o whatDevice

#### Output

Device Name: Tesla C2050
Total global memory: 2817720320
No. of SMs: 14
Shared memory / SM: 49152
Registers / SM: 32768

For more properties see struct cudaDeviceProp

For details see CUDA Reference Manual



# Timing with CUDA Event API

```
int main ()
                                              CUDA Event API Timer are,
    cudaEvent_t start, stop;
    float time:
                                             - OS independent
    cudaEventCreate (&start);
                                             - High resolution
    cudaEventCreate (&stop);
                                             - Useful for timing asynchronous calls
    cudaEventRecord (start, 0);
    someKernel <<<qrids, blocks, 0, 0>>> (...);
    cudaEventRecord (stop, 0);
    cudaEventSynchronize (stop); — Ensures kernel execution has completed
    cudaEventElapsedTime (&time, start, stop);
    cudaEventDestroy (start);
    cudaEventDestroy (stop);
    printf ("Elapsed time %f sec\n", time*.001);
    return 1;
                                       Standard CPU timers will not measure the
                                       timing information of the device.
```



# **Memory Allocations / Copies**

```
int main ()
 float host_signal[N]; host_result[N];
                                         Host and device have separate physical memory
 float *device_signal, *device_result;
 //allocate memory on the device (GPU)
 cudaMalloc ((void**) &device_signal, N * sizeof(float));
 cudaMalloc ((void**) &device_result, N * sizeof(float));
  ... Get data for the host_signal array
 // copy host_signal array to the device
 cudaMemcpy (device_signal, host_signal, N * sizeof(float),
               cudaMemcpyHostToDevice);
 someKernel <<<< >>> (...);
 //copy the result back from device to the host
 cudaMemcpy (host_result, device_result, N * sizeof(float),
               cudaMemcpvDeviceToHost);
                                                                Cannot dereference host
 //display the results
                                                                pointers on device and
 cudaFree (device_signal); cudaFree (device_result) ;
                                                                vice versa
```

# **Basic Memory Methods**

cudaError\_t cudaMalloc (void \*\* devPtr, size\_t size)

Allocates size bytes of linear memory on the device and returns in \*devPtr a pointer to the allocated memory. In case of failure cudaMalloc() returns cudaErrorMemoryAllocation.

### Blocking call

Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst. The argument kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy.

### Non-Blocking call

cudaMemcpyAsync() is asynchronous with respect to the host. The call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input.

See also, cudaMemset, cudaFree, ...



### **Kernel**

### The CUDA kernel is,

Run on device

```
Defined by __global__ qualifier and does not return anything __global__ void someKernel ();
```

Executed asynchronously by the host with <<< >>> qualifier, for example,

```
someKernel <<<nGrid, nBlocks, sharedMemory, streams>>> (...)
someKernel <<<nGrid, nBlocks>>> (...)
```

The kernel launches a 1- or 2-D **grid** of 1-, 2- or 3-D **blocks** of **threads**Each thread executes the same kernel in parallel (SIMT)
Threads within blocks can communicate via shared memory
Threads within blocks can be synchronized

Grids and blocks are of type struct dim3

Built-in variables gridDim, blockDim, threadIdx, blockIdx are used to traverse across the device memory space with multi-dimensional indexing



# **Grids, Blocks and Threads**

#### Grid





```
someKernel <<< 1, 1 >>> ();
gridDim.x = 1
blockDim.x = 1
blockIdx.x = 0
threadIdx.x = 0

dim3 blocks (2,1,1);
someKernel <<< (blocks, 4) >>> ();
gridDim.x = 2;
blockDim.x = 4;
blockIdx.x = 0,1;
threadIdx.x = 0,1,2,3,0,1,2,3
```

<<< number of blocks in a grid, number of threads per block >>>

Useful for multidimensional indexing and creating unique thread IDs int index = threadIdx.x + blockDim.x \* blockIdx.x;



# **Example - Inner Product**

### Matrix-multiplication

Each element of product matrix **C** is generated by row column multiplication and reduction of matrices **A** and **B**. This operation is similar to inner product of the vector multiplication kind also known as vector dot product.



For size N × N matrices the matrix-multiplication  $\mathbf{C} = \mathbf{A} \cdot \mathbf{B}$  will be equivalent to N<sup>2</sup> independent (hence parallel) inner products.



### Serial representation

$$c = \sum_{i} a_{i} b_{i}$$

### Simple parallelization strategy



Multiplications are done in parallel



Summation is sequential

#### **CUDA Kernel**

alleu i

Called in the host code



}

```
__global__ void innerProduct (int *a, int *b, int *c) {
  int product[SIZE];
                                                            Qualifier __global__ encapsulates
                                                            device specific code that runs on the
  int i = threadIdx.x;
                                                            device and is called by the host
  if (i < SIZE)
                                                            Other qualifiers are,
     product[i] = a[i] * b[i];
                                                            __device__, __host__,
                                                            host__and__device
                        threadIdx is a built in iterator for
                       threads. It has 3 dimensions x, y and
                        Z.
                                                Each thread with a unique threadIdx.x
}
                                                runs the kernel code in parallel.
```

```
__global__ void innerProduct (int *a, int *b, int *c) {
  int product[SIZE];
  int i = threadIdx.x;
  if (i < SIZE)
     product[i] = a[i] * b[i];
                                          Now we can sum the all the products to get
                                          the scalar c
        int sum = 0;
        for (int k = 0; k < N; k++)
            sum += product[k];
                                              Unfortunately this won't work for following reasons,
        *c = sum;
                                              - product[i] is local to each thread
}
                                              - Threads are not visible to each other
```

```
__global__ void innerProduct (int *a, int *b, int *c)
{
                                           First we make the product[i] visible to all the
  __shared__ int product[SIZE];
                                           threads by copying it to shared memory
  int i = threadIdx.x;
  if (i < SIZE)
                                           Next we make sure that all the threads are
     product[i] = a[i] * b[i];
                                           synchronized. In other words each thread has
                                           finished its workload before we move ahead. We do
  __syncthreads();
                                           this by calling __syncthreads()
    if (threadIdx.x == 0)
                                           Finally we assign summation to one thread
                                           (extremely inefficient reduction)
        int sum = 0;
        for (int k = 0; k < SIZE; k++)
            sum += product[k];
        *c = sum;
     }
}
                                              Aside: cudaThreadSynchronize() is used
                                              on the host side to synchronize host and device
```

```
__global___ void innerProduct (int *a, int *b, int *c)
{
    __shared__ int product[SIZE];
    int i = threadIdx.x;
    if (i < SIZE)
        product[i] = a[i] * b[i];
    __syncthreads();
    // Efficient reduction call
    *c = someEfficientLibrary_reduce (product);
}</pre>
```



# **Performance Considerations**



# **Memory Bandwidth**

Memory bandwidth – rate at which the data is transferred – is a valuable metric to gauge the performance of an application

#### **Theoretical Bandwidth**

Memory bandwidth (GB/s) = Memory clock rate (Hz) × interface width (bytes) / 109

### **Real Bandwidth (Effective Bandwidth)**

Bandwidth (GB/s) = [(bytes read + bytes written) / 10<sup>9</sup>] / execution time

If real bandwidth is much lower than the theoretical then code may need review Optimize on Real Bandwidth

May also use profilers to estimate bandwidth and bottlenecks



# **Arithmetic Intensity**

Memory access bandwidth of GPUs is limited compared to the peak compute throughput

High arithmetic intensity (arithmetic operations per memory access) algorithms perform well on such architectures

### **Example**

Fermi peak throughput for SP is 1 TFLOP/s and DP is 0.5 TFLOP/s Global memory (off-chip) bandwidth is 144 GB/s

For every 4 byte single precision floating point operand bandwidth is 36 GB/s and 18 GB/s for double precision

To obtain peak throughout will require 1000/36 ~ 28 SP (14 DP) arithmetic operations



# **Example revisited**

```
__global__ void innerProduct (int *a, int *b, int *c)
  __shared__ int product[SIZE];
  int i = threadIdx.x;
                                   Contrast this with inner product example where for
                                   every 2 memory (data a_i and b_i) accesses only two
 if (i < SIZE)
                                   operations (multiplication and add) are performed.
    product[i] = a[i] * b[i];
                                   That is ratio of 1 as opposed to 28 that is required for
  __syncthreads();
                                   peak throughput.
    if (threadIdx.x == 0)
                                                     Room for algorithm improvement!
       int sum = 0;
       for (int k = 0; k < SIZE; k++)
           sum += product[k];
        *c = sum;
}
```

Aside: Not all performance will be peak performance



### **Benchmarks**

### Relative Performance of Algorithms





**Arithmetic Intensity** 

Courtesy - Sam Williams



# **Optimization Strategies**

Coalesced memory data accesses (use faster memories like shared memory)

Minimize data transfer over PCIe (~ 5 GB/s)

Overlap data transfers and computations with asynchronous calls

Use fast page-locked memory (pinned memory – host memory guaranteed to device)

**Judiciously** 

Threads in a block should be multiples of 32 (warp size). Experiment with your device Smaller thread-blocks better than large many threads blocks when resource limited

Fast libraries (cuBLAS, Thrust, CUSP, cuFFT,...)

Built-in arithmetic instructions



### **CUDA Streams**

Stream is defined as sequence of device operations executed in order

Stream 1 Do memCopy Start timer Launch kernel Stop timer cudaStream\_t stream0, stream1; cudaStreamCreate (&stream0); cudaMemCopyAsync (..., stream0); someKernel<<<..., stream0>>>(); cudaMemCopyAsync (..., stream1); someKernel<<<..., stream1>>>(); cudaStreamSynchronize (stream0); Time Task (stream ID) Down (2) Down (3) Down (N) **Example** Down (1) N streams performing Ker (N-1) Ker (N) Ker (1) Ker (2) 3 tasks Up (N-2) Up (N-1) **Up (N)** Up (1)



# References

#### **CUDA**

http://developer.nvidia.com/category/zone/cuda-zone

### **OpenCL**

http://www.khronos.org/opencl/

#### **GPGPU**

http://www.gpucomputing.net/

### **Advanced topics from Jan 2011 ICCS Summer School**

http://iccs.lbl.gov/workshops/tutorials.html



### Conclusion

If you have parallel code you may benefit from GPUs

In some cases algorithms written on sequential machines may not migrate efficiently and require reexamination and rewrite

If you have short-term goal(s) it may be worthwhile looking into CUDA etc

CUDA provides better performance over OpenCL (Depends)

Most efficient codes optimally use the entire system and not just parts

Heterogeneous computing and parallel programming are here to stay

Number one 2-PetaFlop/s HPC machine in the world (Tianhe-1 in China) is a heterogeneous cluster with 7k+ NVIDIA GPUs and 14k Intel CPUs

