Home High Performance CUDA: performance aspects of typical tasks

CUDA: performance aspects of typical tasks

by admin

CUDA: performance aspects of typical tasks
Before we start implementing a computational algorithm on a graphics card, it’s worth considering whether we’re going to get the performance gain we want or just lose time.And despite manufacturer’s promises of hundreds of GFLOPS, current generation cards have their own problems, about which it is better to know in advance.I won’t go deep into theory and will review a few significant practical points and formulate some useful conclusions.
Let’s assume that you have roughly figured out how CUDA works and you already downloaded the stable version CUDA Toolkit
I will torture the now middle-end video card GTX460 On a Core Duo E8400.

Function call

Yes, ifwe want to count something, we can’t do without calling a function on the card. So let’s write a simple test function :

__global__ void stubCUDA( unsignedshort * output)
{
// the most valid function: yep, does nothing.
}

Recall that the __global__ specifier allows you to execute a function on the GPU by calling it from the CPU:

cudaThreadSynchronize();
stubCUDA<<<GRID, THREADS> > > (0);
cudaThreadSynchronize();

All function calls are asynchronous by default, so calls to cudaThreadSynchronize() are necessary to wait forthe called function to complete.
Let’s try to run such a block in a loop: we get the order 15000 calls per second forGRID=160, THREADS=96.
Let’s just say it’s not dense at all. Even the simplest function, which does nothing, cannot execute faster than 0.7 ms.
The first assumption is that most of the time is spent on thread synchronization and asynchronous calls would work much faster (although it is more specific to apply them to specific tasks).
Check. Without synchronization it was possible to run the function 73100 once per second. The result, it should be noted, is not at all impressive.
And the last test, let’s run the function with GRID=THREADS=1, it would seem that this should eliminate the overhead of creating a bunch of threads inside the card. But it doesn’t, we get the same 73000-73500 calls per second.
So, the moral :

  • It makes absolutely no sense to run those tasks on the card which are also counted on the CPU in milliseconds.
  • Threads synchronization after a call decreases performance very insignificantly on medium tasks.
  • The number of threads and mesh size have no effect on the total number of calls per second (of course, this is not true for"useful" functions that do something).

External memory access

In order to read something useful we need input and output data. To do this we need to understand how fast the data transfer is from/to the graphics card. We use the following function :

cudaMemcpy(data_cuda, image, data_cuda_size, cudaMemcpyHostToDevice);

Yes, CUDA also offers us asynchronous transfer facilities, but their performance, looking ahead, is no different than the synchronous function.
Copy large blocks : both towards cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost yield performance of the order of 2 Gbytes/sec on large blocks (over 100 megabytes). All in all, this is pretty good.
Things are much worse with very small structures. Transmitting by 4 bytes we get no more than 22000 calls per second, i.e. 88 kbytes/sec
Moral :

  • It is desirable to group the data into large blocks and transfer them with a single call to the cudaMemcpy function.

Access to memory from inside

After we have transferred the data to the card, we can start working with it. We want to estimate the approximate speed of access to the video memory. To do this let’s write the following function :

__global__ void accessTestCUDA(unsigned short * output, unsigned short * data, int blockcount, int blocksize)
{
// just fortest of max access speed: does nothing useful
unsigned short temp;
for ( int i = blockIdx.x; i < blockcount; i += gridDim.x)
{
int vectorBase = i * blocksize;
int vectorEnd = vectorBase + blocksize;
for ( int j = vectorBase + threadIdx.x; j < vectorEnd; j += blockDim.x)
{
temp = data[j];
}
}
output[0] = temp;
}

GRID and THREADS parameters are already used here, I won’t explain why yet, but trust me – everything is as it should be. Picky people will say that the result is not written correctly because of the lack of synchronization, but we don’t need it.
So, we get the order of 42 Gbytes/sec for random reading. Now that’s not bad at all.
Now modify the function to copy the input data to the output. It makes no sense, but allows you to estimate the speed of writing to video memory (since the modification is not complicated at all, I won’t duplicate the code).
We get about 30 Gbytes/second per I/O. That’s not bad either.
A correction should be made for the fact that we actually used sequential (with some deviations) memory access. For arbitrary, the numbers may degrade by up to a factor of two – but that’s not a problem either, is it?
Moral :

  • Due to very high memory access speeds on the cards, it is efficient to implement algorithms that use it intensively.

Arithmetic operations

Let’s skip the very simple examples and do something useful. Namely, let’s normalize the image (pixel[t] := (pixel[t]-sub)*factor). The actual code is :

__global__ void normalizeCUDA(unsigned short * data, int blockcount, int blocksize, float sub, float factor)
{
for ( int i = blockIdx.x; i < blockcount; i += gridDim.x)
{
int vectorBase = i * blocksize;
int vectorEnd = vectorBase + blocksize;
for ( int j = vectorBase + threadIdx.x; j < vectorEnd; j += blockDim.x)
{
register float d = ( float )data[j];
d = (d - sub) * factor;
data[j] = (unsigned short )d;
}
}
}

There are three seemingly costly computational procedures used here: conversion to real numbers, ADDMUL, and conversion to integers. On the forums it’s scary that the integer-to-real number conversion works out of the box. Maybe this was true for older generations of cards, but it’s not true now.
Total processing speed : 26 Gbytes/sec Three operations degraded the performance relative to direct I/O by only 13%.
If you look closely at the code, it doesn’t normalize quite right. Before writing to integers, the real should be rounded, for example with round(). But don’t do that, and try never to use it!
round(d): 20 Gbytes/sec , another minus 23%.
(unsigned short)(d + 0.5): 26 Gbytes/sec , the actual time is within the measurement error, it has not even changed.
Moral :

  • Arithmetic operations are really fast!
  • For the simplest image processing algorithms you can expect speeds of 10-20 Gbytes/sec.
  • It is better to avoidusing round().

Logical operations

Let’s try to evaluate how fast the logical operations work and also do one more good thing: find minimum and maximum values of an array. This step usually precedes normalization (and that’s what it was written for), but in our case it will be the opposite, since it is more complicated. Here is the working code :

__global__ void getMinMaxCUDA(unsigned short * output, unsigned short * data, int blockcount, int blocksize)
{
__shared__ unsigned short sMins[MAX_THREADS];
__shared__ unsigned short sMaxs[MAX_THREADS];
sMins[threadIdx.x] = data[0];
sMaxs[threadIdx.x] = data[0];
for ( int i = blockIdx.x; i < blockcount; i += gridDim.x)
{
int vectorBase = i * blocksize;
int vectorEnd = vectorBase + blocksize;
for ( int j = vectorBase + threadIdx.x; j < vectorEnd; j += blockDim.x)
{
register unsigned short d = data[j];
if (d < sMins[threadIdx.x])
sMins[threadIdx.x] = d;
if (d > sMaxs[threadIdx.x])
sMaxs[threadIdx.x] = d;
}
}
__syncthreads();
if (threadIdx.x == 0)
{
register unsigned short min = sMins[0];
for ( int j = 1; j < blockDim.x; j++)
if (sMins[j] < min)
min = sMins[j];
if (min < output[0])
output[0] = min;
}
if (threadIdx.x == 1)
{
register unsigned short max = sMaxs[0];
for ( int j = 1; j < blockDim.x; j++)
if (sMaxs[j] > max)
max = sMaxs[j];
if (max > output[1])
output[1] = max;
}
__syncthreads();
}

You can’t do without stream synchronization here and shared memory
Final Speed : 29 Gbytes/sec , even faster than normalization.
Why I combined the minimum and maximum code – you usually need both, and calling them separately is a waste of time (see the first paragraph).
Anyway, throw a stone at someone who said that video cards are bad with conditional operations: I managed to artificially slow down this fragment by almost 2 times, but it required increasing condition depth up to 4! if () if () if () if () else if ()…
Moral :

  • On modern cards the logic is generally not that bad, but you should avoid a lot of depth of nested conditions.

Complex data structures

Guided by the idea that algorithms and data structures are strongly related (at least remember N. Wirth), we should check how things are with some complex data structures.
This is where the problem arises, when passing data into functions we can only use two kinds of objects – constant integral types (chisels) and references to blocks of video memory.
The idea of building e.g. trees based on links is covered immediately :

  • we cannot allocate memory from a function running on the card;
  • any allocation and copying of small amounts of data is very slow (see section 2).

Thus, complex data structures remain to be represented as a continuous block of memory and an array of references to elements of this block. In this way you can easily represent a hash table, a tree, and an index structure over an array of data.
The payoff forsuch tricks is the need to apply double indexing :

for ( int j = vectorBase + threadIdx.x; j < vectorEnd; j += blockDim.x)
{
temp = data[index[j]+i];
}

This fragment runs at a rate of 10 to 30 GB/s depending on the fill and size of the index and data. Memory usage can be attempted co-optimize but even in the best case we lose 25% of the access speed. Triple indexes behave even worse, losing 40%-60% of performance.

We learned a lot today

If you use the video card properly you can get unprecedented performance in tasks like image processing, sound processing, video processing – wherever there are large amounts of data, the need for cunning arithmetic and no complex data structures.
If you like this topic I will tell you about how to calculate on graphics card some useful objects: Distance Map, morphology and search indexes and show you some interesting data structures, which work fast enough and do not create unnecessary problems with synchronization

You may also like