Home GPGPU Features of memory allocation in OpenCL

Features of memory allocation in OpenCL

by admin


Hello, dear readers.
In this post I will try to look at the specifics of memory allocation for OpenCL objects.
OpenCL is a cross-platform standard for heterogeneous computing. It is no secret that programs are written in it when they need to be executed quickly. As a rule, such code needs to be comprehensively optimized. Any GPGPU developer knows that memory operations are often the weakest link in the program’s speed. Since there are a great many OpenCL-compatible hardware platforms, the issue of memory object organization is often a headache. What works well on Nvidia Tesla equipped with local memory and connected with a wide bus to the global bus refuses to show acceptable performance on SoC with a completely different architecture.
This post is dedicated to the specifics of memory allocation for systems with a common CPU and GPU memory. Let’s leave Image memory types aside and focus on the most commonly used Buffer memory type. As a standard we will consider version 1.1 as the most commonly used one. In the beginning we will give a short theoretical lesson and then look at some examples.


The memory is allocated by calling the API function clCreateBuffer. The syntax is as follows :

cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret)

We are primarily interested in the flags that determine how memory is allocated. The following values are allowed :


The simplest variant. The memory will be allocated on the OpenCL Device side in read-write/write-only/read-only mode.


The memory for the object will be allocated from the Host memory-i.e., from RAM. This flag is of interest for systems with common CPU and GPU memory.


The object will use the memory already allocated (and used by the program) by the Host at the specified address. The standard allows the possibility to allocate memory on the Device side as an intermediate buffer. This flag and CL_MEM_ALLOC_HOST_PTR are mutually exclusive. This flag is interesting if you add OpenCL support to your application and want to use the existing memory to work with it on the Device side.


This flag means that when an object is created, a memcpy analog will be made from the specified address.


Let’s try to find out in practice which memory allocation variant suits better for the traditional case of a discrete graphics card with its own memory "on board" and the one where the GPU uses memory from RAM. The following computers will be used as test systems :

  • System with discrete graphics chip : Intel Core i5 4200U, 4Gb DDR31600Mhz, Radeon 8670M 128bit GDDR3 1800Mhz
  • System with internal video chip : AMD A6700, 8Gb DDR31800Mhx, Radeon 7660D 128bit

Operating system in both cases Windows7 SP1, development environment – Visual Studio 2013 Express + AMD APP SDK 2.9.
As a test load we will perform read/write and mapping/unmapping of different size memory objects – from 65Kb to 65Mb.
Without thinking too much, let’s move on to the graphs. In all cases the abscissa axis shows the amount of memory in bytes, the ordinate axis shows the operation execution time in microseconds. The video card with discrete memory is labeled "discrete GPU" in the chart title; the video card with shared memory with CPU is labeled "Integrated GPU".

Discrete graphics card

Features of memory allocation in OpenCL
This graph shows a linear relationship between data transfer time and volume. The adapter uses its own memory, so the results are stable.
Features of memory allocation in OpenCL
In this case, the memory for the object was allocated from RAM, so we have a slightly larger scatter of values.
Features of memory allocation in OpenCL
Features of memory allocation in OpenCL
The graphs illustrate the mapping/unmapping of the buffer allocated from the GPU memory. Mapping is the procedure of mapping a memory region from Device address space to Host space. Unmapping is the reverse process. Since for GPU with its own memory these address spaces are physically different, that to perform mapping the reading/writing to temporary buffers takes place.
Features of memory allocation in OpenCL
Features of memory allocation in OpenCL
When using the existing memory allocated on the Host side, the time variation is more significant. There can be many reasons for this – memory alignment to different base values, competing memory controller loads, and more.

Integrated graphics card

Features of memory allocation in OpenCL
Features of memory allocation in OpenCL
In the case of Host and Device shared memory, the scatter of the results is stronger. This is easily explained by the need for shared memory and the increased load on the controller.
Features of memory allocation in OpenCL
Features of memory allocation in OpenCL
When allocating memory from RAM, the Device and Host memory objects are in the same physical address space and different virtual ones. Therefore, the address conversion time is constant and does not depend on the size of the object.
Features of memory allocation in OpenCL
Features of memory allocation in OpenCL
However, if we allocate memory for the object from the GPU memory size, the dependence of execution time on the object size will be similar to what is observed when using a discrete graphics card, correcting for the increased scatter of results.
Features of memory allocation in OpenCL
Features of memory allocation in OpenCL
Again, if the memory allocated from the Host side is used, we get about zero execution time, independent of the buffer size.


It is worth noting the following patterns identified during the experiment :

  • Using a graphics card with discrete memory gives results with less variation. This is due to the use of split memory. Using shared memory, on the contrary, results in a larger scatter.
  • In both cases, using the USE_HOST_PTR flag increases the variability of the results.

While local memory on the discrete graphics card is preferable for heavily accessed OpenCL cores, using shared memory allows mapping/unmapping to be done in about zero time regardless of buffer size in some cases.
The mapping technique can be used in both cases. On a system with shared memory it gives the benefits described above, on a system with a discrete graphics card it simply works in the same linear time as the classic read/write scheme.

You may also like