People‎ > ‎John Stavrakakis‎ > ‎John's blog‎ > ‎

Digging into OpenCL

posted 22 Dec 2011, 01:03 by John Stavrakakis
Recently I have begun to port my compression algorithm to OpenCL. Despite being C99 friendly, there are some gotchas.

1) Implicit float and double conversion not present. E.g.

double *vecd;

float4 vecf;

vecf[0] = vecd[0]; /* NOOO! */

vecf[0] = (float)vecd[0]; /* ? */

2) Very pedantic about the parameters passed between functions. Explicit casting of const __global float * modelview 

 (const float *) modelview does not fly.

float4 _mesa_gluProjectf(float4 v, 

                     const __global float * modelview, 

                     const __global float *projection, 

                     const __global int *viewport);

__kernel void projection_test(const __global float* modelview,

                                   const __global float* projection,

                                   const __global int *viewport,

                                   const __global float* vert)


    uint gid = get_global_id(0);

    float4 v;

    v[0] = vert[4*gid+0];

    v[1] = vert[4*gid+1];

    v[2] = vert[4*gid+2];

    v[3] = 1;

    float4 wv = _mesa_gluProjectf(v, mview, mproj, viewport);



3) Byte addressable memory needs to be explicitly enabled!
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable - enables the application to write 1 byte data types into global memory. This is very important to effectively manipulate strings or monochromatic pictures. 

Just a quick test of performance for single projection operation on different 3D models using my Macbook Pro (C2Duo, Nvidia 320M):

12 verts

GPU Loop - Unoptimized: 0.136173512

CPU Loop - gluProject : 1.81e-05

1760 verts

GPU Loop - Unoptimized: 0.079442842

CPU Loop - gluProject : 0.000334405

2144 verts

GPU Loop - Unoptimized: 0.082229331

CPU Loop - gluProject : 0.00050072

8728 verts

GPU Loop - Unoptimized: 0.104535711

CPU Loop - gluProject : 0.0015727

22998 verts

GPU Loop - Unoptimized: 0.097971984

CPU Loop - gluProject : 0.004996595

32328 verts

GPU Loop - Unoptimized: 0.091570921

CPU Loop - gluProject : 0.006260859

543652 verts

GPU Loop - Unoptimized: 0.414667467

CPU Loop - gluProject : 0.109198138

The results are non-stellar. The performance bottleneck can be seen as:
  • added time to transfer all that data to the compute units
  • transferring vertices as 4 floating point numbers instead of 3
  • using all global memory, not making use of local or shared.
  • not utilising the very fast mad operations (multiply-add) that are built in OpenCL.
  • have not divided the index space very well. There is only one work group and it remains a naive parallel implementation.

OpenCL works best when you make each kernel work hard. As the port of this algorithm goes on the load will be increased and the results should start to appear. Nevertheless, with OpenCL, this work is otherwise deferred from the CPU.

I've only just obtained another version of the AMD APP for OpenCL development as they posted a corrupt file (219mb != 7.4mb) on their website. I would like to see if there is any improvement on this simple routine on the bigger machine running windows.

I've really found the CMSoft website to be most useful. It covers a few of the things that you would not otherwise find out through trial and painful error.