OpenGL – Visual Computing Lab

While we wait – Approaching Zero Driver Overhead

Jesper Mosegaard — Sun, 11 Jan 2015 20:07:47 +0000

As I (red: Jesper Børlum previous employee), was looking through the presentations from Siggraph Asia 2014, one presentation in particular caught my eye. Tristan Lorachs presentation on Nvidias upcoming manual Command-List OpenGL extension. With all the focus on reducing the CPU-side driver overhead in the current graphics APIs this last year, and the upcoming new rendering APIs (AMDs Mantle, Microsoft DirectX 12, Apple Metal), I decided to make an overview of the current recommendations for scene rendering using core OpenGL and take a poke at Nvidias new extension. This first article is going to look at the core OpenGL recommendations, and the next article is going to be on Nvidias new extension. I am writing this article because I wanted to get a better grasp on the implementation details in the excellent GTC / Siggraph performance presentations found here:
http://on-demand.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdf
http://on-demand.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering-techniques.pdf
http://www.slideshare.net/tlorach/opengl-nvidia-commandlistapproaching-zerodriveroverhead

For performance results and shader code please refer to the Nvidia presentations.

Disclaimer – This post is a simplification of a complex topic. If you feel I have left out important details, please add them to the comments at the end or write me.

Modern GPUs are absolute beasts. It never ceases to amaze me how much raw processing power they can handle. Even standard gaming hardware. However, scenes requirements are getting increasingly complex. They contain more geometry, more different types of materials used, and new and complex render effects. The GPU driver often ends up being a serious performance bottleneck handling this complexity. This means that no matter how much GPU power you throw at the rendering the overall performance is not going to increase.
A lot of stuff eats up CPU performance. Scenegraph traversal, animation, renderlist generation, sorting by state and all driver interactions etc.
Current driver performance culprits are:

Frequent GPU state changes (shader, parameters, textures, framebuffer etc.).
Draw commands.
Geometry stream changes.
Data transfers (uploads / read-backs).

All of these boils down to the driver eating up your precious CPU clockcycles.
Using the techniques below most of this CPU driver overhead can be reduces to almost zero. In the following sections, I will be looking at several methods for reducing the overhead. Most achieve this simply by calling the driver less. Seams simple enough, but handling material changes, texture changes, buffer changes, state changes between the draw calls can get tricky. Also, note that most of these methods require a newer version of OpenGL. Some of the functions only just made it into the core specifications (OpenGL 4.4 / 4.5).

A scene, in the context of this post, is a collection of objects, each consisting of sub-objects. A sub-object is a material and a draw command. Objects are logical collections of sub-objects each with their own world transform matrix. A material is a collection consisting of a shader program, parameters for the shader program and an OpenGL render state collection.

I have provided two naïve approaches to scene rendering and uploading of shader parameters – The two areas we will be focusing on.

Naïve scene rendering
This will act as the baseline for performance, and is what each improvement will try to improve on.

foreach(object in scene)
{
    foreach(subobject in object)
    {
        // Attaches the vertex and index buffers to the pipeline.
        SetupGeometry(subobject.geometry);

        // Updates active shaders if changed.
        // Uploads the material parameters.
        // Uploads the world transform parameter.
        SetMaterialIfChanged(subobject.material, object.transform);

        // Dispatch the draw call.
        Draw();
    }
}

This method imposes a large number of driver interactions:

Geometry streams are changed per sub-object.
Shaders are changed per sub-object, if different from current.
Shader parameters are uploaded per draw.
A draw call per sub-object.

Naïve parameter update
Uploading parameters, also known as uniform parameters, to shaders can impose a significant number of driver calls – especially if uploaded “the old fashioned way” where each parameter upload is a separate call to glUniform. This will act as the baseline for performance, and is what each improvement will try to improve on.

foreach(object in scene)
{
    ...
    foreach(batch in object.materialBatches)
    {
        if (batch.material != currentMaterial)
        {
            // Apply the active program to the pipeline.
            glUseProgram(batch.material.program);

            // Uniforms are program object state which needs to be updates for each program!
            glUniform(transformLoc, object.tranform);
            glUniform(diffuseColorLoc, batch.material.diffuseColor);
            glUniform(...);
            ...
        }

        // Dispatch draw.
    }
}

This technique has several weaknesses. Its many separate driver calls, which the driver cannot predict. To make it even worse, we need to re-upload all the parameters each time we change the shader program. Shader program objects contain the parameter values – Not the general OpenGL state. In the past, I have solved this by maintaining CPU-side parameter state cache per shader program. The proxy is then responsible for re-uploading if the uniform becomes dirty. This is a workable solution if you cannot use buffer objects, which trivializes the sharing of parameter data across shader programs as seen later in this post.

Improvement 1 – Single buffer per object
The obvious improvement to the naïve scene rendering is to move the buffers from the sub-objects into a collection of collapsed buffers in the containing object. This will allow us to move the buffer bind call from the inner loop to the outer loop. This will dramatically lower the number of geometry driver calls in a scene were each object contains many sub-objects. Each sub-object will now need to know the correct stream offset into the collapsed buffers to be able to draw correctly. When loading geometry you will need to collapse all sub-object buffers and offset the vertex indices to reflect the new position in the collapsed buffer.

foreach(object in scene)
{
    // Attaches the vertex and index buffers to the pipeline.
    SetupGeometry(object.geometry);

    foreach(subobject in object)
    {
        // Updates active shaders if changed.
        // Uploads the material parameters.
        // Uploads the world transform parameter.
        SetMaterialIfChanged(subobject.material, object.transform);

        // Dispatch the draw call.
        Draw();
    }
}

Improvement 2 – Sort sub-objects by material
Sorting by complete materials (same shaders, render state and material parameters – for now) achieves two things. We can now draw several sub-objects at a time and avoid costly shader changes.
The main difference to the render loop is that instead of looping over each sub-object, we now loop over a material batch. A material batch contains the material information, along with information about which parts of the geometry is to be rendered using that material setup.
During geometry load, you will need to sort by materials so that each batch contains enough information to render all sub-objects it contains.
You can opt to rearrange the vertex buffer data so that the draw command ranges can be “grown” to draw several sub-objects in a single command.
When drawing you can choose between two different ways:

Using a loop over each of the sub-object buffer ranges in the batch drawing each with glDrawElements.
Submitting all draw calls in one call using the slightly improved glMultiDrawElements.

The second multi draw approach will execute the loop for you inside the driver – hence only a slight improvement.

foreach(object in scene)
{
    // Attaches the vertex and index buffers to the pipeline.
    SetupGeometry(object.geometry);

    foreach(batch in object.materialBatches)
    {
        // Updates active shaders if changed.
        // Uploads the material parameters.
        // Uploads the world transform parameter.
        SetMaterialIfChanged(batch.material, object.transform);

        // Dispatch the draw call.
        foreach(range in batch.ranges)
            glDrawElements(GL_TRIANGLES, range.count, GL_UNSIGNED_INT, range.offset);
    }
}

Improvement 3 – Buffers for uniforms
Instead of uploading each uniform separately as shown in the naïve parameter update, OpenGL allows you to store uniform in objects. So called Uniform Buffer Objects (UBO). Instead of having a glUniform call per object, you can upload a chunk of uniforms using a buffer upload like glBufferData or glBufferSubData. It is important to group uniforms according to frequency of change, when uploading data into buffers. A practical grouping of uniforms could look something like the following:

Scene globals – camera etc.
Active lights.
Material parameters.
Object specifics – transform etc.

Grouping parameters allows you to leave infrequently changed data on the GPU, while the only the dynamic data is re-uploaded. A key UBO feature is that they allow parameter sharing across shader programs unlike glUniform. I am not going to write a full usage guide on UBOs – one can be found here.
There are different ways to use Uniform Buffer Objects. They recommended way changes according to if the data you are using is fairly static or dynamic. Below are examples of both. Note – You can mix the methods as best fit your use case.

Static buffer data:
If the data changes infrequently, upload parameters for all the sub-objects in one go into a large UBO. Then target the correct parameters by using the glBindBufferRange calls as shown below:

#define UBO_GLOBAL_SLOT 0
#define UBO_TRANS_SLOT 1
#define UBO_MAT_SLOT 2

// Update combined uniform buffers for all objects.
UpdateUniformBuffers();

// Bind global uniform buffers.
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_GLOBAL_SLOT, uboGlobal);

foreach(object in scene)
{
    ...
    // Bind object uniform buffer.
    glBindBufferRange(GL_UNIFORM_BUFFER, UBO_TRANS_SLOT, uboTransforms, object.transformOffset, matrixSize);

    foreach(batch in object.materialBatches)
    {
        // Bind material uniform buffer.
        glBindBufferRange(GL_UNIFORM_BUFFER, UBO_MAT_SLOT, uboMaterials, batch.materialOffset, mtlSize);

        if (batch.material.program != currentProgram)
        {
            // Apply the active program to the pipeline.
            glUseProgram(batch.material.program);
        }

        // Draw.
    }
}

Dynamic buffer data:
If data change frequently, upload parameters into a small UBO for each material batch. The example below takes advantage of the new direct state methods (DSA) introduced in OpenGL 4.5. The below shows how such a render loop could look.

#define UBO_GLOBAL_SLOT 0
#define UBO_TRANS_SLOT 1
#define UBO_MAT_SLOT 2

// Bind buffers to their respective slots.
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_GLOBAL_SLOT, uboGlobal);
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_TRANS_SLOT, uboTransforms);
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_MAT_SLOT, uboMaterials);

foreach(object in scene)
{
    ...
    // Upload object transform.
    glNamedBufferSubData(uboTransforms, 0, matrixSize, object.transform);

    foreach(batch in object.materialBatches)
    {
        // Upload batch material.
        glNamedBufferSubData(uboMaterials, 0, mtlSize, batch.material);

        if (batch.material.program != currentProgram)
        {
            // Apply the active program to the pipeline.
            glUseProgram(batch.material.program);
        }

        // Draw.
    }
}

Note – Upload of scattered data changes to static buffer using compute + SSBO
Nvidia mentioned a cute way to scatter data into a buffer. Normally you need to upload using a series of smaller glBufferSubData calls if the changes are non-continuous in memory. Alternatively, you could re-upload the entire buffer from scratch. This could potentially degrade performance significantly. They suggests placing all changes in a SSBO and perform the scatter-write using a compute shader. A shader storage buffer object (SSBO) is just a user-defined OpenGL buffer object that can be read/written using compute shaders. I have yet to try this technique out so I cannot comment if the performance makes it feasible. I really like the idea though.

Improvement 4 – Shader-based material / transform lookup
Improvement 3 introduces the notion of using UBOs to improve the uniform communication performance. Unfortunately, there are still many glBindBufferRange operations. It is possible to remove those binds by binding the entire buffer and then have the shader index the information. Communication of the index is done through a generic vertex attributes as shown below.

#define UBO_GLOBAL_SLOT 0
#define UBO_TRANS_SLOT 1
#define UBO_MAT_SLOT 2

// Update combined uniform buffers for all objects.
UpdateUniformBuffers();

// Bind buffers to their respective slots.
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_GLOBAL_SLOT, uboGlobal);
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_TRANS_SLOT, uboTransforms);
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_MAT_SLOT, uboMaterials);

foreach(object in scene)
{
    ...

    foreach(batch in object.materialBatches)
    {
        if (batch.material.program != currentProgram)
        {
            // Apply the active program to the pipeline.
            glUseProgram(batch.material.program);
        }

        // Set buffer indices - shader program specific location!
        glVertexAttribI2i(indexAttribLoc, object.transformLoc, batch.materialLoc);

        // Draw.
    }
}

You use a generic vertex attribute as any other vertex attribute from inside the shader.

Improvement 5 – Bindless resources
Changing texture state have up to recently been a major headache when it came to batching efficiently. Sure, it is possible to store several textures inside an array texture and then index into the different layers, but there are several limitations and it is generally a pain to work with. OpenGL requires the application to bind textures to the texture slots prior to dispatching the draw calls. Textures are merely CPU-side handles as all other OpenGL object, but the new extension ARB_bindless_texture allows the application to retrieve a unique 64-bit GPU handle that the shader can use to lookup texture data without binding first. It is possible to store these new GPU handles in uniform buffers, unlike the CPU-side handles. GPU handles can be set like any other uniform using glUniformHandleui64, but it is strongly recommended to use UBOs (or similar – see Improvement 3). It is the applications responsibility to make sure textures are resident before dispatching the draw call. More information regarding this can be found in the extension spec here.
Nvidia has an extension that allows bindless buffers as well – More information can be found here. This is something we will have a look at when looking at the new Nvidia commandlist extension in the next article.

Improvement 6 – The indirect draw commands
A new addition to the numerous ways to draw in OpenGL is the indirect draw commands. Rather than submitting each draw call from the CPU, it is now possible to store all the draw information inside a buffer, which the GPU then loops through when drawing. The buffer contains an array of predefined structures, which in the case of glMultiDrawElementsIndirect looks like this:

typedef struct
{
    uint count;
    uint instanceCount;
    uint firstIndex;
    uint baseVertex;
    uint baseInstance;
} DrawElementsIndirectCommand;

Using an indirect draw command works much like the glMultiDrawElements described in Improvement 2 works. An added benefit is that you can create your GPU worklist directly on the GPU. You can e.g. use this to cull your scene from a compute shader rather than use the CPU.

There is a special bind target for indirect buffers called GL_DRAW_INDIRECT_BUFFER. The driver uses bound buffer to read the draw data. It is illegal to submit an indirect draw call using client memory.
Using indirect draw you will not need a separate draw command for each sub-object in a material batch as described in Improvement 2. To draw efficiently you will only have to create a buffer filled with the structs that describe the ranges of the objects you wish to draw using the active shader. This can be a huge draw command improvement. I have yet to test if you get an improved performance by growing the draw ranges by physically rearranging the vertex buffers.
Which material parameters and matrix to use when drawing each of the sub-objects can be handled much like in Improvement 4. Through a matrix / material array index. However, the method is a bit different as we are no longer able to set a generic vertex between each drawn sub-object. The indirect struct contains a lot of information, not all of which we need to use. The baseInstance member for example. By using this, we can communicate both the material and matrix index, so the shader program can get the data it needs. How you choose to split the bits all comes down to how much you need to draw.

#define UBO_GLOBAL_SLOT 0
#define UBO_TRANS_SLOT 1
#define UBO_MAT_SLOT 2

// Update combined uniform buffers for all objects.
UpdateUniformBuffers();

// Bind buffers to their respective slots.
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_GLOBAL_SLOT, uboGlobal);
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_TRANS_SLOT, uboTransforms);
glBindBufferBase(GL_UNIFORM_BUFFER, UBO_MAT_SLOT, uboMaterials);

// Bind indirect buffer for entire scene.
glBindBuffer(GL_DRAW_INDIRECT_BUFFER, scene.indirectBuffer);

foreach(object in scene)
{
    ...
    
    foreach(batch in object.materialBatches)
    {
        if (batch.material.program != currentProgram)
        {
            // Apply the active program to the pipeline.
            glUseProgram(batch.material.program);
        }

        // Draw batch.
        glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, object->indirectOffset, object->numIndirects, 0);
    }
}

Unfortunately, it is not yet possible to change state (renderstate and shaders) using the indirect draw commands. This is something I am going to look at in the next article on the Nvidia CommandList extension.

This post turned out to be bigger than I had first anticipated, but efficient drawing is tricky. If you made it this far – Good for you! I hope to get time to write the follow up article as soon as real life allows me.

Procedural Rock Modeling

Jesper Mosegaard — Mon, 09 Sep 2013 07:53:11 +0000

Modeling boulders and other detailed rock formations can be a challenging and time-consuming task for 3D artists. Funded by the Innovationsnetværket VISUEL VÆKST we have built a plugin for Maya which enables 3D artists to quickly and painlessly generate artist-controllable rocks from arbitrary 3D input models. We achieve this by leveraging the base technologies built in the Elements – Environmental Visual Effects through Result-oriented Design project. Like the Elements project, the artists have full control while modeling and all changes to the geometry are reflected in real-time in the Maya Viewport 2.0. However, unlike the Elements project we are able to store the procedurally generated geometry to disk for later use.

If you want to participate by testing the plugin please contact jesper.borlum@alexandra.dk.

The image above shows the final results.

Real-time volume rendering sandbox using GLSL

Jesper Mosegaard — Mon, 18 Jun 2012 10:17:39 +0000

As part of an ongoing research project we decided to see how far we could push real-time volume rendering using only GLSL shaders. The video shown here demonstrates some of the supported features such as:

Multiple iso-surface shading
Density plotting
Arbitrary oriented contour planes
Arbitrary oriented cutting plane

The shown video is running on an explicit dataset 256^3 4x16bit floating data on a Nvidia GTX470 graphics card.

[vimeo]http://vimeo.com/44234825[/vimeo]

Smoke rendering software

Peter Trier Mikkelsen — Fri, 15 May 2009 08:50:48 +0000

Screenshot from the demo

Here is a tech demo of our Cuda smoke visualizer software. The software demonstrates real-time interaction and visualization with a smoke data set. It is possible to adjust several parameters such as density and lighting position. To download press HERE.

Additional screenshot from the demo

Cardiac Surgery Simulator

Jesper Mosegaard — Wed, 06 May 2009 14:01:01 +0000

Easy GPGPU program – OpenGL and CG

Jesper Mosegaard — Wed, 29 Apr 2009 18:43:56 +0000

As part of the GPGPU course at the University of Aarhus in 2005 we developed a very simple set of base-classes for General Purpose Computation using the Graphics Processing Unit (GPGPU) through OpenGL, Nvidia CG, and either framebuffer objects or PBuffers for render-to-texture functionality. Today you should ideally use Nvidia CUDA or OpenCL for GPGPU – but the code might still be of interest for older hardware or a pure OpenGL/CG based approach to GPGPU:

SimpleReactionDiffusion (framebuffer_object).zip

The archive file includes the EasyGPUProgram class that has methods to initiate data in a 2d grid layout, do computation (as a cg fragment shader), and retrieve the data. We have included a reaction-diffusion example based on GPU Gems 2 chapter 31 using the EasyGPUProgram class.

GPU Raycasting Tutorial

Peter Trier Mikkelsen — Tue, 28 Apr 2009 18:56:48 +0000

The famous Stanford dragon rendered using GPU raycasting.

This post will try to explain how to implement a GPU based raycasting render, using open GL and Nvidia’s CG. This tutorial assumes some experiance with OpenGl and vertex-fragment shaders.

First of all why do we need this algortihm? Because it is a smart way to achieve high quality volume rendering and the raycasting algorithm is well suited for the modern GPU’s. Esspecially the new 8800 series because of the unified shading system.

The reason behind this tutorial is to help people getting startet with GPU raycasting because there is some technical difficulties that has to be adressed in order to render volumetric data like in the picture above.

The main core of the algorithm is to send one ray per screen pixel and trace this ray through the volume. This is possible to implement in a fragment program and the rendering can be done in realtime. The techinque is pretty flexible for instance effect like shadows can be implemented with a few lines of code.

Here is a conceptual image of the raycasting algorithm where one ray pr pixel is spawned and traced through the volume.

In order to generate the nessesary rays we use a clever trick by using OpenGl abillity to render geometry. How can this help us you might say? Well listen up my young apprentice. First we define a ray:

A ray is just a origin point o and a direction vector dir.
A ray descripes a line in 3d space by using the formula P(t) = o + dit * t
So to generate a ray we need to find the origin point and the direction vector.

This can be done be rendering a cube where the colors represent coordinates, and let OpenGL’s interpolation take care of the rest. The way to do this is to render the front and the backside of a unitcube that is illustrated just below.

If we subtract the backface (on the right) from the front we get at a direction vector for each pixel. This is the direction of our ray. The origin is just the frontface values of the cube. So we have to do two renderpasses one for the front and one for the back. To render the backside we enable frontface culling. In my implementation i use an OpenGL framebuffer object (fbo) to store the result from the rendering of the backside, and use the frontface rendering to generate the fragments that starts the raycasting process. If you are unfamiliar with framebuffer objects check out this link.

So to calculate the raycasting we need to create the ray and then step through the volumetexture. This is all done in a single fragmentprogram and calculated on the GPU. The fragment program is fairly simple, the only real issue is to calculate texturecoordinates used to index the backface buffer in order to get the ray exit point. These texture coordinates is refered to as normalized device coordinate, and in the implementation we find a corresponding pixel in the backface buffer by this calculation.

float2 texc = ((IN.Pos.xy / IN.Pos.w) + 1) / 2;

Where IN.Pos is the modelviewprojected position. This calulation gives us the fragments screen position in the interval between [0,1]. Then the ray exitposition is found be using texc to index in the backface buffer like this:

float4 exit_position = tex2D(backface_buffer, texc);

Now we create the ray and use the shader model 3.0 looping capabilities to create a for loop. This loop will step through the volume with a certain stepsize delta and we can accumulate opacity and color value according to the nature of our volume data set. In the demo implementation the ray terminates when it leaves the volume or when the accumulated opacity reaches a high enough value. But there are many possibilities. For instance if we terminate the ray when a certain opacity threshold is first encountered then the result will be some kind of iso surface rendering.

Here is a link to the raycasting shader.

The raycasting technique can be used for many types of rendering problems where polygonbased rendering have a hard time. For instance effects like smoke and glass can be rendered in realtime. Maybe i will do a tut on these subjects in the near future.

To get you started with this cool rendering algorithm, i have made a simple GPU raycasting implementation that hopefully will clear up the details. That is just the kind of guy i am

The demo contains a windows executable and source code that implements the GPU raycasting algorithm. To compile you need Nvidias cg and Glew.

The demo just shows a volume of colors where some spheres have been subtracted. The stepsize of the ray can be adjusted be pressing the “w” and “e” buttons. Notice this demo might be hard on your gfx card, just try to press “w” a lot this will result in a big stepsize and the raycasting will update more rapidly.

To download windows demo and source code click here.

As a last comment: this is just a tutorial that will get you started with GPU ryacasting, the technique was invented back in 2003 by these guys:

J. Kr¨uger and R. Westermann. Acceleration techniques for
GPU-based volume rendering. In Proceedings of IEEE Visualization
2003, pages 287–292, 2003.

The GPU raycasting is an active research area and if you want to learn more about this, here is a couple of references:

A more recent article that explain a shader model 3.0 implementation can be fund at this url: http://www.vis.uni-stuttgart.de/ger/research/fields/current/spvolren/

VRVis has posted a lot of really good papers on the subject so that has been my greatest resource. Visit them at this address: http://medvis.vrvis.at/home/

If you find some errors or make some cool improvements please let me know at peter.trier@alexandra.dk. Have fun!!