Inside Niftys RenderDevice and how to speed it up (Part 1/2)

When I started Nifty my mindset was like, "well, games can render millions of polys each frame so throwing a couple of hundred textured polys at the GPU shouldn't hurt performance that much".

Well, I was wrong.

To achieve a somewhat high performance you still have to play by the GPU rules. Simply throwing polys at the GPU and expect the best rendering performance doesn't really work. In this two part series of blog posts I'll try to explain what basically sucks in the current way we render things and what we've done in the last couple of months to achieve better rendering performance.

There is of course always room for more improvement. One thing I'd like to tackle in the future and especially in Nifty 2.0 would be to render only the parts of a scene that have changed. Since the best performance you can ever get is not to render :) But since this is a bit more involved for now we're stuck with the "render the whole GUI each frame" approach of current generation Nifty. BUT at least we can make Nifty render fast. Very fast.

So here we go. A trip down the current way how Nifty renders its GUI and how we can try to be a bit smart to optimize it a lot. In this first part we'll look at the way the current renderer works and how bad some of the decisions have been according to rendering performance. In the second part of this two part series we'll look at the way we can make everything better and speed it up.

When it comes to rendering Nifty only knows three somewhat high level primitives:

render a quad in a single color or with different colors at each vertex (for gradient support)</li>
render a textured image</li>
render text with a given font</li>
</ul>
So all of the elements on your Nifty screen and all effects you apply to them will end up as a number of colored quads, textured images or text renderings.

To actually perform all of this Nifty provides some SPI</a> in the de.lessvoid.nifty.spi.render</a> package. If you implement the four simple interfaces for the rendering system of your choice you're done and Nifty can be used with your rendering system. Nifty provides native LWJGL, JOGL, jME3, Slick2D and even Java2D adapter implementations already.
So let's take a look at the main interface of the SPI the de.lessvoid.nifty.spi.render.RenderDevice</a>. There are methods to load images and fonts and methods to let Nifty request the size of the screen. Other methods let the implementation know when a render frame begins and when it ends. However, the core of the RenderDevice interface are a couple of methods to render colored quads, images and text.
The render*() methods contain almost all the state that is required to perform the render directly as method parameters. Things like where to render the quad on the screen and which width, height and color to use are given as parameters.

Besides those parameters there are two additional states that can be modified by Nifty in calling RenderDevice methods which are:
- enable or disable clipping - to restrict rendering to a certain rectangle on the screen. Everything outside this clipping rectangle will not be rendered.</li>
 </ul>
 So Nifty calls beginFrame() and then repeats for everything it needs to render: set the state (clipping and blending) and then calls renderQuad(), renderImage() or renderFont() and finally it calls endFrame(). In each render*() call the implementation will now ensure that the correct textures are set or that texturing is disabled, to render plain colored quads. In case of font rendering the correct bitmap font texture needs to be selected so that the text can be rendered properly and so on.
 
 And here is the main issue in the naive implementations that have been used so far especially in the LWJGL renderer. Changing state costs performance since each state switch results in quite a lot of processing on it's way through all the different layers involved on the way to the GPU. There are driver calls, OS calls, state checks, command queues and so on to finally set the GPU in the state we need to render our triangles. If you're interessted in all of the details there is a great series of blogs available by Fabian Giesen called A trip through the Graphics Pipeline 2011</a>.
 So the first issue the current way Nifty renders stuff is that we change state quite a lot each frame. If we need to render a single colored untextured quad we'll need to disable texturing. If we need to render a certain image next, we'll need to enable texturing again and make sure the texture of the image we need to render is enabled. The same happens to clipping and blending which need to be enabled or disabled as well. So we're constantly changing state which, well, hurts performance.
 
 A second issue that is especially apparent in the native LWJGL renderer is the way the actual vertex data is submitted to the GPU. When submitting data to OpenGL the classic (and very old way) to submit vertex data has been used: the immediate mode. Which means each vertex is send with multiple OpenGL calls. Here is an example:
 
 // code to render a single quad with vertex color - DON'T DO THAT!
 GL11.glBegin(GL11.GL_QUADS);
 GL11.glColor4f(topLeft.getRed(), topLeft.getGreen(), topLeft.getBlue(), topLeft.getAlpha());
 GL11.glVertex2i(x, y);
 GL11.glColor4f(topRight.getRed(), topRight.getGreen(), topRight.getBlue(), topRight.getAlpha());
 GL11.glVertex2i(x + width, y);
 GL11.glColor4f(bottomRight.getRed(), bottomRight.getGreen(), bottomRight.getBlue(), bottomRight.getAlpha());
 GL11.glVertex2i(x + width, y + height);
 GL11.glColor4f(bottomLeft.getRed(), bottomLeft.getGreen(), bottomLeft.getBlue(), bottomLeft.getAlpha());
 GL11.glVertex2i(x, y + height);
 GL11.glEnd();</pre>
 That's really bad. First there are lots of calls to the GL and each of the calls will need to get through the different layers and checks again. This will only be ok if you send very view vertices but as the vertex count increases the overhead of the individual method calls will add up and will hurt performance as well.
 
 So these are the main issues we'll need to solve to improve rendering performance:
 
 reduce state switches</li>
 reduce draw calls to send vertex data</li>
 </ul>
 The second part of this mini blog series will explain how we solve those two issues by providing a special RenderDevice implementation. Interesting enough this special RenderDevice solves additional issues and makes implementing a new renderer for Nifty more easy as well :)
 
 Instead of hundreds of glVertex() calls we can render the whole GUI in very few draw calls and most of the time even only in a single one! And all of this with no changes to the rest of Nifty or your code. In most case you'll be able to use the new special RenderDevice for a performance boost and that's it. nifty! :)
 
 Curious? See you on the next blog post!
 
 void :D

Inside Niftys RenderDevice and how to speed it up (Part 1/2)

You might also enjoy (View all posts)