Optimizing Dreamcast Microsoft
Direct3D Performance
By Sebastian Wloch
Kalisto Entertainment
March 1999
Summary: This article provides guidelines for achieving high performance
for Microsoft® Windows® CE-based game applications. Game developers share
useful implementations for those who want to write an efficient 3-D engine,
based on Microsoft Direct3D® and the Windows CE operating system for the
Dreamcast. The article discusses performance techniques, optimization
methods, geometries and textures, and solutions to problems. (11 printed
pages)
Contents
Introduction
Taking Advantage of the Power
of the Dreamcast 3-D Chip
Improving Performance
Working with Geometry and Performance
Optimizing a Game
Summary
Introduction
While developing a Microsoft® Windows® CE–based game on the Sega Dreamcast,
we discovered several techniques that help to optimize game code and make
the best use of the Microsoft Direct3D® API. This article documents what
we learned.
A game developer might think that Direct3D techniques would be the same,
whether you're developing your game for the PC or for the Dreamcast. However,
in reality, Microsoft optimized Direct3D specifically for the Dreamcast
hardware. Therefore, to obtain the best performance, you need to pay attention
to Dreamcast-specific issues. In other words, you need to understand the
Dreamcast hardware and the Direct3D for Dreamcast implementation.
This article presents an overview of what we as game developers consider
useful for anyone who wants to write an efficient 3-D engine, based on
Direct3D and the Windows CE operating system for the Dreamcast. First,
we will cover features of the Dreamcast's 3-D hardware. Then we will provide
tips to help you implement the following techniques, which can improve
the overall performance of your 3-D game engine.
- Send less geometry to Direct3D.
- Choose the best way to send geometry to Direct3D.
- Test different optimizations, and then view the results by using the
performance viewer tool of Direct3D.
Taking Advantage
of the Power of the Dreamcast 3-D Chip
As triangles are sent to it, the Dreamcast hardware 3-D chip does not
render the triangles scan line by scan line. Instead, it stores the triangles
in video memory as they are sent. Once the entire scene has been collected,
the hardware sends all triangles to the screen tile by tile, not triangle
by triangle.
Every tile is 32 x 32 pixels. For each tile, the hardware selects the
pixels that intersect the tile and retrieves for each pixel the closest
triangle to the camera (viewport). Then this pixel is rendered to the
screen by following the process of completing the interpolations, reading
the texel, and so on. Thus, every pixel on the screen is actually rendered
to the screen buffer only once. Other 3-D hardware systems render every
pixel as often as that pixel is recovered by a triangle, but not the Dreamcast
hardware.
By using this method, the hardware is not limited by the fill rate. No
matter how many triangles recover a single pixel, that single pixel is
rendered only once. Therefore, with the Dreamcast hardware, you don't
need a Z-Buffer, because only the closest triangle is rendered.
In addition, with the Dreamcast hardware, you don't need to clip the
triangles to the screen viewport, so there is no need for clipping tests
and calculations. This is because the hardware renders graphics tile by
tile. As a result, you don't need to test primitives, nor do you need
to break up primitives into smaller primitives that fit on the screen.
The Dreamcast hardware does have to do several passes to render transparency,
which slows down the rendering process a little. However, during that
process, the hardware sorts the transparent triangles automatically, so
your game engine does not need to sort them. Because your game engine
doesn't have to do the memory manipulations that come with sorting, it
avoids disturbing (slowing down) your 3-D pipeline. Even if the polygons
intersect, there won't be any artifacts because the translucency sorting
is done for each pixel by the hardware.
Not all transparent modes need several passes. The 5551 (Punch Through)
mode does not need to combine the most recently rendered pixel with the
pixel previously rendered to the screen buffer because 1 bit of alpha
channel does not allow any degree of translucency. Such triangles are
rendered with the same speed as opaque triangles—in a single pass.
Another feature of the Dreamcast hardware is that it has SH4 native operations
that are fully supported by a set of intrinsics. The ones that we use
the most are the dot product and the reciprocal square root. One special
function that computes the sine and cosine of an angle is also very useful
for character animation and camera movement calculations.
You can also apply the following Dreamcast hardware features to each
pixel the hardware renders to the screen:
- Use a special surface mode to perform realistic bump mapping.
- Use a special texture mode (VQ compression) to complete texture compression
with an 8:1 compression ratio plus 2 KB of overhead for the codebook.
- Test the on-screen pixel with a set of volumes, and apply a specific
operation to the pixels inside or outside of the volume (color modification,
transparency, or the texture ID). This makes shadows, lighting, and
other special effects easy, and it doesn't break up the 3-D geometry
pipeline.
Improving Performance
Usually in games, the complete scene is much larger than the part a game
user actually sees on the screen. Therefore, sending every triangle of
the scene to Direct3D would waste resources and slow down performance.
So, cull the triangles that are not currently visible from the triangle
set sent to Direct3D.
To eliminate the geometry that is outside the viewable area, you need
to build efficient tests that meet all of the following rules:
- They are called as infrequently as possible.
- They are as fast as possible.
- They eliminate as many triangles as possible.
Tests are designed to eliminate the following three kinds of geometry:
- Triangles off the screen—To test for this condition, apply view frustum
elimination. That is, test every triangle, primitive, or object against
the viewing frustum pyramid, and then eliminate the triangle, primitive,
or object if it is outside the viewing frustum pyramid. This test generally
eliminates a lot of triangles by using only a few tests.
- Triangles not facing the screen—To test for this condition, apply
backface culling. That is, test every triangle or group of triangles
to see if it faces the screen, and eliminate the geometry that is not
facing the screen, such as the back of a person's head. This test generally
eliminates 10-50 percent of the geometry, but the cost and overhead
may be huge. The efficiency depends on the geometry; the more strips
you find, the better.
- Triangles completely hidden by other objects—In this case, create
an advanced scene organization to determine rapidly which triangles
are hidden. This test generally eliminates 10-50 percent of the triangle
geometry, but the performance depends on the geometrical organization.
This method is not discussed in this article because it depends on the
type of game. For example, there is a big difference between exteriors
and interiors.
To apply viewing frustum elimination, you need a test that rapidly determines
whether or not a triangle is in the viewing frustum. The easiest way is
to group the triangles into objects or primitives, and then test all the
triangles of an object or a primitive together. Then you can easily have
a bounding sphere that is larger than all the triangles, and test whether
or not the bounding sphere touches the viewing frustum, is completely
inside the frustum, or is completely outside the frustum. The center of
the sphere may just be the barycentrum of the triangles.
It is also very efficient to group primitives together into objects.
Then you need only test the primitives if the object is on the edge of
the viewing frustum. If the object is completely inside or outside the
viewing frustum, you know that all the primitives share their container
object's property.
Direct3D already does backface culling very efficiently. In some cases,
we can also group triangles and treat them together. For a series of connected
triangles (a strip for example) that are completely or almost on the same
plane, you can:
- Calculate an average normal vector.
- Compute the backface culling on the average vector.
- Use a tolerance value to know if the whole set of triangles is in
the viewable area or not.
By using this process, instead of testing each triangle, you can eliminate
a strip of 10 triangles with a single test.
If an object is getting very big and contains a lot of primitives or
triangles, you may find it worthwhile to subdivide the object into a hierarchy
of smaller objects. Indeed, a large object often does touch the viewing
frustum even if only a small piece of it really intersects the frustum.
This results in sending a large invisible piece to Direct3D for nothing.
To solve this problem, you can apply a subdivision technique such as an
Octree or a SEAD to test each piece of the large object. The idea is to
create subgroups of objects based on a regular (SEAD) or irregular (Octree)
subdivision. You could also use the logical hierarchy of the scene. For
example, the hierarchy of a single character—if the arm isn't on the screen,
you don't need to check to see if the hand is on the screen.
All these elimination techniques are based on grouping triangles or primitives
together. They are inefficient if applied to small groups of triangles
or, worse, to single triangles.
Summary
- Do the fewest number of tests per triangle to eliminate it (1 bspere
test for 1000 triangle objects costs 1/1000th of a test for 1 triangle).
- Create hierarchies to reduce the number of tests for each object.
- Subdivide objects that are too large into smaller hierarchies, so
that you don't end up with one DrawPrimitive call for 10,000 triangles
when only 1000 of the triangles are actually in the viewable area.
Working with Geometry
and Performance
The way you store geometry and send it to Direct3D affects performance.
In some games, you'll find that triangle lists provide better performance.
In others, you'll find that triangle strips provide better performance.
Test your situation to determine the best approach to use.
Strips share vertices. Therefore, in very large strips, you'll find that
the number of vertices in the primitive tends towards the number of triangles,
so a large strip represents three times less data to send to Direct3D
than does a list of triangles of the same size. Therefore, Direct3D transforms,
lights, and sends three times less data to the hardware. This is why strips
are much faster than single triangles.
One difficulty with strips is that triangles must share the same state
(texture and effects) and the adjacent vertices must be identical (xyz,
rgb, normal vector, and so on). Those constraints are very important and
the quality of the meshes directly influences the size and number of strips
that can be found. To get the best results, you should ensure that meshes
use as few different textures as possible and that texture mapping is
done so that all adjacent vertices share the UV coordinates.
There are two different ways to send geometry to Direct3D. You can use
DrawPrimitive or DrawIndexedPrimitive. If you use the DrawPrimitive function,
you should send triangles in the D3DPT_TRIANGLESTRIP mode, especially
if you can do a simple backface culling test for the whole strip. Avoid
using the D3DPT_TRIANGLELIST mode with the DrawPrimitive function.
If you simply want to send a list of triangles, use DrawIndexedPrimitive
instead. It is the best solution if you can't do backface culling on large
groups of triangles. With DrawIndexedPrimitive, Direct3D automatically
generates strips from the triangle list wherever the list of indexes makes
it possible.
Regarding the type of vertex data sent, generally, D3D_LVERTEX (lit by
the game but transformed by Direct3D) is faster than D3D_TLVERTEX (lit
and transformed by the game) because Direct3D has very efficient transformation
code. But if you already have the screen coordinates (for On Screen Display
for example) or if you can generate the geometry in the screen space (for
Bezier patches for example), then you might prefer D3D_TLVERTEX.
A problem may occur if you group several objects into a single list and
these objects are positioned differently (different limbs of a character
for example). In this case, the only way you can have Direct3D carry out
the transformations is to split the triangle list into several smaller
lists. This reduces performance because Direct3D is faster with large
lists. It may be impossible to create some lists if several vertices of
a triangle don't share the same matrix, which happens when you are putting
skin on characters. In those cases, it is usually more efficient to do
the transformation in the game code (for example, with the animation)
and send the transformed vertices in larger lists by using the D3D_TLVERTEX
type.
While the Dreamcast hardware does the viewport clipping, Direct3D does
the near plane clipping if the DONOTCLIP flag is not set. The DONOTCLIP
flag tells Direct3D not to do clipping calculations. It is best to turn
the DONOTCLIP flag on whenever possible. Test each object to see if it
touches the near plane. If it does, then you know that all of its triangles
won't have the DONOTCLIP flag set.
Our final issue with geometry involves data locality and alignment. To
be as efficient as possible, align all vertex data to 32 bytes. If the
vertex data is misaligned, Direct3D has to copy the data to another memory
block that is aligned to 32 bytes. An important thing to consider is that
a block allocated with the malloc function is only aligned to 4 bytes.
Also, you should not generate primitives on the fly. It is much faster
to have everything ready in the final format. Then you can simply call
the DrawPrimitive function. You should use D3D_VERTEX (transformed and
lit by Direct3D) wherever possible.
Finally, don't store the primitives in a random order. Try to group them
in the same order that you're going to render them. This will be faster
due to better cache coherence.
Summary
- Send as many vertices as possible in a single DrawPrimitive call.
This is the most important optimization you can do. Do everything you
can to keep from breaking up primitives.
- Do the transformation yourself if it would make you break up primitives,
because vertices have different matrices.
- Try to share all states for the triangles you send.
- Group the triangles per state and matrix, but don't sort them on the
fly in real time. If you arrange them by matrix and state beforehand,
then object by object is fine.
Optimizing a Game
The Windows CE performance viewer is an interesting and important tool
that you can use to do all the optimization work on a game. To activate
this tool, you must activate it in the Monitor's drop-down menu in the
Dreamcast Tool, but only after you have launched the game.
When you activate the Windows CE performance viewer, you will see three
horizontal bars on the screen. The first bar (light blue) represents the
time the hardware takes to render the scene. The second bar (gray with
red, green, or blue vertical lines) represents the time spent either in
the application or in Direct3D. The third bar (purple) represents the
frame rate.
The three bars grow from left to right. The slower a part is, the longer
its bar will be. On the second bar, you can differentiate between the
time spent in the application (gray) and in Direct3D (colored lines).
You can see the results of every optimization explained in this article
by looking at the bars displayed by the Windows CE performance viewer.
An efficient elimination algorithm reduces the time spent in Direct3D,
so you'll see fewer colored lines and more gray. If the gray part of the
bar grows more than the colored lines disappear, then the game code took
more time to eliminate the triangles than to render them—thus increasing
globally the time for each frame.
Because each DrawPrimitive and DrawIndexedPrimitive call is represented
by one colored line, if a geometry is rendered triangle by triangle, a
large part will be interlaced with gray and colored lines. If the geometry
is rendered with only one DrawIndexedPrimitive call, there will be one
large colored line. But this line will be much smaller than the previous
interlaced part. This shows how it can take less time to render the same
number of triangles if they are sent together in one large list.
If a geometry can be automatically transformed into strips by the DrawIndexedPrimitive
call, the large colored block will shrink, and the global performance
will be better. This is because the number of vertices will be reduced
in the mesh and because the size of the colored line depends directly
on the number of vertices sent.
It is very easy with this tool to try out different modes, flags, and
to precisely measure the difference between them. We really appreciated
the direct feedback this tool can deliver. You can disconnect some functionality
by pressing a key and immediately see the bar shrink.
Examples from Optimization Process
The following examples include screen shots, which are from our optimization
process—from a technical demo game. At the bottom of each screen shot,
notice the bars that indicate performance. These bars are a performance
monitor. Figure 1 illustrates the performance monitor used, so you may
better understand the screen shots in Figures 2 through 5.
Figure 1. Performance monitor
In the first screen shot in Figure 2, none of the optimizations has been
implemented. The game is sending a lot of small primitives, as shown by
every little red or blue line.
Figure 2. No optimizations implemented
In Figure 3, primitives are aligned to 32 bytes, lined up one behind
the other.
Figure 3. Primitives aligned
In Figure 4, triangles are grouped by render state to reduce the number
of primitives.
Figure 4. Triangles grouped by render state
In Figure 5, strips were generated to reduce the number of vertices.
Figure 5. Strips generated to reduce vertices
Summary
By following the guidelines in this article, you will be able to achieve
very high performance for your Windows CE–based game application with
Direct3D.
When we first launched our PC application on the Dreamcast, performance
was worse than 10 frames per second. But after we applied the techniques
explained in this article, performance improved significantly. Now the
performance is close to 60 frames per second, and we still have more optimizations
to do. We plan to increase the size of our primitives even further and
use fewer textures for our objects. We are confident that, with these
additional optimizations, we will be able to achieve a performance of
better than 60 frames per second.
The solutions discussed in this article don't all bring the same performance
improvement, but the basic idea remains the same. Try to send as many
triangles using as few DrawPrimitive or DrawIndexedPrimitive calls as
possible. Once you've achieved that, reduce the number of vertices sent
by sharing the vertices that you do send.
It is very important to choose the right method for each kind of geometry
(humans, animals, cars, and so on) and to train artists to create clean
geometries that use just a few different textures with texture coordinates
that can be shared by the vertices.
--------------------------------------------
This document is provided for informational purposes only. MICROSOFT
MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
Microsoft, Direct3D and Windows are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.
Other product or company names mentioned herein may be the trademarks
of their respective owners.
©
1999 Microsoft Corporation. All rights reserved. Terms of use.
optimizing Direct 3d Performance
|