FAQ-11 性能表现

DancingWind · 发表于 2006-12-15 18:12:00

这一部分版主发过了，只是基本改动了一下

性能问题

我在程序中使用了OGL函数 glDrawPixels 和 glReadPixels。但是性能却不好。我应该怎么办？
我将我的程序升级到DX9以后，硬件阴影贴图不能用了，为什么？
"Render_to_texture_渲染到纹理(RTT)"似乎成为了我的D3D程序的性能瓶颈，甚至在我很少使用RTT或者只渲染很少量的几何体到纹理缓存时，程序也非常慢。这种状况正常吗？
我的程序速度很慢。我怎么知道是哪个部分影响了速度的?
那么现在你已经找到性能的瓶颈所在了。我们如何消除它好让我们的程序跑的更快呢？
我如何获取渲染代码的耗时？（我怎么知道GPU渲染花了多长时间？）

/*****************************/
Performance

I'm using glDrawPixels and glReadPixels in OpenGL. I'm seeing poor performance. What should I do?
I upgraded my application to DirectX9 and my hardware shadow maps no longer work! What's up?
Render-to-texture seems to really slow down my Direct3D application, even when I do it infrequently and render very little geometry to a small surface. Is this normal?
My application is slow. How can I figure out what's causing the slowdown?
Ok, now I know what my application's bottleneck is. How do I get rid of it and make my application run faster?
How do I time my rendering code? (How do I know how long it takes the GPU to render something?)

DancingWind · 发表于 2006-12-15 18:12:00

性能表现

我在程序中使用了OGL函数 glDrawPixels 和 glReadPixels。但是性能却不好。我应该怎么办？

BGRA是所有色彩格式里面性能最好的。（在某些情况下RGBA也不错，但通常用BGR都要比RGB好，BGRA是最稳妥的选择）。

对于BGRA/RGBA的色彩格式，你所能获得的回读的性能极限大约在160-180MB/s（约45MPix/s），这是由GPU硬件限制决定的（取决于PCI总线的传输速度[注：因为是从显存到内存，要经过总线传输]）。这是在与P4 1.5GHz相当的CPU或更高的条件上得出的结论。回读速度在未来的FX显卡上不会有明显改善。注意，一次读取的区域越大，你所获得性能越高。

而对于glDrawPixels()来说，它的性能是与GPU、"path_渲染通道"和纹理大小有关。在图像大于128*128的情况下，NV28GL上我们使用RGBA格式能获得约130MPix/s（520MB/s）的读取能力，在Quadro FX上我们使用RGBA格式能获得约170MPix/s（680MB/s）的填充能力。

由于以上数据依赖于特定平台上而变化，因此你可以使用GLperf工具来测量一下实际性能。

在AGP 8x上我们能获得1.7GB/s的写性能，在AGP 4x上能得960MB/s的写性能。更多的信息可以参考
http://www.nvidia.com/dev_content/nvopenglspecs/GL_NV_pixel_data_range.txt.

我将我的程序升级到DX9以后，硬件阴影贴图不能用了，为什么？

DX8和DX9的硬件阴影贴图的接口发生了一些变化。在DX8中，我们需要根据位深对z分量进行缩放。从DX9开始，我们改变了接口的行为，从而不再需要手工变换深度值值，所以不管位深是多少，z值始终在0-1之间。基本上，我们想实现这个简单的接口，而不愿意影响旧的程序，所以只在DX9中实现了这个接口。

"Render_to_texture_渲染到纹理(RTT)"似乎成为了我的D3D程序的性能瓶颈，甚至在我很少使用RTT或者只渲染很少量的几何体到纹理缓存时，程序也非常慢。这种状况正常吗？

首先确定你不是在D3DPOOL_MANAGED内存管理池中创建纹理的。如果使用这种内存管理模式，D3D运行时在会对所有的纹理做本地拷贝以便于管理纹理和在模式改变时重新载入纹理。因此当程序使用RTT时便意味着从本地显存回读到系统内存的发生，从而引起显著的性能下降。对于用于渲染目标的纹理，应该使用D3DPOOL_DEFAULT 方式申请内存池。

我的程序速度很慢。我怎么知道是哪个部分影响了速度的？

关键是在于要找到程序的瓶颈所在。你可以用以下方法判断问题出处：

消除所有的文件存取操作：任何的硬盘存取行为都将大幅降低你的帧率。这种情况很容易区分出来，只需要你看下计算机的硬盘工作指示灯就好。
使用相同的GPU在不同速度的CPU上跑你的程序：如果帧率不一样，那么你的程序中CPU是性能瓶颈。
在BIOS中降低AGP的读写速度：如果帧率不同，AGP带宽是程序的性能障碍。
放慢你的GPU芯片时钟频率：如果你的程序变慢了，则是由于你程序的图形绘制太复杂所造成的。
放慢你的显存颗粒时钟：如果变慢的显存频率影响了性能，那么程序是受到纹理或者帧缓存带宽的限制（GPU带宽限制）
那么现在你已经找到性能的瓶颈所在了。我们如何消除它好让我们的程序跑的更快呢？

CPU速度瓶颈：你可以运行VTune或者类似的性能评测工具用来找出代码中最耗时的部分。注意，显卡驱动也是一个潜在的CPU时间的占用者，尤其是在没有按照标准方式使用GPU的时候。当CPU和GPU的平行操作的行为被破坏时，将显著的降低程序的速度，如资源锁定或者从GPU回读数据至CPU都是常见的使CPU和GPU的丧失并行性的罪魁祸首。
AGP带宽瓶颈：首先确定一下你的AGP速度的设置是否为最高。如果这样的话，每帧应该向GPU传递更少的数据。不过在今天，这种情况已经不多见。
GPU瓶颈：首先要保证你在"vertex shader_顶点脚本(VS)"和"pixel shader_像素脚本(PS)"上的工作量应该均衡分布，使得整个流水线保持畅通，例如进行线形插值的运算应该放在VS而不是PS上。另外，只选择合适的精度（在float，half，fixed data types间谨慎选择）。尽量使用纹理查找替代函数计算纹理值。
GPU带宽瓶颈：减少你的纹理大小。可能你执行了太多的混合操作，这种操作需要极大的带宽。
我如何获取渲染代码的耗时？（我怎么知道GPU渲染花了多长时间？）

D3D和OpenGL的调用时间和并不是渲染耗时。这个时间仅仅是通过驱动向显卡提交渲染请求并将渲染请求放置到"push-buffer_缓冲区"的时间。实际的渲染工作在稍候以异步的方式完成。没有什么方法能够直接测出GPU处理一个指定的渲染请求的耗时。

DancingWind · 发表于 2006-12-15 18:12:00

Performance

I'm using glDrawPixels and glReadPixels in OpenGL. I'm seeing poor performance. What should I do?

BGRA is and always has been the fastest format to use. (There are some cases where RGBA is OK, and usually BGR is better than RGB, but in general, BGRA is the safest mode.)

The fastest performance you'll get a readback is approximately 160-180 MB/s (~45 MPix/s) for RGBA/BGRA which is the GPU hardware limit (due to PCI reads on the memory interface). This is with a P4 1.5GHz and above class system. The readback rate doesn't change significantly with the GeForce FX family. Note that you'll get the highest performance when you read back large areas as opposed to small ones.

For glDrawPixels(), performance it depends on the GPU, path, and the texture size but for NV28GL we achieve ~130 MPix/s RGBA (520 MB/s) and for Quadro FX we achieve ~170 MPix/s RGBA (680 MB/s) for images >128x128.

These numbers will vary based on your particular system so you may consider measure these yourself using GLperf.

Using pixel data range on an AGP 8x system we can achieve writes at ~1.7GB/s and ~960 MB/s on an AGP 4x system. More information is available at: http://www.nvidia.com/dev_content/nvopenglspecs/GL_NV_pixel_data_range.txt.

I upgraded my application to DirectX9 and my hardware shadow maps no longer work! What's up?

We've changed the behavior of hardware shadow maps between the DirectX8 and DirectX9 interfaces. In DirectX8, you're required to scale the interpolated z component (that will be compared with the value in the shadow map) by the bit depth of the shadow map itself. Starting with the DirectX9 interfaces, we've changed this behavior to no longer require this scale, so the z value should be in the range [0..1], regardless of bit depth. Basically, we wanted to implement this new cleaner behavior, but didn't want to break shipping apps that rely on the old behavior, so we changed it only for the new DX9 interfaces.

Render-to-texture seems to really slow down my Direct3D application, even when I do it infrequently and render very little geometry to a small surface. Is this normal?

Make sure you're not creating your texture in the D3DPOOL_MANAGED pool. The Direct3D runtime needs a local copy of all MANAGED textures to restore them on mode switch or for texture management, so rendering to these implies that a readback from local video memory to system memory will occur, dramatically hurting performance. For render targets, use D3DPOOL_DEFAULT instead.

My application is slow. How can I figure out what's causing the slowdown?

The key is to identify your application's bottleneck. There are several ways to do this:

Eliminate all file accesses. Any hard disk accesses will surely kill your frame rate. This is easy enough to detect--just take a look at your computer's "hard disk in use" light.
Run identical GPUs on different speed CPUs. If the frame rate varies, your application is CPU-limited.
Decrease your AGP speed from your system BIOS. If the frame rate varies, your application is AGP bandwidth-limited.
Reduce your GPU's core clock. If the slower core clock reduces performance, then your application is limited by the vertex shader, rasterization, or the fragment shader (i.e, shader-limited).
Reduce your GPU's memory clock. If the slower memory clock affects performance, your application is limited by texture or frame-buffer bandwidth (GPU bandwidth-limited).
Ok, now I know what my application's bottleneck is. How do I get rid of it and make my application run faster?

If you are CPU-limited: Try running VTune or a similar performance tool to find out where most of your time is being spent. Note that the graphics driver is a potential CPU consumer, particularly if you are using the GPU in non-standard ways. One common way to lose parallelism between the CPU and the GPU is locking resources (vertex buffers or textures), or reading back data from the GPU to the CPU.
If you are AGP-bandwidth-limited: Make sure your AGP settings are maximized. Transfer less data per frame to the GPU. Today, we see very few applications that are AGP-bandwidth-limited.
If you are shader-limited: Make sure you've balanced the workload between the vertex and fragment programs. For example, calculations that can be linearly interpolated belong in the vertex shader, not in the fragment shader. Use only the amount of precision that you need (choose between float, half, and fixed data types prudently). Try encoding functions in textures.
If you are GPU bandwidth-limited: Try reducing the size of your textures. You may also be performing too many blending operations, which are costly.
How do I time my rendering code? (How do I know how long it takes the GPU to render something?)

The wrong answer is to time all Direct3D or OpenGL calls. This simply times how long it takes the driver to submit the rendering request to the push-buffer. The actual rendering work is done asynchronously and later. There is no direct way for you to measure how long the GPU takes to process a particular rendering call.

账号		自动登录	找回密码
密码			立即注册

FAQ-11 性能表现

Re:FAQ-11 性能表现

Re:FAQ-11 性能表现