Overhaul of the Integration Kernels
Since the beginning of Octane the integration kernels had one CUDA thread calculate one complete sample. We changed this for various reasons, the main one being the fact that the integration kernels got really huge and impossible to optimize. Also OSL and OpenCL are pretty much impossible to implement this way. To solve the problem, we split the big task of calculating a sample into smaller steps which are then processed one by one by the CUDA threads. I.e. there are a lot more kernel calls are happening than in the past.
There are two major consequences coming with this new approach: Octane needs to keep information for every sample that is calculated in parallel between kernel calls, which requires additional GPU memory. And the CPU is stressed a bit more since it has to do more work to do many more kernel launches. To give you some control over the kernel execution we added two options to the direct lighting / path tracing / info channel kernel nodes:
+ "Parallel samples" controls how many samples we calculate in parallel. If you set it to a small value, Octane requires less memory to store the samples state, but most likely renders a bit slower. If you set it to a high value, more graphics memory is needed rendering becomes faster. The change in performance depends on the scene, the GPU architecture and the number of shader processors the GPU has.
+ "Max. tile samples" controls the number of samples per pixel Octane renders until it takes the result and stores it in the film buffer. A higher number means that results arrive less often at the film buffer, but reduce the CPU overhead during rendering and as a consequence can improve performance, too.
大修一体化核心
自从Octane开始以来,集成内核就有一个CUDA线程来计算一个完整的采样。我们出于各种原因更改了此设置,主要的原因是集成内核变得非常庞大且无法优化。而且OSL和OpenCL几乎不可能以这种方式实现。为了解决该问题,我们将计算采样的大任务分成了较小的步骤,然后由CUDA线程逐一处理。即与过去相比,发生了更多的内核调用。
这种新方法带来两个主要后果:Octane需要保留内核调用之间并行计算的每个采样的信息,这需要额外的GPU内存。而且,由于必须做更多的工作才能进行更多的内核启动,因此给CPU施加了更多压力。为了让您对内核执行有一些控制,我们在直接照明/路径跟踪/信息通道内核节点中添加了两个选项:
+“并行采样”控制我们并行计算的采样数。如果将其设置为较小的值,则Octane需要较少的内存来存储采样状态,但是很可能会使渲染速度变慢。如果将其设置为较高的值,则需要更多的图形内存,渲染速度更快。性能的变化取决于场景,GPU架构以及GPU拥有的着色器处理器的数量。
+“最大图块采样数”控制Octane渲染的每个像素的采样数,直到获取结果并将其存储在胶片缓冲区中为止。较高的数字表示结果很少到达胶片缓冲区,但是减少了渲染期间的CPU开销,因此也可以提高性能。
Comparison for VRAM/RAM Usage Capabilities
Here is the comparison table between V2 and V3
VRAM / RAM使用能力比较
这是V2和V3之间的比较表
|
V2
|
V3
|
Render buffers
|
VRAM(gpu)
|
VRAM(gpu)+RAM(system)
|
Textures
|
|
out-of-core
|
|
out-of-core
|
VRAM
|
VRAM+RAM
|
VRAM
|
VRAM+RAM
|
Geometry
|
VRAM
Triangles count: Max. 19.6 millions
|
VRAM
Triangles count: Max 76 millions
|
Speed
It's hard to quantify the performance impact, but what we have seen during testing is that in simple scenes (like the chess set or Cornell boxes etc.) the old system was hard to beat. That is because in this type of scenes, samples of neighbouring pixels are very coherent (similar) which is what GPUs like and can process very fast, because CUDA threads did almost the same task and didn't have to wait for each other. In these cases you usually have plenty of VRAM left, which means you can bump up the "parallel samples" to the maximum making the new system as fast or almost as fast as the old system.
速度
很难量化对性能的影响,但是我们在测试过程中看到的是,在简单的场景(如国际象棋或康奈尔盒子等)中,旧系统很难被击败。 那是因为在这种类型的场景中,相邻像素的采样非常连贯(相似),这正是GPU所喜欢并且可以非常快速地处理的,因为CUDA线程几乎完成了相同的任务,而不必彼此等待。 在这些情况下,通常会留下大量的VRAM,这意味着您可以将“并行采样”提高到最大,从而使新系统的速度与旧系统一样快或几乎与旧系统一样快。
The problem is that in real production scenes the execution of CUDA threads diverges very quickly causing CUDA threads to wait a long time for other CUDA threads to finish some work, i.e. twiddling thumbs. And for these more complex scenes the new system usually works better since the coherency is increased by the way how each step is processed. And we can optimize the kernels more, because the scope of their task is much more narrow. So you usually see a speed up for complex scenes, even with the default parallel samples setting or a lower value (in case you are struggling with memory).
问题在于,在实际的生产场景中,CUDA线程的执行速度非常快,导致CUDA线程要等待很长时间才能使其他CUDA线程完成某些工作,即打动拇指。对于这些更复杂的场景,新系统通常会更好地工作,因为通过处理每个步骤的方式来提高一致性。而且我们可以进一步优化内核,因为它们的任务范围要狭窄得多。因此,即使使用默认的并行采样设置或使用较低的值(通常在内存不足的情况下),您通常都会看到复杂场景的加速。
TLDR Version
In simple scenes where you've got plenty of VRAM left: Increase "parallel samples" to the maximum.
In complex scenes where VRAM is sparse: Set it to the highest value without running out of memory. It should usually still be faster than before or at least render with roughly the same speed.
TLDR版本
在简单的场景中,您还有足够的VRAM:将``并行采样''增加到最大数量。
在VRAM稀疏的复杂场景中:将其设置为最大值而不耗尽内存。通常它应该比以前更快,或者至少以大致相同的速度渲染。
Moved Film Buffers to the Host and Tiled Rendering
The second major refactoring in the render core was the way we store render results. Until v3 each GPU had its own film buffer where part of the calculated samples were aggregated. This has various drawbacks: For example, a CUDA error usually means that you lose the samples calculated by that GPU or a crashing/disconnected slave means you lost its samples. Another problem was that large images mean a large film buffer, especially if you enable render passes. And yes, deep image rendering would have been pretty much impossible since it's very very memory hungry. And implementing save and resume would have been a pain.
将影片缓冲区移至主机并进行平铺渲染
渲染核心中的第二个主要重构是我们存储渲染结果的方式。在v3之前,每个GPU都有自己的影片缓冲区,其中将部分计算出的采样进行汇总。这有很多缺点:例如,CUDA错误通常意味着您丢失了该GPU计算的采样,或者崩溃/断开连接的从设备意味着您丢失了其采样。另一个问题是较大的图像意味着较大的胶片缓冲,尤其是在启用渲染过程的情况下。是的,由于它非常占用内存,因此深图像渲染几乎是不可能的。实施保存和恢复将是一件痛苦的事情。
To solve these issues we moved the film buffer into host memory. Doesn't sound exciting, but has some major consequences. The biggest one is that now Octane has to deal with a huge amount of data that GPU's produce. Especially in multi-GPU setups or when network rendering is used. As a solution, we introduced tiled rendering for all integration kernels except PMC (where tiled rendering is not possible). The tiles are relatively large (compared to most other renders), and we tried to hide tile rendering as much as we can.
为了解决这些问题,我们将影片缓冲区移到了主机内存中。听起来并不令人兴奋,但会带来一些重大后果。最大的问题是,Octane现在必须处理GPU产生的大量数据。特别是在多GPU设置中或使用网络渲染时。作为解决方案,我们为除PMC(无法进行分块渲染)以外的所有集成内核引入了分块渲染。磁贴相对较大(与大多数其他渲染相比),我们尝试了尽可能多地隐藏磁贴渲染。
Of course, the film buffer in system memory means more memory usage, so make sure that you have enough RAM installed before you crank up the resolution (which is now straight forward to do). Another consequence is that the CPU has to merge render results from the various sources like local GPUs or net render slaves into the film buffers which requires some computational power. We tried to optimize that area, but there is obviously an impact on the CPU usage. Let us know if you run into issues here. Again, increasing the "max. tile samples" option in the kernels allows you to reduce the overhead accordingly (see above). Info passes are now rendered in parallel, too, since we can now just reuse the same tile buffer on the GPU that is used for rendering beauty passes.
当然,系统内存中的胶片缓冲区意味着更多的内存使用,因此在提高分辨率之前,请确保已安装足够的RAM(现在可以直接这样做)。另一个后果是,CPU必须将来自各种源(例如本地GPU或网络渲染从设备)的渲染结果合并到需要某些计算能力的影片缓冲区中。我们试图优化该区域,但是显然会对CPU使用率产生影响。如果您在这里遇到问题,请告诉我们。同样,增加内核中的“最大切片采样”选项可以相应地减少开销(请参见上文)。现在,信息传递也可以并行呈现,因为我们现在可以在GPU上重用用于呈现美丽传递的相同图块缓冲区。
Overhauled Work Distribution in Network Rendering
We also had to modify how render work is distributed to net render slaves and how their results are sent back, to make it work with the new film buffer. The biggest problem to solve was the fact that transmitting samples to the master is 1 to 2 magnitudes slower than generating them on the slave. The only way to solve this is to aggregate samples on the slaves and de-coupling the work distribution from the result transmission, which has the nice side effect that while rendering large resolutions (like stereo GearVR cube maps) doesn't throttle slaves anymore.
网络渲染中的大修工作分配
我们还必须修改渲染工作如何分配给网络渲染从属,以及如何将其结果发送回去,以使其与新的影片缓冲区一起工作。要解决的最大问题是,将采样传输到主节点要比在从属节点上生成采样慢1-2个数量级。解决此问题的唯一方法是在从属设备上聚合采样,并从结果传输中解耦工作分配,这具有很好的副作用,即在渲染高分辨率(如立体声GearVR立方体贴图)时不再限制从属设备。
Of course, caching results on the slaves means that they require more system memory than in the past and if the tiles rendered by a slave are distributed uniformly, the slave will produce a big pile of cached tiles that needs to be be transmitted to the master eventually. I.e. after all samples have been rendered, the master still needs to receive all those cached results from the slaves, which can take quite some time. To solve this problem we introduced an additional option to the kernel nodes that support tiled rendering:
"Minimize net traffic", if enabled, distributes only the same tile to the net render slaves, until the max samples/pixel has been reached for that tile and only then the next tile is distributed to slaves. Work done by local GPUs is not affected by this option. This way a slave can merge all its results into the same cached tile until the master switches to a different tile. Of course, you should set the maximum samples/pixel to something reasonable or the network rendering will focus on the first tile for a very long time.
当然,从属服务器上的缓存结果意味着它们比过去需要更多的系统内存,如果从属服务器渲染的图块均匀分布,则从属服务器将生成一大堆缓存的图块,需要将其传输到主服务器最终。即渲染完所有采样后,主服务器仍然需要从属服务器接收所有那些缓存的结果,这可能要花费一些时间。为了解决此问题,我们向支持平铺渲染的内核节点引入了一个附加选项:
如果启用了“最小化网络流量”,则仅将同一图块分配给网络渲染从属,直到达到该图块的最大采样/像素,然后才将下一个图块分配给从属。本地GPU完成的工作不受此选项的影响。这样,从站可以将其所有结果合并到相同的缓存切片中,直到主节点切换到其他切片。当然,您应该将最大采样数/像素设置为合理的值,否则网络渲染将在很长时间内集中在第一个图块上。
赶快留个言打破零评论!~