Ryujinx/src/Ryujinx.Graphics.Vulkan/Shaders/ConvertD32S8ToD24S8ShaderSource.comp

#version 450 core

#extension GL_EXT_scalar_block_layout : require

layout (local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout (std430, set = 0, binding = 0) uniform stride_arguments
{
    int pixelCount;
    int dstStartOffset;
};

layout (std430, set = 1, binding = 1) buffer in_s
{
    uint[] in_data;
};

layout (std430, set = 1, binding = 2) buffer out_s
{
    uint[] out_data;
};

void main()
{
    // Determine what slice of the stride copies this invocation will perform.
    int invocations = int(gl_WorkGroupSize.x * gl_NumWorkGroups.x);

    int copiesRequired = pixelCount;

    // Find the copies that this invocation should perform.
    
    // - Copies that all invocations perform.
    int allInvocationCopies = copiesRequired / invocations;

    // - Extra remainder copy that this invocation performs.
    int index = int(gl_GlobalInvocationID.x);
    int extra = (index < (copiesRequired % invocations)) ? 1 : 0;

    int copyCount = allInvocationCopies + extra;

    // Finally, get the starting offset. Make sure to count extra copies.

    int startCopy = allInvocationCopies * index + min(copiesRequired % invocations, index);

    int srcOffset = startCopy * 2;
    int dstOffset = dstStartOffset + startCopy;

    // Perform the conversion for this region.
    for (int i = 0; i < copyCount; i++)
    {
        float depth = uintBitsToFloat(in_data[srcOffset++]);
        uint stencil = in_data[srcOffset++];

        uint rescaledDepth = uint(clamp(depth, 0.0, 1.0) * 16777215.0);

        out_data[dstOffset++] = (rescaledDepth << 8) | (stencil & 0xff);
    }
}
GPU: Pre-emptively flush textures that are flushed often (to imported memory when available) (#4711) * WIP texture pre-flush Improve performance of TextureView GetData to buffer Fix copy/sync ordering Fix minor bug Make this actually work WIP host mapping stuff * Fix usage flags * message * Cleanup 1 * Fix rebase * Fix * Improve pre-flush rules * Fix pre-flush * A lot of cleanup * Use the host memory bits * Select the correct memory type * Cleanup TextureGroupHandle * Missing comment * Remove debugging logs * Revert BufferHandle _value access modifier * One interrupt action at a time. * Support D32S8 to D24S8 conversion, safeguards * Interrupt cannot happen in sync handle's lock Waitable needs to be checked twice now, but this should stop it from deadlocking. * Remove unused using * Address some feedback * Address feedback * Address more feedback * Address more feedback * Improve sync rules Should allow for faster sync in some cases. 2023-05-01 19:05:12 +00:00			`#version 450 core`

			`#extension GL_EXT_scalar_block_layout : require`

			`layout (local_size_x = 64, local_size_y = 1, local_size_z = 1) in;`

			`layout (std430, set = 0, binding = 0) uniform stride_arguments`
			`{`
			`int pixelCount;`
			`int dstStartOffset;`
			`};`

			`layout (std430, set = 1, binding = 1) buffer in_s`
			`{`
			`uint[] in_data;`
			`};`

			`layout (std430, set = 1, binding = 2) buffer out_s`
			`{`
			`uint[] out_data;`
			`};`

			`void main()`
			`{`
			`// Determine what slice of the stride copies this invocation will perform.`
Vulkan: Device Local and higher invocation count for buffer conversions (#5623) Just some simple changes to the buffer conversion shaders. (stride conversion, D32S8 to D24S8) The first change is using a device local buffer for converted vertex buffers, since they're only read/written on the GPU. These paths don't trigger on NVIDIA, but if you force them to use it demonstrates the full extent writing to host owned memory from compute absolutely destroys them. AMD GPUs are less heavily affected by this issue, but since the game in question was writing 230MB from compute, I imagine it should have some effect. The second change is allowing the buffer conversion shaders to scale their work group count. While dividing the work between 32 invocations works OK for M1 macs, it's not so great for anything with more cores like AMD GPUs, which should be able to do a lot more parallel copies. Now, it scales by roughly 100 elements per invocation. Some stride change cases could be improved further by either limiting vertex buffer size somehow (reading the index buffer could help, but is always risky) or only updating regions that changed, rather than invalidating the whole thing. 2023-09-02 20:58:15 +00:00			`int invocations = int(gl_WorkGroupSize.x * gl_NumWorkGroups.x);`
GPU: Pre-emptively flush textures that are flushed often (to imported memory when available) (#4711) * WIP texture pre-flush Improve performance of TextureView GetData to buffer Fix copy/sync ordering Fix minor bug Make this actually work WIP host mapping stuff * Fix usage flags * message * Cleanup 1 * Fix rebase * Fix * Improve pre-flush rules * Fix pre-flush * A lot of cleanup * Use the host memory bits * Select the correct memory type * Cleanup TextureGroupHandle * Missing comment * Remove debugging logs * Revert BufferHandle _value access modifier * One interrupt action at a time. * Support D32S8 to D24S8 conversion, safeguards * Interrupt cannot happen in sync handle's lock Waitable needs to be checked twice now, but this should stop it from deadlocking. * Remove unused using * Address some feedback * Address feedback * Address more feedback * Address more feedback * Improve sync rules Should allow for faster sync in some cases. 2023-05-01 19:05:12 +00:00
			`int copiesRequired = pixelCount;`

			`// Find the copies that this invocation should perform.`

			`// - Copies that all invocations perform.`
			`int allInvocationCopies = copiesRequired / invocations;`

			`// - Extra remainder copy that this invocation performs.`
Vulkan: Device Local and higher invocation count for buffer conversions (#5623) Just some simple changes to the buffer conversion shaders. (stride conversion, D32S8 to D24S8) The first change is using a device local buffer for converted vertex buffers, since they're only read/written on the GPU. These paths don't trigger on NVIDIA, but if you force them to use it demonstrates the full extent writing to host owned memory from compute absolutely destroys them. AMD GPUs are less heavily affected by this issue, but since the game in question was writing 230MB from compute, I imagine it should have some effect. The second change is allowing the buffer conversion shaders to scale their work group count. While dividing the work between 32 invocations works OK for M1 macs, it's not so great for anything with more cores like AMD GPUs, which should be able to do a lot more parallel copies. Now, it scales by roughly 100 elements per invocation. Some stride change cases could be improved further by either limiting vertex buffer size somehow (reading the index buffer could help, but is always risky) or only updating regions that changed, rather than invalidating the whole thing. 2023-09-02 20:58:15 +00:00			`int index = int(gl_GlobalInvocationID.x);`
GPU: Pre-emptively flush textures that are flushed often (to imported memory when available) (#4711) * WIP texture pre-flush Improve performance of TextureView GetData to buffer Fix copy/sync ordering Fix minor bug Make this actually work WIP host mapping stuff * Fix usage flags * message * Cleanup 1 * Fix rebase * Fix * Improve pre-flush rules * Fix pre-flush * A lot of cleanup * Use the host memory bits * Select the correct memory type * Cleanup TextureGroupHandle * Missing comment * Remove debugging logs * Revert BufferHandle _value access modifier * One interrupt action at a time. * Support D32S8 to D24S8 conversion, safeguards * Interrupt cannot happen in sync handle's lock Waitable needs to be checked twice now, but this should stop it from deadlocking. * Remove unused using * Address some feedback * Address feedback * Address more feedback * Address more feedback * Improve sync rules Should allow for faster sync in some cases. 2023-05-01 19:05:12 +00:00			`int extra = (index < (copiesRequired % invocations)) ? 1 : 0;`

			`int copyCount = allInvocationCopies + extra;`

			`// Finally, get the starting offset. Make sure to count extra copies.`

			`int startCopy = allInvocationCopies * index + min(copiesRequired % invocations, index);`

			`int srcOffset = startCopy * 2;`
			`int dstOffset = dstStartOffset + startCopy;`

			`// Perform the conversion for this region.`
			`for (int i = 0; i < copyCount; i++)`
			`{`
			`float depth = uintBitsToFloat(in_data[srcOffset++]);`
			`uint stencil = in_data[srcOffset++];`

			`uint rescaledDepth = uint(clamp(depth, 0.0, 1.0) * 16777215.0);`

			`out_data[dstOffset++] = (rescaledDepth << 8) \| (stencil & 0xff);`
			`}`
			`}`