Compare commits
6 Commits
dynarmic-d
...
3096/qcom/
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c0a6eb8fdd | ||
|
|
dec889adb1 | ||
|
|
b8b4fe9a45 | ||
|
|
270fdb978d | ||
|
|
19863048b4 | ||
|
|
fa0859814c |
@@ -13,4 +13,3 @@ This contains documentation created by developers. This contains build instructi
|
||||
- **[The NVIDIA SM86 (Maxwell) GPU](./NvidiaGpu.md)**
|
||||
- **[User Handbook](./user)**
|
||||
- **[Release Policy](./ReleasePolicy.md)**
|
||||
- **[Dynarmic](./dynarmic)**
|
||||
|
||||
@@ -49,7 +49,7 @@ Important API Changes in v6.x Series
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Design documentation can be found at [./Design.md](./Design.md).
|
||||
Design documentation can be found at [docs/Design.md](docs/Design.md).
|
||||
|
||||
|
||||
Usage Example
|
||||
@@ -343,279 +343,3 @@ SetTerm(IR::Term::If{cond, term_then, term_else})
|
||||
|
||||
This terminal instruction conditionally executes one terminal or another depending
|
||||
on the run-time state of the ARM flags.
|
||||
|
||||
# Register Allocation (x64 Backend)
|
||||
|
||||
`HostLoc`s contain values. A `HostLoc` ("host value location") is either a host CPU register or a host spill location.
|
||||
|
||||
Values once set cannot be changed. Values can however be moved by the register allocator between `HostLoc`s. This is
|
||||
handled by the register allocator itself and code that uses the register allocator need not and should not move values
|
||||
between registers.
|
||||
|
||||
The register allocator is based on three concepts: `Use`, `Def` and `Scratch`.
|
||||
|
||||
* `Use`: The use of a value.
|
||||
* `Define`: The definition of a value, this is the only time when a value is set.
|
||||
* `Scratch`: Allocate a register that can be freely modified as one wishes.
|
||||
|
||||
Note that `Use`ing a value decrements its `use_count` by one. When the `use_count` reaches zero the value is discarded and no longer exists.
|
||||
|
||||
The member functions on `RegAlloc` are just a combination of the above concepts.
|
||||
|
||||
The following registers are reserved for internal use and should NOT participate in register allocation:
|
||||
- `%xmm0`, `%xmm1`, `%xmm2`: Used as scratch in exclusive memory access.
|
||||
- `%rsp`: Stack pointer.
|
||||
- `%r15`: JIT pointer
|
||||
- `%r14`: Page table pointer.
|
||||
- `%r13`: Fastmem pointer.
|
||||
|
||||
The layout convenes `%r15` as the JIT state pointer - while it may be tempting to turn it into a synthetic pointer, keeping an entire register (out of 12 available) is preferable over inlining a directly computed immediate.
|
||||
|
||||
Do NEVER modify `%r15`, we must make it clear that this register is "immutable" for the entirety of the JIT block duration.
|
||||
|
||||
### `Scratch`
|
||||
|
||||
```c++
|
||||
Xbyak::Reg64 ScratchGpr(HostLocList desired_locations = any_gpr);
|
||||
Xbyak::Xmm ScratchXmm(HostLocList desired_locations = any_xmm);
|
||||
```
|
||||
|
||||
At runtime, allocate one of the registers in `desired_locations`. You are free to modify the register. The register is discarded at the end of the allocation scope.
|
||||
|
||||
### Pure `Use`
|
||||
|
||||
```c++
|
||||
Xbyak::Reg64 UseGpr(Argument& arg);
|
||||
Xbyak::Xmm UseXmm(Argument& arg);
|
||||
OpArg UseOpArg(Argument& arg);
|
||||
void Use(Argument& arg, HostLoc host_loc);
|
||||
```
|
||||
|
||||
At runtime, the value corresponding to `arg` will be placed a register. The actual register is determined by
|
||||
which one of the above functions is called. `UseGpr` places it in an unused GPR, `UseXmm` places it
|
||||
in an unused XMM register, `UseOpArg` might be in a register or might be a memory location, and `Use` allows
|
||||
you to specify a specific register (GPR or XMM) to use.
|
||||
|
||||
This register **must not** have it's value changed.
|
||||
|
||||
### `UseScratch`
|
||||
|
||||
```c++
|
||||
Xbyak::Reg64 UseScratchGpr(Argument& arg);
|
||||
Xbyak::Xmm UseScratchXmm(Argument& arg);
|
||||
void UseScratch(Argument& arg, HostLoc host_loc);
|
||||
```
|
||||
|
||||
At runtime, the value corresponding to `arg` will be placed a register. The actual register is determined by
|
||||
which one of the above functions is called. `UseScratchGpr` places it in an unused GPR, `UseScratchXmm` places it
|
||||
in an unused XMM register, and `UseScratch` allows you to specify a specific register (GPR or XMM) to use.
|
||||
|
||||
The return value is the register allocated to you.
|
||||
|
||||
You are free to modify the value in the register. The register is discarded at the end of the allocation scope.
|
||||
|
||||
### `Define` as register
|
||||
|
||||
A `Define` is the defintion of a value. This is the only time when a value may be set.
|
||||
|
||||
```c++
|
||||
void DefineValue(IR::Inst* inst, const Xbyak::Reg& reg);
|
||||
```
|
||||
|
||||
By calling `DefineValue`, you are stating that you wish to define the value for `inst`, and you have written the
|
||||
value to the specified register `reg`.
|
||||
|
||||
### `Define`ing as an alias of a different value
|
||||
|
||||
Adding a `Define` to an existing value.
|
||||
|
||||
```c++
|
||||
void DefineValue(IR::Inst* inst, Argument& arg);
|
||||
```
|
||||
|
||||
You are declaring that the value for `inst` is the same as the value for `arg`. No host machine instructions are
|
||||
emitted.
|
||||
|
||||
## When to use each?
|
||||
|
||||
* Prefer `Use` to `UseScratch` where possible.
|
||||
* Prefer the `OpArg` variants where possible.
|
||||
* Prefer to **not** use the specific `HostLoc` variants where possible.
|
||||
|
||||
# Return Stack Buffer Optimization (x64 Backend)
|
||||
|
||||
One of the optimizations that dynarmic does is block-linking. Block-linking is done when
|
||||
the destination address of a jump is available at JIT-time. Instead of returning to the
|
||||
dispatcher at the end of a block we can perform block-linking: just jump directly to the
|
||||
next block. This is beneficial because returning to the dispatcher can often be quite
|
||||
expensive.
|
||||
|
||||
What should we do in cases when we can't predict the destination address? The eponymous
|
||||
example is when executing a return statement at the end of a function; the return address
|
||||
is not statically known at compile time.
|
||||
|
||||
We deal with this by using a return stack buffer: When we execute a call instruction,
|
||||
we push our prediction onto the RSB. When we execute a return instruction, we pop a
|
||||
prediction off the RSB. If the prediction is a hit, we immediately jump to the relevant
|
||||
compiled block. Otherwise, we return to the dispatcher.
|
||||
|
||||
This is the essential idea behind this optimization.
|
||||
|
||||
## `UniqueHash`
|
||||
|
||||
One complication dynarmic has is that a compiled block is not uniquely identifiable by
|
||||
the PC alone, but bits in the FPSCR and CPSR are also relevant. We resolve this by
|
||||
computing a 64-bit `UniqueHash` that is guaranteed to uniquely identify a block.
|
||||
|
||||
```c++
|
||||
u64 LocationDescriptor::UniqueHash() const {
|
||||
// This value MUST BE UNIQUE.
|
||||
// This calculation has to match up with EmitX64::EmitTerminalPopRSBHint
|
||||
u64 pc_u64 = u64(arm_pc) << 32;
|
||||
u64 fpscr_u64 = u64(fpscr.Value());
|
||||
u64 t_u64 = cpsr.T() ? 1 : 0;
|
||||
u64 e_u64 = cpsr.E() ? 2 : 0;
|
||||
return pc_u64 | fpscr_u64 | t_u64 | e_u64;
|
||||
}
|
||||
```
|
||||
|
||||
## Our implementation isn't actually a stack
|
||||
|
||||
Dynarmic's RSB isn't actually a stack. It was implemented as a ring buffer because
|
||||
that showed better performance in tests.
|
||||
|
||||
### RSB Structure
|
||||
|
||||
The RSB is implemented as a ring buffer. `rsb_ptr` is the index of the insertion
|
||||
point. Each element in `rsb_location_descriptors` is a `UniqueHash` and they
|
||||
each correspond to an element in `rsb_codeptrs`. `rsb_codeptrs` contains the
|
||||
host addresses for the corresponding the compiled blocks.
|
||||
|
||||
`RSBSize` was chosen by performance testing. Note that this is bigger than the
|
||||
size of the real RSB in hardware (which has 3 entries). Larger RSBs than 8
|
||||
showed degraded performance.
|
||||
|
||||
```c++
|
||||
struct JitState {
|
||||
// ...
|
||||
|
||||
static constexpr size_t RSBSize = 8; // MUST be a power of 2.
|
||||
u32 rsb_ptr = 0;
|
||||
std::array<u64, RSBSize> rsb_location_descriptors;
|
||||
std::array<u64, RSBSize> rsb_codeptrs;
|
||||
void ResetRSB();
|
||||
|
||||
// ...
|
||||
};
|
||||
```
|
||||
|
||||
### RSB Push
|
||||
|
||||
We insert our prediction at the insertion point iff the RSB doesn't already
|
||||
contain a prediction with the same `UniqueHash`.
|
||||
|
||||
```c++
|
||||
void EmitX64::EmitPushRSB(IR::Block&, IR::Inst* inst) {
|
||||
using namespace Xbyak::util;
|
||||
|
||||
ASSERT(inst->GetArg(0).IsImmediate());
|
||||
u64 imm64 = inst->GetArg(0).GetU64();
|
||||
|
||||
Xbyak::Reg64 code_ptr_reg = reg_alloc.ScratchGpr(code, {HostLoc::RCX});
|
||||
Xbyak::Reg64 loc_desc_reg = reg_alloc.ScratchGpr(code);
|
||||
Xbyak::Reg32 index_reg = reg_alloc.ScratchGpr(code).cvt32();
|
||||
u64 code_ptr = unique_hash_to_code_ptr.find(imm64) != unique_hash_to_code_ptr.end()
|
||||
? u64(unique_hash_to_code_ptr[imm64])
|
||||
: u64(code->GetReturnFromRunCodeAddress());
|
||||
|
||||
code->mov(index_reg, dword[code.ABI_JIT_PTR + offsetof(JitState, rsb_ptr)]);
|
||||
code->add(index_reg, 1);
|
||||
code->and_(index_reg, u32(JitState::RSBSize - 1));
|
||||
|
||||
code->mov(loc_desc_reg, u64(imm64));
|
||||
CodePtr patch_location = code->getCurr<CodePtr>();
|
||||
patch_unique_hash_locations[imm64].emplace_back(patch_location);
|
||||
code->mov(code_ptr_reg, u64(code_ptr)); // This line has to match up with EmitX64::Patch.
|
||||
code->EnsurePatchLocationSize(patch_location, 10);
|
||||
|
||||
Xbyak::Label label;
|
||||
for (size_t i = 0; i < JitState::RSBSize; ++i) {
|
||||
code->cmp(loc_desc_reg, qword[code.ABI_JIT_PTR + offsetof(JitState, rsb_location_descriptors) + i * sizeof(u64)]);
|
||||
code->je(label, code->T_SHORT);
|
||||
}
|
||||
|
||||
code->mov(dword[code.ABI_JIT_PTR + offsetof(JitState, rsb_ptr)], index_reg);
|
||||
code->mov(qword[code.ABI_JIT_PTR + index_reg.cvt64() * 8 + offsetof(JitState, rsb_location_descriptors)], loc_desc_reg);
|
||||
code->mov(qword[code.ABI_JIT_PTR + index_reg.cvt64() * 8 + offsetof(JitState, rsb_codeptrs)], code_ptr_reg);
|
||||
code->L(label);
|
||||
}
|
||||
```
|
||||
|
||||
In pseudocode:
|
||||
|
||||
```c++
|
||||
for (i := 0 .. RSBSize-1)
|
||||
if (rsb_location_descriptors[i] == imm64)
|
||||
goto label;
|
||||
rsb_ptr++;
|
||||
rsb_ptr %= RSBSize;
|
||||
rsb_location_desciptors[rsb_ptr] = imm64; //< The UniqueHash
|
||||
rsb_codeptr[rsb_ptr] = /* codeptr corresponding to the UniqueHash */;
|
||||
label:
|
||||
```
|
||||
|
||||
## RSB Pop
|
||||
|
||||
To check if a predicition is in the RSB, we linearly scan the RSB.
|
||||
|
||||
```c++
|
||||
void EmitX64::EmitTerminalPopRSBHint(IR::Term::PopRSBHint, IR::LocationDescriptor initial_location) {
|
||||
using namespace Xbyak::util;
|
||||
|
||||
// This calculation has to match up with IREmitter::PushRSB
|
||||
code->mov(ecx, MJitStateReg(Arm::Reg::PC));
|
||||
code->shl(rcx, 32);
|
||||
code->mov(ebx, dword[code.ABI_JIT_PTR + offsetof(JitState, FPSCR_mode)]);
|
||||
code->or_(ebx, dword[code.ABI_JIT_PTR + offsetof(JitState, CPSR_et)]);
|
||||
code->or_(rbx, rcx);
|
||||
|
||||
code->mov(rax, u64(code->GetReturnFromRunCodeAddress()));
|
||||
for (size_t i = 0; i < JitState::RSBSize; ++i) {
|
||||
code->cmp(rbx, qword[code.ABI_JIT_PTR + offsetof(JitState, rsb_location_descriptors) + i * sizeof(u64)]);
|
||||
code->cmove(rax, qword[code.ABI_JIT_PTR + offsetof(JitState, rsb_codeptrs) + i * sizeof(u64)]);
|
||||
}
|
||||
|
||||
code->jmp(rax);
|
||||
}
|
||||
```
|
||||
|
||||
In pseudocode:
|
||||
|
||||
```c++
|
||||
rbx := ComputeUniqueHash()
|
||||
rax := ReturnToDispatch
|
||||
for (i := 0 .. RSBSize-1)
|
||||
if (rbx == rsb_location_descriptors[i])
|
||||
rax = rsb_codeptrs[i]
|
||||
goto rax
|
||||
```
|
||||
|
||||
# Fast memory (Fastmem)
|
||||
|
||||
The main way of accessing memory in JITed programs is via an invoked function, say "Read()" and "Write()". On our translator, such functions usually take a sizable amounts of code space (push + call + pop). Trash the i-cache (due to an indirect call) and overall make code emission more bloated.
|
||||
|
||||
The solution? Delegate invalid accesses to a dedicated arena, similar to a swap. The main idea behind such mechanism is to allow the OS to transmit page faults from invalid accesses into the JIT translator directly, bypassing address space calls, while this sacrifices i-cache coherency, it allows for smaller code-size and "faster" throguhput.
|
||||
|
||||
Many kernels however, do not support fast signal dispatching (Solaris, OpenBSD, FreeBSD). Only Linux and Windows support relatively "fast" signal dispatching. Hence this feature is better suited for them only.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
In x86_64 for example, when a page fault occurs, the CPU will transmit via control registers and the stack (see `IRETQ`) the appropriate arguments for a page fault handler, the OS then will transform that into something that can be sent into userspace.
|
||||
|
||||
Most modern OSes implement kernel-page-table-isolation, which means a set of system calls will invoke a context switch (not often used syscalls), whereas others are handled by the same process address space (the smaller kernel portion, often used syscalls) without needing a context switch. This effect can be negated on systems with PCID (up to 4096 unique IDs).
|
||||
|
||||
Signal dispatching takes a performance hit from reloading `%cr3` - but Linux does something more clever to avoid reloads: VDSO will take care of the entire thing in the same address space. Making dispatching as costly as an indirect call - without the hazards of increased code size.
|
||||
|
||||
The main downside from this is the constant i-cache trashing and pipeline hazards introduced by the VDSO signal handlers. However on most benchmarks fastmem does perform faster than without (Linux only). This also abuses the fact of continous address space emulation by using an arena - which can then be potentially transparently mapped into a hugepage, reducing TLB walk times.
|
||||
2474
src/dynarmic/docs/Doxyfile
Normal file
2474
src/dynarmic/docs/Doxyfile
Normal file
File diff suppressed because it is too large
Load Diff
19
src/dynarmic/docs/FastMemory.md
Normal file
19
src/dynarmic/docs/FastMemory.md
Normal file
@@ -0,0 +1,19 @@
|
||||
# Fast memory (Fastmem)
|
||||
|
||||
The main way of accessing memory in JITed programs is via an invoked function, say "Read()" and "Write()". On our translator, such functions usually take a sizable amounts of code space (push + call + pop). Trash the i-cache (due to an indirect call) and overall make code emission more bloated.
|
||||
|
||||
The solution? Delegate invalid accesses to a dedicated arena, similar to a swap. The main idea behind such mechanism is to allow the OS to transmit page faults from invalid accesses into the JIT translator directly, bypassing address space calls, while this sacrifices i-cache coherency, it allows for smaller code-size and "faster" throguhput.
|
||||
|
||||
Many kernels however, do not support fast signal dispatching (Solaris, OpenBSD, FreeBSD). Only Linux and Windows support relatively "fast" signal dispatching. Hence this feature is better suited for them only.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
In x86_64 for example, when a page fault occurs, the CPU will transmit via control registers and the stack (see `IRETQ`) the appropriate arguments for a page fault handler, the OS then will transform that into something that can be sent into userspace.
|
||||
|
||||
Most modern OSes implement kernel-page-table-isolation, which means a set of system calls will invoke a context switch (not often used syscalls), whereas others are handled by the same process address space (the smaller kernel portion, often used syscalls) without needing a context switch. This effect can be negated on systems with PCID (up to 4096 unique IDs).
|
||||
|
||||
Signal dispatching takes a performance hit from reloading `%cr3` - but Linux does something more clever to avoid reloads: VDSO will take care of the entire thing in the same address space. Making dispatching as costly as an indirect call - without the hazards of increased code size.
|
||||
|
||||
The main downside from this is the constant i-cache trashing and pipeline hazards introduced by the VDSO signal handlers. However on most benchmarks fastmem does perform faster than without (Linux only). This also abuses the fact of continous address space emulation by using an arena - which can then be potentially transparently mapped into a hugepage, reducing TLB walk times.
|
||||
|
Before Width: | Height: | Size: 128 KiB After Width: | Height: | Size: 128 KiB |
|
Before Width: | Height: | Size: 98 KiB After Width: | Height: | Size: 98 KiB |
97
src/dynarmic/docs/RegisterAllocator.md
Normal file
97
src/dynarmic/docs/RegisterAllocator.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Register Allocation (x64 Backend)
|
||||
|
||||
`HostLoc`s contain values. A `HostLoc` ("host value location") is either a host CPU register or a host spill location.
|
||||
|
||||
Values once set cannot be changed. Values can however be moved by the register allocator between `HostLoc`s. This is
|
||||
handled by the register allocator itself and code that uses the register allocator need not and should not move values
|
||||
between registers.
|
||||
|
||||
The register allocator is based on three concepts: `Use`, `Def` and `Scratch`.
|
||||
|
||||
* `Use`: The use of a value.
|
||||
* `Define`: The definition of a value, this is the only time when a value is set.
|
||||
* `Scratch`: Allocate a register that can be freely modified as one wishes.
|
||||
|
||||
Note that `Use`ing a value decrements its `use_count` by one. When the `use_count` reaches zero the value is discarded and no longer exists.
|
||||
|
||||
The member functions on `RegAlloc` are just a combination of the above concepts.
|
||||
|
||||
The following registers are reserved for internal use and should NOT participate in register allocation:
|
||||
- `%xmm0`, `%xmm1`, `%xmm2`: Used as scratch in exclusive memory access.
|
||||
- `%rsp`: Stack pointer.
|
||||
- `%r15`: JIT pointer
|
||||
- `%r14`: Page table pointer.
|
||||
- `%r13`: Fastmem pointer.
|
||||
|
||||
The layout convenes `%r15` as the JIT state pointer - while it may be tempting to turn it into a synthetic pointer, keeping an entire register (out of 12 available) is preferable over inlining a directly computed immediate.
|
||||
|
||||
Do NEVER modify `%r15`, we must make it clear that this register is "immutable" for the entirety of the JIT block duration.
|
||||
|
||||
### `Scratch`
|
||||
|
||||
```c++
|
||||
Xbyak::Reg64 ScratchGpr(HostLocList desired_locations = any_gpr);
|
||||
Xbyak::Xmm ScratchXmm(HostLocList desired_locations = any_xmm);
|
||||
```
|
||||
|
||||
At runtime, allocate one of the registers in `desired_locations`. You are free to modify the register. The register is discarded at the end of the allocation scope.
|
||||
|
||||
### Pure `Use`
|
||||
|
||||
```c++
|
||||
Xbyak::Reg64 UseGpr(Argument& arg);
|
||||
Xbyak::Xmm UseXmm(Argument& arg);
|
||||
OpArg UseOpArg(Argument& arg);
|
||||
void Use(Argument& arg, HostLoc host_loc);
|
||||
```
|
||||
|
||||
At runtime, the value corresponding to `arg` will be placed a register. The actual register is determined by
|
||||
which one of the above functions is called. `UseGpr` places it in an unused GPR, `UseXmm` places it
|
||||
in an unused XMM register, `UseOpArg` might be in a register or might be a memory location, and `Use` allows
|
||||
you to specify a specific register (GPR or XMM) to use.
|
||||
|
||||
This register **must not** have it's value changed.
|
||||
|
||||
### `UseScratch`
|
||||
|
||||
```c++
|
||||
Xbyak::Reg64 UseScratchGpr(Argument& arg);
|
||||
Xbyak::Xmm UseScratchXmm(Argument& arg);
|
||||
void UseScratch(Argument& arg, HostLoc host_loc);
|
||||
```
|
||||
|
||||
At runtime, the value corresponding to `arg` will be placed a register. The actual register is determined by
|
||||
which one of the above functions is called. `UseScratchGpr` places it in an unused GPR, `UseScratchXmm` places it
|
||||
in an unused XMM register, and `UseScratch` allows you to specify a specific register (GPR or XMM) to use.
|
||||
|
||||
The return value is the register allocated to you.
|
||||
|
||||
You are free to modify the value in the register. The register is discarded at the end of the allocation scope.
|
||||
|
||||
### `Define` as register
|
||||
|
||||
A `Define` is the defintion of a value. This is the only time when a value may be set.
|
||||
|
||||
```c++
|
||||
void DefineValue(IR::Inst* inst, const Xbyak::Reg& reg);
|
||||
```
|
||||
|
||||
By calling `DefineValue`, you are stating that you wish to define the value for `inst`, and you have written the
|
||||
value to the specified register `reg`.
|
||||
|
||||
### `Define`ing as an alias of a different value
|
||||
|
||||
Adding a `Define` to an existing value.
|
||||
|
||||
```c++
|
||||
void DefineValue(IR::Inst* inst, Argument& arg);
|
||||
```
|
||||
|
||||
You are declaring that the value for `inst` is the same as the value for `arg`. No host machine instructions are
|
||||
emitted.
|
||||
|
||||
## When to use each?
|
||||
|
||||
* Prefer `Use` to `UseScratch` where possible.
|
||||
* Prefer the `OpArg` variants where possible.
|
||||
* Prefer to **not** use the specific `HostLoc` variants where possible.
|
||||
157
src/dynarmic/docs/ReturnStackBufferOptimization.md
Normal file
157
src/dynarmic/docs/ReturnStackBufferOptimization.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Return Stack Buffer Optimization (x64 Backend)
|
||||
|
||||
One of the optimizations that dynarmic does is block-linking. Block-linking is done when
|
||||
the destination address of a jump is available at JIT-time. Instead of returning to the
|
||||
dispatcher at the end of a block we can perform block-linking: just jump directly to the
|
||||
next block. This is beneficial because returning to the dispatcher can often be quite
|
||||
expensive.
|
||||
|
||||
What should we do in cases when we can't predict the destination address? The eponymous
|
||||
example is when executing a return statement at the end of a function; the return address
|
||||
is not statically known at compile time.
|
||||
|
||||
We deal with this by using a return stack buffer: When we execute a call instruction,
|
||||
we push our prediction onto the RSB. When we execute a return instruction, we pop a
|
||||
prediction off the RSB. If the prediction is a hit, we immediately jump to the relevant
|
||||
compiled block. Otherwise, we return to the dispatcher.
|
||||
|
||||
This is the essential idea behind this optimization.
|
||||
|
||||
## `UniqueHash`
|
||||
|
||||
One complication dynarmic has is that a compiled block is not uniquely identifiable by
|
||||
the PC alone, but bits in the FPSCR and CPSR are also relevant. We resolve this by
|
||||
computing a 64-bit `UniqueHash` that is guaranteed to uniquely identify a block.
|
||||
|
||||
```c++
|
||||
u64 LocationDescriptor::UniqueHash() const {
|
||||
// This value MUST BE UNIQUE.
|
||||
// This calculation has to match up with EmitX64::EmitTerminalPopRSBHint
|
||||
u64 pc_u64 = u64(arm_pc) << 32;
|
||||
u64 fpscr_u64 = u64(fpscr.Value());
|
||||
u64 t_u64 = cpsr.T() ? 1 : 0;
|
||||
u64 e_u64 = cpsr.E() ? 2 : 0;
|
||||
return pc_u64 | fpscr_u64 | t_u64 | e_u64;
|
||||
}
|
||||
```
|
||||
|
||||
## Our implementation isn't actually a stack
|
||||
|
||||
Dynarmic's RSB isn't actually a stack. It was implemented as a ring buffer because
|
||||
that showed better performance in tests.
|
||||
|
||||
### RSB Structure
|
||||
|
||||
The RSB is implemented as a ring buffer. `rsb_ptr` is the index of the insertion
|
||||
point. Each element in `rsb_location_descriptors` is a `UniqueHash` and they
|
||||
each correspond to an element in `rsb_codeptrs`. `rsb_codeptrs` contains the
|
||||
host addresses for the corresponding the compiled blocks.
|
||||
|
||||
`RSBSize` was chosen by performance testing. Note that this is bigger than the
|
||||
size of the real RSB in hardware (which has 3 entries). Larger RSBs than 8
|
||||
showed degraded performance.
|
||||
|
||||
```c++
|
||||
struct JitState {
|
||||
// ...
|
||||
|
||||
static constexpr size_t RSBSize = 8; // MUST be a power of 2.
|
||||
u32 rsb_ptr = 0;
|
||||
std::array<u64, RSBSize> rsb_location_descriptors;
|
||||
std::array<u64, RSBSize> rsb_codeptrs;
|
||||
void ResetRSB();
|
||||
|
||||
// ...
|
||||
};
|
||||
```
|
||||
|
||||
### RSB Push
|
||||
|
||||
We insert our prediction at the insertion point iff the RSB doesn't already
|
||||
contain a prediction with the same `UniqueHash`.
|
||||
|
||||
```c++
|
||||
void EmitX64::EmitPushRSB(IR::Block&, IR::Inst* inst) {
|
||||
using namespace Xbyak::util;
|
||||
|
||||
ASSERT(inst->GetArg(0).IsImmediate());
|
||||
u64 imm64 = inst->GetArg(0).GetU64();
|
||||
|
||||
Xbyak::Reg64 code_ptr_reg = reg_alloc.ScratchGpr(code, {HostLoc::RCX});
|
||||
Xbyak::Reg64 loc_desc_reg = reg_alloc.ScratchGpr(code);
|
||||
Xbyak::Reg32 index_reg = reg_alloc.ScratchGpr(code).cvt32();
|
||||
u64 code_ptr = unique_hash_to_code_ptr.find(imm64) != unique_hash_to_code_ptr.end()
|
||||
? u64(unique_hash_to_code_ptr[imm64])
|
||||
: u64(code->GetReturnFromRunCodeAddress());
|
||||
|
||||
code->mov(index_reg, dword[code.ABI_JIT_PTR + offsetof(JitState, rsb_ptr)]);
|
||||
code->add(index_reg, 1);
|
||||
code->and_(index_reg, u32(JitState::RSBSize - 1));
|
||||
|
||||
code->mov(loc_desc_reg, u64(imm64));
|
||||
CodePtr patch_location = code->getCurr<CodePtr>();
|
||||
patch_unique_hash_locations[imm64].emplace_back(patch_location);
|
||||
code->mov(code_ptr_reg, u64(code_ptr)); // This line has to match up with EmitX64::Patch.
|
||||
code->EnsurePatchLocationSize(patch_location, 10);
|
||||
|
||||
Xbyak::Label label;
|
||||
for (size_t i = 0; i < JitState::RSBSize; ++i) {
|
||||
code->cmp(loc_desc_reg, qword[code.ABI_JIT_PTR + offsetof(JitState, rsb_location_descriptors) + i * sizeof(u64)]);
|
||||
code->je(label, code->T_SHORT);
|
||||
}
|
||||
|
||||
code->mov(dword[code.ABI_JIT_PTR + offsetof(JitState, rsb_ptr)], index_reg);
|
||||
code->mov(qword[code.ABI_JIT_PTR + index_reg.cvt64() * 8 + offsetof(JitState, rsb_location_descriptors)], loc_desc_reg);
|
||||
code->mov(qword[code.ABI_JIT_PTR + index_reg.cvt64() * 8 + offsetof(JitState, rsb_codeptrs)], code_ptr_reg);
|
||||
code->L(label);
|
||||
}
|
||||
```
|
||||
|
||||
In pseudocode:
|
||||
|
||||
```c++
|
||||
for (i := 0 .. RSBSize-1)
|
||||
if (rsb_location_descriptors[i] == imm64)
|
||||
goto label;
|
||||
rsb_ptr++;
|
||||
rsb_ptr %= RSBSize;
|
||||
rsb_location_desciptors[rsb_ptr] = imm64; //< The UniqueHash
|
||||
rsb_codeptr[rsb_ptr] = /* codeptr corresponding to the UniqueHash */;
|
||||
label:
|
||||
```
|
||||
|
||||
## RSB Pop
|
||||
|
||||
To check if a predicition is in the RSB, we linearly scan the RSB.
|
||||
|
||||
```c++
|
||||
void EmitX64::EmitTerminalPopRSBHint(IR::Term::PopRSBHint, IR::LocationDescriptor initial_location) {
|
||||
using namespace Xbyak::util;
|
||||
|
||||
// This calculation has to match up with IREmitter::PushRSB
|
||||
code->mov(ecx, MJitStateReg(Arm::Reg::PC));
|
||||
code->shl(rcx, 32);
|
||||
code->mov(ebx, dword[code.ABI_JIT_PTR + offsetof(JitState, FPSCR_mode)]);
|
||||
code->or_(ebx, dword[code.ABI_JIT_PTR + offsetof(JitState, CPSR_et)]);
|
||||
code->or_(rbx, rcx);
|
||||
|
||||
code->mov(rax, u64(code->GetReturnFromRunCodeAddress()));
|
||||
for (size_t i = 0; i < JitState::RSBSize; ++i) {
|
||||
code->cmp(rbx, qword[code.ABI_JIT_PTR + offsetof(JitState, rsb_location_descriptors) + i * sizeof(u64)]);
|
||||
code->cmove(rax, qword[code.ABI_JIT_PTR + offsetof(JitState, rsb_codeptrs) + i * sizeof(u64)]);
|
||||
}
|
||||
|
||||
code->jmp(rax);
|
||||
}
|
||||
```
|
||||
|
||||
In pseudocode:
|
||||
|
||||
```c++
|
||||
rbx := ComputeUniqueHash()
|
||||
rax := ReturnToDispatch
|
||||
for (i := 0 .. RSBSize-1)
|
||||
if (rbx == rsb_location_descriptors[i])
|
||||
rax = rsb_codeptrs[i]
|
||||
goto rax
|
||||
```
|
||||
@@ -433,6 +433,12 @@ void BufferCache<P>::SetComputeUniformBufferState(u32 mask,
|
||||
|
||||
template <class P>
|
||||
void BufferCache<P>::UnbindGraphicsStorageBuffers(size_t stage) {
|
||||
if constexpr (requires { runtime.ShouldLimitDynamicStorageBuffers(); }) {
|
||||
if (runtime.ShouldLimitDynamicStorageBuffers()) {
|
||||
channel_state->total_graphics_storage_buffers -=
|
||||
static_cast<u32>(std::popcount(channel_state->enabled_storage_buffers[stage]));
|
||||
}
|
||||
}
|
||||
channel_state->enabled_storage_buffers[stage] = 0;
|
||||
channel_state->written_storage_buffers[stage] = 0;
|
||||
}
|
||||
@@ -440,8 +446,26 @@ void BufferCache<P>::UnbindGraphicsStorageBuffers(size_t stage) {
|
||||
template <class P>
|
||||
bool BufferCache<P>::BindGraphicsStorageBuffer(size_t stage, size_t ssbo_index, u32 cbuf_index,
|
||||
u32 cbuf_offset, bool is_written) {
|
||||
const bool already_enabled =
|
||||
((channel_state->enabled_storage_buffers[stage] >> ssbo_index) & 1U) != 0;
|
||||
if constexpr (requires { runtime.ShouldLimitDynamicStorageBuffers(); }) {
|
||||
if (runtime.ShouldLimitDynamicStorageBuffers() && !already_enabled) {
|
||||
const u32 max_bindings = runtime.GetMaxDynamicStorageBuffers();
|
||||
if (channel_state->total_graphics_storage_buffers >= max_bindings) {
|
||||
LOG_WARNING(HW_GPU,
|
||||
"Skipping graphics storage buffer {} due to driver limit {}",
|
||||
ssbo_index, max_bindings);
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
channel_state->enabled_storage_buffers[stage] |= 1U << ssbo_index;
|
||||
channel_state->written_storage_buffers[stage] |= (is_written ? 1U : 0U) << ssbo_index;
|
||||
if constexpr (requires { runtime.ShouldLimitDynamicStorageBuffers(); }) {
|
||||
if (runtime.ShouldLimitDynamicStorageBuffers() && !already_enabled) {
|
||||
++channel_state->total_graphics_storage_buffers;
|
||||
}
|
||||
}
|
||||
|
||||
const auto& cbufs = maxwell3d->state.shader_stages[stage];
|
||||
const GPUVAddr ssbo_addr = cbufs.const_buffers[cbuf_index].address + cbuf_offset;
|
||||
@@ -472,6 +496,12 @@ void BufferCache<P>::BindGraphicsTextureBuffer(size_t stage, size_t tbo_index, G
|
||||
|
||||
template <class P>
|
||||
void BufferCache<P>::UnbindComputeStorageBuffers() {
|
||||
if constexpr (requires { runtime.ShouldLimitDynamicStorageBuffers(); }) {
|
||||
if (runtime.ShouldLimitDynamicStorageBuffers()) {
|
||||
channel_state->total_compute_storage_buffers -=
|
||||
static_cast<u32>(std::popcount(channel_state->enabled_compute_storage_buffers));
|
||||
}
|
||||
}
|
||||
channel_state->enabled_compute_storage_buffers = 0;
|
||||
channel_state->written_compute_storage_buffers = 0;
|
||||
channel_state->image_compute_texture_buffers = 0;
|
||||
@@ -485,8 +515,26 @@ void BufferCache<P>::BindComputeStorageBuffer(size_t ssbo_index, u32 cbuf_index,
|
||||
ssbo_index);
|
||||
return;
|
||||
}
|
||||
const bool already_enabled =
|
||||
((channel_state->enabled_compute_storage_buffers >> ssbo_index) & 1U) != 0;
|
||||
if constexpr (requires { runtime.ShouldLimitDynamicStorageBuffers(); }) {
|
||||
if (runtime.ShouldLimitDynamicStorageBuffers() && !already_enabled) {
|
||||
const u32 max_bindings = runtime.GetMaxDynamicStorageBuffers();
|
||||
if (channel_state->total_compute_storage_buffers >= max_bindings) {
|
||||
LOG_WARNING(HW_GPU,
|
||||
"Skipping compute storage buffer {} due to driver limit {}",
|
||||
ssbo_index, max_bindings);
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
channel_state->enabled_compute_storage_buffers |= 1U << ssbo_index;
|
||||
channel_state->written_compute_storage_buffers |= (is_written ? 1U : 0U) << ssbo_index;
|
||||
if constexpr (requires { runtime.ShouldLimitDynamicStorageBuffers(); }) {
|
||||
if (runtime.ShouldLimitDynamicStorageBuffers() && !already_enabled) {
|
||||
++channel_state->total_compute_storage_buffers;
|
||||
}
|
||||
}
|
||||
|
||||
const auto& launch_desc = kepler_compute->launch_description;
|
||||
if (((launch_desc.const_buffer_enable_mask >> cbuf_index) & 1) == 0) {
|
||||
@@ -857,9 +905,23 @@ void BufferCache<P>::BindHostGraphicsUniformBuffer(size_t stage, u32 index, u32
|
||||
const u32 size = (std::min)(binding.size, (*channel_state->uniform_buffer_sizes)[stage][index]);
|
||||
Buffer& buffer = slot_buffers[binding.buffer_id];
|
||||
TouchBuffer(buffer, binding.buffer_id);
|
||||
const bool use_fast_buffer = binding.buffer_id != NULL_BUFFER_ID &&
|
||||
size <= channel_state->uniform_buffer_skip_cache_size &&
|
||||
!memory_tracker.IsRegionGpuModified(device_addr, size);
|
||||
const bool has_host_buffer = binding.buffer_id != NULL_BUFFER_ID;
|
||||
const u32 offset = has_host_buffer ? buffer.Offset(device_addr) : 0;
|
||||
const bool needs_alignment_stream = [&]() {
|
||||
if constexpr (IS_OPENGL) {
|
||||
return false;
|
||||
} else {
|
||||
if (!has_host_buffer) {
|
||||
return false;
|
||||
}
|
||||
const u32 alignment = runtime.GetUniformBufferAlignment();
|
||||
return alignment > 1 && (offset % alignment) != 0;
|
||||
}
|
||||
}();
|
||||
const bool use_fast_buffer = needs_alignment_stream ||
|
||||
(has_host_buffer &&
|
||||
size <= channel_state->uniform_buffer_skip_cache_size &&
|
||||
!memory_tracker.IsRegionGpuModified(device_addr, size));
|
||||
if (use_fast_buffer) {
|
||||
if constexpr (IS_OPENGL) {
|
||||
if (runtime.HasFastBufferSubData()) {
|
||||
@@ -898,7 +960,6 @@ void BufferCache<P>::BindHostGraphicsUniformBuffer(size_t stage, u32 index, u32
|
||||
if (!needs_bind) {
|
||||
return;
|
||||
}
|
||||
const u32 offset = buffer.Offset(device_addr);
|
||||
if constexpr (IS_OPENGL) {
|
||||
// Mark the index as dirty if offset doesn't match
|
||||
const bool is_copy_bind = offset != 0 && !runtime.SupportsNonZeroUniformOffset();
|
||||
@@ -1015,9 +1076,30 @@ void BufferCache<P>::BindHostComputeUniformBuffers() {
|
||||
TouchBuffer(buffer, binding.buffer_id);
|
||||
const u32 size =
|
||||
(std::min)(binding.size, (*channel_state->compute_uniform_buffer_sizes)[index]);
|
||||
const bool has_host_buffer = binding.buffer_id != NULL_BUFFER_ID;
|
||||
const u32 offset = has_host_buffer ? buffer.Offset(binding.device_addr) : 0;
|
||||
const bool needs_alignment_stream = [&]() {
|
||||
if constexpr (IS_OPENGL) {
|
||||
return false;
|
||||
} else {
|
||||
if (!has_host_buffer) {
|
||||
return false;
|
||||
}
|
||||
const u32 alignment = runtime.GetUniformBufferAlignment();
|
||||
return alignment > 1 && (offset % alignment) != 0;
|
||||
}
|
||||
}();
|
||||
if constexpr (!IS_OPENGL) {
|
||||
if (needs_alignment_stream) {
|
||||
const std::span<u8> span =
|
||||
runtime.BindMappedUniformBuffer(0, binding_index, size);
|
||||
device_memory.ReadBlockUnsafe(binding.device_addr, span.data(), size);
|
||||
return;
|
||||
}
|
||||
}
|
||||
|
||||
SynchronizeBuffer(buffer, binding.device_addr, size);
|
||||
|
||||
const u32 offset = buffer.Offset(binding.device_addr);
|
||||
buffer.MarkUsage(offset, size);
|
||||
if constexpr (NEEDS_BIND_UNIFORM_INDEX) {
|
||||
runtime.BindComputeUniformBuffer(binding_index, buffer, offset, size);
|
||||
|
||||
@@ -8,6 +8,7 @@
|
||||
|
||||
#include <algorithm>
|
||||
#include <array>
|
||||
#include <bit>
|
||||
#include <functional>
|
||||
#include <memory>
|
||||
#include <mutex>
|
||||
@@ -132,6 +133,9 @@ public:
|
||||
u32 enabled_compute_storage_buffers = 0;
|
||||
u32 written_compute_storage_buffers = 0;
|
||||
|
||||
u32 total_graphics_storage_buffers = 0;
|
||||
u32 total_compute_storage_buffers = 0;
|
||||
|
||||
std::array<u32, NUM_STAGES> enabled_texture_buffers{};
|
||||
std::array<u32, NUM_STAGES> written_texture_buffers{};
|
||||
std::array<u32, NUM_STAGES> image_texture_buffers{};
|
||||
|
||||
@@ -198,6 +198,10 @@ public:
|
||||
return device.CanReportMemoryUsage();
|
||||
}
|
||||
|
||||
u32 GetUniformBufferAlignment() const {
|
||||
return static_cast<u32>(device.GetUniformBufferAlignment());
|
||||
}
|
||||
|
||||
u32 GetStorageBufferAlignment() const {
|
||||
return static_cast<u32>(device.GetShaderStorageBufferAlignment());
|
||||
}
|
||||
|
||||
@@ -333,6 +333,13 @@ BufferCacheRuntime::BufferCacheRuntime(const Device& device_, MemoryAllocator& m
|
||||
staging_pool{staging_pool_}, guest_descriptor_queue{guest_descriptor_queue_},
|
||||
quad_index_pass(device, scheduler, descriptor_pool, staging_pool,
|
||||
compute_pass_descriptor_queue) {
|
||||
const VkDriverIdKHR driver_id = device.GetDriverID();
|
||||
limit_dynamic_storage_buffers = driver_id == VK_DRIVER_ID_QUALCOMM_PROPRIETARY ||
|
||||
driver_id == VK_DRIVER_ID_MESA_TURNIP ||
|
||||
driver_id == VK_DRIVER_ID_ARM_PROPRIETARY;
|
||||
if (limit_dynamic_storage_buffers) {
|
||||
max_dynamic_storage_buffers = device.GetMaxDescriptorSetStorageBuffersDynamic();
|
||||
}
|
||||
if (device.GetDriverID() != VK_DRIVER_ID_QUALCOMM_PROPRIETARY) {
|
||||
// TODO: FixMe: Uint8Pass compute shader does not build on some Qualcomm drivers.
|
||||
uint8_pass = std::make_unique<Uint8Pass>(device, scheduler, descriptor_pool, staging_pool,
|
||||
@@ -408,6 +415,10 @@ bool BufferCacheRuntime::CanReportMemoryUsage() const {
|
||||
return device.CanReportMemoryUsage();
|
||||
}
|
||||
|
||||
u32 BufferCacheRuntime::GetUniformBufferAlignment() const {
|
||||
return static_cast<u32>(device.GetUniformBufferAlignment());
|
||||
}
|
||||
|
||||
u32 BufferCacheRuntime::GetStorageBufferAlignment() const {
|
||||
return static_cast<u32>(device.GetStorageBufferAlignment());
|
||||
}
|
||||
|
||||
@@ -6,6 +6,8 @@
|
||||
|
||||
#pragma once
|
||||
|
||||
#include <limits>
|
||||
|
||||
#include "video_core/buffer_cache/buffer_cache_base.h"
|
||||
#include "video_core/buffer_cache/memory_tracker_base.h"
|
||||
#include "video_core/buffer_cache/usage_tracker.h"
|
||||
@@ -94,6 +96,8 @@ public:
|
||||
|
||||
bool CanReportMemoryUsage() const;
|
||||
|
||||
u32 GetUniformBufferAlignment() const;
|
||||
|
||||
u32 GetStorageBufferAlignment() const;
|
||||
|
||||
[[nodiscard]] StagingBufferRef UploadStagingBuffer(size_t size);
|
||||
@@ -155,6 +159,14 @@ public:
|
||||
guest_descriptor_queue.AddTexelBuffer(buffer.View(offset, size, format));
|
||||
}
|
||||
|
||||
bool ShouldLimitDynamicStorageBuffers() const {
|
||||
return limit_dynamic_storage_buffers;
|
||||
}
|
||||
|
||||
u32 GetMaxDynamicStorageBuffers() const {
|
||||
return max_dynamic_storage_buffers;
|
||||
}
|
||||
|
||||
private:
|
||||
void BindBuffer(VkBuffer buffer, u32 offset, u32 size) {
|
||||
guest_descriptor_queue.AddBuffer(buffer, offset, size);
|
||||
@@ -194,6 +206,9 @@ private:
|
||||
|
||||
std::unique_ptr<Uint8Pass> uint8_pass;
|
||||
QuadIndexedPass quad_index_pass;
|
||||
|
||||
bool limit_dynamic_storage_buffers = false;
|
||||
u32 max_dynamic_storage_buffers = std::numeric_limits<u32>::max();
|
||||
};
|
||||
|
||||
struct BufferCacheParams {
|
||||
|
||||
@@ -269,8 +269,9 @@ size_t GetTotalPipelineWorkers() {
|
||||
const size_t max_core_threads =
|
||||
std::max<size_t>(static_cast<size_t>(std::thread::hardware_concurrency()), 2ULL) - 1ULL;
|
||||
#ifdef ANDROID
|
||||
// Leave at least a few cores free in android
|
||||
constexpr size_t free_cores = 3ULL;
|
||||
// Leave at least a few cores free on Android; reserve two instead of three so
|
||||
// pipeline compilation can consume one more worker thread for testing.
|
||||
constexpr size_t free_cores = 2ULL;
|
||||
if (max_core_threads <= free_cores) {
|
||||
return 1ULL;
|
||||
}
|
||||
@@ -767,6 +768,19 @@ std::unique_ptr<ComputePipeline> PipelineCache::CreateComputePipeline(
|
||||
}
|
||||
|
||||
auto program{TranslateProgram(pools.inst, pools.block, env, cfg, host_info)};
|
||||
const VkDriverIdKHR driver_id = device.GetDriverID();
|
||||
const bool needs_shared_mem_clamp =
|
||||
driver_id == VK_DRIVER_ID_QUALCOMM_PROPRIETARY ||
|
||||
driver_id == VK_DRIVER_ID_ARM_PROPRIETARY;
|
||||
const u32 max_shared_memory = device.GetMaxComputeSharedMemorySize();
|
||||
if (needs_shared_mem_clamp && program.shared_memory_size > max_shared_memory) {
|
||||
LOG_WARNING(Render_Vulkan,
|
||||
"Compute shader 0x{:016x} requests {}KB shared memory but device max is {}KB - clamping",
|
||||
key.unique_hash,
|
||||
program.shared_memory_size / 1024,
|
||||
max_shared_memory / 1024);
|
||||
program.shared_memory_size = max_shared_memory;
|
||||
}
|
||||
const std::vector<u32> code{EmitSPIRV(profile, program, this->optimize_spirv_output)};
|
||||
device.SaveShader(code);
|
||||
vk::ShaderModule spv_module{BuildShader(device, code)};
|
||||
|
||||
@@ -1516,6 +1516,10 @@ bool TextureCacheRuntime::CanReportMemoryUsage() const {
|
||||
return device.CanReportMemoryUsage();
|
||||
}
|
||||
|
||||
std::optional<size_t> TextureCacheRuntime::GetSamplerHeapBudget() const {
|
||||
return device.GetSamplerHeapBudget();
|
||||
}
|
||||
|
||||
void TextureCacheRuntime::TickFrame() {}
|
||||
|
||||
Image::Image(TextureCacheRuntime& runtime_, const ImageInfo& info_, GPUVAddr gpu_addr_,
|
||||
|
||||
@@ -62,6 +62,8 @@ public:
|
||||
|
||||
bool CanReportMemoryUsage() const;
|
||||
|
||||
std::optional<size_t> GetSamplerHeapBudget() const;
|
||||
|
||||
void BlitImage(Framebuffer* dst_framebuffer, ImageView& dst, ImageView& src,
|
||||
const Region2D& dst_region, const Region2D& src_region,
|
||||
Tegra::Engines::Fermi2D::Filter filter,
|
||||
|
||||
@@ -6,6 +6,8 @@
|
||||
|
||||
#pragma once
|
||||
|
||||
#include <limits>
|
||||
#include <optional>
|
||||
#include <unordered_set>
|
||||
#include <boost/container/small_vector.hpp>
|
||||
|
||||
@@ -1736,11 +1738,89 @@ SamplerId TextureCache<P>::FindSampler(const TSCEntry& config) {
|
||||
}
|
||||
const auto [pair, is_new] = channel_state->samplers.try_emplace(config);
|
||||
if (is_new) {
|
||||
EnforceSamplerBudget();
|
||||
pair->second = slot_samplers.insert(runtime, config);
|
||||
}
|
||||
return pair->second;
|
||||
}
|
||||
|
||||
template <class P>
|
||||
std::optional<size_t> TextureCache<P>::QuerySamplerBudget() const {
|
||||
if constexpr (requires { runtime.GetSamplerHeapBudget(); }) {
|
||||
return runtime.GetSamplerHeapBudget();
|
||||
} else {
|
||||
return std::nullopt;
|
||||
}
|
||||
}
|
||||
|
||||
template <class P>
|
||||
void TextureCache<P>::EnforceSamplerBudget() {
|
||||
const auto budget = QuerySamplerBudget();
|
||||
if (!budget) {
|
||||
return;
|
||||
}
|
||||
if (slot_samplers.size() < *budget) {
|
||||
return;
|
||||
}
|
||||
if (!channel_state) {
|
||||
return;
|
||||
}
|
||||
if (last_sampler_gc_frame == frame_tick) {
|
||||
return;
|
||||
}
|
||||
last_sampler_gc_frame = frame_tick;
|
||||
TrimInactiveSamplers(*budget);
|
||||
}
|
||||
|
||||
template <class P>
|
||||
void TextureCache<P>::TrimInactiveSamplers(size_t budget) {
|
||||
if (channel_state->samplers.empty()) {
|
||||
return;
|
||||
}
|
||||
static constexpr size_t SAMPLER_GC_SLACK = 1024;
|
||||
auto mark_active = [](auto& set, SamplerId id) {
|
||||
if (!id || id == CORRUPT_ID || id == NULL_SAMPLER_ID) {
|
||||
return;
|
||||
}
|
||||
set.insert(id);
|
||||
};
|
||||
std::unordered_set<SamplerId> active;
|
||||
active.reserve(channel_state->graphics_sampler_ids.size() +
|
||||
channel_state->compute_sampler_ids.size());
|
||||
for (const SamplerId id : channel_state->graphics_sampler_ids) {
|
||||
mark_active(active, id);
|
||||
}
|
||||
for (const SamplerId id : channel_state->compute_sampler_ids) {
|
||||
mark_active(active, id);
|
||||
}
|
||||
|
||||
size_t removed = 0;
|
||||
auto& sampler_map = channel_state->samplers;
|
||||
for (auto it = sampler_map.begin(); it != sampler_map.end();) {
|
||||
const SamplerId sampler_id = it->second;
|
||||
if (!sampler_id || sampler_id == CORRUPT_ID) {
|
||||
it = sampler_map.erase(it);
|
||||
continue;
|
||||
}
|
||||
if (active.find(sampler_id) != active.end()) {
|
||||
++it;
|
||||
continue;
|
||||
}
|
||||
slot_samplers.erase(sampler_id);
|
||||
it = sampler_map.erase(it);
|
||||
++removed;
|
||||
if (slot_samplers.size() + SAMPLER_GC_SLACK <= budget) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (removed != 0) {
|
||||
LOG_WARNING(HW_GPU,
|
||||
"Sampler cache exceeded {} entries on this driver; reclaimed {} inactive samplers",
|
||||
budget, removed);
|
||||
}
|
||||
}
|
||||
|
||||
template <class P>
|
||||
ImageViewId TextureCache<P>::FindColorBuffer(size_t index) {
|
||||
const auto& regs = maxwell3d->regs;
|
||||
|
||||
@@ -429,6 +429,9 @@ private:
|
||||
|
||||
void QueueAsyncDecode(Image& image, ImageId image_id);
|
||||
void TickAsyncDecode();
|
||||
void EnforceSamplerBudget();
|
||||
void TrimInactiveSamplers(size_t budget);
|
||||
std::optional<size_t> QuerySamplerBudget() const;
|
||||
|
||||
Runtime& runtime;
|
||||
|
||||
@@ -500,6 +503,7 @@ private:
|
||||
|
||||
u64 modification_tick = 0;
|
||||
u64 frame_tick = 0;
|
||||
u64 last_sampler_gc_frame = (std::numeric_limits<u64>::max)();
|
||||
|
||||
Common::ThreadWorker texture_decode_worker{1, "TextureDecoder"};
|
||||
std::vector<std::unique_ptr<AsyncDecodeContext>> async_decodes;
|
||||
|
||||
@@ -579,6 +579,18 @@ Device::Device(VkInstance instance_, vk::PhysicalDevice physical_, VkSurfaceKHR
|
||||
if (version < VK_MAKE_API_VERSION(0, 255, 615, 512)) {
|
||||
has_broken_parallel_compiling = true;
|
||||
}
|
||||
const size_t sampler_limit = properties.properties.limits.maxSamplerAllocationCount;
|
||||
if (sampler_limit > 0) {
|
||||
constexpr size_t MIN_SAMPLER_BUDGET = 1024U;
|
||||
const size_t reserved = sampler_limit / 4U;
|
||||
const size_t derived_budget =
|
||||
(std::max)(MIN_SAMPLER_BUDGET, sampler_limit - reserved);
|
||||
sampler_heap_budget = derived_budget;
|
||||
LOG_WARNING(Render_Vulkan,
|
||||
"Qualcomm driver reports max {} samplers; reserving {} (25%) and "
|
||||
"allowing Eden to use {} (75%) to avoid heap exhaustion",
|
||||
sampler_limit, reserved, sampler_heap_budget);
|
||||
}
|
||||
}
|
||||
|
||||
if (extensions.sampler_filter_minmax && is_amd) {
|
||||
@@ -1273,6 +1285,13 @@ void Device::SetupFamilies(VkSurfaceKHR surface) {
|
||||
}
|
||||
}
|
||||
|
||||
std::optional<size_t> Device::GetSamplerHeapBudget() const {
|
||||
if (sampler_heap_budget == 0) {
|
||||
return std::nullopt;
|
||||
}
|
||||
return sampler_heap_budget;
|
||||
}
|
||||
|
||||
u64 Device::GetDeviceMemoryUsage() const {
|
||||
VkPhysicalDeviceMemoryBudgetPropertiesEXT budget;
|
||||
budget.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_BUDGET_PROPERTIES_EXT;
|
||||
|
||||
@@ -6,6 +6,7 @@
|
||||
|
||||
#pragma once
|
||||
|
||||
#include <optional>
|
||||
#include <set>
|
||||
#include <span>
|
||||
#include <string>
|
||||
@@ -310,6 +311,16 @@ public:
|
||||
return properties.properties.limits.maxComputeSharedMemorySize;
|
||||
}
|
||||
|
||||
/// Returns the maximum number of dynamic storage buffer descriptors per set.
|
||||
u32 GetMaxDescriptorSetStorageBuffersDynamic() const {
|
||||
return properties.properties.limits.maxDescriptorSetStorageBuffersDynamic;
|
||||
}
|
||||
|
||||
/// Returns the maximum number of dynamic uniform buffer descriptors per set.
|
||||
u32 GetMaxDescriptorSetUniformBuffersDynamic() const {
|
||||
return properties.properties.limits.maxDescriptorSetUniformBuffersDynamic;
|
||||
}
|
||||
|
||||
/// Returns float control properties of the device.
|
||||
const VkPhysicalDeviceFloatControlsPropertiesKHR& FloatControlProperties() const {
|
||||
return properties.float_controls;
|
||||
@@ -631,6 +642,8 @@ public:
|
||||
return has_broken_parallel_compiling;
|
||||
}
|
||||
|
||||
std::optional<size_t> GetSamplerHeapBudget() const;
|
||||
|
||||
/// Returns the vendor name reported from Vulkan.
|
||||
std::string_view GetVendorName() const {
|
||||
return properties.driver.driverName;
|
||||
@@ -849,6 +862,7 @@ private:
|
||||
bool dynamic_state3_blending{}; ///< Has all blending features of dynamic_state3.
|
||||
bool dynamic_state3_enables{}; ///< Has all enables features of dynamic_state3.
|
||||
bool supports_conditional_barriers{}; ///< Allows barriers in conditional control flow.
|
||||
size_t sampler_heap_budget{}; ///< Sampler budget for buggy drivers (0 = unlimited).
|
||||
u64 device_access_memory{}; ///< Total size of device local memory in bytes.
|
||||
u32 sets_per_pool{}; ///< Sets per Description Pool
|
||||
NvidiaArchitecture nvidia_arch{NvidiaArchitecture::Arch_AmpereOrNewer};
|
||||
|
||||
Reference in New Issue
Block a user