Hi, I am using xCode26.x. But my Metal4 classes are not compiling. I downloaded the sample code from Apple's website - https://developer.apple.com/documentation/Metal/processing-a-texture-in-a-compute-function. For example, I am getting errors like "Cannot find protocol declaration for 'MTL4CommandQueue';
I have hit a deadline. Any recommendations are very welcome.
I have downloaded the Metal Tool chain. When I run the following commands on the terminal - xcodebuild -showComponent metalToolchain ; xcrun -f metal ; xcrun metal --version
I get the following response -
Asset Path: /System/Library/AssetsV2/com_apple_MobileAsset_MetalToolchain/86fbaf7b114a899754307896c0bfd52ffbf4fded.asset/AssetData
Build Version: 17A321
Status: installed
Toolchain Identifier: com.apple.dt.toolchain.Metal.32023
Toolchain Search Path: /Users/private/Library/Developer/DVTDownloads/MetalToolchain/mounts/86fbaf7b114a899754307896c0bfd52ffbf4fded
/Users/private/Library/Developer/DVTDownloads/MetalToolchain/mounts/86fbaf7b114a899754307896c0bfd52ffbf4fded/Metal.xctoolchain/usr/bin/metal
Apple metal version 32023.830 (metalfe-32023.830.2)
Target: air64-apple-darwin24.6.0
Thread model: posix
InstalledDir: /Users/private/Library/Developer/DVTDownloads/MetalToolchain/mounts/86fbaf7b114a899754307896c0bfd52ffbf4fded/Metal.xctoolchain/usr/metal/current/bin
Metal
RSS for tagRender advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
I am puzzled by the setAddress(_:attributeStride:index:) of MTL4ArgumentTable. Can anyone please explain what the attributeStride parameter is for? The doc says that it is "The stride between attributes in the buffer." but why?
Who uses this for what? On the C++ side in the shaders the stride is determined by the C++ type, as far as I know. What am I missing here?
Thanks!
I have something like this drawing in an MTKView (see at bottom).
I am finding it difficult to figure out when can the Swift-land resources used in making the MTLBuffer(s) be released? Below, for example, is it ok if args goes out of scope (or is otherwise deallocated) at point 1, 2, or 3? Or perhaps even earlier, as soon as argsBuffer has been created?
I have been reading through various articles such as
Setting resource storage modes
Choosing a resource storage mode for Apple GPUs
Copying data to a private resource
but it's a lot to absorb and I haven't been really able to find an authoritative description of the required lifetime of the resources in CPU land.
I should mention that this is Metal 4 code. In previous versions of Metal, the MTLCommandBuffer had the ability to add a completion handler to be called by the GPU after it has finished running the commands in the buffer but in Metal 4 there is no such thing (it it were even needed for the purpose I am interested in).
Any advice and/or pointers to the definitive literature will be appreciated.
guard let argsBuffer = device.makeBuffer(bytes: &args,...
argumentTable.setAddress(argsBuffer.gpuAddress, ...
encoder.setArgumentTable(argumentTable, stages: .vertex)
// encode drawing
renderEncoder.draw...
...
encoder.endEncoding() // 1
commandBuffer.endCommandBuffer() // 2
commandQueue.waitForDrawable(drawable)
commandQueue.commit([commandBuffer]) // 3
commandQueue.signalDrawable(drawable)
drawable.present()
Topic:
Graphics & Games
SubTopic:
Metal
Is the pseudocode below thread-safe? Imagine that the Main thread sets the CAMetalLayer's drawableSize to a new size meanwhile the rendering thread is in the middle of rendering into an existing MTLDrawable which does still have the old size.
Is the change of metalLayer.drawableSize thread-safe in the sense that I can present an old MTLDrawable which has a different resolution than the current value of metalLayer.drawableSize? I assume that setting the drawableSize property informs Metal that the next MTLDrawable offered by the CAMetalLayer should have the new size, right?
Is it valid to assume that "metalLayer.drawableSize = newSize" and "metalLayer.nextDrawable()" are internally synchronized, so it cannot happen that metalLayer.nextDrawable() would produce e.g. a MTLDrawable with the old width but with the new height (or a completely invalid resolution due to potential race conditions)?
func onWindowResized(newSize: CGSize) {
// Called on the Main thread
metalLayer.drawableSize = newSize
}
func onVsync(drawable: MTLDrawable) {
// Called on a background rendering thread
renderer.renderInto(drawable: drawable)
}
Now the examples of metal-cpp are target on desktop and using AppKit which is not supported on iOS. Is there any tips for developing with metal-cpp on mobile device?
How many 32-bit variables can I use concurrently in a single thread of a Metal compute kernel without worrying about the variables getting spilled into the device memory? Alternatively: how many 32-bit registers does a single thread have available for itself?
Let's say that each thread of my compute kernel needs to store and work with its own array of N float variables, where N can be 128, 256, 512 or more. To achieve maximum possible performance, I do not want to the local thread variables to get spilled into the slow device memory. I want all N variables to be stored "on-chip", in the thread memory space.
To make my question more concrete, let's say there is an array thread float localArray[N]. Assuming an unrealistic hypothetical scenario where localArray is the only variable in the whole kernel, what is the maximum value of N for which no portion of localArray would get spilled into the device memory?
I searched in the Metal feature set tables, but I could not find any details.
In my project I need to do the following:
In runtime create metal Dynamic library from source.
In runtime create metal Executable library from source and Link it with my previous created Dynamic library.
Create compute pipeline using those two libraries created above.
But I get the following error at the third step:
Error Domain=AGXMetalG15X_M1 Code=2 "Undefined symbols:
_Z5noisev, referenced from: OnTheFlyKernel
" UserInfo={NSLocalizedDescription=Undefined symbols:
_Z5noisev, referenced from: OnTheFlyKernel
}
import Foundation
import Metal
class MetalShaderCompiler {
let device = MTLCreateSystemDefaultDevice()!
var pipeline: MTLComputePipelineState!
func compileDylib() -> MTLDynamicLibrary {
let source = """
#include <metal_stdlib>
using namespace metal;
half3 noise() {
return half3(1, 0, 1);
}
"""
let option = MTLCompileOptions()
option.libraryType = .dynamic
option.installName = "@executable_path/libFoundation.metallib"
let library = try! device.makeLibrary(source: source, options: option)
let dylib = try! device.makeDynamicLibrary(library: library)
return dylib
}
func compileExlib(dylib: MTLDynamicLibrary) -> MTLLibrary {
let source = """
#include <metal_stdlib>
using namespace metal;
extern half3 noise();
kernel void OnTheFlyKernel(texture2d<half, access::read> src [[texture(0)]],
texture2d<half, access::write> dst [[texture(1)]],
ushort2 gid [[thread_position_in_grid]]) {
half4 rgba = src.read(gid);
rgba.rgb += noise();
dst.write(rgba, gid);
}
"""
let option = MTLCompileOptions()
option.libraryType = .executable
option.libraries = [dylib]
let library = try! self.device.makeLibrary(source: source, options: option)
return library
}
func runtime() {
let dylib = self.compileDylib()
let exlib = self.compileExlib(dylib: dylib)
let pipelineDescriptor = MTLComputePipelineDescriptor()
pipelineDescriptor.computeFunction = exlib.makeFunction(name: "OnTheFlyKernel")
pipelineDescriptor.preloadedLibraries = [dylib]
pipeline = try! device.makeComputePipelineState(descriptor: pipelineDescriptor, options: .bindingInfo, reflection: nil)
}
}
There is a sample project from Apple here. It has a scene of a city at night and you can move in it.
It basically has 2 parts:
application code written in what looks like Objective-C (I am more familiar with C++), which inherits from things like NSObject, MTKView, NSViewController and so on - it processes input and all app-related and window-related stuff.
rendering code that also looks like Objective-C. Btw both parts are mostly in .mm files (Obj-C++ AFAIK). The application part directly uses only one class from the rendering part - AAPLRenderer.
I want to move the rendering part to C++ using metal-cpp. For that I need to link metal-cpp to the project. I did it successfully with blank projects several times before using this tutorial. But with this sample project Xcode can't find Foundation/Foundation.hpp (and other metal-cpp headers). The error says this:
Did not find header 'Foundation.hpp' in framework 'Foundation' (loaded from '/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks')
Pls help
I am building a MacOS desktop app (https://anukari.com) that is using Metal compute to do real-time audio/DSP processing, as I have a problem that is highly parallelizable and too computationally expensive for the CPU.
However it seems that the way in which I am using the GPU, even when my app is fully compute-limited, the OS never increases the power/performance state. Because this is a real-time audio synthesis application, it's a huge problem to not be able to take advantage of the full clock speeds that the GPU is capable of, because the app can't keep up with real-time.
I discovered this issue while profiling the app using Instrument's Metal tracing (and Game tracing) modes. In the profiling configuration under "Metal Application" there is a drop-down to select the "Performance State." If I run the application under Instruments with Performance State set to Maximum, it runs amazingly well, and all my problems go away.
For comparison, when I run the app on its own, outside of Instruments, the expensive GPU computation it's doing takes around 2x as long to complete, meaning that the app performs half as well.
I've done a ton of work to micro-optimize my Metal compute code, based on every scrap of information from the WWDC videos, etc. A problem I'm running into is that I think that the more efficient I make my code, the less it signals to the OS that I want high GPU clock speeds!
I think part of why the OS is confused is that in most use cases, my computation can be done using only a small number of Metal threadgroups. I'm guessing that the OS heuristics see that only a small fraction of the GPU is saturated and fail to scale up the power/clock state.
I'm not sure what to do here; I'm in a bit of a bind. One possibility is that I intentionally schedule busy work -- spin threadgroups just to waste energy and signal to the OS that I need higher clock speeds. This is obviously a really bad idea, but it might work.
Is there any other (better) way for my app to signal to the OS that it is doing real-time latency-sensitive computation on the GPU and needs the clock speeds to be scaled up?
Note that game mode is not really an option, as my app also runs as an AU plugin inside hosts like Garageband, so it can't be made fullscreen, etc.
Hi! I just installed GPTK2 on my new Mac , but the Terminal gave “Error:OpenSSL1.1 has been disabled.”
How should I fix it?Or waiting for the GPTK2 beta4?
Thanks.
I am making a framework in C++ using metal-cpp, basically a small game engine. I am also consequently using metal-cpp-extensions provided in LearnMetalCPP to make applications work.
For one of my classes, I needed to add AppKit.hpp inside a public header file, so I moved it and its associate headers(NSApplication.hpp, NSMenu.hpp, etc.) from Project headers to Public in Build Phases' Headers, however, it started giving me the error "cast of C pointer type 'void *' to Objective-C pointer type 'Class' requires a bridged cast" at several points in the AppKit headers. They don't appear when AppKit and its associates are in the Project headers, or when they are in the Private headers and no headers import it.
I imagined that disabling Objective-C ARC and Using __bridge casts outside of ARC in Build Settings would solve it, but it didn't budge.
I imagined it wouldn't involve actively changing the headers would be the answer, but even if I try to put __bridge before the problematic casts, it didn't recognize __bridge.
How do I solve this? And why is it only happening in Public and not Project headers?
I have an M1 Pro with a 16-core GPU. When I run a shader with 8193 threads, atomic_thread_fence is violated across the boundary between thread 8191 (the last thread in the 7th threadgroup) and 8192 (the first thread in the 9th threadgroup).
I've attached the Metal and Swift files, but I'll repost the relevant kernel here. It's a function that launches N threads to iterate through a binary tree from the leaves, where the first thread to reach the parent terminates and the second one populates it with the sum of the nodes two children.
// clang-format off
void sum(device const int& size,
device const int* __restrict__ in,
device int* __restrict__ out,
device atomic_int* visited,
uint i [[thread_position_in_grid]]) {
// clang-format on
int val = in[i];
uint cur = (size + i - 1);
out[cur] = val;
atomic_thread_fence(mem_flags::mem_device, memory_order_seq_cst);
cur = (cur - 1) / 2;
int proceed = atomic_fetch_add_explicit(&visited[cur], 1, memory_order_relaxed);
while (proceed == 1) {
uint left = 2 * cur + 1;
uint right = 2 * cur + 2;
uint val_left = out[left];
uint val_right = out[right];
uint val_cur = val_left + val_right;
out[cur] = val_cur;
if (cur == 0) {
break;
}
cur = (cur - 1) / 2;
atomic_thread_fence(mem_flags::mem_device, memory_order_seq_cst);
proceed = atomic_fetch_add_explicit(&visited[cur], 1, memory_order_relaxed);
}
}
What I'm observing is that thread 8192 hits the atomic_fetch_add first and terminates, while thread 8191 hits it second (observes that thread 8192 had incremented it by 1) and proceeds into the loop. Thread 8191 reads out[16383] (which it populated with 8191) and out[16384] (which thread 8192 populated with 8192 prior to the atomic_thread_fence). Instead of reading 8192 from out[16384] though, it reads 0.
Maybe I'm missing something but this seems like a pretty clear violation of the atomic_thread_fence which (I thought) was supposed to guarantee that the write from thread 8192 to out[16384] would be visible to any thread observing the effects of the following atomic_fetch_add. Is atomic_fetch_add not a store operation? Modifying it to an atomic_store or atomic_exchange still results in the bug. Adding another atomic_thread_fence between the atomic_fetch_add and reading of out also doesn't change anything.
I only begin to observe this on grid sizes of 8193 and upwards. That's 9 threadgroups per grid, which I assume could be related to my M1 Pro GPU having 16 cores.
Running the same example on an A17 Pro GPU doesn't show any of this behavior up through a tested grid size of 4194303 (2^22-1), at which point testing larger grid sizes starts to run into other issues so I can't test anything larger.
Removing the atomic_thread_fences on both the M1 and A17 cause the test to fail at much smaller grid sizes, as expected.
sum.metal
main.swift
I am working on a custom resolve tile shader for a client. I see a big difference in performance depending on where we write to:
1- the resolve texture of the color attachment
2- a rw tile shader texture set via [renderEncoder setTileTexture: myResolvedTexture]
Option 2 is more than twice as slow than option 1.
Our compute shader writes to 4 UAVs so just using the resolve texture entry is not possible.
Why such a difference as there is no more data being written? Can option 2 be as fast as option 1?
I can demonstrate the issue in a modified version of the Multisample code sample.
Is there a working example of imageblock_slice with implicit layout somewhere?
I get a compilation error when i write this:
imageblock_slilce color_slice = img_blk.slice(frag->color);
Error:
No matching member function for call to 'slice'
candidate template ignored: couldn't infer template argument 'E'
candidate function template not viable: requires 2 arguments, but 1 was provided
Too few template arguments for class template 'imageblock_slice'
It seems the syntax has changed since the Imageblocks presentation https://developer.apple.com/videos/play/tech-talks/603/
I tried supplying the struct type of the image block between <> but it still does not work.
I'm trying to pass a buffer of float2 items from CPU to GPU.
In the kernel, I can provide a parameter for the buffer:
device const float2* values
for example.
How do I specify float2 as the type for the MTL::Buffer?
I managed to get the code to work by "cheating" by defining a simple class that has the same data members as a float2, but there is probably a better way.
class Coord_f { public: float x{0.0f}; float y{0.0f}; };
then using code to allocate like this:
NS::TransferPtr(device->newBuffer(n_elements * sizeof(Coord_f), MTL::ResourceStorageModeManaged))
The headers for metal-cpp do not appear to define vector objects like float2, but I'm doubtless missing something.
Thanks.
Hello, Apple!
This post is a bug report for Metal driver in MacOS Sequoia.
I'm working on opensource game engine and one of my users reported a bug, on "MacOS Sequoia" + "AMD Radeon RX 6900 XT". Engine crashes, when liking a compute pipeline, with following NSError:
"Compiler encountered an internal error: I"
Offended shader (depth aware blur): https://shader-playground.timjones.io/27565de40391f62f078c891077ba758c
On my end, compiling same shader on M1 (Sonoma) or with offline compiler doesn't reproduce the issue.
Post with bug-report on github: https://github.com/Try/OpenGothic/issues/712
Looking forward for your help and driver fix ;)
Topic:
Graphics & Games
SubTopic:
Metal
I have a bare-bones Metal app setup where I attach a CAMetalLayer to a window that inherits from a NSWindow with a custom delegate. Everything else is vanilla. I'm also using metal-cpp and metal shader converter.
I'm running into a issue where the application runs fine in the beginning, but once I resize the window, it starts hitching. It turns out that [CAMetalLayer nextDrawable:] frequently (but not always) takes around a full second (plus or minus a few milliseconds) to return once drawableSize has been updated.
I've tried setting allowsNextDrawableTimeout to false which doesn't work; it returns a valid drawable after a second instead of nil. Setting displaySyncEnabled to false reduces the likelihood of this happening to around 50% from 90%+ but does not eliminate it. Setting maximumDrawableCount to 2 or 3 does not seem to make a difference.
By dumping the resource IDs of the returned textures I've noticed something interesting: Before resizing, the layer seems to shuffle between 2 textures or at least 2 resource IDs, but after resizing it starts to create new textures for each returned drawable. Occasionally it seems to reuse a previous resource ID, but it does not seem to have anything to do with whether the method returns quickly or not.
Why does this happen, and how can I fix it? Should I create a new CAMetalLayer when resizing the window instead of updating drawableSize?
I am interested in learning the Metal framework for rendering development. However, most of Apple’s official documentation uses Objective-C code. Therefore, I am seeking guidance on whether it is more advantageous for me to focus solely on learning Swift to gain proficiency in Metal.
When inspecting the geometry in Xcode's metal debugger, I noticed that the shown "frustrum box" didn't make sense. Since Metal uses depth range 0,1 in NDC space, I would expect a vertex that is projected to z:0 to be on the front clipping plane of the frustrum shown in the geometry inspector. This is however not the case. A vertex with ndc z:0 is shown halfway inside the frustrum. Vertices with ndc z less than 0 are correctly culled during rendering, while the geometry inspector's frustrum shows that the vertex is stil inside the frustrum.
The image shows vertices that are visually in the middle of the frustrum on z axis, but at the same time the out position shows that they are projected to z:0. How is this possible, unless there's a bug in the geometry inspector?
Hey everyone,
I’m trying to run Kingdom Come: Deliverance 2 using the Game Porting Toolkit, but I’m encountering a black screen when launching the game. From what I know about the game’s requirements, it might be using Shader Model 6.5, which supports advanced features like DirectX Raytracing (DXR) Tier 1.1. This leads me to suspect that the issue could be related to missing support for DirectX 12.1 Features or Shader Model 6.5 in GPTK.
Does anyone know if these features are currently supported by GPTK? If not, are there any plans to implement them in future updates? Alternatively, is there any workaround for games that rely on Shader Model 6.5 and ray tracing?
Thanks a lot for your help!