Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Posts under Metal subtopic

Post

Replies

Boosts

Views

Created

Metal texture allocated size versus actual image data size
Hello. In the iOS app i'm working on we are very tight on memory budget and I was looking at ways to reduce our texture memory usage. However I noticed that comparing ASTC8x8 to ASTC12x12, there is no actual difference in allocated memory for most of our textures despite ASTC12x12 having less than half the bpp of 8x8. The difference between the two only becomes apparent for textures 1024x1024 and larger, and even in that case the actual texture data is sometimes only 60% of the allocation size. I understand there must be some alignment and padding going on, but this seems extreme. For an example scene in my app with astc12x12 for most textures there is over a 100mb difference in astc size on disk versus when loaded, so I would love to be able to recover even a portion of that memory. Here is some test code with some measurements i've taken using an iphone 11: for(int i = 0; i < 11; i++) { MTLTextureDescriptor *texDesc = [[MTLTextureDescriptor alloc] init]; texDesc.pixelFormat = MTLPixelFormatASTC_12x12_LDR; int dim = 12; int n = 2 << i; int mips = i+1; texDesc.width = n; texDesc.height = n; texDesc.mipmapLevelCount = mips; texDesc.resourceOptions = MTLResourceStorageModeShared; texDesc.usage = MTLTextureUsageShaderRead; // Calculate the equivalent astc texture size int blocks = 0; if(mips == 1) { blocks = n/dim + (n%dim>0? 1 : 0); blocks *= blocks; } else { for(int j = 0; j < mips; j++) { int a = 2 << j; int cur = a/dim + (a%dim>0? 1 : 0); blocks += cur*cur; } } auto tex = [objCObj newTextureWithDescriptor:texDesc]; printf("%dx%d, mips %d, Astc: %d, Metal: %d\n", n, n, mips, blocks*16, (int)tex.allocatedSize); } MTLPixelFormatASTC_12x12_LDR 128x128, mips 7, Astc: 2768, Metal: 6016 256x256, mips 8, Astc: 10512, Metal: 32768 512x512, mips 9, Astc: 40096, Metal: 98304 1024x1024, mips 10, Astc: 158432, Metal: 262144 128x128, mips 1, Astc: 1936, Metal: 4096 256x256, mips 1, Astc: 7744, Metal: 16384 512x512, mips 1, Astc: 29584, Metal: 65536 1024x1024, mips 1, Astc: 118336, Metal: 147456 MTLPixelFormatASTC_8x8_LDR 128x128, mips 7, Astc: 5488, Metal: 6016 256x256, mips 8, Astc: 21872, Metal: 32768 512x512, mips 9, Astc: 87408, Metal: 98304 1024x1024, mips 10, Astc: 349552, Metal: 360448 128x128, mips 1, Astc: 4096, Metal: 4096 256x256, mips 1, Astc: 16384, Metal: 16384 512x512, mips 1, Astc: 65536, Metal: 65536 1024x1024, mips 1, Astc: 262144, Metal: 262144 I also tried using MTLHeaps (placement and automatic) hoping they might be better, but saw nearly the same numbers. Is there any way to have metal allocate these textures in a more compact way to save on memory?
8
0
2.8k
Feb ’22
Metal and Swift Concurrency
Hi, Introducing Swift Concurrency to my Metal app has been a bit challenging as Swift Concurrency is limited by the cooperative thread pool. GPU work is obviously not CPU bound and can block forward moving progress, especially when using waitUntilCompleted on the command buffer. For concurrent render work this has the potential of under utilizing the CPU and even creating dead locks. My question is, what is the Metal's teams general recommendation when it comes to concurrency? It seems to me that Dispatch or OperationQueues are still the preferred way for Metal bound tasks in order to gain maximum performance? To integrate with Swift Concurrency my idea is to use continuations that kick off render jobs via Dispatch or Queues? Would this be the best solution to bridge async tasks with Metal work? Thanks!
5
0
1.1k
Jun ’24
In Metal compute kernels, when do thread variables get spilled into the device memory?
How many 32-bit variables can I use concurrently in a single thread of a Metal compute kernel without worrying about the variables getting spilled into the device memory? Alternatively: how many 32-bit registers does a single thread have available for itself? Let's say that each thread of my compute kernel needs to store and work with its own array of N float variables, where N can be 128, 256, 512 or more. To achieve maximum possible performance, I do not want to the local thread variables to get spilled into the slow device memory. I want all N variables to be stored "on-chip", in the thread memory space. To make my question more concrete, let's say there is an array thread float localArray[N]. Assuming an unrealistic hypothetical scenario where localArray is the only variable in the whole kernel, what is the maximum value of N for which no portion of localArray would get spilled into the device memory? I searched in the Metal feature set tables, but I could not find any details.
1
0
600
Aug ’24
Metal runtime shader library compilation and linking issue
In my project I need to do the following: In runtime create metal Dynamic library from source. In runtime create metal Executable library from source and Link it with my previous created Dynamic library. Create compute pipeline using those two libraries created above. But I get the following error at the third step: Error Domain=AGXMetalG15X_M1 Code=2 "Undefined symbols: _Z5noisev, referenced from: OnTheFlyKernel " UserInfo={NSLocalizedDescription=Undefined symbols: _Z5noisev, referenced from: OnTheFlyKernel } import Foundation import Metal class MetalShaderCompiler { let device = MTLCreateSystemDefaultDevice()! var pipeline: MTLComputePipelineState! func compileDylib() -> MTLDynamicLibrary { let source = """ #include <metal_stdlib> using namespace metal; half3 noise() { return half3(1, 0, 1); } """ let option = MTLCompileOptions() option.libraryType = .dynamic option.installName = "@executable_path/libFoundation.metallib" let library = try! device.makeLibrary(source: source, options: option) let dylib = try! device.makeDynamicLibrary(library: library) return dylib } func compileExlib(dylib: MTLDynamicLibrary) -> MTLLibrary { let source = """ #include <metal_stdlib> using namespace metal; extern half3 noise(); kernel void OnTheFlyKernel(texture2d<half, access::read> src [[texture(0)]], texture2d<half, access::write> dst [[texture(1)]], ushort2 gid [[thread_position_in_grid]]) { half4 rgba = src.read(gid); rgba.rgb += noise(); dst.write(rgba, gid); } """ let option = MTLCompileOptions() option.libraryType = .executable option.libraries = [dylib] let library = try! self.device.makeLibrary(source: source, options: option) return library } func runtime() { let dylib = self.compileDylib() let exlib = self.compileExlib(dylib: dylib) let pipelineDescriptor = MTLComputePipelineDescriptor() pipelineDescriptor.computeFunction = exlib.makeFunction(name: "OnTheFlyKernel") pipelineDescriptor.preloadedLibraries = [dylib] pipeline = try! device.makeComputePipelineState(descriptor: pipelineDescriptor, options: .bindingInfo, reflection: nil) } }
4
0
889
Sep ’24
Can't link metal-cpp to Modern Rendering With Metal sample
There is a sample project from Apple here. It has a scene of a city at night and you can move in it. It basically has 2 parts: application code written in what looks like Objective-C (I am more familiar with C++), which inherits from things like NSObject, MTKView, NSViewController and so on - it processes input and all app-related and window-related stuff. rendering code that also looks like Objective-C. Btw both parts are mostly in .mm files (Obj-C++ AFAIK). The application part directly uses only one class from the rendering part - AAPLRenderer. I want to move the rendering part to C++ using metal-cpp. For that I need to link metal-cpp to the project. I did it successfully with blank projects several times before using this tutorial. But with this sample project Xcode can't find Foundation/Foundation.hpp (and other metal-cpp headers). The error says this: Did not find header 'Foundation.hpp' in framework 'Foundation' (loaded from '/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks') Pls help
2
0
849
Sep ’24
OS choosing performance state poorly for GPU use case
I am building a MacOS desktop app (https://anukari.com) that is using Metal compute to do real-time audio/DSP processing, as I have a problem that is highly parallelizable and too computationally expensive for the CPU. However it seems that the way in which I am using the GPU, even when my app is fully compute-limited, the OS never increases the power/performance state. Because this is a real-time audio synthesis application, it's a huge problem to not be able to take advantage of the full clock speeds that the GPU is capable of, because the app can't keep up with real-time. I discovered this issue while profiling the app using Instrument's Metal tracing (and Game tracing) modes. In the profiling configuration under "Metal Application" there is a drop-down to select the "Performance State." If I run the application under Instruments with Performance State set to Maximum, it runs amazingly well, and all my problems go away. For comparison, when I run the app on its own, outside of Instruments, the expensive GPU computation it's doing takes around 2x as long to complete, meaning that the app performs half as well. I've done a ton of work to micro-optimize my Metal compute code, based on every scrap of information from the WWDC videos, etc. A problem I'm running into is that I think that the more efficient I make my code, the less it signals to the OS that I want high GPU clock speeds! I think part of why the OS is confused is that in most use cases, my computation can be done using only a small number of Metal threadgroups. I'm guessing that the OS heuristics see that only a small fraction of the GPU is saturated and fail to scale up the power/clock state. I'm not sure what to do here; I'm in a bit of a bind. One possibility is that I intentionally schedule busy work -- spin threadgroups just to waste energy and signal to the OS that I need higher clock speeds. This is obviously a really bad idea, but it might work. Is there any other (better) way for my app to signal to the OS that it is doing real-time latency-sensitive computation on the GPU and needs the clock speeds to be scaled up? Note that game mode is not really an option, as my app also runs as an AU plugin inside hosts like Garageband, so it can't be made fullscreen, etc.
6
0
904
Nov ’24
MDLAsset loads texture in usdz file loaded with wrong colorspace
I have a very basic usdz file from this repo I call loadTextures() after loading the usdz via MDLAsset. Inspecting the MDLTexture object I can tell it is assigning a colorspace of linear rgb instead of srgb although the image file in the usdz is srgb. This causes the textures to ultimately render as over saturated. In the code I later convert the MDLTexture to MTLTexture via MTKTextureLoader but if I set the srgb option it seems to ignore it. This significantly impacts the usefulness of Model I/O if it can't load a simple usdz texture correctly. Am I missing something? Thanks!
3
3
850
Nov ’24
Jurassic World Evolution 2 Likely Fails Due to Missing Tiled Resources Support
I’ve been trying to run Jurassic World Evolution 2 using the Game Porting Toolkit on macOS, but the game doesn’t launch and crashes immediately. Based on the error and research, it seems the issue is related to missing support for D3D12_TILED_RESOURCES_TIER_2 in the Metal API. If this is the case, does anyone know if support for tiled resources is planned for future updates of the toolkit? Or are there any potential workarounds for bypassing this limitation?
1
0
731
Dec ’24
Metal-cpp-extensions isn't working inside frameworks
I am making a framework in C++ using metal-cpp, basically a small game engine. I am also consequently using metal-cpp-extensions provided in LearnMetalCPP to make applications work. For one of my classes, I needed to add AppKit.hpp inside a public header file, so I moved it and its associate headers(NSApplication.hpp, NSMenu.hpp, etc.) from Project headers to Public in Build Phases' Headers, however, it started giving me the error "cast of C pointer type 'void *' to Objective-C pointer type 'Class' requires a bridged cast" at several points in the AppKit headers. They don't appear when AppKit and its associates are in the Project headers, or when they are in the Private headers and no headers import it. I imagined that disabling Objective-C ARC and Using __bridge casts outside of ARC in Build Settings would solve it, but it didn't budge. I imagined it wouldn't involve actively changing the headers would be the answer, but even if I try to put __bridge before the problematic casts, it didn't recognize __bridge. How do I solve this? And why is it only happening in Public and not Project headers?
2
0
806
Dec ’24
How to use MTKTextureLoader to load png data
I am trying to load some PNG data with MTKTextureLoader newTextureWithData,but the result shows wrong at the alpha area. Here is the code. I have an image URL, after it downloads successfully, I try to use the data or UIImagePNGRepresentation (image), they all show wrong. UIImage *tempImg = [UIImage imageWithData:data]; CGImageRef cgRef = tempImg.CGImage; MTKTextureLoader *loader = [[MTKTextureLoader alloc] initWithDevice:device]; id<MTLTexture> temp1 = [loader newTextureWithData:data options:@{MTKTextureLoaderOptionSRGB: @(NO), MTKTextureLoaderOptionTextureUsage: @(MTLTextureUsageShaderRead), MTKTextureLoaderOptionTextureCPUCacheMode: @(MTLCPUCacheModeWriteCombined)} error:nil]; NSData *tempData = UIImagePNGRepresentation(tempImg); id<MTLTexture> temp2 = [loader newTextureWithData:tempData options:@{MTKTextureLoaderOptionSRGB: @(NO), MTKTextureLoaderOptionTextureUsage: @(MTLTextureUsageShaderRead), MTKTextureLoaderOptionTextureCPUCacheMode: @(MTLCPUCacheModeWriteCombined)} error:nil]; id<MTLTexture> temp3 = [loader newTextureWithCGImage:cgRef options:@{MTKTextureLoaderOptionSRGB: @(NO), MTKTextureLoaderOptionTextureUsage: @(MTLTextureUsageShaderRead), MTKTextureLoaderOptionTextureCPUCacheMode: @(MTLCPUCacheModeWriteCombined)} error:nil]; }] resume];
5
0
625
Dec ’24
M1 GPU violates atomic_thread_fence across threadgroups
I have an M1 Pro with a 16-core GPU. When I run a shader with 8193 threads, atomic_thread_fence is violated across the boundary between thread 8191 (the last thread in the 7th threadgroup) and 8192 (the first thread in the 9th threadgroup). I've attached the Metal and Swift files, but I'll repost the relevant kernel here. It's a function that launches N threads to iterate through a binary tree from the leaves, where the first thread to reach the parent terminates and the second one populates it with the sum of the nodes two children. // clang-format off void sum(device const int& size, device const int* __restrict__ in, device int* __restrict__ out, device atomic_int* visited, uint i [[thread_position_in_grid]]) { // clang-format on int val = in[i]; uint cur = (size + i - 1); out[cur] = val; atomic_thread_fence(mem_flags::mem_device, memory_order_seq_cst); cur = (cur - 1) / 2; int proceed = atomic_fetch_add_explicit(&visited[cur], 1, memory_order_relaxed); while (proceed == 1) { uint left = 2 * cur + 1; uint right = 2 * cur + 2; uint val_left = out[left]; uint val_right = out[right]; uint val_cur = val_left + val_right; out[cur] = val_cur; if (cur == 0) { break; } cur = (cur - 1) / 2; atomic_thread_fence(mem_flags::mem_device, memory_order_seq_cst); proceed = atomic_fetch_add_explicit(&visited[cur], 1, memory_order_relaxed); } } What I'm observing is that thread 8192 hits the atomic_fetch_add first and terminates, while thread 8191 hits it second (observes that thread 8192 had incremented it by 1) and proceeds into the loop. Thread 8191 reads out[16383] (which it populated with 8191) and out[16384] (which thread 8192 populated with 8192 prior to the atomic_thread_fence). Instead of reading 8192 from out[16384] though, it reads 0. Maybe I'm missing something but this seems like a pretty clear violation of the atomic_thread_fence which (I thought) was supposed to guarantee that the write from thread 8192 to out[16384] would be visible to any thread observing the effects of the following atomic_fetch_add. Is atomic_fetch_add not a store operation? Modifying it to an atomic_store or atomic_exchange still results in the bug. Adding another atomic_thread_fence between the atomic_fetch_add and reading of out also doesn't change anything. I only begin to observe this on grid sizes of 8193 and upwards. That's 9 threadgroups per grid, which I assume could be related to my M1 Pro GPU having 16 cores. Running the same example on an A17 Pro GPU doesn't show any of this behavior up through a tested grid size of 4194303 (2^22-1), at which point testing larger grid sizes starts to run into other issues so I can't test anything larger. Removing the atomic_thread_fences on both the M1 and A17 cause the test to fail at much smaller grid sizes, as expected. sum.metal main.swift
2
0
532
Dec ’24
Tile Shaders performance when writing to tile texture vs. resolve texture
I am working on a custom resolve tile shader for a client. I see a big difference in performance depending on where we write to: 1- the resolve texture of the color attachment 2- a rw tile shader texture set via [renderEncoder setTileTexture: myResolvedTexture] Option 2 is more than twice as slow than option 1. Our compute shader writes to 4 UAVs so just using the resolve texture entry is not possible. Why such a difference as there is no more data being written? Can option 2 be as fast as option 1? I can demonstrate the issue in a modified version of the Multisample code sample.
5
0
568
Dec ’24
How to use imageblock_slice
Is there a working example of imageblock_slice with implicit layout somewhere? I get a compilation error when i write this: imageblock_slilce color_slice = img_blk.slice(frag->color); Error: No matching member function for call to 'slice' candidate template ignored: couldn't infer template argument 'E' candidate function template not viable: requires 2 arguments, but 1 was provided Too few template arguments for class template 'imageblock_slice' It seems the syntax has changed since the Imageblocks presentation https://developer.apple.com/videos/play/tech-talks/603/ I tried supplying the struct type of the image block between <> but it still does not work.
1
0
671
Dec ’24
What are the CAMetalLayer.nextDrawable threading rules?
What evidence exists that it's safe to call nextDrawable() on CAMetalLayer off the main thread? I have seen developers claiming that it's OK, but the official docs are silent on the topic. Attempting to do so with Strict Concurrency Checking set to Complete complains that CAMetalLayer is not @Sendable. I want to call it off the main thread since there doesn't seem to be any way to prevent it from blocking the UI for up to a second. I have read hints and allegations that this won't happen if you avoid asking for too many drawables, but that doesn't seem to be true 100% of the time in my experience. Supposing it is allowed, I wonder how races are handled such as when the layer's size is changed on the main thread, or if the layer is removed from the layer hierarchy.
0
0
515
Dec ’24
Texture Definitions for MPSSVGF Denoise
I am trying to use the SVGF denoiser to denoise my ray traced shadows (and also other textures later). I do get a smoothed image, but with wonky denoising. I need the depth-normal textures and motion textures for the SVGF and assume that these are badly filled in my case. However, neither in the above linked documentation nor in the WWDC19 video I find how they should be defined. I am looking to answers to: Is depth in red or alpha channel for the depth-normal texture? Are the normals in screen space? Is depth linear? Is it distance or z coordinate in view space? Or even logarithmically scaled or something else? Are the motion vectors supposed to be in pixels per frame? What is the orientation of the axis? Is y up or down? Are there are other restrictions on the formats? Also the linked code did not help me (I have not found any SVGF so far; also all the code is in Objective-C++, not Swift, but that's a different topic). So how should I fill these textures. Can someone point me to the documentation where these kinds of questions are answered?
0
0
558
Dec ’24
metal-cpp syntax for MTL::Buffer float2 parameter
I'm trying to pass a buffer of float2 items from CPU to GPU. In the kernel, I can provide a parameter for the buffer: device const float2* values for example. How do I specify float2 as the type for the MTL::Buffer? I managed to get the code to work by "cheating" by defining a simple class that has the same data members as a float2, but there is probably a better way. class Coord_f { public: float x{0.0f}; float y{0.0f}; }; then using code to allocate like this: NS::TransferPtr(device->newBuffer(n_elements * sizeof(Coord_f), MTL::ResourceStorageModeManaged)) The headers for metal-cpp do not appear to define vector objects like float2, but I'm doubtless missing something. Thanks.
2
0
653
Dec ’24
Shader compiler crash on MacOS Sequoia+Radeon
Hello, Apple! This post is a bug report for Metal driver in MacOS Sequoia. I'm working on opensource game engine and one of my users reported a bug, on "MacOS Sequoia" + "AMD Radeon RX 6900 XT".  Engine crashes, when liking a compute pipeline, with following NSError: "Compiler encountered an internal error: I" Offended shader (depth aware blur): https://shader-playground.timjones.io/27565de40391f62f078c891077ba758c On my end, compiling same shader on M1 (Sonoma) or with offline compiler doesn't reproduce the issue. Post with bug-report on github: https://github.com/Try/OpenGothic/issues/712 Looking forward for your help and driver fix ;)
2
0
695
Jan ’25