Hello Triangle

This tutorial explains how to draw a simple triangle using Metal and the Metal-C++ API. It assumes you already know how to create a window with GLFW (covered in the previous tutorial) and focuses specifically on Metal’s rendering pipeline.

Understanding The Rendering Pipeline

Before diving into code, it’s important to understand how the rendering pipeline works and why it’s designed that way. Metal’s pipeline transforms and rasterizes geometry into pixels displayed on the screen.

Here’s a simplified overview:

flowchart LR
    A["MTLBuffer (Vertex Data)"] --> B{Vertex Shader}
    B --> C["Tessellation Shader (optional)"]
    C --> D{Fragment Shader}
    D --> E[Rasterization]
    E --> F[Framebuffer]

Metal does not support geometry shaders like OpenGL; instead, similar effects are achieved through tessellation or compute shaders.

Each stage plays a critical role:

The vertex shader processes each vertex.
The fragment shader determines the color of each pixel.
Finally, the rasterizer converts this data into pixels that appear on screen.

The Metal Core Objects

Several key Metal objects form the backbone of the rendering process:

Object	Purpose
`MTL::Device`	Represents the GPU device
`CA::MetalLayer`	Connects Metal rendering to the window’s view layer
`MTL::Buffer`	Holds vertex data on the GPU
`MTL::Library`	Contains compiled Metal shaders
`MTL::CommandQueue`	Manages GPU command buffers
`MTL::RenderPipelineState`	Encapsulates the GPU rendering pipeline (shaders, formats, etc.)
`CA::MetalDrawable`	Represents the drawable framebuffer surface
`MTL::CommandBuffer`	Holds GPU commands to be executed

Because the rendering pipeline can seem complex at first, we’ll break it down into sections to make it easier to follow.

Preparing The Triangle’s Vertex Data

First, we define the triangle’s vertices in 3D space using x, y, z components.

The triangle consists of three vertices defined in normalized device coordinates (NDC), ranging from -1 to 1:

simd::float3 triangleVertices[] =
{
    {-0.5f, -0.5f, 0.0f},
    { 0.5f, -0.5f, 0.0f},
    { 0.0f,  0.5f, 0.0f}
};

We create a vertex buffer by passing a pointer to this data, specifying its size in bytes, and setting the storage mode to shared, which allows both CPU and GPU to access it — ideal for simple or dynamic data.

triangleVertexBuffer = metalDevice->newBuffer(&triangleVertices,
                                              sizeof(triangleVertices),
                                              MTL::ResourceStorageModeShared);

Writing the Shader

Shaders are small programs that run on the GPU and control how vertices and pixels are processed during rendering.
They’re essential in modern graphics programming because they determine how objects appear on screen.

The most common types are:

Vertex Shader: Processes each vertex, transforming positions and attributes (e.g., from model space to screen space).
Fragment Shader: Runs per pixel, determining the pixel’s final color, often using lighting or textures.
Compute and Tessellation Shaders: Used for general parallel computations or dynamic geometry creation, not strictly part of the rendering pipeline.

While the term “shader” originally referred to simple shading routines, today shaders can implement entire physically based rendering (PBR) pipelines.

flowchart TB
    subgraph Vertex Shader [Vertex Shader Stage]
        V1["Vertex 0"] --> VS1["vertexShader() thread 0"]
        V2["Vertex 1"] --> VS2["vertexShader() thread 1"]
        V3["Vertex 2"] --> VS3["vertexShader() thread 2"]
    end

    subgraph Rasterizer
        VS1 --> R
        VS2 --> R
        VS3 --> R
    end

    subgraph Fragment Shader [Fragment Shader Stage]
        R --> F1["fragmentShader() thread A"]
        R --> F2["fragmentShader() thread B"]
        R --> F3["fragmentShader() thread C"]
        R --> F4["...many more"]
    end

The GPU launches many threads in parallel — one per vertex or pixel — running each shader function instance independently.

What Are Normalized Device Coordinates (NDC)?

After the vertex shader runs, each vertex is represented in clip space as a 4D coordinate (x, y, z, w). The GPU performs perspective division (x/w, y/w, z/w) to transform these into Normalized Device Coordinates (NDC):

(x, y, z, w) → (x/w, y/w, z/w)

This gives us Normalized Device Coordinates (NDC) — a space where:

x ∈ [-1, 1] → left to right
y ∈ [-1, 1] → bottom to top
z ∈ [-1, 1] → near to far (in Metal, z goes from 0 to 1)

These coordinates are independent of screen resolution — for example, (0, 0) is always the center of the screen.

The rasterizer maps NDC to actual screen pixels using the viewport size, producing the final visible triangle.

flowchart LR
    A["Clip Space (float4)"] --> B["NDC Space (-1 to 1)"]
    B --> C["Viewport Transform"]
    C --> D["Screen Pixels"]

Vertices outside the NDC range are clipped before rasterization.
The GPU automatically performs perspective division between clip space and NDC.

Gpu Parallelism

Modern GPUs are designed to process thousands of operations at the same time. This parallelism is what makes real-time rendering fast and efficient.

When you render a triangle:

The vertex shader runs once per vertex, so for example, in a triangle with 3 vertices, the vertex shader is executed 3 times — all in parallel.
The fragment shader runs once per pixel that the triangle covers on screen - that could mean thousands of invocations, depending on its size.

This parallelism allows Metal (and the GPU) to render complex scenes quickly by executing many small shader programs concurrently, rather than one after another like a CPU.

GPUs optimize for throughput (many operations in parallel), not latency (speed of individual operations), which explains why shaders are stateless and independent.

This also explains why shaders must be stateless and cannot rely on shared memory or global state across invocations — each shader runs independently and simultaneously.

Vertex Shader

Here’s a simple vertex shader in Metal Shading Language (MSL):

#include <metal_stdlib>
using namespace metal;

struct VertexOut
{
    float4 position [[position]];
};

vertex VertexOut vertexShader(uint vertexID [[vertex_id]], constant float3* vertexPositions)
{
    VertexOut out;
    out.position = float4(vertexPositions[vertexID], 1.0);
    return out;
}

It receives a vertex index and a pointer to vertex positions, returning a float4 position as required for clip space. The vertexShader function runs in parallel on the GPU — one instance per vertex.

It receives:

vertexID: the index of the current vertex being processed.
vertexPositions: a pointer to a buffer (constant float3*) that holds all vertex positions. The constant keyword is an address space qualifier - it marks the buffer as read-only, allowing the GPU to optimize access.

We use vertexID to access the correct vertex in the buffer

The vertex position is cast from float3 to float4 - Why?

Metal requires vertex shaders to output positions in clip space, which is a 4D coordinate system; float4(x, y, z, w) is necessary because the GPU performs perspective division: dividing x, y, z by w to achieve proper depth and perspective, this ensures correct rendering and rasterization on screen.

You can read more about why and how that happens in this wonderful article.

The result is then returned in a VertexOut struct, where position is tagged with [[position]] to tell Metal it’s the vertex’s position.

Fragment Shader

The fragmentShader function takes the output from the vertex stage (VertexOut in). It returns a float4 color — in this case, a simple orange shade.

In this example, the fragment shader returns a constant orange color for every pixel of the triangle:

fragment float4 fragmentShader(VertexOut in [[stage_in]])
{
    return float4(1.0, 0.5, 0.2, 1.0); // Orange color
}

float4 is required for clip-space coordinates.

Loading Shaders and Creating the Rendering Pipeline

To use the shaders, you first load them from the Metal shader library compiled by Xcode, then create a rendering pipeline object that defines the GPU state:

The rendering pipeline is an object encapsulating the GPU’s rendering state — this includes shaders, vertex input, and various graphics settings.

The pipeline is configured once and can be reused for rendering similar objects efficiently - in this example, it’s set up specifically for drawing individual triangles.

The first step is to retrieve the vertex and fragment shader functions from the Metal library by their names -
To do this, we need to load the default Metal shader library compiled by Xcode, then fetch our vertex and fragment shader functions:

metalDefaultLibrary = metalDevice->newDefaultLibrary();

MTL::Function* vertexShader = metalDefaultLibrary->newFunction(NS::String::string("vertexShader", NS::ASCIIStringEncoding));
MTL::Function* fragmentShader = metalDefaultLibrary->newFunction(NS::String::string("fragmentShader", NS::ASCIIStringEncoding));

If the function names don’t match, this will return nullptr, so double-check spelling.

Shader functions are extracted from the default library, if the function name doesn’t match one defined in the .metal file, this will return nullptr, so make sure everything matches.

Next up, we need to create a RenderPipelineDescriptor object, which holds the configuration for the GPU’s rendering pipeline - we label it for debugging clarity, then assign the vertex and fragment shaders that define how vertices and pixels are processed.

We then set the pixel format to match our render target’s color buffer, compile the descriptor into a RenderPipelineState object, which represents the optimized pipeline ready for use - this object can then be reused to encode rendering commands efficiently.

Ideally you’d probably want to create a pipeline state per rendering setup — in this case, a single one for drawing triangles is fine.

Okay so next, let’s set up the pipeline descriptor:

MTL::RenderPipelineDescriptor* renderPipelineDescriptor = MTL::RenderPipelineDescriptor::alloc()->init();

renderPipelineDescriptor->setLabel(NS::String::string("Triangle Rendering Pipeline", NS::ASCIIStringEncoding));
renderPipelineDescriptor->setVertexFunction(vertexShader);
renderPipelineDescriptor->setFragmentFunction(fragmentShader);
renderPipelineDescriptor->colorAttachments()->object(0)->setPixelFormat(metalLayer->pixelFormat());

Then create the optimized pipeline state:

NS::Error* error;

metalRenderPSO = metalDevice->newRenderPipelineState(renderPipelineDescriptor, &error);

renderPipelineDescriptor->release();

if (!metalRenderPSO)
{
    std::cerr << "[ERROR] Failed to create Render Pipeline State: " << error->localizedDescription()->utf8String() << std::endl;
    std::exit(EXIT_FAILURE);
}

Rendering Loop

For each frame you:

Get the next drawable from the Metal layer.
Create a command buffer and render pass descriptor.
Set clear color and configure render target.
Create a render command encoder.
Bind the pipeline state and vertex buffer.
Issue a draw call for 3 vertices (the triangle).
End encoding, present drawable, commit and wait for completion.

Example snippet inside the render loop:

@autoreleasepool
{
    metalDrawable = metalLayer->nextDrawable();

    if (!metalDrawable)
    {
        std::cerr << "[WARNING] m_MetalDrawable is null. Possibly due to invalid layer size or window not drawable." << std::endl;
        continue;
    }

The @autoreleasepool ensures proper memory management of Objective-C objects during rendering, preventing leaks and following Cocoa conventions.

Metal’s C++ wrapper also expects RAII-style management, so combining it with __bridge_retained ensures safe ownership handling.

In Metal, a render pass is a sequence of rendering commands that process input resources—such as textures and buffers—through the graphics pipeline to produce a final output, typically an image.

To start a render pass, you create a command encoder configured with a render pass descriptor. This encoder records draw commands, sets the rendering pipeline state, and binds resources for the GPU - once all commands are encoded, you end the encoding session, and the GPU executes the render pass when the command buffer is committed.

    metalCommandBuffer = metalCommandQueue->commandBuffer();

    MTL::RenderPassDescriptor* renderPassDescriptor = MTL::RenderPassDescriptor::alloc()->init();
    MTL::RenderPassColorAttachmentDescriptor* cd = renderPassDescriptor->colorAttachments()->object(0);

    cd->setTexture(metalDrawable->texture());
    cd->setLoadAction(MTL::LoadActionClear);
    cd->setClearColor(MTL::ClearColor(41.0f/255.0f, 42.0f/255.0f, 48.0f/255.0f, 1.0));
    cd->setStoreAction(MTL::StoreActionStore);

We begin encoding rendering commands into a command buffer by creating a render command encoder with the render pass descriptor.

Before issuing any draw calls, we must bind the render pipeline state—in this case, our previously created pipeline configured to draw triangles.

Next, we set the vertex buffer containing the triangle’s vertex data: we specify the primitive type (triangle) and define the starting vertex and number of vertices to draw.

Finally, we issue a draw call to render the triangle and then end encoding to finalize the command buffer.

    MTL::RenderCommandEncoder* renderCommandEncoder = metalCommandBuffer->renderCommandEncoder(renderPassDescriptor);

    renderCommandEncoder->setRenderPipelineState(metalRenderPSO);
    renderCommandEncoder->setVertexBuffer(triangleVertexBuffer, 0, 0);

    renderCommandEncoder->drawPrimitives(MTL::PrimitiveTypeTriangle, 0, 3);
    renderCommandEncoder->endEncoding();

What Happens When You Call `drawPrimitives()` ?

When you issue a draw call like this:

renderCommandEncoder->drawPrimitives(MTL::PrimitiveTypeTriangle, 0, 3);

you’re telling the GPU:

“Start rendering a triangle using 3 vertices starting from index 0.”

Here’s what Metal and the GPU do behind the scenes:

flowchart LR
    A["drawPrimitives()"] --> B["Vertex Fetch"]
    B --> C["Vertex Shader"]
    C --> D["Primitive Assembly"]
    
    D --> E["Clipping / Culling"]
    E --> F["Rasterization"]
    
    F --> G["Fragment Shader"]
    G --> H["Blending & Depth"]
    H --> I["Framebuffer"]

Vertex Fetch: The GPU reads vertex data from your MTLBuffer using the vertex function’s layout.
Vertex Shader Execution: Runs your vertexShader() in parallel, once per vertex.
Each one returns a clip-space position (and possibly other attributes).
Primitive Assembly: The GPU groups vertices into primitives — in this case, one triangle.
Clipping & Culling: Primitives outside the view frustum (i.e., outside [-1, 1] in NDC) are clipped or discarded.
Rasterization: The triangle is converted into a grid of fragments (potential pixels).
Fragment Shader Execution: Your fragmentShader() runs once per covered pixel, determining the final color.
Depth & Blending (optional): If depth testing or blending is enabled, the GPU performs those operations.
Framebuffer Write: The final pixel color is written to the render target (your drawable texture).

If you switch to indexed drawing later (using drawIndexedPrimitives()), the vertex fetch stage will use an index buffer to reuse vertices — more efficient for complex geometry.

After encoding draw commands, we instruct the command buffer to present the drawable, which schedules the rendered image to be displayed on screen.

We then commit the command buffer, sending all commands to the GPU for execution - to ensure synchronization, we wait until the GPU completes processing these commands before continuing.

Finally, we clean up by releasing the render pass descriptor and the drawable to free resources and avoid memory leaks.

    metalCommandBuffer->presentDrawable(metalDrawable);
    metalCommandBuffer->commit();
    metalCommandBuffer->waitUntilCompleted();

    renderPassDescriptor->release();
}

Cleanup

Always release Metal resources when closing your app to avoid memory leaks:

if (metalRenderPSO)        { metalRenderPSO->release();        metalRenderPSO = nullptr; }
if (triangleVertexBuffer)  { triangleVertexBuffer->release();  triangleVertexBuffer = nullptr; }
if (metalCommandQueue)     { metalCommandQueue->release();     metalCommandQueue = nullptr; }
if (metalDefaultLibrary)   { metalDefaultLibrary->release();   metalDefaultLibrary = nullptr; }
if (metalDevice)           { metalDevice->release();           metalDevice = nullptr; }

Render Pipeline Compilation & Optimization

In Metal, the render pipeline state is created using a MTLRenderPipelineDescriptor. It includes the shaders, pixel format, and other GPU state. Once you call:

metalRenderPSO = metalDevice->newRenderPipelineState(renderPipelineDescriptor, &error);

Metal compiles the pipeline into a MTLRenderPipelineState object — this is a GPU-optimized binary that tells the hardware exactly how to execute your shaders.

Why This Matters

Pipeline creation is expensive - it can take milliseconds so make sure to never do it inside your frame loop. The pipeline state is immutable once created.

If you want a different shader or pixel format just create a new one.

Always precompile your pipeline(s) during initialization and reuse them across frames.

Can You Cache It?

Yes.

Metal can cache compiled pipeline binaries between runs using shader libraries.
Xcode and runtime tools like Metal Shader Cache can persist pipelines to avoid re-compiling the same shaders across launches.

Debug Tip

If your app stutters on first render, or every time a shader changes — you’re probably recompiling pipelines at runtime.

Triple Buffering & The Drawable Lifecycle

When you render a frame in Metal, you’re not drawing directly to the screen. Instead, you draw to an off-screen drawable provided by the CAMetalLayer. This is part of a system called triple buffering.

Why Triple Buffering?

Triple buffering allows the GPU, CPU, and display to work in parallel without waiting for each other:

Frame N: GPU is rendering.
Frame N+1: CPU is preparing the next frame.
Frame N-1: Display is showing the last completed frame.

This reduces screen tearing and stalls, keeping frame delivery smooth.

Lifecycle of a Drawable

Acquire a drawable:

metalDrawable = metalLayer->nextDrawable();

This gives you a texture to render into.

Render to it using a command buffer.

Present it:

metalCommandBuffer->presentDrawable(metalDrawable);

Release it automatically when the GPU finishes.

flowchart LR
    A["nextDrawable()"] --> B["Render to Texture"]
    B --> C["presentDrawable()"]
    C --> D["Displayed on Screen"]

If nextDrawable() returns nullptr, it usually means your window is minimized or has zero size — skip the frame in that case.

Build And Run

Build (Cmd+B) and Run (Cmd+R) the project. If everything is set up correctly, you should see a nice triangle pop up in the middle of your window!

Download The Project Files

If you encounter issues with the code or simply want to test with the correct setup, you can download the latest project files below:

Download Project Files

Hello Triangle

Understanding The Rendering Pipeline

The Metal Core Objects

Preparing The Triangle’s Vertex Data

Writing the Shader

What Are Normalized Device Coordinates (NDC)?

Gpu Parallelism

Vertex Shader

Fragment Shader

Loading Shaders and Creating the Rendering Pipeline

Rendering Loop

What Happens When You Call drawPrimitives() ?

Cleanup

Render Pipeline Compilation & Optimization

Why This Matters

Can You Cache It?

Debug Tip

Triple Buffering & The Drawable Lifecycle

Why Triple Buffering?

Lifecycle of a Drawable

Build And Run

Download The Project Files

What Happens When You Call `drawPrimitives()` ?