Running ONNX Models on Apple Silicon

Updated May 24, 2026#AI#ML#ONNX#Swift#Python#macOS

ONNX is the closest thing ML has to a universal binary format. You train in PyTorch, export to ONNX, and then deploy anywhere — Windows, Linux, browsers, or a Mac app. The “anywhere” part gets interesting on Apple Silicon because the hardware has three distinct compute units (CPU, GPU, Neural Engine) and ONNX Runtime can target all of them through the CoreML Execution Provider.

This guide covers two paths: running ONNX models from Python for prototyping, and embedding them into a macOS app for production use.


Python: The Quick Path

Install ONNX Runtime with the CoreML backend:

pip install onnxruntime

No separate GPU package is needed on macOS — the standard onnxruntime wheel includes CoreML support. Load a model and select the provider:

import onnxruntime as ort

session = ort.InferenceSession(
    "model.onnx",
    providers=["CoreMLExecutionProvider"]
)

# Run inference
import numpy as np
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {"input": input_data})

That’s it. ONNX Runtime detects Apple Silicon at runtime and dispatches compatible ops to CoreML, which in turn routes them to the GPU or ANE depending on the model.

CoreML EP Configuration

The default settings work, but you’ll want to tune them for production:

providers = [
    (
        "CoreMLExecutionProvider",
        {
            "ModelFormat": "MLProgram",
            "MLComputeUnits": "CPUAndNeuralEngine",
            "RequireStaticInputShapes": "1",
        },
    )
]
  • ModelFormat"MLProgram" uses the modern MLProgram format (Core ML 5+, macOS 12+). The older "NeuralNetwork" format is still useful for compatibility with older OS releases, but MLProgram usually has better operator coverage for current Apple Silicon models.
  • MLComputeUnits"CPUAndNeuralEngine" routes supported ops to the ANE for low-power inference. "CPUAndGPU" avoids the ANE and can be faster for certain graph shapes. "ALL" lets Core ML decide.
  • RequireStaticInputShapes"1" only lets CoreML EP claim nodes with static input shapes. Dynamic shapes are allowed by default, but ONNX Runtime warns that they can hurt performance.

Smoke Test with a Real Model

Before tuning, run the same model through CPU EP and CoreML EP. Any image classifier with a fixed 1x3x224x224 input works for this check, such as MobileNet, SqueezeNet, or ResNet exported to ONNX:

import time

import numpy as np
import onnxruntime as ort


def benchmark(model_path, providers, runs=30):
    session = ort.InferenceSession(model_path, providers=providers)
    input_meta = session.get_inputs()[0]
    input_name = input_meta.name

    # Most ONNX image classifiers use NCHW layout.
    image = np.random.randn(1, 3, 224, 224).astype(np.float32)

    session.run(None, {input_name: image})  # warmup

    start = time.perf_counter()
    for _ in range(runs):
        outputs = session.run(None, {input_name: image})
    latency_ms = (time.perf_counter() - start) * 1000 / runs

    return latency_ms, [output.shape for output in outputs]


coreml_providers = [
    (
        "CoreMLExecutionProvider",
        {
            "ModelFormat": "MLProgram",
            "MLComputeUnits": "ALL",
            "RequireStaticInputShapes": "1",
        },
    ),
    "CPUExecutionProvider",
]

print("Available providers:", ort.get_available_providers())
print("CPU:", benchmark("model.onnx", ["CPUExecutionProvider"]))
print("CoreML:", benchmark("model.onnx", coreml_providers))

Keep CPUExecutionProvider as the fallback provider. If CoreML EP cannot claim part of the graph, ONNX Runtime can still finish the run on CPU instead of failing outright. For a useful comparison, ignore the first run because Core ML may compile the model before the first inference.

Partition Fragmentation: The Silent Killer

The CoreML EP does not support every ONNX op. Unsupported ops fall back to the CPU, which forces a data transfer between CPU and ANE/GPU memory. Each fallback creates a new partition boundary. A model with many unsupported ops can fragment into a dozen+ partitions, each requiring a round-trip.

The most common culprits on Apple Silicon:

  • Pad(mode=reflect) — CoreML supports only a subset of padding cases. Reflect padding often forces CPU fallback; in one production TTS model, 14 Pad nodes created 14 partition boundaries and degraded inference by 36% compared to a fused equivalent.
  • Resize with dynamic scales — Dynamic Resize (bilinear upsampling with a scale factor) also falls back to CPU. Convert to fixed-shape resize or export with explicit output sizes.
  • Gather / ScatterND — These ops can trigger partition splits in some model architectures. If the graph shape allows it, replacing a Gather pattern with Split + Squeeze during export is worth benchmarking.

Diagnose fragmentation with verbose logging:

ort.set_default_logger_severity(0)  # verbose
session = ort.InferenceSession("model.onnx", providers=["CoreMLExecutionProvider"])

Look for lines like Node(s) placed on [CPUExecutionProvider] — each one is a partition boundary.

When to Use CPU EP Instead

The CPU Execution Provider on Apple Silicon is faster than you’d expect. Apple’s BNNS and AMX paths accelerate matrix operations directly on the CPU. For models with many small ops, the cost of ANE dispatch overhead can outweigh the per-op speedup, making CPU EP the faster choice.

A useful heuristic from profiling app workloads: if CoreML EP is only marginally faster than CPU EP in steady state, and ANE/GPU time is a small slice of total runtime, partition overhead is probably eating your gains. Drop to CPU EP or switch to MLX.


Inside a macOS App: Swift Bindings

Shipping an ONNX model inside a macOS app means using ONNX Runtime’s Objective-C bindings via Swift Package Manager. Microsoft publishes onnxruntime-swift-package-manager as a binary XCFramework, so there’s no source compilation — just add the dependency.

// Package.swift
dependencies: [
    .package(url: "https://github.com/microsoft/onnxruntime-swift-package-manager",
             from: "1.24.2"),
],
targets: [
    .target(
        name: "MyApp",
        dependencies: [
            .product(
                name: "onnxruntime",
                package: "onnxruntime-swift-package-manager"
            )
        ]
    ),
]

Or in Xcode: File → Add Package Dependencies → enter the URL above.

The package provides the same Objective-C bindings as the CocoaPods distribution, exposed to Swift through OnnxRuntimeBindings. Configure the CoreML Execution Provider via appendExecutionProvider:

import OnnxRuntimeBindings

class ModelRunner {
    private var ortEnv: ORTEnv
    private var ortSession: ORTSession

    init(modelPath: String) throws {
        ortEnv = try ORTEnv(loggingLevel: ORTLoggingLevel.warning)
        let options = try ORTSessionOptions()
        try options.appendExecutionProvider(
            "CoreML",
            providerOptions: [
                "ModelFormat": "MLProgram",
                "MLComputeUnits": "CPUAndNeuralEngine",
                "RequireStaticInputShapes": "1",
            ]
        )
        try options.setGraphOptimizationLevel(
            ORTGraphOptimizationLevel.all,
            error: nil
        )
        ortSession = try ORTSession(
            env: ortEnv,
            modelPath: modelPath,
            sessionOptions: options
        )
    }

    func predict(input: [Float]) throws -> [Float] {
        let shape: [NSNumber] = [1, 3, 224, 224]
        let inputData = NSMutableData(
            bytes: input,
            length: input.count * MemoryLayout<Float>.stride
        )
        let inputTensor = try ORTValue(
            tensorData: inputData,
            elementType: ORTTensorElementDataType.float,
            shape: shape
        )
        let outputs = try ortSession.run(
            withInputs: ["input": inputTensor],
            outputNames: Set(["output"]),
            runOptions: nil
        )
        // Handle output...
        return []
    }
}

The appendExecutionProvider method takes a provider name string and a dictionary of options. Use the new CoreML provider option keys (ModelFormat, MLComputeUnits, RequireStaticInputShapes) so the Swift and Python paths stay aligned.

Memory Management with autoreleasepool

ONNX Runtime’s ORTValue objects have internal retain semantics. If you’re running a long generation loop (like autoregressive speech or text), tensors accumulate unless you scope them aggressively:

for step in 0..<maxSteps {
    let nextToken: Int64 = try autoreleasepool {
        let outputs = try ortSession.run(
            withInputs: buildInputs(for: step),
            outputNames: Set(["logits"]),
            runOptions: nil
        )
        return sampleToken(from: outputs)
    }
}

Without autoreleasepool, memory can climb linearly with generation length. In a 1024-step speech generation loop, this was the difference between staying around 2GB and climbing past 16GB.

CoreML Compilation on First Launch

The first time you call ortSession.run() with the CoreML EP, ONNX Runtime compiles the MLProgram. This can take 10–15 seconds on an M-series Mac. Two strategies to handle this:

  1. Background warmup — run a dummy inference on a background thread immediately after app launch, so the first real inference is instant.
  2. Model caching — enable the CoreML EP model cache. ONNX Runtime persists compiled Core ML artifacts there, skipping compilation on subsequent launches.
try options.appendExecutionProvider(
    "CoreML",
    providerOptions: [
        "ModelFormat": "MLProgram",
        "MLComputeUnits": "ALL",
        "ModelCacheDirectory": cacheDirectory.path,
    ]
)

Thread Safety

ONNX Runtime can run work in parallel internally, and some APIs allow concurrent run calls. In a Swift app, the safer pattern is still to isolate each session behind one concurrency boundary, especially when the model uses CoreML EP and shares mutable app-level buffers. An actor is the cleanest version of that:

actor InferenceActor {
    private let session: ORTSession

    func run(inputs: [String: ORTValue]) throws -> [String: ORTValue] {
        try session.run(
            withInputs: inputs,
            outputNames: Set(["output"]),
            runOptions: nil
        )
    }
}

This serializes access without locks and integrates naturally with Swift Concurrency. If you need true parallel inference, benchmark separate ORTSession instances instead of sharing one session across tasks.


Performance Table

Example steady-state measurements for a ResNet-50 style model on M3 Max:

Configuration Latency Notes
CPU EP 45ms Good baseline
CoreML (NeuralNetwork) 52ms Old format, Cast overhead
CoreML (MLProgram, GPU) 18ms GPU path, fast compute
CoreML (MLProgram, ANE) 12ms Low power, great for sustained

In this benchmark, the ANE path wins for sustained inference workloads. The GPU path can be better for bursty, high-throughput scenarios, so treat this table as a starting point rather than a rule.

ONNX vs MLX for macOS Apps

MLX is Apple’s research framework — excellent for training and experimentation on Apple Silicon, but it lacks a streamlined deploy story. Exporting MLX models into apps currently requires either converting to CoreML or writing the inference loop manually.

ONNX Runtime fills that gap. It’s the production path: export once, run on macOS, iOS, Windows, and Linux with the same binary. The tradeoff is that MLX can be faster for models that exploit its unified memory model directly, but ONNX Runtime gives you cross-platform portability out of the box.

ONNX Runtime also fits cleanly beside native media frameworks. Use AVFoundation, Vision, or NaturalLanguage for capture and preprocessing, then pass tensors into ONNX Runtime. If you need direct Vision/Core ML model APIs, convert the model to Core ML instead.


Real-World Example: Spokio

Spokio is a macOS TTS app that runs Chatterbox Turbo entirely on-device using ONNX Runtime’s Swift bindings. The model has four ONNX components (speech encoder, embedding table, autoregressive language model, conditional decoder), each loaded as a separate session. The CoreML EP routes the encoder and decoder to the ANE while the LM runs on the GPU — the best of both compute units for different parts of the same pipeline.

The generation loop runs 1024 autoregressive steps with per-step autoreleasepool scoping, keeping memory flat at ~2GB. Without that scoping, memory climbs past 16GB and latency spikes to 50s. The full inference pipeline is described in detail in Writing a Swift Inference Pipeline for Chatterbox Turbo.

This is the pattern that makes ONNX Runtime compelling: one model format, three compute units, and a shipping app that users download from the Mac App Store with no server component.


Summary

Use Case Recommended Approach
Prototyping / research Python + CoreMLExecutionProvider with ModelFormat="MLProgram"
Shipping a Mac App SPM onnxruntime-swift-package-manager, wrap session in Swift actor
Model has many unsupported ops Profile partition count; consider CPU EP or re-export
Cross-platform deployment ONNX format; one export, many runtimes
Pure Apple Silicon research MLX (training/flexibility)

ONNX Runtime on Apple Silicon is pragmatic and mature. The CoreML EP gives you hardware acceleration without vendor lock-in, and the Python and Swift paths cover the full development lifecycle from notebook to shipped product.