ONNX Runtime vs Core ML on Apple Silicon

Updated Jun 02, 2026#ai #ml #coreml #onnx #silicon

Core ML is Apple’s native inference engine for Apple platforms. ONNX Runtime is Microsoft’s cross-platform inference engine with official Apple Silicon support via the Core ML execution provider. Both can run models on the same hardware, but they take different paths to get there.

This comparison covers the technical differences: how each framework maps operations to Apple Silicon hardware, conversion workflows, operator coverage, performance characteristics, and real-world tradeoffs for shipping a model on macOS or iOS.

Architecture overview

Core ML

Core ML takes a model in its compiled .mlmodelc format and distributes operations across the CPU, GPU, and Neural Engine using a runtime planner. The planner decides which ops run on which processor based on the model’s structure and the capabilities of the device.

Model (.mlpackage)
  → coremltools converts from PyTorch/TF
  → Model is compiled to .mlmodelc at build time
  → Core ML runtime dispatches to CPU/GPU/ANE

The key advantage: the conversion + compilation step allows Core ML to optimize the execution plan, including fusing adjacent operations and selecting compatible compute units.

ONNX Runtime

ONNX Runtime loads a model in ONNX format and executes it through a configurable set of execution providers. On Apple Silicon, the relevant providers are:

Core ML provider — delegates supported ops to Core ML, which then routes to ANE/GPU/CPU as appropriate.
CPU provider — default fallback for ops that are not assigned to another provider.
XNNPACK provider — lightweight CPU backend optimized for mobile and ARM.

ONNX model (.onnx)
  → ONNX Runtime loads the model
  → Session configured with execution providers
  → Each op is dispatched to the first provider that handles it

The typical configuration on Apple Silicon uses the Core ML provider as the primary execution backend, falling back to CPU for ops Core ML does not support.

Conversion workflow

Core ML conversion

import coremltools as ct

# Convert a TorchScript model
model = ct.convert(
    traced_model,
    inputs=[ct.TensorType(shape=(1, 3, 224, 224))],
    minimum_deployment_target=ct.target.iOS18
)

model.save("Model.mlpackage")

The unified coremltools conversion API supports PyTorch and TensorFlow source models. The older ONNX-to-Core ML converter is frozen and no longer maintained. If your source model is ONNX, either export from the original framework for Core ML conversion or run the ONNX model with ONNX Runtime.

ONNX Runtime setup

Microsoft publishes an official ONNX Runtime Swift package that exposes the Objective-C bindings to Swift:

dependencies: [
    .package(
        url: "https://github.com/microsoft/onnxruntime-swift-package-manager",
        from: "1.24.2"
    )
]

import OnnxRuntimeBindings

let env = try ORTEnv(loggingLevel: .warning)
let options = try ORTSessionOptions()
try options.appendExecutionProvider(
    "CoreML",
    providerOptions: ["ModelFormat": "MLProgram"]
)
let session = try ORTSession(
    env: env,
    modelPath: modelPath,
    sessionOptions: options
)

ONNX Runtime loads the .onnx file directly. The Core ML provider converts and compiles supported subgraphs internally, while unsupported nodes fall back to the default CPU provider.

Performance comparison

Benchmark results vary by model architecture. General observations:

Scenario	Core ML	ONNX Runtime (Core ML EP)
ANE-compatible model (e.g., MobileNet)	Can use ANE	Can delegate supported subgraphs to Core ML
GPU-compatible model	Can use GPU	Can delegate supported subgraphs to Core ML
Unsupported operations	May require a different export strategy	Uses CPU fallback when a compatible kernel exists
Mixed-precision quantization	Yes (FP16, INT8, palettization)	Depends on model and provider support
Startup time	Compile before shipping or when loading a model	Core ML EP compilation can add startup cost
Binary size	Small runtime footprint (system framework)	Larger (includes ONNX Runtime library + `.onnx`)

When Core ML is faster

Core ML has direct access to the Neural Engine planner. For models that fit the ANE’s constraints, Core ML can be faster and more power-efficient than CPU fallback.

ONNX Runtime with the Core ML provider can approach native Core ML performance when most of the graph is delegated. Measure your actual model: unsupported nodes can split the graph into partitions and add overhead.

When ONNX Runtime is faster

For models with operations that the Core ML provider cannot delegate, ONNX Runtime can still execute remaining nodes with its default CPU provider. This makes ONNX Runtime more flexible, but mixed-provider execution is not automatically faster.

ONNX Runtime also supports dynamic shapes. The Core ML provider allows dynamic shapes by default, but its documentation warns that they may reduce performance. Benchmark variable-length workloads on target devices.

Operator support

Core ML and ONNX use different model representations and operator sets. The coremltools converter handles many common PyTorch and TensorFlow operations, but unsupported operations may require a different export strategy or custom handling.

ONNX Runtime supports a broad range of ONNX operators. If an op is not handled by the Core ML provider, it falls through to the default CPU provider when a compatible CPU kernel exists. Custom operators still require registration.

Core ML provider limitations

The ONNX Runtime Core ML provider publishes separate supported-op lists for its NeuralNetwork and MLProgram formats. Support also depends on attributes and shape constraints. Common limitations include:

Op / Pattern	Core ML provider limitation
Convolution and pooling	Some variants and dimensions are unsupported
Resize	Only specific combinations of attributes are supported
Slice	Several inputs must be constant
MatMul / Gemm	Some inputs or attributes are constrained
Control flow	Delegation inside `If`, `Loop`, and `Scan` requires `EnableOnSubgraphs`
Dynamic shape inputs	Allowed by default, but may reduce performance

Execution provider configuration

ONNX Runtime lets you prioritize providers and configure per-provider options:

let options = try ORTSessionOptions()

try options.appendExecutionProvider("CoreML", providerOptions: [
    "ModelFormat": "MLProgram",
    "MLComputeUnits": "ALL",
    "RequireStaticInputShapes": "0",
    "EnableOnSubgraphs": "0",
])

// CPU is the default fallback for nodes not assigned to Core ML.

Execution-provider priority matters when you register multiple providers. Each provider claims compatible nodes or subgraphs in priority order. The default CPU provider handles compatible nodes that remain unassigned.

For complex models, set ModelCacheDirectory so the compiled Core ML subgraphs can be reused. Without a cache, Core ML EP compilation can add significant startup time.

Binary size impact

Framework	App bundle impact
Core ML	Runtime is provided by the operating system
ONNX Runtime	Runtime library must be bundled with the app
Model file	Measure the exported `.onnx` or compiled Core ML model for your model

Core ML wins on binary size because it is a system framework. ONNX Runtime must be bundled with the app.

When to choose which

Choose Core ML when

You are shipping a model in an iOS/macOS app and want the smallest binary size.
Your model is a standard architecture (MobileNet, ResNet, BERT, ViT) with static input shapes.
You want Neural Engine acceleration for power efficiency.
You are willing to run a conversion step and debug conversion issues.
Model protection (encryption) is a requirement.

Choose ONNX Runtime when

You already have a model in ONNX format and want to avoid the conversion step.
Your model has dynamic shapes or variable-length inputs.
Your model uses ops that Core ML does not support natively.
You are building a cross-platform app and want a single inference backend for all platforms.
You need to switch between ONNX models without maintaining separate Core ML exports.

Use both when

A practical pattern: use ONNX Runtime as the primary engine for flexibility, with the Core ML execution provider for hardware acceleration of supported subgraphs. Profile the result because graph partitioning and Core ML compilation can outweigh the acceleration benefit for some models.

Example: Running a model with ONNX Runtime on macOS

import OnnxRuntimeBindings
import Foundation

class InferenceService {
    private let env: ORTEnv
    private var session: ORTSession?

    init(modelURL: URL) throws {
        env = try ORTEnv(loggingLevel: .warning)
        let options = try ORTSessionOptions()
        try options.appendExecutionProvider("CoreML", providerOptions: [
            "ModelFormat": "MLProgram",
            "MLComputeUnits": "ALL",
        ])

        session = try ORTSession(
            env: env,
            modelPath: modelURL.path,
            sessionOptions: options
        )
    }

    func run(input: [Float], shape: [NSNumber]) throws -> [Float] {
        guard let session else { throw ServiceError.notInitialized }

        let inputData = NSMutableData(
            bytes: input,
            length: input.count * MemoryLayout<Float>.stride
        )
        let inputTensor = try ORTValue(
            tensorData: inputData,
            elementType: .float,
            shape: shape
        )

        let outputs = try session.run(
            withInputs: ["input": inputTensor],
            outputNames: Set(["output"]),
            runOptions: nil
        )

        guard let outputTensor = outputs["output"],
              let outputData = try? outputTensor.tensorData() else {
            throw ServiceError.inferenceFailed
        }

        return outputData.withUnsafeBytes { buffer in
            Array(buffer.bindMemory(to: Float.self))
        }
    }

    enum ServiceError: Error {
        case notInitialized
        case inferenceFailed
    }
}