Apple Foundation Models with Swift Examples

Jun 01, 2026#ai #ml #apple #foundation-models #swift

Apple Intelligence is not a single model — it is a layered system built on Apple Foundation Models (AFM). At WWDC 2025, Apple opened these models to developers through the FoundationModels framework, letting any app call the same on-device LLM that powers Writing Tools, summarization, and smart replies.

This article covers the architecture of the models, the framework API surface, and practical Swift examples you can use today.

Model Architecture

Apple ships two model tiers:

Model	Parameters	Compression	Runs On
AFM-on-device	~3B	2-bit QAT (weights), 4-bit (embeddings)	iPhone, iPad, Mac Neural Engine/GPU/CPU
AFM-server	MoE (undisclosed size)	3.56-bit ASTC	Apple Silicon in Private Cloud Compute

On-Device Model

The on-device model is a decoder-only transformer with grouped-query attention, SwiGLU activation, and RoPE positional embeddings. Apple splits the model into two blocks at a 5:3 depth ratio — the second block reuses KV caches from the first block’s final layer, cutting KV cache memory by 37.5% and improving time-to-first-token.

The aggressive 2-bit quantization-aware training (QAT) is what makes a ~3B model fit within device memory and bandwidth budgets. Apple applies low-rank adapters post-quantization to recover most of the quality loss (they report a ~4.6% regression on MGSM and a 1.5% improvement on MMLU, so recovery is uneven).

Server Model (PT-MoE)

When a request is too complex for on-device inference, Apple Intelligence system features (such as Siri or Writing Tools) may route to Private Cloud Compute. The server model uses a Parallel Track Mixture-of-Experts (PT-MoE) design — multiple smaller transformer tracks process tokens independently with synchronization only at block boundaries. This avoids the all-reduce overhead of traditional tensor parallelism. The Foundation Models framework described in this article exclusively uses the on-device model.

The server weights use Adaptive Scalable Texture Compression (ASTC) at 3.56 bpw, which Apple GPUs can decompress in hardware at inference time — no additional compute overhead.

The Foundation Models Framework

The FoundationModels framework (iOS 26+, macOS 26+, iPadOS 26+, visionOS 26+) is the Swift API to the on-device LLM. You import it, check availability, create a session, and send prompts.

Minimum requirements to run the examples in this article:

Xcode 26+ (from WWDC 2025)
Device with Apple Silicon (A17 Pro / M1 or later) or Mac with Apple Silicon
Apple Intelligence enabled in System Settings
The on-device model downloads automatically after Apple Intelligence is enabled (may take time on first use; downloading depends on network and battery conditions)

Checking Availability

Not every device can run the model. Apple Intelligence must be enabled, the device must be eligible, and the model may still be downloading on first use. Check SystemLanguageModel.default.availability to determine the current state.

import FoundationModels
import SwiftUI

struct ChatView: View {
    var body: some View {
        switch SystemLanguageModel.default.availability {
        case .available:
            ChatUI()
        case .unavailable(let reason):
            ContentUnavailableView(
                message(for: reason),
                systemImage: "apple.intelligence.badge.xmark"
            )
        }
    }

    private func message(for reason: SystemLanguageModel.UnavailableReason) -> String {
        switch reason {
        case .appleIntelligenceNotEnabled:
            "Enable Apple Intelligence in Settings"
        case .deviceNotEligible:
            "This device does not support Apple Intelligence"
        case .modelNotReady:
            "Downloading the language model — try again soon"
        @unknown default:
            "Unavailable"
        }
    }
}

The Session

You interact with the model through LanguageModelSession. It holds conversation history, system instructions, and tool registrations.

let session = LanguageModelSession {
    """
    You are a concise technical assistant. Answer in plain English.
    When asked about code, show Swift examples.
    """
}

You can prewarm the model to load it into memory before the user types:

session.prewarm(promptPrefix: "You are helping with")

Check session.isResponding to disable input during generation.

Generating Text

The simplest call is respond(to:) with a string:

let response = try await session.respond(
    to: "Explain the difference between `let` and `var` in Swift."
)
print(response.content)

You can pass GenerationOptions to control behavior:

let options = GenerationOptions(
    sampling: .greedy,
    temperature: 0.3,
    maximumResponseTokens: 500
)

let response = try await session.respond(
    to: "Summarize this meeting transcript...",
    options: options
)

For real-time output, use streamResponse(to:generating:) with a @Generable type. The framework streams snapshots of the partially generated struct rather than raw token deltas. See the next section for details.

Streaming Structured Output

When you use @Generable types, streaming returns partially generated snapshots where every property is optional until the model fills it in:

let stream = session.streamResponse(
    to: "Generate meeting notes from this transcript...",
    generating: MeetingNotes.self,
    options: GenerationOptions(temperature: 0.2)
)

for try await partial in stream {
    // Each snapshot fills in more fields
    meetingNotes = partial
}

This lets your SwiftUI view animate in data as it arrives, rather than blocking until the full response is ready.

Guided Generation with `@Generable` and `@Guide`

The standout feature of the framework is guided generation — you define a Swift struct, mark it @Generable, optionally annotate fields with @Guide, and the model returns instances of your type directly, without fragile string parsing.

import FoundationModels

@Generable
struct MeetingNotes: Equatable {
    let title: String
    let date: String
    let attendees: [String]
    let summary: String
    let actionItems: [ActionItem]
}

@Generable
struct ActionItem: Equatable {
    let owner: String
    let task: String
    let deadline: String
}

// Generate structured output
let notes: MeetingNotes = try await session.respond(
    generating: MeetingNotes.self
).content

The @Guide macro constrains individual fields:

@Generable
struct MovieRecommendation: Equatable {
    let title: String

    @Guide(.anyOf(["Action", "Comedy", "Drama", "Sci-Fi", "Horror"]))
    let genre: String

    @Guide(description: "Rotten Tomatoes score out of 100")
    let rating: Int

    @Guide(.count(3))
    let similarTitles: [String]
}

Use @Guide for:

.anyOf([...]) — restrict to a fixed set of values
.count(n) — fix array length
description: — hint what the field means
Regex patterns on string fields

Under the hood, the macro generates a JSON Schema during compilation. The framework injects the schema into the prompt, and a constrained decoding daemon enforces that the model output matches the schema — guaranteeing schema-conformant structured output.

Tool Calling

Tools let the model fetch data or call your app logic. Conform to the Tool protocol:

import FoundationModels
import EventKit

final class CalendarTool: Tool {
    let name = "getCalendarEvents"
    let description = "Fetch calendar events for a given date range."

    @Generable
    struct Arguments {
        @Guide(description: "ISO 8601 start date")
        let startDate: String
        @Guide(description: "ISO 8601 end date")
        let endDate: String
    }

    func call(arguments: Arguments) async throws -> ToolOutput {
        let store = EKEventStore()
        let granted = try await store.requestAccess(to: .event)
        guard granted else { throw CalendarError.accessDenied }

        guard let startDate = ISO8601DateFormatter().date(from: arguments.startDate),
              let endDate = ISO8601DateFormatter().date(from: arguments.endDate) else {
            throw CalendarError.invalidDate
        }

        let predicate = store.predicateForEvents(
            withStart: startDate,
            end: endDate,
            calendars: nil
        )
        let events = store.events(matching: predicate).map { event in
            ["title": event.title, "start": event.startDate.ISO8601Format()]
        }
        let content = GeneratedContent(properties: ["events": events])
        return ToolOutput(content)
    }
}

enum CalendarError: Error {
    case accessDenied
    case invalidDate
}

Pass tools into the session:

let session = LanguageModelSession(tools: [CalendarTool()]) {
    "You are a scheduling assistant. Use the calendar tool when asked about events."
}

let response = try await session.respond(to: "What meetings do I have tomorrow?")

The model decides whether to call the tool based on the conversation context. Tool calls and results appear in session.transcript, and errors throw LanguageModelSession.ToolCallError with the specific underlyingError.

Use Cases

Beyond the general model, you can request a specialized built-in use case:

let taggingModel = SystemLanguageModel(useCase: .contentTagging)
let session = LanguageModelSession(model: taggingModel)

The contentTagging use case uses a specialized adapter for tag generation, entity extraction, and topic detection.

Performance Characteristics

The on-device model was trained on sequences up to 65K tokens of context. The runtime context limit is exposed via SystemLanguageModel.contextSize. First response latency depends on whether the session has been prewarmed — calling prewarm(promptPrefix:) ahead of time loads the model into memory so the first prompt skips model loading.

Apple’s own human evaluations show the on-device model is competitive with Qwen-2.5-3B and Gemma-3-4B, while the server model beats Qwen-2.5-VL at under half the inference FLOPs. Neither matches GPT-4o on general reasoning, but that is not the design goal — these models are optimized for on-device latency, privacy, and battery life.

Design Guidance

The WWDC sessions emphasize that the on-device model is a device-scale model — it excels at summarization, extraction, classification, and short-form generation, but is not designed for world knowledge or advanced reasoning. Some practical guidelines:

Break tasks down — decompose complex requests into smaller, focused prompts
Avoid code generation and math — the model was not trained for reliable code or arithmetic output; use standard computation instead
No real-time data — training data has a cutoff; the model does not browse the web
Guardrails are mandatory for guided generation — the default guardrails apply. For plain text generation, you can use permissiveContentTransformations to skip safety checks, with caveats.
Treat output as a draft — Apple’s Human Interface Guidelines suggest having users review generated content rather than treating it as authoritative

For tasks requiring deeper reasoning, consider routing to a server-scale model or combining the on-device model with tool calling to fetch external data.

Putting It All Together

Here is a complete minimalist chat view that uses the Foundation Models framework:

import SwiftUI
import FoundationModels

struct AIChatView: View {
    @State private var session = LanguageModelSession(tools: [CalendarTool()]) {
        "You are a helpful assistant. Be concise."
    }
    @State private var input = ""

    var body: some View {
        VStack {
            List(session.transcript) { entry in
                switch entry {
                case .prompt(let prompt):
                    SegmentsView(segments: prompt.segments, isUser: true)
                case .response(let response):
                    SegmentsView(segments: response.segments, isUser: false)
                default:
                    EmptyView()
                }
            }
            .listStyle(.plain)

            HStack {
                TextField("Ask something...", text: $input, axis: .vertical)
                    .textFieldStyle(.roundedBorder)
                    .disabled(session.isResponding)
                Button("Send") {
                    let text = input
                    input = ""
                    Task {
                        try? await session.respond(to: text)
                    }
                }
                .disabled(input.isEmpty || session.isResponding)
            }
            .padding()
        }
    }
}

struct SegmentsView: View {
    let segments: [Transcript.Segment]
    let isUser: Bool

    var body: some View {
        ForEach(segments, id: \.id) { segment in
            switch segment {
            case .text(let text):
                Text(LocalizedStringKey(text.content))
                    .padding(10)
                    .background(isUser ? Color.blue.opacity(0.15) : Color(.systemGray6))
                    .cornerRadius(12)
                    .frame(maxWidth: .infinity, alignment: isUser ? .trailing : .leading)
            case .structure:
                EmptyView()
            @unknown default:
                EmptyView()
            }
        }
    }
}

Key Takeaways

Apple Foundation Models come in two flavors: a ~3B on-device model (2-bit QAT compressed) and a PT-MoE server model running on Private Cloud Compute.
The FoundationModels framework provides SystemLanguageModel, LanguageModelSession, @Generable, and Tool protocol — all pure Swift.
Guided generation with @Generable and @Guide gives you type-safe structured output without string parsing.
Streaming works via PartiallyGenerated snapshots — every property is optional until the model fills it in.
Tools let the model call into your app (HealthKit, Calendar, network requests, etc.) with automatic argument generation.
The framework is privacy-first — all on-device data stays on device.