Apple Intelligence is not a single model β it is a layered system built on Apple Foundation Models (AFM). At WWDC 2025, Apple opened these models to developers through the FoundationModels framework, letting any app call the same on-device LLM that powers Writing Tools, summarization, and smart replies.
This article covers the architecture of the models, the framework API surface, and practical Swift examples you can use today.
Apple ships two model tiers:
| Model | Parameters | Compression | Runs On |
|---|---|---|---|
| AFM-on-device | ~3B | 2-bit QAT (weights), 4-bit (embeddings) | iPhone, iPad, Mac Neural Engine/GPU/CPU |
| AFM-server | MoE (undisclosed size) | 3.56-bit ASTC | Apple Silicon in Private Cloud Compute |
The on-device model is a decoder-only transformer with grouped-query attention, SwiGLU activation, and RoPE positional embeddings. Apple splits the model into two blocks at a 5:3 depth ratio β the second block reuses KV caches from the first blockβs final layer, cutting KV cache memory by 37.5% and improving time-to-first-token.
The aggressive 2-bit quantization-aware training (QAT) is what makes a ~3B model fit within device memory and bandwidth budgets. Apple applies low-rank adapters post-quantization to recover most of the quality loss (they report a ~4.6% regression on MGSM and a 1.5% improvement on MMLU, so recovery is uneven).
When a request is too complex for on-device inference, Apple Intelligence system features (such as Siri or Writing Tools) may route to Private Cloud Compute. The server model uses a Parallel Track Mixture-of-Experts (PT-MoE) design β multiple smaller transformer tracks process tokens independently with synchronization only at block boundaries. This avoids the all-reduce overhead of traditional tensor parallelism. The Foundation Models framework described in this article exclusively uses the on-device model.
The server weights use Adaptive Scalable Texture Compression (ASTC) at 3.56 bpw, which Apple GPUs can decompress in hardware at inference time β no additional compute overhead.
The FoundationModels framework (iOS 26+, macOS 26+, iPadOS 26+, visionOS 26+) is the Swift API to the on-device LLM. You import it, check availability, create a session, and send prompts.
Minimum requirements to run the examples in this article:
Not every device can run the model. Apple Intelligence must be enabled, the device must be eligible, and the model may still be downloading on first use. Check SystemLanguageModel.default.availability to determine the current state.
import FoundationModels
import SwiftUI
struct ChatView: View {
var body: some View {
switch SystemLanguageModel.default.availability {
case .available:
ChatUI()
case .unavailable(let reason):
ContentUnavailableView(
message(for: reason),
systemImage: "apple.intelligence.badge.xmark"
)
}
}
private func message(for reason: SystemLanguageModel.UnavailableReason) -> String {
switch reason {
case .appleIntelligenceNotEnabled:
"Enable Apple Intelligence in Settings"
case .deviceNotEligible:
"This device does not support Apple Intelligence"
case .modelNotReady:
"Downloading the language model β try again soon"
@unknown default:
"Unavailable"
}
}
}You interact with the model through LanguageModelSession. It holds conversation history, system instructions, and tool registrations.
let session = LanguageModelSession {
"""
You are a concise technical assistant. Answer in plain English.
When asked about code, show Swift examples.
"""
}You can prewarm the model to load it into memory before the user types:
session.prewarm(promptPrefix: "You are helping with")Check session.isResponding to disable input during generation.
The simplest call is respond(to:) with a string:
let response = try await session.respond(
to: "Explain the difference between `let` and `var` in Swift."
)
print(response.content)You can pass GenerationOptions to control behavior:
let options = GenerationOptions(
sampling: .greedy,
temperature: 0.3,
maximumResponseTokens: 500
)
let response = try await session.respond(
to: "Summarize this meeting transcript...",
options: options
)For real-time output, use streamResponse(to:generating:) with a @Generable type. The framework streams snapshots of the partially generated struct rather than raw token deltas. See the next section for details.
When you use @Generable types, streaming returns partially generated snapshots where every property is optional until the model fills it in:
let stream = session.streamResponse(
to: "Generate meeting notes from this transcript...",
generating: MeetingNotes.self,
options: GenerationOptions(temperature: 0.2)
)
for try await partial in stream {
// Each snapshot fills in more fields
meetingNotes = partial
}This lets your SwiftUI view animate in data as it arrives, rather than blocking until the full response is ready.
@Generable and @GuideThe standout feature of the framework is guided generation β you define a Swift struct, mark it @Generable, optionally annotate fields with @Guide, and the model returns instances of your type directly, without fragile string parsing.
import FoundationModels
@Generable
struct MeetingNotes: Equatable {
let title: String
let date: String
let attendees: [String]
let summary: String
let actionItems: [ActionItem]
}
@Generable
struct ActionItem: Equatable {
let owner: String
let task: String
let deadline: String
}
// Generate structured output
let notes: MeetingNotes = try await session.respond(
generating: MeetingNotes.self
).contentThe @Guide macro constrains individual fields:
@Generable
struct MovieRecommendation: Equatable {
let title: String
@Guide(.anyOf(["Action", "Comedy", "Drama", "Sci-Fi", "Horror"]))
let genre: String
@Guide(description: "Rotten Tomatoes score out of 100")
let rating: Int
@Guide(.count(3))
let similarTitles: [String]
}Use @Guide for:
.anyOf([...]) β restrict to a fixed set of values.count(n) β fix array lengthdescription: β hint what the field meansUnder the hood, the macro generates a JSON Schema during compilation. The framework injects the schema into the prompt, and a constrained decoding daemon enforces that the model output matches the schema β guaranteeing schema-conformant structured output.
Tools let the model fetch data or call your app logic. Conform to the Tool protocol:
import FoundationModels
import EventKit
final class CalendarTool: Tool {
let name = "getCalendarEvents"
let description = "Fetch calendar events for a given date range."
@Generable
struct Arguments {
@Guide(description: "ISO 8601 start date")
let startDate: String
@Guide(description: "ISO 8601 end date")
let endDate: String
}
func call(arguments: Arguments) async throws -> ToolOutput {
let store = EKEventStore()
let granted = try await store.requestAccess(to: .event)
guard granted else { throw CalendarError.accessDenied }
guard let startDate = ISO8601DateFormatter().date(from: arguments.startDate),
let endDate = ISO8601DateFormatter().date(from: arguments.endDate) else {
throw CalendarError.invalidDate
}
let predicate = store.predicateForEvents(
withStart: startDate,
end: endDate,
calendars: nil
)
let events = store.events(matching: predicate).map { event in
["title": event.title, "start": event.startDate.ISO8601Format()]
}
let content = GeneratedContent(properties: ["events": events])
return ToolOutput(content)
}
}
enum CalendarError: Error {
case accessDenied
case invalidDate
}Pass tools into the session:
let session = LanguageModelSession(tools: [CalendarTool()]) {
"You are a scheduling assistant. Use the calendar tool when asked about events."
}
let response = try await session.respond(to: "What meetings do I have tomorrow?")The model decides whether to call the tool based on the conversation context. Tool calls and results appear in session.transcript, and errors throw LanguageModelSession.ToolCallError with the specific underlyingError.
Beyond the general model, you can request a specialized built-in use case:
let taggingModel = SystemLanguageModel(useCase: .contentTagging)
let session = LanguageModelSession(model: taggingModel)The contentTagging use case uses a specialized adapter for tag generation, entity extraction, and topic detection.
The on-device model was trained on sequences up to 65K tokens of context. The runtime context limit is exposed via SystemLanguageModel.contextSize. First response latency depends on whether the session has been prewarmed β calling prewarm(promptPrefix:) ahead of time loads the model into memory so the first prompt skips model loading.
Appleβs own human evaluations show the on-device model is competitive with Qwen-2.5-3B and Gemma-3-4B, while the server model beats Qwen-2.5-VL at under half the inference FLOPs. Neither matches GPT-4o on general reasoning, but that is not the design goal β these models are optimized for on-device latency, privacy, and battery life.
The WWDC sessions emphasize that the on-device model is a device-scale model β it excels at summarization, extraction, classification, and short-form generation, but is not designed for world knowledge or advanced reasoning. Some practical guidelines:
permissiveContentTransformations to skip safety checks, with caveats.For tasks requiring deeper reasoning, consider routing to a server-scale model or combining the on-device model with tool calling to fetch external data.
Here is a complete minimalist chat view that uses the Foundation Models framework:
import SwiftUI
import FoundationModels
struct AIChatView: View {
@State private var session = LanguageModelSession(tools: [CalendarTool()]) {
"You are a helpful assistant. Be concise."
}
@State private var input = ""
var body: some View {
VStack {
List(session.transcript) { entry in
switch entry {
case .prompt(let prompt):
SegmentsView(segments: prompt.segments, isUser: true)
case .response(let response):
SegmentsView(segments: response.segments, isUser: false)
default:
EmptyView()
}
}
.listStyle(.plain)
HStack {
TextField("Ask something...", text: $input, axis: .vertical)
.textFieldStyle(.roundedBorder)
.disabled(session.isResponding)
Button("Send") {
let text = input
input = ""
Task {
try? await session.respond(to: text)
}
}
.disabled(input.isEmpty || session.isResponding)
}
.padding()
}
}
}
struct SegmentsView: View {
let segments: [Transcript.Segment]
let isUser: Bool
var body: some View {
ForEach(segments, id: \.id) { segment in
switch segment {
case .text(let text):
Text(LocalizedStringKey(text.content))
.padding(10)
.background(isUser ? Color.blue.opacity(0.15) : Color(.systemGray6))
.cornerRadius(12)
.frame(maxWidth: .infinity, alignment: isUser ? .trailing : .leading)
case .structure:
EmptyView()
@unknown default:
EmptyView()
}
}
}
}FoundationModels framework provides SystemLanguageModel, LanguageModelSession, @Generable, and Tool protocol β all pure Swift.@Generable and @Guide gives you type-safe structured output without string parsing.PartiallyGenerated snapshots β every property is optional until the model fills it in.