How Apple's Foundation Models framework turns every iPhone into a private AI engine

If you’ve been watching the AI landscape from the sidelines — waiting for costs to drop, privacy concerns to settle, or the right SDK to land — iOS 26 might be the nudge you needed. Apple has quietly shipped a full LLM in every device running Apple Intelligence, and with the Foundation Models framework, it’s yours to use. No API keys. No servers. No cost per request. And most importantly: no user data ever leaves the device.

With the release of iOS 26 and macOS 26, Apple introduced the Foundation Models framework — a Swift-native API that gives developers direct access to the same on-device large language model powering Apple Intelligence features like Writing Tools, Genmoji, and Image Playground. This isn’t a wrapper around a cloud API. It’s a model baked into the operating system, running inference locally on the Neural Engine. The implications are significant: your app can now summarize text, extract structured data, answer user questions, and generate content without ever making a network call. Let’s explore what this looks like in practice.

The Basics: A Session, Some Instructions, a Prompt

The programming model is deliberately simple. You create a session with a set of instructions that define the model’s role. Then you pass a prompt — typically user-generated content — and get a response.

import FoundationModels

let session = LanguageModelSession( model: SystemLanguageModel.default, instructions: "You are an expert headline writer. Generate a concise title for the provided meeting notes." )

let response = try await session.respond(to: transcript) print(response.content)

The distinction between instructions and prompts matters more than it might seem. Instructions have precedence over prompts — the model treats them as the ground truth. This separation is Apple’s answer to prompt injection: user-generated content goes in the prompt, never in the instructions. It’s a small architectural choice that prevents a class of problems that cloud-based chat applications have been wrestling with for years.

Guided Generation: Where Things Get Interesting

If you’ve worked with LLMs, you know the pain of parsing free-form text. You ask for a title, and the model replies: “Sure! Here’s a great headline for your notes: Project Kickoff Review.” Now you need to strip the preamble. Apple’s solution is elegant. Instead of asking for a string and hoping for the best, you tell the model to generate a specific Swift type.

@Generable struct RecordingTitle { var title: String }

let answer = try await session.respond( to: transcript, generating: RecordingTitle.self ) // answer.content.title — just the title, nothing else

The @Generable macro causes the compiler to generate a JSON schema from your struct. The framework passes this schema to the model, constraining its output to conform to your type. No parsing, no regex, no hoping the model follows instructions. This gets powerful quickly. Consider extracting a project timeline from meeting notes:

@Generable struct Timeline { @Guide(description: "Milestones and tasks from meeting transcripts") var items: [TimelineItem]

@Guide(description: "Ambiguities encountered during extraction") var extractionNotes: String?

@Generable struct TimelineItem { var title: String

@Guide(description: "Due date or timeframe, preserved as stated") var date: String

var owner: String? var priority: Priority?

@Generable enum Priority { case critical, high, medium, low } } }

Nested types, enumerations, optionals, constraints — the schema language is expressive enough to model complex data structures. The @Guide property wrapper adds semantic context for the model. You can also enforce hard constraints, such as capping an array at a maximum number of items with .maximumCount(10). One pattern worth highlighting: adding an optional extractionNotes property. This acts as a window into the model’s reasoning — it might note ambiguities like “dates are relative” or “no specific assignee was mentioned.” It’s a lightweight way to build trust and transparency into your AI features.

The 4096-Token Elephant in the Room

The on-device model has a fixed context window of 4096 tokens. That includes instructions, the prompt, the schema, and the generated output. It’s roughly 12,000–16,000 characters in English — enough for a short meeting transcript, but not for a day’s worth of notes. This is the fundamental tradeoff. The model is private, free, and instant — but it can’t hold a novel in its head. The practical solution is semantic chunking. Using Apple’s NaturalLanguage framework, you split your text along paragraph boundaries and process chunks recursively:

Try to process the full text
If it exceeds the context window, split in half at the nearest paragraph boundary
Process each half independently
Combine the results

import NaturalLanguage

let tokenizer = NLTokenizer(unit: .paragraph) tokenizer.string = longText let paragraphRanges = tokenizer.tokens(for: longText.startIndex..<longText.endIndex)

The recursive approach is elegant because it automatically finds the right chunk size without you needing to predict token counts (which you can’t do ahead of time — tokenization varies by language, and the only way to know is to actually tokenize). A practical optimization: if the input is more than 3x the context window, skip the tokenization attempt entirely and go straight to splitting. It saves several seconds of wasted work.

Streaming: Show Tokens as They Arrive

For any user-facing generation, waiting for the complete response before updating the UI is a poor experience. The streaming API lets you display partial results as the model generates them:

let stream = session.streamResponse(to: text, generating: Summary.self) for try await partial in stream { displayedText = partial.content.summary ?? "" }

This works with guided generation too — you get progressively more complete instances of your @Generable type. For a Q&A feature, the citation might appear before the full answer is complete, giving the user confidence that the model is working with real sources.

Tool-Calling: When the Model Needs Real-Time Data

The model knows what it was trained on, but it doesn’t know what’s on the user’s calendar. Tool-calling bridges this gap by letting the LLM invoke Swift functions during inference.

struct AvailabilityTool: Tool { let name = "checkAvailability" let description = "Checks availability on a given date."

@Generable struct Arguments { @Guide(description: "The date to check — relative or absolute") let date: String }

func call(arguments: Arguments) async throws -> String { // Query calendar, return result return "You are available on March 25." } }

You register tools when creating a session, and the model decides autonomously when to call them. You can’t force a tool call — the model evaluates whether it needs external data based on the conversation context. This is similar to MCP (Model Context Protocol) in the cloud LLM world, but running entirely on-device.
The tool executes locally, its output feeds back into the session, and the model uses it to formulate a response. The user never knows a function was called.
One caveat: tool calls consume tokens. Each invocation adds the tool description, arguments, and result to the context window.

Combining Frameworks: The Real Power Move

Foundation Models becomes truly compelling when combined with Apple’s other ML frameworks. Each framework handles a different modality, and Foundation Models ties them together with reasoning.
Speech → Foundation Models: Record a meeting using the Speech framework for live transcription (with timestamp attributes for tap-to-play), then pass the transcript to Foundation Models for summarization, action item extraction, or Q&A.
Vision → Foundation Models: Scan a whiteboard using VisionKit’s document camera, extract text with VNRecognizeTextRequest (parallelized across pages with TaskGroup), then reason on the extracted text.
Foundation Models → Image Playground: Use the LLM to generate a visual description from meeting notes, then pass it to ImageCreator for programmatic image generation. A meeting about “a house overlooking a lake in the mountains” produces a contextual sketch automatically. The pipeline for a meeting assistant app looks something like:

Audio → Speech (transcribe) → Foundation Models (title, summary, Q&A) Photos → Vision (OCR) → Foundation Models (extract action items) Text → Foundation Models (describe) → Image Playground (generate visual)

Each step runs on-device. The entire pipeline works offline. No data leaves the phone.

Prompt Engineering for a Smaller Model

The on-device LLM is not GPT-4. It’s a smaller, purpose-built model that trades breadth for speed and privacy. Prompting techniques that are optional with large cloud models become essential here.

Assign a persona: “You are an expert headline writer” produces notably better titles than “Generate a title.” The persona gives the model a frame of reference for quality and style.
Use ALL-CAPS for emphasis: The model was trained to interpret capitalization as emphasis. “You MUST return only the title” is more reliable than a polite lowercase request.
Prefer examples over rules: Few-shot prompting — providing concrete examples of desired output — is often more effective and more token-efficient than verbose instructions. With guided generation, you can pass @Generable instances as examples directly in the instruction builder.
Be explicit about what you don’t want: “Output ONLY a string, no explanations, no bullet points” prevents the chatty preamble that LLMs tend to add.

The Guardrails

Foundation Models includes built-in safety guardrails that prevent the generation of inappropriate content. For most use cases, these work transparently. But for content transformation tasks — like summarizing user-generated meeting notes that might discuss sensitive topics — the default guardrails can be too aggressive. Apple provides a relaxation mechanism:

let model = SystemLanguageModel( useCase: .general, guardrails: .permissiveContentTransformations )

This signals that you’re transforming existing user content, not generating new content from scratch. It’s a thoughtful distinction that acknowledges the difference between summarizing a discussion about workplace safety and generating that content from nothing.

What This Changes

The availability of a free, private, zero-latency LLM changes the calculus for many app features that previously required a cloud dependency.

Note-taking apps can offer real-time summarization and smart tagging without ever uploading the user’s private thoughts to a server.
Expense trackers can combine Vision (to scan receipts) with Foundation Models (to extract amounts, vendors, and categories) entirely on-device.
Productivity apps can enable natural language search — “show me tasks about the redesign that are overdue” — by extracting structured filters from free-form queries.
Accessibility features can simplify complex text for different audiences, acting as a cognitive accessibility tool that works without network connectivity.

The model updates with the OS, so it improves over time. But this also means your carefully tuned prompts might behave differently after an update. Apple’s recommendation is to build an evaluation dataset — essentially unit tests for your prompts — and run them periodically to catch regressions. Think of it as CI/CD for prompt engineering.

The Tradeoffs, Honestly

This isn’t a replacement for cloud LLMs. The 4096-token context window is a hard constraint. Complex multi-step reasoning, code generation, and tasks requiring broad world knowledge still belong in the cloud.
But for the vast majority of app-level AI features — the ones that process the user’s own data to extract value for that specific user — an on-device model that’s free, private, and instant is a better fit than a cloud API that’s expensive, latency-bound, and privacy-concerning.
A hybrid architecture makes a lot of sense: handle what you can locally, hand off to the cloud when the task exceeds the device’s capabilities. You get the best of both worlds, and the user gets privacy by default with power on demand.

Getting Started

The Foundation Models framework ships with iOS 26 and macOS 26. You need an Apple Silicon Mac, Xcode 26, and Apple Intelligence enabled on your device (the simulator uses the host Mac’s model). Check availability before using any feature:

guard SystemLanguageModel.default.isAvailable else { // Guide user to enable Apple Intelligence return }

Start simple — add a title generator or a summarizer to an existing view. Use guided generation from day one to avoid the parsing trap. Profile with the Foundation Models Instrument to understand token usage and inference time. And separate your instructions from user content, always. The LLM is already on the device. Your users are waiting for you to use it.