On-Device AI Analysis with Apple Intelligence

How we added AI-powered report analysis to CodeFrog’s Mega Report using Apple’s Foundation Models framework — processing everything on-device with zero data leaving your Mac.

February 2026 Flutter / Dart / Swift macOS 26+ Apple Intelligence

Test Sections Analyzed

0 bytes

Data Sent to Cloud

Max Findings per Section

~4096

Token Context Window

The Challenge

CodeFrog’s Mega Report runs 19 test categories in parallel — accessibility, security headers, SEO, HTML validation, broken links, secrets detection, supply chain vulnerabilities, and more. A typical scan generates dozens or hundreds of findings across these categories, each with different severity levels.

Users needed prioritized, actionable guidance: which section should I fix first? What are the highest-impact changes? Without this, developers would often stare at a wall of findings without a clear starting point.

The obvious solution — sending findings to a cloud AI service like OpenAI or Anthropic — was unacceptable. Mega Report scans can contain detected secrets (API keys, tokens), security vulnerability details, internal URLs, and source code metadata. For a security scanning tool, sending this data to a third-party cloud service would undermine the very trust the tool is designed to build.

Why On-Device AI

We evaluated cloud AI services, self-hosted models, and on-device inference. On-device AI via Apple Intelligence was the clear winner for our use case:

Privacy guarantee: Security scans may contain detected secrets, API keys, vulnerability details, and internal URLs. On-device processing means none of this data ever leaves the machine.
Offline capability: The AI works without an internet connection — valuable for developers working in air-gapped environments or on flights.
No API costs: No per-token billing, no API key management, no usage quotas. The model is built into macOS.
No vendor lock-in: Uses Apple’s built-in Foundation Models framework, which ships with macOS 26. No external dependencies to manage.
Low latency: On-device inference on the Neural Engine is fast — no network round-trip, results in seconds.
Verifiable trust: The “Generated on-device by Apple Intelligence” attribution gives users confidence that no network calls were made.

The Approach

ArchitectureFoundation Models Integration

We integrated Apple’s Foundation Models framework via a Flutter plugin (foundation_models_framework) that bridges Dart to the native Swift API using Pigeon-generated method channels. The plugin provides availability checking, single-prompt requests, and streaming responses.

At widget initialization, the app checks whether Apple Intelligence is available on the current device. If unavailable (older macOS, Apple Intelligence disabled, non-Apple hardware), the AI buttons are disabled with a descriptive tooltip explaining why.

Mode 1Overall Score Improvement Plan

The overall mode analyzes severity counts across all completed sections to generate a prioritized improvement plan. It never sees individual findings — only aggregated counts.

Sections are sorted by a severity weight formula before being sent to the model, ensuring the AI focuses on the most critical areas first:

// Sort sections by severity weight (worst first)
int severityWeight(Map<Severity, int> counts) {
  return (counts[critical] ?? 0) * 1000 +
         (counts[high] ?? 0) * 100 +
         (counts[medium] ?? 0) * 10 +
         (counts[low] ?? 0);
}
      

The prompt includes the current grade letter, the worst-performing section, and exact severity counts per section. The system instruction enforces factual analysis: “Reference the exact section names and severity counts provided. Do not give generic advice.”

Mode 2Section-Specific Fix Suggestions

The section mode analyzes individual findings within a specific test category. Each finding is compressed into a one-liner format that maximizes information density within the token budget:

// Each section type has a specialized formatter
// Accessibility:  [HIGH] Color contrast insufficient (12 nodes)
// Security:      [HIGH] missing-csp: Add Content-Security-Policy header
// SEO:           [MEDIUM] Missing meta description: No meta description found
// HTML:          [WARNING] Element “div” not allowed as child of “ul” (line 47)
// Gitleaks:      [HIGH] aws-access-key at src/config.js:23
// OSV:           [CRITICAL] lodash@4.17.20: GHSA-jf85-cpcp-j695
      

Findings are capped at 30 per section. The prompt instructs the model to explain each specific finding, how to fix it, and its severity impact — and to reference only the provided findings, never inventing issues.

ConstraintToken Budget Management

The on-device model has a ~4096 token context window — significantly smaller than cloud models. We designed the entire prompt strategy around this constraint:

Overall mode: Sends only severity counts, never individual findings. A report with all 19 sections fits comfortably in the budget.
Section mode: Caps findings at 30 and uses one-liner formatting. Each finding averages 50–80 characters, keeping total input under 3000 characters.
System instructions: Kept minimal (“web quality expert, plain text, not markdown”) to reserve tokens for the actual analysis.
Plain text output: Requesting plain text instead of markdown eliminates formatting tokens (headers, bullets, code fences), leaving more room for substance.

Privacy by Design

Privacy is not a feature we added — it is a constraint we designed around from the start. Here is exactly what the AI model processes:

What the AI Sees

Overall mode: Grade letter (e.g., “B”), section names, severity counts (e.g., “Security: critical 1, high 3, medium 7”), and skipped section names
Section mode: One-liner finding summaries (e.g., “[HIGH] missing-csp: Add Content-Security-Policy header”) and severity counts

What the AI Never Sees

Raw HTML source code or page content
Full URLs with query parameters or authentication tokens
Request/response bodies from security scans
The actual secret values detected by Gitleaks (only the rule ID and file location)
Personally identifiable information
Source code beyond file path and line number references

No Persistence

AI suggestions are held in widget state only. They are not written to the database, not included in report exports, and not cached between sessions. Every click of the button generates a fresh analysis.

Results

Actionable prioritization: Users get a clear starting point after every Mega Report — which section to fix first, which findings are highest impact
Privacy guarantee: Zero data leaves the device. Verified by the “Generated on-device by Apple Intelligence” attribution on every result
Offline support: Works without an internet connection — no API calls, no network dependency
No cost: No API subscriptions, no per-token charges, no usage quotas
Graceful degradation: On devices without Apple Intelligence, buttons are disabled with clear tooltips explaining why — no errors, no broken UI
Fast inference: On-device processing on the Neural Engine delivers responses in a few seconds

Lessons Learned

Design prompts around the constraint, not the ideal input. With ~4096 tokens, we could not send full finding details. The one-liner format was born from necessity — and it turned out to be surprisingly effective. The model does not need verbose descriptions to give useful advice.
Sort inputs by importance. Sorting sections by severity weight before prompting ensures the model focuses on the most critical areas. Without this, the model might spend its limited output on low-severity sections.
“Do not give generic advice” dramatically improves output. Adding this constraint to the system instructions forces the model to reference specific findings from the scan, producing targeted recommendations instead of boilerplate advice.
Plain text output saves tokens. Requesting plain text instead of markdown eliminates formatting overhead (headers, code fences, bullet characters) and produces output that renders cleanly in a Flutter text widget.
Always check availability at runtime. Not all macOS 26 devices have Apple Intelligence enabled. A tooltip explaining “Apple Intelligence is not available” is far better than a crash or mysterious empty state.
On-device AI is viable for analysis tasks. The ~4096 token window is a real constraint, but for structured input (severity counts, one-liner findings), it is more than sufficient. The model consistently produces useful, grounded analysis when given specific data to reference.

← Back to Case Studies