Implementation Documentation for Agentic LLM Workflow: macOS ScreenMate (SwiftUI First - Direct VLM, In-Memory Screenshot, Custom Prompts)
Develop a native macOS application ("ScreenMate") that:
- Runs as a menubar accessory application (no Dock icon).
- Provides advanced image understanding functionality triggered by a screenshot, capturing the image into memory (as an
NSImage
) and processing it using a locally loaded Vision Language Model (VLM) via MLX Swift, with an option for users to provide custom prompts. (OCR is one of its capabilities). - Features a main interface in a menubar popover panel.
- Features a "Custom Prompt" floating panel allowing users to input their own VLM prompts for image processing.
- Allows configuration for auto-starting at login and selecting a VLM model from a predefined list.
- Uses SwiftUI for UI components where feasible, and AppKit for system integrations and panel management.
- Language: Swift.
- UI Framework: SwiftUI first. Use AppKit for system integration.
- ML Framework: MLX Swift (including
MLXLLM
,MLXLMCommon
,Tokenizers
,Hub
) for direct VLM loading and inference. - Screenshot Handling: Capture screenshots directly into memory as
NSImage
. - Bridging (UI):
NSHostingView
for SwiftUI in AppKit panels. - State Management (SwiftUI): Standard SwiftUI state management.
ScreenMateEngine
andAppSettings
will beObservableObject
. - Modularity: Distinct classes/structs.
- Error Handling: Robust error handling for VLM operations, UI feedback.
- Asynchronous Operations:
async/await
,Task.detached
for VLM. UI updates on main thread. - File Structure: Adhere to suggested structure (
UI
,SystemIntegration
,Shared
subfolders).
Component ID: C000 (New)
Name: App Settings (Shared/AppSettings.swift
)
- Purpose: An
ObservableObject
to store shared application settings, like the selected VLM model identifier and potentially the last custom prompt. - Key Technologies: Swift, Combine (
ObservableObject
,@Published
). - Core Responsibilities/Tasks for LLM:
- Create
AppSettings.swift
(in aShared
group). - Define a class
AppSettings
conforming toObservableObject
. - Add
@Published var selectedVLMModelIdentifier: String
initialized with a default fromScreenMateEngine.supportedVLMModels
(C006). - Add
@Published var lastCustomPrompt: String = ""
.
- Create
- Inputs/Dependencies:
ScreenMateEngine.supportedVLMModels
(C006) for default. - Outputs/Deliverables: An
AppSettings
class. - Interface with Other Components: Injected as an
@StateObject
or@EnvironmentObject
intoMenubarContentView
(C004) andSettingsView
(C008),CustomPromptView
(C012).
Component ID: C001
Name: Application Core Setup (AppDelegate.swift
)
- Purpose: Initialize the application, set it as an accessory app, and manage its lifecycle. Instantiate core managers.
- Key Technologies: Swift, AppKit (
NSApplicationDelegate
,NSApp
). - Core Responsibilities/Tasks for LLM:
- Create/Update
AppDelegate.swift
. - Ensure conformance:
NSObject
,NSApplicationDelegate
. - Implement
applicationDidFinishLaunching(_:)
:- Set
NSApp.setActivationPolicy(.accessory)
. - Instantiate
MenuBarManager
(C002) and retain it. - Add/Modify method
showCustomPromptPanel()
(renamed fromshowSpotlight
) to instantiate and showCustomPromptPanelController
(C011).
- Set
- Implement
applicationWillTerminate(_:)
. - Implement
applicationShouldTerminateAfterLastWindowClosed(_:)
returningfalse
. - Ensure
@main
inScreenMateApp.swift
uses@NSApplicationDelegateAdaptor(AppDelegate.self)
.
- Create/Update
- Inputs/Dependencies: None initially. Will create
MenuBarManager
. - Outputs/Deliverables: A functioning
AppDelegate.swift
. - Interface with Other Components: Instantiates
MenuBarManager
. ProvidesshowCustomPromptPanel()
. - Success Criteria/Verification: App launches as accessory.
MenuBarManager
initialized.showCustomPromptPanel()
is callable. - Reference File(s):
ScreenMate/ScreenMate/AppDelegate.swift
.
Component ID: C002
Name: MenuBar Management (MenuBarManager.swift
)
- Purpose: Create and manage the system menubar icon and its actions.
- Key Technologies: Swift, AppKit (
NSStatusItem
,NSStatusBar
,NSImage
,NSObject
). - Core Responsibilities/Tasks for LLM: (As per original plan, ensure image name is "TrayIcon" or a new generic icon for ScreenMate). Default panel dimensions might be around
width: 340, height: 550
to accommodate more UI. - Inputs/Dependencies:
PanelController
(C003), "TrayIcon" inAssets.xcassets
. - Outputs/Deliverables:
MenuBarManager.swift
. - Success Criteria: Menubar icon appears and toggles panel.
- Reference File(s):
ScreenMate/ScreenMateCore/MenuBarManager.swift
.
Component ID: C003
Name: Menubar Panel Controller (PanelController.swift
)
- Purpose: Manage the
NSPanel
for the menubar popover, hosting SwiftUI content. - Key Technologies: Swift, AppKit (
NSWindowController
,NSPanel
,NSWindowDelegate
), SwiftUI (NSHostingView
). - Core Responsibilities/Tasks for LLM (Updates):
6. Implement
private func embedSwiftUIView(in panel: NSPanel)
:- Instantiate
AppSettings
(C000) andAutostartManager
(C007). - Instantiate
MenubarContentView
(C004), injecting bothappSettings
andautostartManager
as.environmentObject()
.
- Instantiate
- Inputs/Dependencies:
MenubarContentView
(C004),AppSettings
(C000),AutostartManager
(C007). - Outputs/Deliverables:
PanelController.swift
. - Success Criteria: Panel works, hosts
MenubarContentView
with injected environment objects. - Reference File(s):
ScreenMate/ScreenMateCore/PanelController.swift
.
Component ID: C004
Name: Menubar Content View (UI/MenubarContentView.swift
)
- Purpose: Main UI. Interacts with
ScreenshotManager
,ScreenMateEngine
. Manages VLM model loading based onAppSettings
. No Copy button. - Key Technologies: Swift, SwiftUI.
- Core Responsibilities/Tasks for LLM:
- Create
MenubarContentView.swift
inUI/
. - State Management:
@EnvironmentObject var appSettings: AppSettings
(C000).@StateObject private var screenshotManager = ScreenshotManager()
(C005).@StateObject private var screenMateEngine = ScreenMateEngine()
(C006 - Renamed fromOCREngine
).@State private var processedTextResult: String = "Select a VLM model in Settings and click Load."
.@State private var showingSettings = false
.@State private var lastScreenshotPreviewImage: Image?
.
- Body Layout (VStack):
- Display
screenMateEngine.currentStatusMessage
,screenMateEngine.loadedModelNameDisplay
. - "Load/Change VLM Model"
Button
(orTextField
for ID + LoadButton
). Action:Task { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }
. Disable based onscreenMateEngine.isLoadingModel
. - "Process Screenshot"
Button
(renamed from "Take Screenshot & OCR"). Action:processScreenshotWithDefaultPrompt()
. Disable appropriately. ScrollView
forprocessedTextResult
(selectable, monospaced).- Optional
Image
forlastScreenshotPreviewImage
. HStack
with:- (Copy Button Removed)
- "Settings"
Button
(with.popover
forSettingsView
C008). - "Custom Prompt"
Button
. Action:(NSApp.delegate as? AppDelegate)?.showCustomPromptPanel()
.
- Display
.onChange(of: appSettings.selectedVLMModelIdentifier)
: If selection changes, callTask { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }
.- Private Methods:
processScreenshotWithDefaultPrompt()
: (wastakeScreenshotAndOCR
)- Guard model loaded/engine state.
- Call
screenshotManager.takeScreenshotToImage(...)
. - On
NSImage
received, update preview. - Call
screenMateEngine.performOCR(onNSImage: receivedNSImage, customPrompt: screenMateEngine.getDefaultOCRPrompt(), ...)
- Handle
Result
, updateprocessedTextResult
.
- Create
- Inputs/Dependencies:
AppSettings
(C000),ScreenshotManager
(C005),ScreenMateEngine
(C006),SettingsView
(C008),NotificationManager
(C009),AppDelegate
(C001). - Outputs/Deliverables:
MenubarContentView.swift
. - Success Criteria: Model loading via
appSettings
. Default image processing works. Custom prompt panel invoked. - Reference File(s):
ScreenMate/ScreenMateCore/UI/MenubarContentView.swift
.
Component ID: C005
Name: Screenshot Manager (In-Memory) (ScreenshotManager.swift
)
- (No functional changes from the "In-Memory Screenshot" plan.)
- Reference File(s):
ScreenMate/ScreenMateCore/ScreenshotManager.swift
.
Component ID: C006
Name: ScreenMate Engine (Direct VLM with Custom Prompt Support) (ScreenMateEngine.swift
) (Renamed from OCREngine)
- Purpose: Load, manage, run VLM for image processing (including OCR) on
NSImage
, supporting custom user prompts. - Key Technologies: Swift, MLX, MLXLLM, MLXLMCommon, Tokenizers, Hub, AppKit.
- Core Responsibilities/Tasks for LLM:
- Rename file and class from
OCREngine
toScreenMateEngine
. - (Structure, Published Properties, Error Enum,
init
,loadModel
as per previous "Direct VLM & In-Memory" plan for C006, usingScreenMateEngineError
). - Add
static let supportedVLMModels: [String: String]
property: A dictionary of["Display Name": "hub_or_path_identifier"]
(e.g.,["Llava Phi-3 Mini": "mlx-community/llava-phi-3-mini-128k-instruct-4bit", "Custom Moondream": "/path/to/moondream"]
). Initialize with at least one valid VLM. - Add
getDefaultOCRPrompt() -> String
method: Returns a default prompt specifically tuned for OCR (e.g., "Extract all text from this image..."). - Modify
performOCR
method signature (or rename to a more genericprocessImage
):func processImage(onNSImage nsImage: NSImage, prompt: String, completion: @escaping (Result<String, ScreenMateEngineError>) -> Void)
- Inside
processImage(...)
:- Use the provided
prompt
when constructingUserInput
messages. The prompt should already include the VLM-specific image placeholder (e.g.,<image>\nUser's custom prompt here
). TheScreenMateEngine
might offer a helper to prepend this placeholder if the user prompt is raw text. - (Image Handling for
UserInput
and inference logic remains similar, focusing on in-memoryNSImage
toUserInput.Image
conversion. Agent must prioritize in-memoryUserInput.Image
creation fromNSImage
data, using temporary files only as a last resort if MLX libraries are restrictive).
- Use the provided
- Rename file and class from
- Inputs/Dependencies:
NSImage
, VLM model identifier, custom text prompt string. MLX Swift packages. - Outputs/Deliverables:
ScreenMateEngine.swift
. - Success Criteria:
loadModel
works.processImage
uses the custom prompt. - Reference File(s): Detailed
OCREngine.swift
example, renamed toScreenMateEngine.swift
and adapted for custom prompts.
Component ID: C007
Name: Autostart Manager (SystemIntegration/AutostartManager.swift
)
- Core Responsibilities/Tasks for LLM (Updates):
- Ensure
appBundleIdentifier
andappName
ininit()
are correctly derived fromBundle.main
for "ScreenMate" or set to the new app's values.
- Ensure
- Reference File(s):
ScreenMate/ScreenMateCore/SystemIntegration/AutostartManager.swift
.
Component ID: C008
Name: Settings View (UI/SettingsView.swift
)
- Purpose: UI for settings, including VLM model selection from a predefined list.
- Key Technologies: Swift, SwiftUI.
- Core Responsibilities/Tasks for LLM:
- Create
SettingsView.swift
. @EnvironmentObject var appSettings: AppSettings
(C000).@EnvironmentObject var autostartManager: AutostartManager
(C007).- Body Layout (VStack):
- Title,
Toggle
for autostart. - VLM Model Selection Section:
Text("Select VLM Model:")
.Picker("VLM Model", selection: $appSettings.selectedVLMModelIdentifier)
:- Iterate over
ScreenMateEngine.supportedVLMModels.keys.sorted()
. - For each key (display name), use
ScreenMateEngine.supportedVLMModels[key]!
as the tag (identifier string).
- Iterate over
- App version display.
- Title,
- Create
- Inputs/Dependencies:
AppSettings
(C000),AutostartManager
(C007),ScreenMateEngine.supportedVLMModels
(C006). - Outputs/Deliverables:
SettingsView.swift
with VLM model picker. - Success Criteria: Autostart toggle. VLM model selection updates
appSettings.selectedVLMModelIdentifier
. - Reference File(s):
ScreenMate/ScreenMateCore/UI/SettingsView.swift
.
Component ID: C009
Name: Notification Manager (SystemIntegration/NotificationManager.swift
)
- Reference File(s):
ScreenMate/ScreenMateCore/SystemIntegration/NotificationManager.swift
.
Component ID: C010
Name: Custom Prompt Panel Appearance (CustomPromptPanel.swift
) (Renamed from SpotlightPanel)
- Purpose: Define the custom
NSPanel
for the floating "Custom Prompt" window. - Key Technologies: Swift, AppKit (
NSPanel
). - Core Responsibilities/Tasks for LLM: (Implement as per original C010 plan, renaming file and class to
CustomPromptPanel
). - Reference File(s): Rename
SpotlightPanel.swift
toCustomPromptPanel.swift
.
Component ID: C011
Name: Custom Prompt Panel Controller (CustomPromptPanelController.swift
) (Renamed)
- Purpose: Manage the
CustomPromptPanel
window and host its SwiftUI content. - Key Technologies: Swift, AppKit (
NSWindowController
), SwiftUI (NSHostingView
). - Core Responsibilities/Tasks for LLM:
- Create/Rename to
CustomPromptPanelController.swift
. - Subclass
NSWindowController
. convenience init(customPromptPanel: CustomPromptPanel)
.- In
AppDelegate.showCustomPromptPanel()
(C001):- Create
AppSettings
(C000) andScreenMateEngine
(C006) instances or retrieve shared instances if they are singletons/globally managed. For simplicity, ifScreenMateEngine
is already@StateObject
inMenubarContentView
, consider how to share it or pass necessary data. It might be better forCustomPromptView
to also takeAppSettings
and create its ownScreenshotManager
and call a global/sharedScreenMateEngine
instance or a method that uses it. Let's assume for now it can get the mainScreenMateEngine
instance. - Create
CustomPromptView
(C012). InjectappSettings
andscreenMateEngine
as.environmentObject()
. - Host in
NSHostingView
, set ascustomPromptPanel.contentView
. - Instantiate
CustomPromptPanelController(customPromptPanel: panel)
.
- Create
- Create/Rename to
- Inputs/Dependencies:
CustomPromptPanel
(C010),CustomPromptView
(C012),AppSettings
(C000),ScreenMateEngine
(C006). - Reference File(s): Rename
SpotlightPanelController.swift
toCustomPromptPanelController.swift
.
Component ID: C012
Name: Custom Prompt Content View (UI/CustomPromptView.swift
) (Renamed from SpotlightContentView)
- Purpose: SwiftUI interface for user to input a custom VLM prompt and process a new screenshot.
- Key Technologies: Swift, SwiftUI.
- Core Responsibilities/Tasks for LLM:
- Create
CustomPromptView.swift
inUI/
. - State Management:
@EnvironmentObject var appSettings: AppSettings
(C000).@EnvironmentObject var screenMateEngine: ScreenMateEngine
(C006).@StateObject private var screenshotManager = ScreenshotManager()
(C005).@State private var userPromptText: String = ""
. (Initialize withappSettings.lastCustomPrompt
in.onAppear
).@State private var customProcessingInProgress: Bool = false
.@State private var customProcessingResultText: String = ""
.@State private var screenshotForCustomPromptPreview: Image?
.
- Body Layout (VStack):
Text("Custom VLM Prompt")
.TextEditor(text: $userPromptText)
for multi-line input. Min height, resizable.Button("Take Screenshot & Process with This Prompt")
. Action:processScreenshotWithCustomPrompt()
. Disable ifscreenMateEngine.modelContainer == nil
,screenMateEngine.isLoadingModel
, orcustomProcessingInProgress
.- Optional
Image
view forscreenshotForCustomPromptPreview
. ScrollView
to displaycustomProcessingResultText
.
- Private Methods:
processScreenshotWithCustomPrompt()
:- Set
customProcessingInProgress = true
. UpdateappSettings.lastCustomPrompt = userPromptText
. - Ensure
userPromptText
includes the VLM-specific image placeholder (e.g.,<image>\n
). The view could prepend this ifuserPromptText
is just the raw question. - Call
screenshotManager.takeScreenshotToImage(...)
. - On
NSImage
received, updatescreenshotForCustomPromptPreview
. - Call
screenMateEngine.processImage(onNSImage: receivedNSImage, prompt: userPromptTextWithPlaceholder, ...)
- Handle
Result
, updatecustomProcessingResultText
. SetcustomProcessingInProgress = false
.
- Set
- Create
- Inputs/Dependencies:
AppSettings
(C000),ScreenMateEngine
(C006),ScreenshotManager
(C005). - Outputs/Deliverables:
CustomPromptView.swift
. - Interface with Other Components: Hosted by
CustomPromptPanelController
. - Success Criteria: User inputs prompt, triggers screenshot, VLM processes with custom prompt, results displayed.
- Reference File(s):
SpotlightContentView.swift
to be heavily adapted intoCustomPromptView.swift
.
Component ID: C013 (Optional - Low Priority)
Name: Workspace Monitor (SystemIntegration/WorkspaceMonitor.swift
)
- (No change to its own spec, but its utility might be higher with custom, context-aware prompts).
- Reference File(s):
ScreenMate/ScreenMateCore/SystemIntegration/WorkspaceMonitor.swift
.
Component ID: C014 Name: Project Setup & Configuration
- Core Responsibilities/Tasks for LLM (Updates):
- Rename existing project files/targets from "OCRToolbox" to "ScreenMate" (careful, manual steps often needed here first).
- Update Bundle Identifier to reflect "ScreenMate" (e.g.,
com.yourcompany.ScreenMate
). - (MLX Dependencies remain essential).
- Reference File(s): Project build settings,
Info.plist
.
- Project Renaming (Manual/Guided First): Rename
.xcodeproj
, schemes, targets to "ScreenMate". - C014 (Project Setup): Verify Bundle ID for "ScreenMate". Ensure MLX dependencies are linked.
- C000 (AppSettings): Create shared settings.
- C001 (AppDelegate): Update for
showCustomPromptPanel()
. - C005 (ScreenshotManager - In-Memory).
- C009 (NotificationManager), C007 (AutostartManager - Update for new Bundle ID/App Name).
- C006 (ScreenMateEngine - Renamed from OCREngine): Implement
supportedVLMModels
,getDefaultOCRPrompt()
. AdaptprocessImage
(wasperformOCR
) to takeprompt: String
. - C008 (SettingsView): Implement with VLM model
Picker
usingScreenMateEngine.supportedVLMModels
and binding toappSettings.selectedVLMModelIdentifier
. - C004 (MenubarContentView):
- Inject
AppSettings
. RenameocrEngine
toscreenMateEngine
. - Implement UI for loading model based on
appSettings.selectedVLMModelIdentifier
. - "Process Screenshot" button calls
screenMateEngine.processImage
withscreenMateEngine.getDefaultOCRPrompt()
. - Button to invoke
AppDelegate.showCustomPromptPanel()
. Remove Copy button.
- Inject
- C010 (CustomPromptPanel - Renamed), C012 (CustomPromptView - Renamed), C011 (CustomPromptPanelController - Renamed): Implement the custom prompt UI and its interactions.
- C003 (PanelController): Ensure
AppSettings
andAutostartManager
are correctly injected intoMenubarContentView
. - C002 (MenuBarManager).
- Core functionality fully testable: VLM selection, default processing, custom prompt processing.
- Refinements & Testing.
- Renaming: Be meticulous with renaming "OCRToolbox" to "ScreenMate" and "OCREngine" to "ScreenMateEngine" throughout the codebase, including filenames, class names, variable names, comments, and log messages.
AppSettings
(C000): This is a new central piece for settings.ScreenMateEngine
(C006):- Add
supportedVLMModels
static property. processImage
(renamed fromperformOCR
) must accept aprompt: String
.
- Add
MenubarContentView
(C004): Remove Copy button. Drive model loading viaappSettings
.SettingsView
(C008): ImplementPicker
for VLM selection.- Custom Prompt Feature (C010, C011, C012): This is a significant UI and logic addition.
CustomPromptView
will need to manage its own screenshot and prompt, then callscreenMateEngine.processImage
. Ensure the VLM-specific image placeholder (e.g.<image>\n
) is correctly prepended to the user's custom text prompt before sending toScreenMateEngine
.