Skip to content

Instantly share code, notes, and snippets.

@zz2115
Created May 25, 2025 20:29
Show Gist options
  • Save zz2115/7dfeb4a0fbf693b9594d7e0b1b0b452b to your computer and use it in GitHub Desktop.
Save zz2115/7dfeb4a0fbf693b9594d7e0b1b0b452b to your computer and use it in GitHub Desktop.

Implementation Documentation for Agentic LLM Workflow: macOS ScreenMate (SwiftUI First - Direct VLM, In-Memory Screenshot, Custom Prompts)

1. Overall Project Goal:

Develop a native macOS application ("ScreenMate") that:

  • Runs as a menubar accessory application (no Dock icon).
  • Provides advanced image understanding functionality triggered by a screenshot, capturing the image into memory (as an NSImage) and processing it using a locally loaded Vision Language Model (VLM) via MLX Swift, with an option for users to provide custom prompts. (OCR is one of its capabilities).
  • Features a main interface in a menubar popover panel.
  • Features a "Custom Prompt" floating panel allowing users to input their own VLM prompts for image processing.
  • Allows configuration for auto-starting at login and selecting a VLM model from a predefined list.
  • Uses SwiftUI for UI components where feasible, and AppKit for system integrations and panel management.

2. Core Principles & Constraints for LLM Agent:

  • Language: Swift.
  • UI Framework: SwiftUI first. Use AppKit for system integration.
  • ML Framework: MLX Swift (including MLXLLM, MLXLMCommon, Tokenizers, Hub) for direct VLM loading and inference.
  • Screenshot Handling: Capture screenshots directly into memory as NSImage.
  • Bridging (UI): NSHostingView for SwiftUI in AppKit panels.
  • State Management (SwiftUI): Standard SwiftUI state management. ScreenMateEngine and AppSettings will be ObservableObject.
  • Modularity: Distinct classes/structs.
  • Error Handling: Robust error handling for VLM operations, UI feedback.
  • Asynchronous Operations: async/await, Task.detached for VLM. UI updates on main thread.
  • File Structure: Adhere to suggested structure (UI, SystemIntegration, Shared subfolders).

3. Component Breakdown & Implementation Tasks:


Component ID: C000 (New) Name: App Settings (Shared/AppSettings.swift)

  • Purpose: An ObservableObject to store shared application settings, like the selected VLM model identifier and potentially the last custom prompt.
  • Key Technologies: Swift, Combine (ObservableObject, @Published).
  • Core Responsibilities/Tasks for LLM:
    1. Create AppSettings.swift (in a Shared group).
    2. Define a class AppSettings conforming to ObservableObject.
    3. Add @Published var selectedVLMModelIdentifier: String initialized with a default from ScreenMateEngine.supportedVLMModels (C006).
    4. Add @Published var lastCustomPrompt: String = "".
  • Inputs/Dependencies: ScreenMateEngine.supportedVLMModels (C006) for default.
  • Outputs/Deliverables: An AppSettings class.
  • Interface with Other Components: Injected as an @StateObject or @EnvironmentObject into MenubarContentView (C004) and SettingsView (C008), CustomPromptView (C012).

Component ID: C001 Name: Application Core Setup (AppDelegate.swift)

  • Purpose: Initialize the application, set it as an accessory app, and manage its lifecycle. Instantiate core managers.
  • Key Technologies: Swift, AppKit (NSApplicationDelegate, NSApp).
  • Core Responsibilities/Tasks for LLM:
    1. Create/Update AppDelegate.swift.
    2. Ensure conformance: NSObject, NSApplicationDelegate.
    3. Implement applicationDidFinishLaunching(_:):
      • Set NSApp.setActivationPolicy(.accessory).
      • Instantiate MenuBarManager (C002) and retain it.
      • Add/Modify method showCustomPromptPanel() (renamed from showSpotlight) to instantiate and show CustomPromptPanelController (C011).
    4. Implement applicationWillTerminate(_:).
    5. Implement applicationShouldTerminateAfterLastWindowClosed(_:) returning false.
    6. Ensure @main in ScreenMateApp.swift uses @NSApplicationDelegateAdaptor(AppDelegate.self).
  • Inputs/Dependencies: None initially. Will create MenuBarManager.
  • Outputs/Deliverables: A functioning AppDelegate.swift.
  • Interface with Other Components: Instantiates MenuBarManager. Provides showCustomPromptPanel().
  • Success Criteria/Verification: App launches as accessory. MenuBarManager initialized. showCustomPromptPanel() is callable.
  • Reference File(s): ScreenMate/ScreenMate/AppDelegate.swift.

Component ID: C002 Name: MenuBar Management (MenuBarManager.swift)

  • Purpose: Create and manage the system menubar icon and its actions.
  • Key Technologies: Swift, AppKit (NSStatusItem, NSStatusBar, NSImage, NSObject).
  • Core Responsibilities/Tasks for LLM: (As per original plan, ensure image name is "TrayIcon" or a new generic icon for ScreenMate). Default panel dimensions might be around width: 340, height: 550 to accommodate more UI.
  • Inputs/Dependencies: PanelController (C003), "TrayIcon" in Assets.xcassets.
  • Outputs/Deliverables: MenuBarManager.swift.
  • Success Criteria: Menubar icon appears and toggles panel.
  • Reference File(s): ScreenMate/ScreenMateCore/MenuBarManager.swift.

Component ID: C003 Name: Menubar Panel Controller (PanelController.swift)

  • Purpose: Manage the NSPanel for the menubar popover, hosting SwiftUI content.
  • Key Technologies: Swift, AppKit (NSWindowController, NSPanel, NSWindowDelegate), SwiftUI (NSHostingView).
  • Core Responsibilities/Tasks for LLM (Updates): 6. Implement private func embedSwiftUIView(in panel: NSPanel):
    • Instantiate AppSettings (C000) and AutostartManager (C007).
    • Instantiate MenubarContentView (C004), injecting both appSettings and autostartManager as .environmentObject().
  • Inputs/Dependencies: MenubarContentView (C004), AppSettings (C000), AutostartManager (C007).
  • Outputs/Deliverables: PanelController.swift.
  • Success Criteria: Panel works, hosts MenubarContentView with injected environment objects.
  • Reference File(s): ScreenMate/ScreenMateCore/PanelController.swift.

Component ID: C004 Name: Menubar Content View (UI/MenubarContentView.swift)

  • Purpose: Main UI. Interacts with ScreenshotManager, ScreenMateEngine. Manages VLM model loading based on AppSettings. No Copy button.
  • Key Technologies: Swift, SwiftUI.
  • Core Responsibilities/Tasks for LLM:
    1. Create MenubarContentView.swift in UI/.
    2. State Management:
      • @EnvironmentObject var appSettings: AppSettings (C000).
      • @StateObject private var screenshotManager = ScreenshotManager() (C005).
      • @StateObject private var screenMateEngine = ScreenMateEngine() (C006 - Renamed from OCREngine).
      • @State private var processedTextResult: String = "Select a VLM model in Settings and click Load.".
      • @State private var showingSettings = false.
      • @State private var lastScreenshotPreviewImage: Image?.
    3. Body Layout (VStack):
      • Display screenMateEngine.currentStatusMessage, screenMateEngine.loadedModelNameDisplay.
      • "Load/Change VLM Model" Button (or TextField for ID + Load Button). Action: Task { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }. Disable based on screenMateEngine.isLoadingModel.
      • "Process Screenshot" Button (renamed from "Take Screenshot & OCR"). Action: processScreenshotWithDefaultPrompt(). Disable appropriately.
      • ScrollView for processedTextResult (selectable, monospaced).
      • Optional Image for lastScreenshotPreviewImage.
      • HStack with:
        • (Copy Button Removed)
        • "Settings" Button (with .popover for SettingsView C008).
        • "Custom Prompt" Button. Action: (NSApp.delegate as? AppDelegate)?.showCustomPromptPanel().
    4. .onChange(of: appSettings.selectedVLMModelIdentifier): If selection changes, call Task { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }.
    5. Private Methods:
      • processScreenshotWithDefaultPrompt(): (was takeScreenshotAndOCR)
        • Guard model loaded/engine state.
        • Call screenshotManager.takeScreenshotToImage(...).
        • On NSImage received, update preview.
        • Call screenMateEngine.performOCR(onNSImage: receivedNSImage, customPrompt: screenMateEngine.getDefaultOCRPrompt(), ...)
        • Handle Result, update processedTextResult.
  • Inputs/Dependencies: AppSettings (C000), ScreenshotManager (C005), ScreenMateEngine (C006), SettingsView (C008), NotificationManager (C009), AppDelegate (C001).
  • Outputs/Deliverables: MenubarContentView.swift.
  • Success Criteria: Model loading via appSettings. Default image processing works. Custom prompt panel invoked.
  • Reference File(s): ScreenMate/ScreenMateCore/UI/MenubarContentView.swift.

Component ID: C005 Name: Screenshot Manager (In-Memory) (ScreenshotManager.swift)

  • (No functional changes from the "In-Memory Screenshot" plan.)
  • Reference File(s): ScreenMate/ScreenMateCore/ScreenshotManager.swift.

Component ID: C006 Name: ScreenMate Engine (Direct VLM with Custom Prompt Support) (ScreenMateEngine.swift) (Renamed from OCREngine)

  • Purpose: Load, manage, run VLM for image processing (including OCR) on NSImage, supporting custom user prompts.
  • Key Technologies: Swift, MLX, MLXLLM, MLXLMCommon, Tokenizers, Hub, AppKit.
  • Core Responsibilities/Tasks for LLM:
    1. Rename file and class from OCREngine to ScreenMateEngine.
    2. (Structure, Published Properties, Error Enum, init, loadModel as per previous "Direct VLM & In-Memory" plan for C006, using ScreenMateEngineError).
    3. Add static let supportedVLMModels: [String: String] property: A dictionary of ["Display Name": "hub_or_path_identifier"] (e.g., ["Llava Phi-3 Mini": "mlx-community/llava-phi-3-mini-128k-instruct-4bit", "Custom Moondream": "/path/to/moondream"]). Initialize with at least one valid VLM.
    4. Add getDefaultOCRPrompt() -> String method: Returns a default prompt specifically tuned for OCR (e.g., "Extract all text from this image...").
    5. Modify performOCR method signature (or rename to a more generic processImage): func processImage(onNSImage nsImage: NSImage, prompt: String, completion: @escaping (Result<String, ScreenMateEngineError>) -> Void)
    6. Inside processImage(...):
      • Use the provided prompt when constructing UserInput messages. The prompt should already include the VLM-specific image placeholder (e.g., <image>\nUser's custom prompt here). The ScreenMateEngine might offer a helper to prepend this placeholder if the user prompt is raw text.
      • (Image Handling for UserInput and inference logic remains similar, focusing on in-memory NSImage to UserInput.Image conversion. Agent must prioritize in-memory UserInput.Image creation from NSImage data, using temporary files only as a last resort if MLX libraries are restrictive).
  • Inputs/Dependencies: NSImage, VLM model identifier, custom text prompt string. MLX Swift packages.
  • Outputs/Deliverables: ScreenMateEngine.swift.
  • Success Criteria: loadModel works. processImage uses the custom prompt.
  • Reference File(s): Detailed OCREngine.swift example, renamed to ScreenMateEngine.swift and adapted for custom prompts.

Component ID: C007 Name: Autostart Manager (SystemIntegration/AutostartManager.swift)

  • Core Responsibilities/Tasks for LLM (Updates):
    • Ensure appBundleIdentifier and appName in init() are correctly derived from Bundle.main for "ScreenMate" or set to the new app's values.
  • Reference File(s): ScreenMate/ScreenMateCore/SystemIntegration/AutostartManager.swift.

Component ID: C008 Name: Settings View (UI/SettingsView.swift)

  • Purpose: UI for settings, including VLM model selection from a predefined list.
  • Key Technologies: Swift, SwiftUI.
  • Core Responsibilities/Tasks for LLM:
    1. Create SettingsView.swift.
    2. @EnvironmentObject var appSettings: AppSettings (C000).
    3. @EnvironmentObject var autostartManager: AutostartManager (C007).
    4. Body Layout (VStack):
      • Title, Toggle for autostart.
      • VLM Model Selection Section:
        • Text("Select VLM Model:").
        • Picker("VLM Model", selection: $appSettings.selectedVLMModelIdentifier):
          • Iterate over ScreenMateEngine.supportedVLMModels.keys.sorted().
          • For each key (display name), use ScreenMateEngine.supportedVLMModels[key]! as the tag (identifier string).
      • App version display.
  • Inputs/Dependencies: AppSettings (C000), AutostartManager (C007), ScreenMateEngine.supportedVLMModels (C006).
  • Outputs/Deliverables: SettingsView.swift with VLM model picker.
  • Success Criteria: Autostart toggle. VLM model selection updates appSettings.selectedVLMModelIdentifier.
  • Reference File(s): ScreenMate/ScreenMateCore/UI/SettingsView.swift.

Component ID: C009 Name: Notification Manager (SystemIntegration/NotificationManager.swift)

  • Reference File(s): ScreenMate/ScreenMateCore/SystemIntegration/NotificationManager.swift.

Component ID: C010 Name: Custom Prompt Panel Appearance (CustomPromptPanel.swift) (Renamed from SpotlightPanel)

  • Purpose: Define the custom NSPanel for the floating "Custom Prompt" window.
  • Key Technologies: Swift, AppKit (NSPanel).
  • Core Responsibilities/Tasks for LLM: (Implement as per original C010 plan, renaming file and class to CustomPromptPanel).
  • Reference File(s): Rename SpotlightPanel.swift to CustomPromptPanel.swift.

Component ID: C011 Name: Custom Prompt Panel Controller (CustomPromptPanelController.swift) (Renamed)

  • Purpose: Manage the CustomPromptPanel window and host its SwiftUI content.
  • Key Technologies: Swift, AppKit (NSWindowController), SwiftUI (NSHostingView).
  • Core Responsibilities/Tasks for LLM:
    1. Create/Rename to CustomPromptPanelController.swift.
    2. Subclass NSWindowController.
    3. convenience init(customPromptPanel: CustomPromptPanel).
    4. In AppDelegate.showCustomPromptPanel() (C001):
      • Create AppSettings (C000) and ScreenMateEngine (C006) instances or retrieve shared instances if they are singletons/globally managed. For simplicity, if ScreenMateEngine is already @StateObject in MenubarContentView, consider how to share it or pass necessary data. It might be better for CustomPromptView to also take AppSettings and create its own ScreenshotManager and call a global/shared ScreenMateEngine instance or a method that uses it. Let's assume for now it can get the main ScreenMateEngine instance.
      • Create CustomPromptView (C012). Inject appSettings and screenMateEngine as .environmentObject().
      • Host in NSHostingView, set as customPromptPanel.contentView.
      • Instantiate CustomPromptPanelController(customPromptPanel: panel).
  • Inputs/Dependencies: CustomPromptPanel (C010), CustomPromptView (C012), AppSettings (C000), ScreenMateEngine (C006).
  • Reference File(s): Rename SpotlightPanelController.swift to CustomPromptPanelController.swift.

Component ID: C012 Name: Custom Prompt Content View (UI/CustomPromptView.swift) (Renamed from SpotlightContentView)

  • Purpose: SwiftUI interface for user to input a custom VLM prompt and process a new screenshot.
  • Key Technologies: Swift, SwiftUI.
  • Core Responsibilities/Tasks for LLM:
    1. Create CustomPromptView.swift in UI/.
    2. State Management:
      • @EnvironmentObject var appSettings: AppSettings (C000).
      • @EnvironmentObject var screenMateEngine: ScreenMateEngine (C006).
      • @StateObject private var screenshotManager = ScreenshotManager() (C005).
      • @State private var userPromptText: String = "". (Initialize with appSettings.lastCustomPrompt in .onAppear).
      • @State private var customProcessingInProgress: Bool = false.
      • @State private var customProcessingResultText: String = "".
      • @State private var screenshotForCustomPromptPreview: Image?.
    3. Body Layout (VStack):
      • Text("Custom VLM Prompt").
      • TextEditor(text: $userPromptText) for multi-line input. Min height, resizable.
      • Button("Take Screenshot & Process with This Prompt"). Action: processScreenshotWithCustomPrompt(). Disable if screenMateEngine.modelContainer == nil, screenMateEngine.isLoadingModel, or customProcessingInProgress.
      • Optional Image view for screenshotForCustomPromptPreview.
      • ScrollView to display customProcessingResultText.
    4. Private Methods:
      • processScreenshotWithCustomPrompt():
        • Set customProcessingInProgress = true. Update appSettings.lastCustomPrompt = userPromptText.
        • Ensure userPromptText includes the VLM-specific image placeholder (e.g., <image>\n). The view could prepend this if userPromptText is just the raw question.
        • Call screenshotManager.takeScreenshotToImage(...).
        • On NSImage received, update screenshotForCustomPromptPreview.
        • Call screenMateEngine.processImage(onNSImage: receivedNSImage, prompt: userPromptTextWithPlaceholder, ...)
        • Handle Result, update customProcessingResultText. Set customProcessingInProgress = false.
  • Inputs/Dependencies: AppSettings (C000), ScreenMateEngine (C006), ScreenshotManager (C005).
  • Outputs/Deliverables: CustomPromptView.swift.
  • Interface with Other Components: Hosted by CustomPromptPanelController.
  • Success Criteria: User inputs prompt, triggers screenshot, VLM processes with custom prompt, results displayed.
  • Reference File(s): SpotlightContentView.swift to be heavily adapted into CustomPromptView.swift.

Component ID: C013 (Optional - Low Priority) Name: Workspace Monitor (SystemIntegration/WorkspaceMonitor.swift)

  • (No change to its own spec, but its utility might be higher with custom, context-aware prompts).
  • Reference File(s): ScreenMate/ScreenMateCore/SystemIntegration/WorkspaceMonitor.swift.

Component ID: C014 Name: Project Setup & Configuration

  • Core Responsibilities/Tasks for LLM (Updates):
    1. Rename existing project files/targets from "OCRToolbox" to "ScreenMate" (careful, manual steps often needed here first).
    2. Update Bundle Identifier to reflect "ScreenMate" (e.g., com.yourcompany.ScreenMate).
    3. (MLX Dependencies remain essential).
  • Reference File(s): Project build settings, Info.plist.

4. Suggested Agentic Workflow (Adjusted for Renaming, VLM Selection & Custom Prompts):

  1. Project Renaming (Manual/Guided First): Rename .xcodeproj, schemes, targets to "ScreenMate".
  2. C014 (Project Setup): Verify Bundle ID for "ScreenMate". Ensure MLX dependencies are linked.
  3. C000 (AppSettings): Create shared settings.
  4. C001 (AppDelegate): Update for showCustomPromptPanel().
  5. C005 (ScreenshotManager - In-Memory).
  6. C009 (NotificationManager), C007 (AutostartManager - Update for new Bundle ID/App Name).
  7. C006 (ScreenMateEngine - Renamed from OCREngine): Implement supportedVLMModels, getDefaultOCRPrompt(). Adapt processImage (was performOCR) to take prompt: String.
  8. C008 (SettingsView): Implement with VLM model Picker using ScreenMateEngine.supportedVLMModels and binding to appSettings.selectedVLMModelIdentifier.
  9. C004 (MenubarContentView):
    • Inject AppSettings. Rename ocrEngine to screenMateEngine.
    • Implement UI for loading model based on appSettings.selectedVLMModelIdentifier.
    • "Process Screenshot" button calls screenMateEngine.processImage with screenMateEngine.getDefaultOCRPrompt().
    • Button to invoke AppDelegate.showCustomPromptPanel(). Remove Copy button.
  10. C010 (CustomPromptPanel - Renamed), C012 (CustomPromptView - Renamed), C011 (CustomPromptPanelController - Renamed): Implement the custom prompt UI and its interactions.
  11. C003 (PanelController): Ensure AppSettings and AutostartManager are correctly injected into MenubarContentView.
  12. C002 (MenuBarManager).
    • Core functionality fully testable: VLM selection, default processing, custom prompt processing.
  13. Refinements & Testing.

5. Agent Instructions (Updates):

  • Renaming: Be meticulous with renaming "OCRToolbox" to "ScreenMate" and "OCREngine" to "ScreenMateEngine" throughout the codebase, including filenames, class names, variable names, comments, and log messages.
  • AppSettings (C000): This is a new central piece for settings.
  • ScreenMateEngine (C006):
    • Add supportedVLMModels static property.
    • processImage (renamed from performOCR) must accept a prompt: String.
  • MenubarContentView (C004): Remove Copy button. Drive model loading via appSettings.
  • SettingsView (C008): Implement Picker for VLM selection.
  • Custom Prompt Feature (C010, C011, C012): This is a significant UI and logic addition. CustomPromptView will need to manage its own screenshot and prompt, then call screenMateEngine.processImage. Ensure the VLM-specific image placeholder (e.g. <image>\n) is correctly prepended to the user's custom text prompt before sending to ScreenMateEngine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment