Skip to content

Instantly share code, notes, and snippets.

@mzbac
Created May 25, 2025 15:18
Show Gist options
  • Save mzbac/866a3c7c9b927177a124256720acbc63 to your computer and use it in GitHub Desktop.
Save mzbac/866a3c7c9b927177a124256720acbc63 to your computer and use it in GitHub Desktop.
ScreenMate Implementation Documentation
# Implementation Documentation for Agentic LLM Workflow: macOS ScreenMate (SwiftUI First - Direct VLM, In-Memory Screenshot, Custom Prompts)
## 1. Overall Project Goal:
Develop a native macOS application ("ScreenMate") that:
* Runs as a menubar accessory application (no Dock icon).
* Provides advanced image understanding functionality triggered by a screenshot, capturing the image **into memory (as an `NSImage`)** and processing it using a **locally loaded Vision Language Model (VLM) via MLX Swift**, with an option for users to provide **custom prompts**. (OCR is one of its capabilities).
* Features a main interface in a menubar popover panel.
* Features a "Custom Prompt" floating panel allowing users to input their own VLM prompts for image processing.
* Allows configuration for auto-starting at login and **selecting a VLM model from a predefined list**.
* Uses SwiftUI for UI components where feasible, and AppKit for system integrations and panel management.
## 2. Core Principles & Constraints for LLM Agent:
* **Language:** Swift.
* **UI Framework:** SwiftUI first. Use AppKit for system integration.
* **ML Framework:** **MLX Swift** (including `MLXLLM`, `MLXLMCommon`, `Tokenizers`, `Hub`) for direct VLM loading and inference.
* **Screenshot Handling:** Capture screenshots directly into memory as `NSImage`.
* **Bridging (UI):** `NSHostingView` for SwiftUI in AppKit panels.
* **State Management (SwiftUI):** Standard SwiftUI state management. `ScreenMateEngine` and `AppSettings` will be `ObservableObject`.
* **Modularity:** Distinct classes/structs.
* **Error Handling:** Robust error handling for VLM operations, UI feedback.
* **Asynchronous Operations:** `async/await`, `Task.detached` for VLM. UI updates on main thread.
* **File Structure:** Adhere to suggested structure (`UI`, `SystemIntegration`, `Shared` subfolders).
## 3. Component Breakdown & Implementation Tasks:
---
**Component ID: C000 (New)**
**Name: App Settings (`Shared/AppSettings.swift`)**
- **Purpose:** An `ObservableObject` to store shared application settings, like the selected VLM model identifier and potentially the last custom prompt.
- **Key Technologies:** Swift, Combine (`ObservableObject`, `@Published`).
- **Core Responsibilities/Tasks for LLM:**
1. Create `AppSettings.swift` (in a `Shared` group).
2. Define a class `AppSettings` conforming to `ObservableObject`.
3. Add `@Published var selectedVLMModelIdentifier: String` initialized with a default from `ScreenMateEngine.supportedVLMModels` (C006).
4. Add `@Published var lastCustomPrompt: String = ""`.
- **Inputs/Dependencies:** `ScreenMateEngine.supportedVLMModels` (C006) for default.
- **Outputs/Deliverables:** An `AppSettings` class.
- **Interface with Other Components:** Injected as an `@StateObject` or `@EnvironmentObject` into `MenubarContentView` (C004) and `SettingsView` (C008), `CustomPromptView` (C012).
---
**Component ID: C001**
**Name: Application Core Setup (`AppDelegate.swift`)**
- **Purpose:** Initialize the application, set it as an accessory app, and manage its lifecycle. Instantiate core managers.
- **Key Technologies:** Swift, AppKit (`NSApplicationDelegate`, `NSApp`).
- **Core Responsibilities/Tasks for LLM:**
1. Create/Update `AppDelegate.swift`.
2. Ensure conformance: `NSObject`, `NSApplicationDelegate`.
3. Implement `applicationDidFinishLaunching(_:)`:
- Set `NSApp.setActivationPolicy(.accessory)`.
- Instantiate `MenuBarManager` (C002) and retain it.
- Add/Modify method `showCustomPromptPanel()` (renamed from `showSpotlight`) to instantiate and show `CustomPromptPanelController` (C011).
4. Implement `applicationWillTerminate(_:)`.
5. Implement `applicationShouldTerminateAfterLastWindowClosed(_:)` returning `false`.
6. Ensure `@main` in `ScreenMateApp.swift` uses `@NSApplicationDelegateAdaptor(AppDelegate.self)`.
- **Inputs/Dependencies:** None initially. Will create `MenuBarManager`.
- **Outputs/Deliverables:** A functioning `AppDelegate.swift`.
- **Interface with Other Components:** Instantiates `MenuBarManager`. Provides `showCustomPromptPanel()`.
- **Success Criteria/Verification:** App launches as accessory. `MenuBarManager` initialized. `showCustomPromptPanel()` is callable.
- **Reference File(s):** `ScreenMate/ScreenMate/AppDelegate.swift`.
---
**Component ID: C002**
**Name: MenuBar Management (`MenuBarManager.swift`)**
- **Purpose:** Create and manage the system menubar icon and its actions.
- **Key Technologies:** Swift, AppKit (`NSStatusItem`, `NSStatusBar`, `NSImage`, `NSObject`).
- **Core Responsibilities/Tasks for LLM:** (As per original plan, ensure image name is "TrayIcon" or a new generic icon for ScreenMate). Default panel dimensions might be around `width: 340, height: 550` to accommodate more UI.
- **Inputs/Dependencies:** `PanelController` (C003), "TrayIcon" in `Assets.xcassets`.
- **Outputs/Deliverables:** `MenuBarManager.swift`.
- **Success Criteria:** Menubar icon appears and toggles panel.
- **Reference File(s):** `ScreenMate/ScreenMateCore/MenuBarManager.swift`.
---
**Component ID: C003**
**Name: Menubar Panel Controller (`PanelController.swift`)**
- **Purpose:** Manage the `NSPanel` for the menubar popover, hosting SwiftUI content.
- **Key Technologies:** Swift, AppKit (`NSWindowController`, `NSPanel`, `NSWindowDelegate`), SwiftUI (`NSHostingView`).
- **Core Responsibilities/Tasks for LLM (Updates):**
6. Implement `private func embedSwiftUIView(in panel: NSPanel)`:
- Instantiate `AppSettings` (C000) and `AutostartManager` (C007).
- Instantiate `MenubarContentView` (C004), injecting both `appSettings` and `autostartManager` as `.environmentObject()`.
- **Inputs/Dependencies:** `MenubarContentView` (C004), `AppSettings` (C000), `AutostartManager` (C007).
- **Outputs/Deliverables:** `PanelController.swift`.
- **Success Criteria:** Panel works, hosts `MenubarContentView` with injected environment objects.
- **Reference File(s):** `ScreenMate/ScreenMateCore/PanelController.swift`.
---
**Component ID: C004**
**Name: Menubar Content View (`UI/MenubarContentView.swift`)**
- **Purpose:** Main UI. Interacts with `ScreenshotManager`, `ScreenMateEngine`. Manages VLM model loading based on `AppSettings`. **No Copy button.**
- **Key Technologies:** Swift, SwiftUI.
- **Core Responsibilities/Tasks for LLM:**
1. Create `MenubarContentView.swift` in `UI/`.
2. **State Management:**
- `@EnvironmentObject var appSettings: AppSettings` (C000).
- `@StateObject private var screenshotManager = ScreenshotManager()` (C005).
- `@StateObject private var screenMateEngine = ScreenMateEngine()` (C006 - Renamed from `OCREngine`).
- `@State private var processedTextResult: String = "Select a VLM model in Settings and click Load."`.
- `@State private var showingSettings = false`.
- `@State private var lastScreenshotPreviewImage: Image?`.
3. **Body Layout (VStack):**
- Display `screenMateEngine.currentStatusMessage`, `screenMateEngine.loadedModelNameDisplay`.
- "Load/Change VLM Model" `Button` (or `TextField` for ID + Load `Button`). Action: `Task { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }`. Disable based on `screenMateEngine.isLoadingModel`.
- "Process Screenshot" `Button` (renamed from "Take Screenshot & OCR"). Action: `processScreenshotWithDefaultPrompt()`. Disable appropriately.
- `ScrollView` for `processedTextResult` (selectable, monospaced).
- Optional `Image` for `lastScreenshotPreviewImage`.
- `HStack` with:
- **(Copy Button Removed)**
- "Settings" `Button` (with `.popover` for `SettingsView` C008).
- "Custom Prompt" `Button`. Action: `(NSApp.delegate as? AppDelegate)?.showCustomPromptPanel()`.
4. **`.onChange(of: appSettings.selectedVLMModelIdentifier)`:** If selection changes, call `Task { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }`.
5. **Private Methods:**
- `processScreenshotWithDefaultPrompt()`: (was `takeScreenshotAndOCR`)
- Guard model loaded/engine state.
- Call `screenshotManager.takeScreenshotToImage(...)`.
- On `NSImage` received, update preview.
- Call `screenMateEngine.performOCR(onNSImage: receivedNSImage, customPrompt: screenMateEngine.getDefaultOCRPrompt(), ...)`
- Handle `Result`, update `processedTextResult`.
- **Inputs/Dependencies:** `AppSettings` (C000), `ScreenshotManager` (C005), `ScreenMateEngine` (C006), `SettingsView` (C008), `NotificationManager` (C009), `AppDelegate` (C001).
- **Outputs/Deliverables:** `MenubarContentView.swift`.
- **Success Criteria:** Model loading via `appSettings`. Default image processing works. Custom prompt panel invoked.
- **Reference File(s):** `ScreenMate/ScreenMateCore/UI/MenubarContentView.swift`.
---
**Component ID: C005**
**Name: Screenshot Manager (In-Memory) (`ScreenshotManager.swift`)**
- **(No functional changes from the "In-Memory Screenshot" plan.)**
- **Reference File(s):** `ScreenMate/ScreenMateCore/ScreenshotManager.swift`.
---
**Component ID: C006**
**Name: ScreenMate Engine (Direct VLM with Custom Prompt Support) (`ScreenMateEngine.swift`)** (Renamed from OCREngine)
- **Purpose:** Load, manage, run VLM for image processing (including OCR) on `NSImage`, **supporting custom user prompts.**
- **Key Technologies:** Swift, MLX, MLXLLM, MLXLMCommon, Tokenizers, Hub, AppKit.
- **Core Responsibilities/Tasks for LLM:**
1. Rename file and class from `OCREngine` to `ScreenMateEngine`.
2. (Structure, Published Properties, Error Enum, `init`, `loadModel` as per previous "Direct VLM & In-Memory" plan for C006, using `ScreenMateEngineError`).
3. **Add `static let supportedVLMModels: [String: String]` property:** A dictionary of `["Display Name": "hub_or_path_identifier"]` (e.g., `["Llava Phi-3 Mini": "mlx-community/llava-phi-3-mini-128k-instruct-4bit", "Custom Moondream": "/path/to/moondream"]`). Initialize with at least one valid VLM.
4. **Add `getDefaultOCRPrompt() -> String` method:** Returns a default prompt specifically tuned for OCR (e.g., "Extract all text from this image...").
5. **Modify `performOCR` method signature (or rename to a more generic `processImage`):**
`func processImage(onNSImage nsImage: NSImage, prompt: String, completion: @escaping (Result<String, ScreenMateEngineError>) -> Void)`
6. **Inside `processImage(...)`:**
- Use the provided `prompt` when constructing `UserInput` messages. The prompt should already include the VLM-specific image placeholder (e.g., `<image>\nUser's custom prompt here`). The `ScreenMateEngine` might offer a helper to prepend this placeholder if the user prompt is raw text.
- (Image Handling for `UserInput` and inference logic remains similar, focusing on in-memory `NSImage` to `UserInput.Image` conversion. Agent must prioritize in-memory `UserInput.Image` creation from `NSImage` data, using temporary files only as a last resort if MLX libraries are restrictive).
- **Inputs/Dependencies:** `NSImage`, VLM model identifier, **custom text prompt string**. MLX Swift packages.
- **Outputs/Deliverables:** `ScreenMateEngine.swift`.
- **Success Criteria:** `loadModel` works. `processImage` uses the custom prompt.
- **Reference File(s):** Detailed `OCREngine.swift` example, renamed to `ScreenMateEngine.swift` and adapted for custom prompts.
---
**Component ID: C007**
**Name: Autostart Manager (`SystemIntegration/AutostartManager.swift`)**
- **Core Responsibilities/Tasks for LLM (Updates):**
- Ensure `appBundleIdentifier` and `appName` in `init()` are correctly derived from `Bundle.main` for "ScreenMate" or set to the new app's values.
- **Reference File(s):** `ScreenMate/ScreenMateCore/SystemIntegration/AutostartManager.swift`.
---
**Component ID: C008**
**Name: Settings View (`UI/SettingsView.swift`)**
- **Purpose:** UI for settings, **including VLM model selection from a predefined list.**
- **Key Technologies:** Swift, SwiftUI.
- **Core Responsibilities/Tasks for LLM:**
1. Create `SettingsView.swift`.
2. `@EnvironmentObject var appSettings: AppSettings` (C000).
3. `@EnvironmentObject var autostartManager: AutostartManager` (C007).
4. **Body Layout (VStack):**
- Title, `Toggle` for autostart.
- **VLM Model Selection Section:**
- `Text("Select VLM Model:")`.
- `Picker("VLM Model", selection: $appSettings.selectedVLMModelIdentifier)`:
- Iterate over `ScreenMateEngine.supportedVLMModels.keys.sorted()`.
- For each key (display name), use `ScreenMateEngine.supportedVLMModels[key]!` as the tag (identifier string).
- App version display.
- **Inputs/Dependencies:** `AppSettings` (C000), `AutostartManager` (C007), `ScreenMateEngine.supportedVLMModels` (C006).
- **Outputs/Deliverables:** `SettingsView.swift` with VLM model picker.
- **Success Criteria:** Autostart toggle. VLM model selection updates `appSettings.selectedVLMModelIdentifier`.
- **Reference File(s):** `ScreenMate/ScreenMateCore/UI/SettingsView.swift`.
---
**Component ID: C009**
**Name: Notification Manager (`SystemIntegration/NotificationManager.swift`)**
- **Reference File(s):** `ScreenMate/ScreenMateCore/SystemIntegration/NotificationManager.swift`.
---
**Component ID: C010**
**Name: Custom Prompt Panel Appearance (`CustomPromptPanel.swift`)** (Renamed from SpotlightPanel)
- **Purpose:** Define the custom `NSPanel` for the floating "Custom Prompt" window.
- **Key Technologies:** Swift, AppKit (`NSPanel`).
- **Core Responsibilities/Tasks for LLM:** (Implement as per original C010 plan, renaming file and class to `CustomPromptPanel`).
- **Reference File(s):** Rename `SpotlightPanel.swift` to `CustomPromptPanel.swift`.
---
**Component ID: C011**
**Name: Custom Prompt Panel Controller (`CustomPromptPanelController.swift`)** (Renamed)
- **Purpose:** Manage the `CustomPromptPanel` window and host its SwiftUI content.
- **Key Technologies:** Swift, AppKit (`NSWindowController`), SwiftUI (`NSHostingView`).
- **Core Responsibilities/Tasks for LLM:**
1. Create/Rename to `CustomPromptPanelController.swift`.
2. Subclass `NSWindowController`.
3. `convenience init(customPromptPanel: CustomPromptPanel)`.
4. In `AppDelegate.showCustomPromptPanel()` (C001):
- Create `AppSettings` (C000) and `ScreenMateEngine` (C006) instances *or retrieve shared instances if they are singletons/globally managed*. For simplicity, if `ScreenMateEngine` is already `@StateObject` in `MenubarContentView`, consider how to share it or pass necessary data. *It might be better for `CustomPromptView` to also take `AppSettings` and create its own `ScreenshotManager` and call a global/shared `ScreenMateEngine` instance or a method that uses it.* Let's assume for now it can get the main `ScreenMateEngine` instance.
- Create `CustomPromptView` (C012). Inject `appSettings` and `screenMateEngine` as `.environmentObject()`.
- Host in `NSHostingView`, set as `customPromptPanel.contentView`.
- Instantiate `CustomPromptPanelController(customPromptPanel: panel)`.
- **Inputs/Dependencies:** `CustomPromptPanel` (C010), `CustomPromptView` (C012), `AppSettings` (C000), `ScreenMateEngine` (C006).
- **Reference File(s):** Rename `SpotlightPanelController.swift` to `CustomPromptPanelController.swift`.
---
**Component ID: C012**
**Name: Custom Prompt Content View (`UI/CustomPromptView.swift`)** (Renamed from SpotlightContentView)
- **Purpose:** SwiftUI interface for user to input a custom VLM prompt and process a new screenshot.
- **Key Technologies:** Swift, SwiftUI.
- **Core Responsibilities/Tasks for LLM:**
1. Create `CustomPromptView.swift` in `UI/`.
2. **State Management:**
- `@EnvironmentObject var appSettings: AppSettings` (C000).
- `@EnvironmentObject var screenMateEngine: ScreenMateEngine` (C006).
- `@StateObject private var screenshotManager = ScreenshotManager()` (C005).
- `@State private var userPromptText: String = ""`. (Initialize with `appSettings.lastCustomPrompt` in `.onAppear`).
- `@State private var customProcessingInProgress: Bool = false`.
- `@State private var customProcessingResultText: String = ""`.
- `@State private var screenshotForCustomPromptPreview: Image?`.
3. **Body Layout (VStack):**
- `Text("Custom VLM Prompt")`.
- `TextEditor(text: $userPromptText)` for multi-line input. Min height, resizable.
- `Button("Take Screenshot & Process with This Prompt")`. Action: `processScreenshotWithCustomPrompt()`. Disable if `screenMateEngine.modelContainer == nil`, `screenMateEngine.isLoadingModel`, or `customProcessingInProgress`.
- Optional `Image` view for `screenshotForCustomPromptPreview`.
- `ScrollView` to display `customProcessingResultText`.
4. **Private Methods:**
- `processScreenshotWithCustomPrompt()`:
- Set `customProcessingInProgress = true`. Update `appSettings.lastCustomPrompt = userPromptText`.
- Ensure `userPromptText` includes the VLM-specific image placeholder (e.g., `<image>\n`). The view could prepend this if `userPromptText` is just the raw question.
- Call `screenshotManager.takeScreenshotToImage(...)`.
- On `NSImage` received, update `screenshotForCustomPromptPreview`.
- Call `screenMateEngine.processImage(onNSImage: receivedNSImage, prompt: userPromptTextWithPlaceholder, ...)`
- Handle `Result`, update `customProcessingResultText`. Set `customProcessingInProgress = false`.
- **Inputs/Dependencies:** `AppSettings` (C000), `ScreenMateEngine` (C006), `ScreenshotManager` (C005).
- **Outputs/Deliverables:** `CustomPromptView.swift`.
- **Interface with Other Components:** Hosted by `CustomPromptPanelController`.
- **Success Criteria:** User inputs prompt, triggers screenshot, VLM processes with custom prompt, results displayed.
- **Reference File(s):** `SpotlightContentView.swift` to be heavily adapted into `CustomPromptView.swift`.
---
**Component ID: C013 (Optional - Low Priority)**
**Name: Workspace Monitor (`SystemIntegration/WorkspaceMonitor.swift`)**
- **(No change to its own spec, but its utility might be higher with custom, context-aware prompts).**
- **Reference File(s):** `ScreenMate/ScreenMateCore/SystemIntegration/WorkspaceMonitor.swift`.
---
**Component ID: C014**
**Name: Project Setup & Configuration**
- **Core Responsibilities/Tasks for LLM (Updates):**
1. Rename existing project files/targets from "OCRToolbox" to "ScreenMate" (careful, manual steps often needed here first).
2. Update Bundle Identifier to reflect "ScreenMate" (e.g., `com.yourcompany.ScreenMate`).
3. (MLX Dependencies remain essential).
- **Reference File(s):** Project build settings, `Info.plist`.
---
## 4. Suggested Agentic Workflow (Adjusted for Renaming, VLM Selection & Custom Prompts):
1. **Project Renaming (Manual/Guided First):** Rename `.xcodeproj`, schemes, targets to "ScreenMate".
2. **C014 (Project Setup):** Verify Bundle ID for "ScreenMate". Ensure MLX dependencies are linked.
3. **C000 (AppSettings):** Create shared settings.
4. **C001 (AppDelegate):** Update for `showCustomPromptPanel()`.
5. **C005 (ScreenshotManager - In-Memory).**
6. **C009 (NotificationManager), C007 (AutostartManager - Update for new Bundle ID/App Name).**
7. **C006 (ScreenMateEngine - Renamed from OCREngine):** Implement `supportedVLMModels`, `getDefaultOCRPrompt()`. Adapt `processImage` (was `performOCR`) to take `prompt: String`.
8. **C008 (SettingsView):** Implement with VLM model `Picker` using `ScreenMateEngine.supportedVLMModels` and binding to `appSettings.selectedVLMModelIdentifier`.
9. **C004 (MenubarContentView):**
* Inject `AppSettings`. Rename `ocrEngine` to `screenMateEngine`.
* Implement UI for loading model based on `appSettings.selectedVLMModelIdentifier`.
* "Process Screenshot" button calls `screenMateEngine.processImage` with `screenMateEngine.getDefaultOCRPrompt()`.
* Button to invoke `AppDelegate.showCustomPromptPanel()`. **Remove Copy button.**
10. **C010 (CustomPromptPanel - Renamed), C012 (CustomPromptView - Renamed), C011 (CustomPromptPanelController - Renamed):** Implement the custom prompt UI and its interactions.
11. **C003 (PanelController):** Ensure `AppSettings` and `AutostartManager` are correctly injected into `MenubarContentView`.
12. **C002 (MenuBarManager).**
* *Core functionality fully testable: VLM selection, default processing, custom prompt processing.*
13. **Refinements & Testing.**
## 5. Agent Instructions (Updates):
* **Renaming:** Be meticulous with renaming "OCRToolbox" to "ScreenMate" and "OCREngine" to "ScreenMateEngine" throughout the codebase, including filenames, class names, variable names, comments, and log messages.
* **`AppSettings` (C000):** This is a new central piece for settings.
* **`ScreenMateEngine` (C006):**
* Add `supportedVLMModels` static property.
* `processImage` (renamed from `performOCR`) must accept a `prompt: String`.
* **`MenubarContentView` (C004):** Remove Copy button. Drive model loading via `appSettings`.
* **`SettingsView` (C008):** Implement `Picker` for VLM selection.
* **Custom Prompt Feature (C010, C011, C012):** This is a significant UI and logic addition. `CustomPromptView` will need to manage its own screenshot and prompt, then call `screenMateEngine.processImage`. Ensure the VLM-specific image placeholder (e.g. `<image>\n`) is correctly prepended to the user's custom text prompt before sending to `ScreenMateEngine`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment