Created
May 25, 2025 15:18
-
-
Save mzbac/866a3c7c9b927177a124256720acbc63 to your computer and use it in GitHub Desktop.
ScreenMate Implementation Documentation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Implementation Documentation for Agentic LLM Workflow: macOS ScreenMate (SwiftUI First - Direct VLM, In-Memory Screenshot, Custom Prompts) | |
## 1. Overall Project Goal: | |
Develop a native macOS application ("ScreenMate") that: | |
* Runs as a menubar accessory application (no Dock icon). | |
* Provides advanced image understanding functionality triggered by a screenshot, capturing the image **into memory (as an `NSImage`)** and processing it using a **locally loaded Vision Language Model (VLM) via MLX Swift**, with an option for users to provide **custom prompts**. (OCR is one of its capabilities). | |
* Features a main interface in a menubar popover panel. | |
* Features a "Custom Prompt" floating panel allowing users to input their own VLM prompts for image processing. | |
* Allows configuration for auto-starting at login and **selecting a VLM model from a predefined list**. | |
* Uses SwiftUI for UI components where feasible, and AppKit for system integrations and panel management. | |
## 2. Core Principles & Constraints for LLM Agent: | |
* **Language:** Swift. | |
* **UI Framework:** SwiftUI first. Use AppKit for system integration. | |
* **ML Framework:** **MLX Swift** (including `MLXLLM`, `MLXLMCommon`, `Tokenizers`, `Hub`) for direct VLM loading and inference. | |
* **Screenshot Handling:** Capture screenshots directly into memory as `NSImage`. | |
* **Bridging (UI):** `NSHostingView` for SwiftUI in AppKit panels. | |
* **State Management (SwiftUI):** Standard SwiftUI state management. `ScreenMateEngine` and `AppSettings` will be `ObservableObject`. | |
* **Modularity:** Distinct classes/structs. | |
* **Error Handling:** Robust error handling for VLM operations, UI feedback. | |
* **Asynchronous Operations:** `async/await`, `Task.detached` for VLM. UI updates on main thread. | |
* **File Structure:** Adhere to suggested structure (`UI`, `SystemIntegration`, `Shared` subfolders). | |
## 3. Component Breakdown & Implementation Tasks: | |
--- | |
**Component ID: C000 (New)** | |
**Name: App Settings (`Shared/AppSettings.swift`)** | |
- **Purpose:** An `ObservableObject` to store shared application settings, like the selected VLM model identifier and potentially the last custom prompt. | |
- **Key Technologies:** Swift, Combine (`ObservableObject`, `@Published`). | |
- **Core Responsibilities/Tasks for LLM:** | |
1. Create `AppSettings.swift` (in a `Shared` group). | |
2. Define a class `AppSettings` conforming to `ObservableObject`. | |
3. Add `@Published var selectedVLMModelIdentifier: String` initialized with a default from `ScreenMateEngine.supportedVLMModels` (C006). | |
4. Add `@Published var lastCustomPrompt: String = ""`. | |
- **Inputs/Dependencies:** `ScreenMateEngine.supportedVLMModels` (C006) for default. | |
- **Outputs/Deliverables:** An `AppSettings` class. | |
- **Interface with Other Components:** Injected as an `@StateObject` or `@EnvironmentObject` into `MenubarContentView` (C004) and `SettingsView` (C008), `CustomPromptView` (C012). | |
--- | |
**Component ID: C001** | |
**Name: Application Core Setup (`AppDelegate.swift`)** | |
- **Purpose:** Initialize the application, set it as an accessory app, and manage its lifecycle. Instantiate core managers. | |
- **Key Technologies:** Swift, AppKit (`NSApplicationDelegate`, `NSApp`). | |
- **Core Responsibilities/Tasks for LLM:** | |
1. Create/Update `AppDelegate.swift`. | |
2. Ensure conformance: `NSObject`, `NSApplicationDelegate`. | |
3. Implement `applicationDidFinishLaunching(_:)`: | |
- Set `NSApp.setActivationPolicy(.accessory)`. | |
- Instantiate `MenuBarManager` (C002) and retain it. | |
- Add/Modify method `showCustomPromptPanel()` (renamed from `showSpotlight`) to instantiate and show `CustomPromptPanelController` (C011). | |
4. Implement `applicationWillTerminate(_:)`. | |
5. Implement `applicationShouldTerminateAfterLastWindowClosed(_:)` returning `false`. | |
6. Ensure `@main` in `ScreenMateApp.swift` uses `@NSApplicationDelegateAdaptor(AppDelegate.self)`. | |
- **Inputs/Dependencies:** None initially. Will create `MenuBarManager`. | |
- **Outputs/Deliverables:** A functioning `AppDelegate.swift`. | |
- **Interface with Other Components:** Instantiates `MenuBarManager`. Provides `showCustomPromptPanel()`. | |
- **Success Criteria/Verification:** App launches as accessory. `MenuBarManager` initialized. `showCustomPromptPanel()` is callable. | |
- **Reference File(s):** `ScreenMate/ScreenMate/AppDelegate.swift`. | |
--- | |
**Component ID: C002** | |
**Name: MenuBar Management (`MenuBarManager.swift`)** | |
- **Purpose:** Create and manage the system menubar icon and its actions. | |
- **Key Technologies:** Swift, AppKit (`NSStatusItem`, `NSStatusBar`, `NSImage`, `NSObject`). | |
- **Core Responsibilities/Tasks for LLM:** (As per original plan, ensure image name is "TrayIcon" or a new generic icon for ScreenMate). Default panel dimensions might be around `width: 340, height: 550` to accommodate more UI. | |
- **Inputs/Dependencies:** `PanelController` (C003), "TrayIcon" in `Assets.xcassets`. | |
- **Outputs/Deliverables:** `MenuBarManager.swift`. | |
- **Success Criteria:** Menubar icon appears and toggles panel. | |
- **Reference File(s):** `ScreenMate/ScreenMateCore/MenuBarManager.swift`. | |
--- | |
**Component ID: C003** | |
**Name: Menubar Panel Controller (`PanelController.swift`)** | |
- **Purpose:** Manage the `NSPanel` for the menubar popover, hosting SwiftUI content. | |
- **Key Technologies:** Swift, AppKit (`NSWindowController`, `NSPanel`, `NSWindowDelegate`), SwiftUI (`NSHostingView`). | |
- **Core Responsibilities/Tasks for LLM (Updates):** | |
6. Implement `private func embedSwiftUIView(in panel: NSPanel)`: | |
- Instantiate `AppSettings` (C000) and `AutostartManager` (C007). | |
- Instantiate `MenubarContentView` (C004), injecting both `appSettings` and `autostartManager` as `.environmentObject()`. | |
- **Inputs/Dependencies:** `MenubarContentView` (C004), `AppSettings` (C000), `AutostartManager` (C007). | |
- **Outputs/Deliverables:** `PanelController.swift`. | |
- **Success Criteria:** Panel works, hosts `MenubarContentView` with injected environment objects. | |
- **Reference File(s):** `ScreenMate/ScreenMateCore/PanelController.swift`. | |
--- | |
**Component ID: C004** | |
**Name: Menubar Content View (`UI/MenubarContentView.swift`)** | |
- **Purpose:** Main UI. Interacts with `ScreenshotManager`, `ScreenMateEngine`. Manages VLM model loading based on `AppSettings`. **No Copy button.** | |
- **Key Technologies:** Swift, SwiftUI. | |
- **Core Responsibilities/Tasks for LLM:** | |
1. Create `MenubarContentView.swift` in `UI/`. | |
2. **State Management:** | |
- `@EnvironmentObject var appSettings: AppSettings` (C000). | |
- `@StateObject private var screenshotManager = ScreenshotManager()` (C005). | |
- `@StateObject private var screenMateEngine = ScreenMateEngine()` (C006 - Renamed from `OCREngine`). | |
- `@State private var processedTextResult: String = "Select a VLM model in Settings and click Load."`. | |
- `@State private var showingSettings = false`. | |
- `@State private var lastScreenshotPreviewImage: Image?`. | |
3. **Body Layout (VStack):** | |
- Display `screenMateEngine.currentStatusMessage`, `screenMateEngine.loadedModelNameDisplay`. | |
- "Load/Change VLM Model" `Button` (or `TextField` for ID + Load `Button`). Action: `Task { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }`. Disable based on `screenMateEngine.isLoadingModel`. | |
- "Process Screenshot" `Button` (renamed from "Take Screenshot & OCR"). Action: `processScreenshotWithDefaultPrompt()`. Disable appropriately. | |
- `ScrollView` for `processedTextResult` (selectable, monospaced). | |
- Optional `Image` for `lastScreenshotPreviewImage`. | |
- `HStack` with: | |
- **(Copy Button Removed)** | |
- "Settings" `Button` (with `.popover` for `SettingsView` C008). | |
- "Custom Prompt" `Button`. Action: `(NSApp.delegate as? AppDelegate)?.showCustomPromptPanel()`. | |
4. **`.onChange(of: appSettings.selectedVLMModelIdentifier)`:** If selection changes, call `Task { await screenMateEngine.loadModel(modelIdentifier: appSettings.selectedVLMModelIdentifier) }`. | |
5. **Private Methods:** | |
- `processScreenshotWithDefaultPrompt()`: (was `takeScreenshotAndOCR`) | |
- Guard model loaded/engine state. | |
- Call `screenshotManager.takeScreenshotToImage(...)`. | |
- On `NSImage` received, update preview. | |
- Call `screenMateEngine.performOCR(onNSImage: receivedNSImage, customPrompt: screenMateEngine.getDefaultOCRPrompt(), ...)` | |
- Handle `Result`, update `processedTextResult`. | |
- **Inputs/Dependencies:** `AppSettings` (C000), `ScreenshotManager` (C005), `ScreenMateEngine` (C006), `SettingsView` (C008), `NotificationManager` (C009), `AppDelegate` (C001). | |
- **Outputs/Deliverables:** `MenubarContentView.swift`. | |
- **Success Criteria:** Model loading via `appSettings`. Default image processing works. Custom prompt panel invoked. | |
- **Reference File(s):** `ScreenMate/ScreenMateCore/UI/MenubarContentView.swift`. | |
--- | |
**Component ID: C005** | |
**Name: Screenshot Manager (In-Memory) (`ScreenshotManager.swift`)** | |
- **(No functional changes from the "In-Memory Screenshot" plan.)** | |
- **Reference File(s):** `ScreenMate/ScreenMateCore/ScreenshotManager.swift`. | |
--- | |
**Component ID: C006** | |
**Name: ScreenMate Engine (Direct VLM with Custom Prompt Support) (`ScreenMateEngine.swift`)** (Renamed from OCREngine) | |
- **Purpose:** Load, manage, run VLM for image processing (including OCR) on `NSImage`, **supporting custom user prompts.** | |
- **Key Technologies:** Swift, MLX, MLXLLM, MLXLMCommon, Tokenizers, Hub, AppKit. | |
- **Core Responsibilities/Tasks for LLM:** | |
1. Rename file and class from `OCREngine` to `ScreenMateEngine`. | |
2. (Structure, Published Properties, Error Enum, `init`, `loadModel` as per previous "Direct VLM & In-Memory" plan for C006, using `ScreenMateEngineError`). | |
3. **Add `static let supportedVLMModels: [String: String]` property:** A dictionary of `["Display Name": "hub_or_path_identifier"]` (e.g., `["Llava Phi-3 Mini": "mlx-community/llava-phi-3-mini-128k-instruct-4bit", "Custom Moondream": "/path/to/moondream"]`). Initialize with at least one valid VLM. | |
4. **Add `getDefaultOCRPrompt() -> String` method:** Returns a default prompt specifically tuned for OCR (e.g., "Extract all text from this image..."). | |
5. **Modify `performOCR` method signature (or rename to a more generic `processImage`):** | |
`func processImage(onNSImage nsImage: NSImage, prompt: String, completion: @escaping (Result<String, ScreenMateEngineError>) -> Void)` | |
6. **Inside `processImage(...)`:** | |
- Use the provided `prompt` when constructing `UserInput` messages. The prompt should already include the VLM-specific image placeholder (e.g., `<image>\nUser's custom prompt here`). The `ScreenMateEngine` might offer a helper to prepend this placeholder if the user prompt is raw text. | |
- (Image Handling for `UserInput` and inference logic remains similar, focusing on in-memory `NSImage` to `UserInput.Image` conversion. Agent must prioritize in-memory `UserInput.Image` creation from `NSImage` data, using temporary files only as a last resort if MLX libraries are restrictive). | |
- **Inputs/Dependencies:** `NSImage`, VLM model identifier, **custom text prompt string**. MLX Swift packages. | |
- **Outputs/Deliverables:** `ScreenMateEngine.swift`. | |
- **Success Criteria:** `loadModel` works. `processImage` uses the custom prompt. | |
- **Reference File(s):** Detailed `OCREngine.swift` example, renamed to `ScreenMateEngine.swift` and adapted for custom prompts. | |
--- | |
**Component ID: C007** | |
**Name: Autostart Manager (`SystemIntegration/AutostartManager.swift`)** | |
- **Core Responsibilities/Tasks for LLM (Updates):** | |
- Ensure `appBundleIdentifier` and `appName` in `init()` are correctly derived from `Bundle.main` for "ScreenMate" or set to the new app's values. | |
- **Reference File(s):** `ScreenMate/ScreenMateCore/SystemIntegration/AutostartManager.swift`. | |
--- | |
**Component ID: C008** | |
**Name: Settings View (`UI/SettingsView.swift`)** | |
- **Purpose:** UI for settings, **including VLM model selection from a predefined list.** | |
- **Key Technologies:** Swift, SwiftUI. | |
- **Core Responsibilities/Tasks for LLM:** | |
1. Create `SettingsView.swift`. | |
2. `@EnvironmentObject var appSettings: AppSettings` (C000). | |
3. `@EnvironmentObject var autostartManager: AutostartManager` (C007). | |
4. **Body Layout (VStack):** | |
- Title, `Toggle` for autostart. | |
- **VLM Model Selection Section:** | |
- `Text("Select VLM Model:")`. | |
- `Picker("VLM Model", selection: $appSettings.selectedVLMModelIdentifier)`: | |
- Iterate over `ScreenMateEngine.supportedVLMModels.keys.sorted()`. | |
- For each key (display name), use `ScreenMateEngine.supportedVLMModels[key]!` as the tag (identifier string). | |
- App version display. | |
- **Inputs/Dependencies:** `AppSettings` (C000), `AutostartManager` (C007), `ScreenMateEngine.supportedVLMModels` (C006). | |
- **Outputs/Deliverables:** `SettingsView.swift` with VLM model picker. | |
- **Success Criteria:** Autostart toggle. VLM model selection updates `appSettings.selectedVLMModelIdentifier`. | |
- **Reference File(s):** `ScreenMate/ScreenMateCore/UI/SettingsView.swift`. | |
--- | |
**Component ID: C009** | |
**Name: Notification Manager (`SystemIntegration/NotificationManager.swift`)** | |
- **Reference File(s):** `ScreenMate/ScreenMateCore/SystemIntegration/NotificationManager.swift`. | |
--- | |
**Component ID: C010** | |
**Name: Custom Prompt Panel Appearance (`CustomPromptPanel.swift`)** (Renamed from SpotlightPanel) | |
- **Purpose:** Define the custom `NSPanel` for the floating "Custom Prompt" window. | |
- **Key Technologies:** Swift, AppKit (`NSPanel`). | |
- **Core Responsibilities/Tasks for LLM:** (Implement as per original C010 plan, renaming file and class to `CustomPromptPanel`). | |
- **Reference File(s):** Rename `SpotlightPanel.swift` to `CustomPromptPanel.swift`. | |
--- | |
**Component ID: C011** | |
**Name: Custom Prompt Panel Controller (`CustomPromptPanelController.swift`)** (Renamed) | |
- **Purpose:** Manage the `CustomPromptPanel` window and host its SwiftUI content. | |
- **Key Technologies:** Swift, AppKit (`NSWindowController`), SwiftUI (`NSHostingView`). | |
- **Core Responsibilities/Tasks for LLM:** | |
1. Create/Rename to `CustomPromptPanelController.swift`. | |
2. Subclass `NSWindowController`. | |
3. `convenience init(customPromptPanel: CustomPromptPanel)`. | |
4. In `AppDelegate.showCustomPromptPanel()` (C001): | |
- Create `AppSettings` (C000) and `ScreenMateEngine` (C006) instances *or retrieve shared instances if they are singletons/globally managed*. For simplicity, if `ScreenMateEngine` is already `@StateObject` in `MenubarContentView`, consider how to share it or pass necessary data. *It might be better for `CustomPromptView` to also take `AppSettings` and create its own `ScreenshotManager` and call a global/shared `ScreenMateEngine` instance or a method that uses it.* Let's assume for now it can get the main `ScreenMateEngine` instance. | |
- Create `CustomPromptView` (C012). Inject `appSettings` and `screenMateEngine` as `.environmentObject()`. | |
- Host in `NSHostingView`, set as `customPromptPanel.contentView`. | |
- Instantiate `CustomPromptPanelController(customPromptPanel: panel)`. | |
- **Inputs/Dependencies:** `CustomPromptPanel` (C010), `CustomPromptView` (C012), `AppSettings` (C000), `ScreenMateEngine` (C006). | |
- **Reference File(s):** Rename `SpotlightPanelController.swift` to `CustomPromptPanelController.swift`. | |
--- | |
**Component ID: C012** | |
**Name: Custom Prompt Content View (`UI/CustomPromptView.swift`)** (Renamed from SpotlightContentView) | |
- **Purpose:** SwiftUI interface for user to input a custom VLM prompt and process a new screenshot. | |
- **Key Technologies:** Swift, SwiftUI. | |
- **Core Responsibilities/Tasks for LLM:** | |
1. Create `CustomPromptView.swift` in `UI/`. | |
2. **State Management:** | |
- `@EnvironmentObject var appSettings: AppSettings` (C000). | |
- `@EnvironmentObject var screenMateEngine: ScreenMateEngine` (C006). | |
- `@StateObject private var screenshotManager = ScreenshotManager()` (C005). | |
- `@State private var userPromptText: String = ""`. (Initialize with `appSettings.lastCustomPrompt` in `.onAppear`). | |
- `@State private var customProcessingInProgress: Bool = false`. | |
- `@State private var customProcessingResultText: String = ""`. | |
- `@State private var screenshotForCustomPromptPreview: Image?`. | |
3. **Body Layout (VStack):** | |
- `Text("Custom VLM Prompt")`. | |
- `TextEditor(text: $userPromptText)` for multi-line input. Min height, resizable. | |
- `Button("Take Screenshot & Process with This Prompt")`. Action: `processScreenshotWithCustomPrompt()`. Disable if `screenMateEngine.modelContainer == nil`, `screenMateEngine.isLoadingModel`, or `customProcessingInProgress`. | |
- Optional `Image` view for `screenshotForCustomPromptPreview`. | |
- `ScrollView` to display `customProcessingResultText`. | |
4. **Private Methods:** | |
- `processScreenshotWithCustomPrompt()`: | |
- Set `customProcessingInProgress = true`. Update `appSettings.lastCustomPrompt = userPromptText`. | |
- Ensure `userPromptText` includes the VLM-specific image placeholder (e.g., `<image>\n`). The view could prepend this if `userPromptText` is just the raw question. | |
- Call `screenshotManager.takeScreenshotToImage(...)`. | |
- On `NSImage` received, update `screenshotForCustomPromptPreview`. | |
- Call `screenMateEngine.processImage(onNSImage: receivedNSImage, prompt: userPromptTextWithPlaceholder, ...)` | |
- Handle `Result`, update `customProcessingResultText`. Set `customProcessingInProgress = false`. | |
- **Inputs/Dependencies:** `AppSettings` (C000), `ScreenMateEngine` (C006), `ScreenshotManager` (C005). | |
- **Outputs/Deliverables:** `CustomPromptView.swift`. | |
- **Interface with Other Components:** Hosted by `CustomPromptPanelController`. | |
- **Success Criteria:** User inputs prompt, triggers screenshot, VLM processes with custom prompt, results displayed. | |
- **Reference File(s):** `SpotlightContentView.swift` to be heavily adapted into `CustomPromptView.swift`. | |
--- | |
**Component ID: C013 (Optional - Low Priority)** | |
**Name: Workspace Monitor (`SystemIntegration/WorkspaceMonitor.swift`)** | |
- **(No change to its own spec, but its utility might be higher with custom, context-aware prompts).** | |
- **Reference File(s):** `ScreenMate/ScreenMateCore/SystemIntegration/WorkspaceMonitor.swift`. | |
--- | |
**Component ID: C014** | |
**Name: Project Setup & Configuration** | |
- **Core Responsibilities/Tasks for LLM (Updates):** | |
1. Rename existing project files/targets from "OCRToolbox" to "ScreenMate" (careful, manual steps often needed here first). | |
2. Update Bundle Identifier to reflect "ScreenMate" (e.g., `com.yourcompany.ScreenMate`). | |
3. (MLX Dependencies remain essential). | |
- **Reference File(s):** Project build settings, `Info.plist`. | |
--- | |
## 4. Suggested Agentic Workflow (Adjusted for Renaming, VLM Selection & Custom Prompts): | |
1. **Project Renaming (Manual/Guided First):** Rename `.xcodeproj`, schemes, targets to "ScreenMate". | |
2. **C014 (Project Setup):** Verify Bundle ID for "ScreenMate". Ensure MLX dependencies are linked. | |
3. **C000 (AppSettings):** Create shared settings. | |
4. **C001 (AppDelegate):** Update for `showCustomPromptPanel()`. | |
5. **C005 (ScreenshotManager - In-Memory).** | |
6. **C009 (NotificationManager), C007 (AutostartManager - Update for new Bundle ID/App Name).** | |
7. **C006 (ScreenMateEngine - Renamed from OCREngine):** Implement `supportedVLMModels`, `getDefaultOCRPrompt()`. Adapt `processImage` (was `performOCR`) to take `prompt: String`. | |
8. **C008 (SettingsView):** Implement with VLM model `Picker` using `ScreenMateEngine.supportedVLMModels` and binding to `appSettings.selectedVLMModelIdentifier`. | |
9. **C004 (MenubarContentView):** | |
* Inject `AppSettings`. Rename `ocrEngine` to `screenMateEngine`. | |
* Implement UI for loading model based on `appSettings.selectedVLMModelIdentifier`. | |
* "Process Screenshot" button calls `screenMateEngine.processImage` with `screenMateEngine.getDefaultOCRPrompt()`. | |
* Button to invoke `AppDelegate.showCustomPromptPanel()`. **Remove Copy button.** | |
10. **C010 (CustomPromptPanel - Renamed), C012 (CustomPromptView - Renamed), C011 (CustomPromptPanelController - Renamed):** Implement the custom prompt UI and its interactions. | |
11. **C003 (PanelController):** Ensure `AppSettings` and `AutostartManager` are correctly injected into `MenubarContentView`. | |
12. **C002 (MenuBarManager).** | |
* *Core functionality fully testable: VLM selection, default processing, custom prompt processing.* | |
13. **Refinements & Testing.** | |
## 5. Agent Instructions (Updates): | |
* **Renaming:** Be meticulous with renaming "OCRToolbox" to "ScreenMate" and "OCREngine" to "ScreenMateEngine" throughout the codebase, including filenames, class names, variable names, comments, and log messages. | |
* **`AppSettings` (C000):** This is a new central piece for settings. | |
* **`ScreenMateEngine` (C006):** | |
* Add `supportedVLMModels` static property. | |
* `processImage` (renamed from `performOCR`) must accept a `prompt: String`. | |
* **`MenubarContentView` (C004):** Remove Copy button. Drive model loading via `appSettings`. | |
* **`SettingsView` (C008):** Implement `Picker` for VLM selection. | |
* **Custom Prompt Feature (C010, C011, C012):** This is a significant UI and logic addition. `CustomPromptView` will need to manage its own screenshot and prompt, then call `screenMateEngine.processImage`. Ensure the VLM-specific image placeholder (e.g. `<image>\n`) is correctly prepended to the user's custom text prompt before sending to `ScreenMateEngine`. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment