βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MICROCONTROLLER UNIT β
β β
β [CAMERA MODULE] β
β β β
β ββββββΌβββββ ββββββββββββ ββββββββββββββββββ β
β β Image βββββΆβ MCU βββββΆβ USB-C Port ββββΌβββΆ Android Phone
β β Sensor β β (ESP32-S3β β (Power + β β
β βββββββββββ β or RP2040β β Data) β β
β β based) β ββββββββββββββββββ β
β ββββββ¬ββββββ β
β β β
β ββββββββββββββΌβββββββββββββ β
β β β β β
β [BTN_MAIN] [BTN_MODE] [BTN_CONTEXT] β
β (Top Face) (Left Side) (Right Side) β
β β
β [LED INDICATOR] β
β (Status Ring) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why 3 buttons instead of 1:
A single button creates complex multi-tap patterns that are unreliable for visually impaired users under stress. Three buttons with distinct physical positions allow muscle memory navigation without looking.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BUTTON RESPONSIBILITY MAP β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ€
β BUTTON β PHYSICAL LOCATION β
βββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ€
β BTN_MAIN β Top face, largest, tactile bump β
β β (Primary action trigger) β
βββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ€
β BTN_MODE β Left side, medium, ribbed texture β
β β (Mode switching / navigation) β
βββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ€
β BTN_CONTEXT β Right side, small, smooth β
β β (Contextual / interrupt) β
βββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ
Every button supports four gesture types. This creates 12 distinct commands from 3 buttons without complex patterns.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GESTURE VOCABULARY β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β GESTURE TYPE β DEFINITION β
ββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β TAP β Press and release under 300ms β
β HOLD β Press and hold 300msβ1500ms β
β DOUBLE TAP β Two taps within 400ms β
β HOLD+TAP β Hold BTN_MODE, then tap BTN_MAIN simultaneously β
β β (Chord gesture, unlocks advanced commands) β
ββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FULL BUTTON COMMAND MAP β
βββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ€
β BUTTON β GESTURE β ACTION β
βββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ€
β β TAP β Capture frame β Run active mode pipeline β
β BTN_MAIN β HOLD β Start continuous streaming mode β
β (Top Face) β DOUBLE TAP β Repeat last audio output β
β β HOLD 3s+ β Emergency SOS trigger β
βββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ€
β β TAP β Cycle forward through Feature Modes β
β BTN_MODE β HOLD β Cycle backward through Feature Modes β
β (Left Side) β DOUBLE TAP β Jump to Home / Default Mode β
β β HOLD 3s+ β Open Settings narration β
βββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ€
β β TAP β INTERRUPT β Silence audio immediately β
β BTN_CONTEXT β HOLD β Repeat at slower speech rate β
β (Right Side) β DOUBLE TAP β Ask follow-up ("Tell me more") β
β β HOLD 3s+ β Toggle Caregiver live link β
βββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ€
β CHORD COMBOS β β β
βββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ€
β MODE + MAIN β Simultaneous β Toggle continuous obstacle detection ON/OFFβ
β MODE + CONTEXT β Simultaneous β Switch language / translation target β
β ALL THREE β Simultaneous 2s β Hard reboot MCU + reconnect USB β
βββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββ
The device operates in discrete Feature Modes. BTN_MODE cycles through them. The user always knows which mode they are in because the system announces it on each switch.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURE MODE RING β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β MODE 0 βββββΆβ MODE 1 βββββΆβ MODE 2 βββββΆβ MODE 3 ββββ β
β β HOME β β READ β β NAVIGATEβ β PEOPLE β β β
β β (Default)β β (OCR) β β (Detect)β β (Face) β β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β² β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β ββββββββββββ MODE 6 ββββββ MODE 5 ββββββ MODE 4 ββββ β
β β DESCRIBE β β LEARN β β MONEY β β
β β (VLM) β β (Edu) β β (Currencyβ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β BTN_MODE TAP = Move clockwise BTN_MODE HOLD = Move counter-clockwise β
β BTN_MODE DOUBLE TAP = Jump to MODE 0 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODE 0: HOME β
β Announced as: "Home Mode" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BTN_MAIN TAP β Scene snapshot: "Describe what is in front of me" β
β BTN_MAIN HOLD β Begin continuous scene awareness (10s intervals) β
β BTN_CONTEXT TAP β Silence β
β BTN_CONTEXT D.T β "Tell me more about what you just described" β
β β
β PIPELINE: β
β Camera Frame β Scene Classifier β Object Detector β LLM Composer β
β β TTS Engine β Phone Speaker / Headphones β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODE 1: READ (OCR MODE) β
β Announced as: "Read Mode. Point at text and press the top button." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BTN_MAIN TAP β Capture β OCR β Read all detected text aloud β
β BTN_MAIN HOLD β Continuous reading (for long documents, scroll by tilt) β
β BTN_CONTEXT TAP β Silence mid-read β
β BTN_CONTEXT HOLD β Re-read last block at half speed β
β BTN_CONTEXT D.T β Translate last read text to target language β
β β
β SUB-MODES within READ (auto-detected by content type): β
β β’ Label Mode β Product label detected, reads name + key warnings first β
β β’ Document Mode β Detects table/form layout, reads by section β
β β’ Currency Mode β Detects currency, announces denomination β
β β
β PIPELINE: β
β Camera Frame β Pre-processor (deskew, contrast boost) β
β β OCR Engine (on-device Tesseract or ML Kit) β
β β Layout Classifier (label / document / currency) β
β β Text Formatter β TTS β Speaker β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODE 2: NAVIGATE (OBSTACLE + SPATIAL DETECTION) β
β Announced as: "Navigate Mode. Continuous obstacle detection active." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β On Mode Entry β Automatically begins continuous frame processing β
β BTN_MAIN TAP β Snapshot for detailed scene read ("What is near me?") β
β BTN_MAIN HOLD β Increase detection frequency (from 2fps to 5fps) β
β BTN_CONTEXT TAP β Silence current alert (obstacle alerts resume next hit) β
β BTN_CONTEXT D.T β Ask "How many people are in front of me?" β
β β
β SPATIAL AUDIO ROUTING: β
β Objects LEFT of center β Audio panned LEFT channel β
β Objects RIGHT of center β Audio panned RIGHT channel β
β Objects CENTER + CLOSE β Alert beep + voice (headphones required for pan) β
β β
β PIPELINE: β
β Continuous Frames β Object Detector (YOLO on-device) β
β β Bounding Box Extractor β Spatial Mapper (left/center/right zones) β
β β Proximity Estimator β Audio Panner β Speaker / Headphones β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODE 3: PEOPLE (FACE + SOCIAL LAYER) β
β Announced as: "People Mode." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BTN_MAIN TAP β Capture β Face detection β Name announcement β
β BTN_MAIN HOLD β Scan room, list all detected faces β
β BTN_CONTEXT TAP β Silence β
β BTN_CONTEXT D.T β "Describe their expression and body language" β
β BTN_MODE + MAIN β Register new face: "Say the name of this person" β
β β
β PIPELINE: β
β Camera Frame β Face Detector β Face Embedder β
β β Local Embedding DB Lookup β Name Resolver β
β [If unknown] β LLM: "Unknown person. Male, approximate 30s, smiling." β
β β Emotion Classifier (optional, Phase 3) β TTS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODE 4: MONEY (CURRENCY DETECTION) β
β Announced as: "Money Mode. Hold a note or coin in front of the camera." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BTN_MAIN TAP β Capture β Identify denomination β Read aloud β
β BTN_MAIN HOLD β Continuous mode: reads each note as it passes camera β
β BTN_CONTEXT D.T β "Calculate total from last 3 identified notes" β
β β
β PIPELINE: β
β Camera Frame β Currency Classifier (lightweight CNN) β
β β Denomination + Currency Type β TTS ("Twenty Dollar Bill") β
β [Running Total Engine] β TTS on request β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODE 5: LEARN (EDUCATIONAL / REASONING) β
β Announced as: "Learn Mode. Point at anything to get an explanation." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BTN_MAIN TAP β Capture β Full VLM analysis β Detailed explanation β
β BTN_CONTEXT D.T β "Go deeper" / step-by-step breakdown β
β BTN_CONTEXT HOLD β Re-read at half speed β
β β
β CONTENT AUTO-ROUTING: β
β Math/Diagram detected β "This is a bar chart comparing..." β
β Handwriting detected β OCR β Read β "This appears to be a sticky note" β
β Whiteboard detected β Summarize key points: "3 main bullet points..." β
β β
β PIPELINE: β
β Camera Frame β Content Type Classifier β
β β Route: [OCR Path] or [Object Path] or [VLM Direct Path] β
β β LLM Reasoning Layer β Structured Response Formatter β TTS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODE 6: DESCRIBE (RICH NARRATION / DELIGHT) β
β Announced as: "Describe Mode. For art, photos, and rich descriptions." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BTN_MAIN TAP β Capture β Full poetic/detailed scene description β
β BTN_MAIN HOLD β Photo album mode: narrate continuously as frames change β
β BTN_CONTEXT D.T β "What colors dominate this scene?" β
β β
β LLM PROMPT STYLE: Descriptive, emotional, painterly language β
β "A warm, sun-drenched kitchen. A wooden table with a blue ceramic bowl..." β
β β
β PIPELINE: β
β Camera Frame β High-res capture β VLM (cloud or on-device) β
β β Narrative Prompt Template β LLM β TTS (natural, expressive voice) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPLETE SYSTEM LAYERS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LAYER 6: OUTPUT LAYER β β
β β Phone Speaker (mono) / Headphones (stereo spatial audio) β β
β β TTS Engine β [Google TTS / On-device neural TTS] β β
β β Audio Panner (for directional Navigate Mode alerts) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LAYER 5: RESPONSE COMPOSER β β
β β Assembles structured output from pipeline results β β
β β Applies Mode-specific language style (terse vs. descriptive) β β
β β Manages audio queue (priority interrupts, don't stack alerts) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LAYER 4: AI / REASONING LAYER β β
β β On-device Models: YOLO (objects), FaceNet (faces), Tesseract (OCR) β β
β β On-device LLM: small quantized model for offline reasoning β β
β β Cloud VLM (optional, for MODE 5/6): GPT-4V or Gemini Vision β β
β β Memory Manager: short-term buffer + long-term SQLite DB β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LAYER 3: FEATURE MODE ROUTER β β
β β Active Mode State Machine β β
β β Routes incoming frame + button event to correct pipeline β β
β β Manages mode transitions and announcements β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LAYER 2: USB COMMUNICATION LAYER (Android App Core) β β
β β USB Host Manager β Opens connection to MCU β β
β β Frame Receiver β MJPEG or raw frame stream parsing β β
β β Button Event Parser β Decodes button ID + gesture type from MCU β β
β β Power Monitor β Tracks USB power delivery to MCU β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LAYER 1: HARDWARE LAYER (Microcontroller) β β
β β Camera Sensor β Frame Buffer β USB Bulk Transfer to Android β β
β β BTN_MAIN, BTN_MODE, BTN_CONTEXT β Debounced β Gesture Classifier β β
β β Gesture events transmitted to Android as USB HID or serial packets β β
β β Receives USB power (5V) from Android via OTG β powers camera + MCU β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The MCU communicates two types of data to the Android app over USB:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USB DATA PACKET TYPES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β TYPE A: FRAME PACKET (Camera Data) β
β ββββββββββββ¬βββββββββββ¬βββββββββββββββββββββββββββββββ β
β β Header β Frame ID β MJPEG compressed frame data β β
β β 0xAA 0xBBβ 4 bytes β Variable length β β
β ββββββββββββ΄βββββββββββ΄βββββββββββββββββββββββββββββββ β
β Sent at: 5β15fps depending on active mode β
β β
β TYPE B: BUTTON EVENT PACKET β
β ββββββββββββ¬ββββββββββββ¬βββββββββββββββββ¬βββββββββββββ β
β β Header β Button ID β Gesture Type β Timestamp β β
β β 0xCC 0xDDβ 1 byte β 1 byte β 4 bytes β β
β ββββββββββββ΄ββββββββββββ΄βββββββββββββββββ΄βββββββββββββ β
β β
β Button ID values: 0x01=MAIN, 0x02=MODE, 0x03=CONTEXT β
β Gesture Type: 0x01=TAP, 0x02=HOLD, 0x03=DOUBLE_TAP β
β 0x04=LONG_HOLD, 0x05=CHORD β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ANDROID APP MODULE MAP β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ€
β MODULE β RESPONSIBILITY β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β UsbHostModule β Detect MCU connection, open USB interface, β
β β manage bulk transfer endpoints β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β FrameReceiverModule β Parse incoming byte stream into valid β
β β JPEG/frame objects, timestamp them β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β ButtonEventModule β Parse button packets, implement gesture β
β β timing logic (tap vs hold threshold), β
β β emit named events to Mode Router β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β ModeStateMachine β Tracks current active mode, handles β
β β mode transitions, announces mode changes β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β PipelineOrchestrator β On frame + command received, selects which β
β β AI pipeline to invoke based on active mode β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β OCRPipeline β Pre-processes frame, runs ML Kit OCR, β
β β classifies layout type, formats output β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β ObjectDetectionPipeline β Runs YOLO inference, maps bounding boxes β
β β to spatial zones, generates proximity alerts β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β FaceRecognitionPipeline β Detects faces, generates embeddings, β
β β queries local face DB, returns names β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β CurrencyPipeline β Runs lightweight currency CNN classifier, β
β β maintains running total state β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β VLMPipeline β Encodes frame to base64, sends to cloud VLM β
β β or on-device model, returns description β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β MemoryModule β Short-term: rolling 15-min visual buffer β
β β Long-term: SQLite for faces, routines β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β AudioOutputModule β Manages TTS queue, priority interrupts, β
β β spatial panning for headphone output, β
β β speech rate control β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β SettingsModule β Voice-navigable settings: language, speed, β
β β mode defaults, caregiver link toggle β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β CaregiverModule β Opt-in remote monitoring, SOS trigger, β
β β sends compressed frame + location to contact β
ββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AUDIO PRIORITY QUEUE β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PRIORITY β EXAMPLE β
ββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β P0 URGENT β "STOP. Obstacle directly ahead." β
β β Interrupts everything. Cannot be silenced mid-word. β
ββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β P1 ALERT β "Stair detected on your left." β
β β Interrupts P2/P3. BTN_CONTEXT TAP silences it. β
ββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β P2 RESPONSEβ Response to user-triggered action (OCR result, face name) β
β β Plays after current P2 finishes. TAP to interrupt. β
ββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β P3 AMBIENT β Periodic scene descriptions in continuous mode β
β β First to be dropped if queue is full. β
ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SPEAKER vs HEADPHONE BEHAVIOR:
β’ Phone Speaker Only β Mono output, no spatial panning, louder voice
β’ Headphones Detected β Stereo spatial panning enabled for Navigate Mode
Quieter, more intimate voice style
Navigate Mode becomes fully directional (left/right beeps)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VOICE INPUT FLOW β
β β
β User says "Hey Glasses" β
β β β
β βΌ β
β On-device Wake Word Detector (Porcupine or equivalent) β
β β β
β βΌ β
β Phone Microphone activates (500ms window) β
β β β
β User speaks command: "What is the expiry date on this?" β
β β β
β βΌ β
β On-device STT (Speech-to-Text) β Intent Classifier β
β β β
β ββββΆ Detected Mode Switch Intent β ModeStateMachine β
β ββββΆ Detected Query Intent β PipelineOrchestrator β
β ββββΆ Detected Settings Intent β SettingsModule β
β ββββΆ Detected Unknown / Unclear β "Sorry, say that again." β
β β
β BTN_MAIN HOLD = Alternative to wake word (always reliable, no mic needed) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SETTINGS MENU STRUCTURE β
β Activated by: BTN_MODE HOLD 3 seconds β
β β
β System announces: "Settings. Say a setting name or press Mode to browse." β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SETTING 1: Speech Speed β Slow / Normal / Fast β β
β β SETTING 2: Language β English / [other options] β β
β β SETTING 3: Default Mode β Which mode on startup β β
β β SETTING 4: Obstacle Alerts β Always On / Only in Navigate β β
β β SETTING 5: Caregiver Link β Enable / Disable β β
β β SETTING 6: Face Database β Add / Remove / Clear β β
β β SETTING 7: Voice Input β Enable / Disable wake word β β
β β SETTING 8: Exit Settings β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Navigation: BTN_MODE cycles through settings β
β Activation: BTN_MAIN TAP selects and cycles value β
β Exit: BTN_MODE DOUBLE TAP returns to active mode β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LED STATUS PATTERNS β
βββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ€
β LED PATTERN β MEANING β
βββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ€
β Slow blue pulse β Connected to Android, idle β
β Solid green β Active, processing frame β
β Fast white flash β Button event registered β
β Solid red β USB disconnected or power error β
β Amber pulse β Low battery (if battery option added) β
β Purple flash x3 β Mode changed (matches mode number) β
β All white spin β SOS triggered, caregiver being contacted β
βββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXAMPLE: User is in READ MODE, presses BTN_MAIN TAP on a medicine bottle β
β β
β [1] User presses BTN_MAIN β
β β β
β [2] MCU: Debounce β Gesture Timer β Confirms TAP (released < 300ms) β
β β β
β [3] MCU: Sends Button Event Packet {ID:MAIN, GESTURE:TAP} over USB β
β β β
β [4] MCU: Captures camera frame β Compresses to MJPEG β Sends Frame Packet β
β β β
β [5] Android ButtonEventModule: Parses packet β Emits event MAIN_TAP β
β β β
β [6] ModeStateMachine: Current mode = READ β routes to OCRPipeline β
β β β
β [7] OCRPipeline receives frame: β
β β Pre-process (deskew, sharpen, contrast) β
β β Run OCR β Extract text blocks β
β β Layout Classifier: "This is a product label" β
β β Priority formatter: Drug name first, then dosage, then warnings β
β β β
β [8] Response Composer: Builds structured audio script β
β "Paracetamol 500mg. Take 1 to 2 tablets every 4 to 6 hours. β
β Warning: Do not exceed 8 tablets in 24 hours." β
β β β
β [9] AudioOutputModule: Queues as P2, plays via TTS on speaker/headphones β
β β β
β [10] User taps BTN_CONTEXT β Audio silences immediately β
β β β
β [11] User double-taps BTN_CONTEXT β "Tell me more" β
β β OCRPipeline re-queries LLM: "Any allergen warnings on this label?" β
β β Response: "Contains Sodium, not suitable for low-sodium diets." β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE β WHAT GETS BUILT β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PHASE 1 β UsbHostModule, FrameReceiverModule, ButtonEventModule, β
β β ModeStateMachine (modes 0β2 only), OCRPipeline, β
β β ObjectDetectionPipeline (basic), AudioOutputModule, β
β β CurrencyPipeline, basic VoiceInput β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PHASE 2 β Full scene understanding in Mode 0, document layout in Mode 1, β
β β MemoryModule (short-term), TranslationModule, Mode 5 skeleton β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PHASE 3 β FaceRecognitionPipeline, Mode 3 fully live, EmotionClassifier, β
β β BodyLanguageDetector, crowd mapping β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PHASE 4 β Full Mode 5 (math, handwriting, whiteboard), VLMPipeline cloud, β
β β interactive tutoring conversation loop β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PHASE 5 β Mode 6 fully live (poetic narration), photo/art/sports modes, β
β β richer audio expressiveness β
βββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PHASE 6 β MemoryModule (long-term), PersonalizationProfiles, β
β β CaregiverModule, PredictiveAssistance, haptic/smartwatch link β
βββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ