Here is the Finalized Project Design Document. This version integrates the core architecture with the critical safety, performance, and self-learning modules (Teach Mode, Caching, Kill Switch) we discussed.
Project Name: "Visual Core" Desktop Automation Assistant¶
Deployment Type: Local Standalone Application (Hybrid Cloud Intelligence)
1. Executive Summary¶
We are building a desktop productivity assistant that floats over professional software (like Femap, Photoshop, Affinity). It accepts Voice (English/Chinese) or Text commands and cursor location to fully understand context.
Key Features: * Smart Context Switching: Automatically identifies the target software and loads its specific icons and manuals. * Dynamic Knowledge (RAG): Ingests PDF manuals (e.g., "Femap 2026 Guide") to learn new workflows on the fly. * Self-Healing Vision: If the app cannot find a specific button (due to UI updates), it enters "Teach Mode," allowing the user to point it out once, after which the app saves the asset and remembers it forever. * Safety & Efficiency: Features a Workflow Cache to repeat common tasks instantly without API costs, and a hardware-level Kill Switch to immediately halt control in emergencies.
2. User Experience Flow¶
- Context Handshake:
- User triggers assistant over "Femap."
- Status: "Target Locked: Femap. Manuals Loaded."
- Input & Cache Check:
- User: "Mesh this surface."
- Fast Path: App checks local Cache. If this exact command was verified before, it executes the stored macro immediately.
- Slow Path: If new, it queries the RAG Engine (Manuals) + Gemini API to formulate a plan.
- Plan Rephrase & Confirmation:
- App: "Plan: Click Mesh Tool > Select Surface. Confirm?"
- User confirms.
- Visual Execution (The "Self-Healing" Loop):
- App scans for "Mesh Tool" icon.
- Scenario A (Found): It clicks and verifies.
- Scenario B (Not Found): App pauses: "I can't find the 'Mesh Tool'. Point to it and press F1."
- Teach Mode: User hovers and presses F1. App captures the icon, saves it to
/assets/femap/, and resumes execution.
- Review & Focus Return:
- Task done. Focus returns to Femap.
- If user rejects result, App performs Undo.
3. Technical Architecture (Local Deployment)¶
The Tech Stack¶
- Language: Python 3.10+
- GUI: PyQt6 (Overlay).
- OS Control: pywin32 / Quartz (Context detection).
- Vision: OpenCV (Template Matching) & MSS (Screen Capture).
- Knowledge: LangChain/FAISS (PDF Manual Indexing).
- AI: Google Gemini API & Speech-to-Text.
- Safety: Keyboard library (Global Hooks).
4. Detailed Feature Specifications¶
A. Smart Context Switching (App Detector)¶
- Logic: Identify active window process (e.g.,
femap.exe). - ROI Optimization: Define "Regions of Interest" per app.
- Example: For Femap, only search for "Tools" in the top 20% of the screen (Toolbar area) to speed up vision processing by 4x.
B. Dynamic Knowledge Loading (RAG)¶
- Ingestion: Settings menu allows PDF upload.
- Retrieval: Searches local vector database for keywords to augment the AI prompt with current software documentation.
C. Self-Healing Visual Engine ("Teach Mode")¶
- Problem: Software updates change icons; Resolution scaling makes icons look different.
- Solution:
- If
vision_enginereturns confidence < 0.8: - Trigger Snipping Tool: Overlay dims the screen.
- User Action: User draws a box around the new button OR hovers and hits a hotkey.
- Asset Save: Image is cropped and saved to
/assets/[app_name]/user_defined/. - Retry: The logic loop restarts immediately using the new asset.
- If
D. Workflow Caching (The "Speed Memory")¶
- Structure: JSON Database (
macros.json). - Key:
Hash(App_Name + User_Command + Context_Type) - Value:
List[Action_Steps] - Logic: Successful executions are saved. Future requests skip the Gemini API call and execute the local list directly.
E. Safety & Rollback¶
- Emergency Kill Switch:
- A separate thread monitors
Ctrl + Alt + Esc. - If pressed: Immediately terminate
PyAutoGUIthread, release all mouse buttons, and reset UI.
- A separate thread monitors
- Undo Stack: Logs every destructive click. Triggers native
Ctrl+Zon the target app if "Review" fails.
5. Deployment Procedures¶
Step 1: Secure API Keys¶
- Use
keyringfor Gemini/Google Cloud keys.
Step 2: Packaging (PyInstaller)¶
- Must include new dependencies for PDF parsing and Caching.
Step 3: Permissions¶
- Run as Administrator (Windows) to allow the "Teach Mode" hook to intercept clicks over other apps.
6. Summary of Work Required (Module Breakdown)¶
To build this, you need 8 distinct Python modules:
overlay_ui.py: The floating interface.app_detector.py: Identifies target app and sets "Regions of Interest" (ROI).knowledge_loader.py: Parses PDFs and handles RAG queries.vision_engine.py: The "Eyes" (OpenCV) with ROI support.snipping_tool.py: (New) Handles the "Teach Mode" / Screen capture for new assets.cache_manager.py: (New) Saves and retrieves verified macro workflows.input_monitor.py: (New) Background thread for the "Kill Switch."main.py: The Controller (Orchestrates the entire loop).