跳转至

Here is the Finalized Project Design Document. This version integrates the core architecture with the critical safety, performance, and self-learning modules (Teach Mode, Caching, Kill Switch) we discussed.


Project Name: "Visual Core" Desktop Automation Assistant

Deployment Type: Local Standalone Application (Hybrid Cloud Intelligence)

1. Executive Summary

We are building a desktop productivity assistant that floats over professional software (like Femap, Photoshop, Affinity). It accepts Voice (English/Chinese) or Text commands and cursor location to fully understand context.

Key Features: * Smart Context Switching: Automatically identifies the target software and loads its specific icons and manuals. * Dynamic Knowledge (RAG): Ingests PDF manuals (e.g., "Femap 2026 Guide") to learn new workflows on the fly. * Self-Healing Vision: If the app cannot find a specific button (due to UI updates), it enters "Teach Mode," allowing the user to point it out once, after which the app saves the asset and remembers it forever. * Safety & Efficiency: Features a Workflow Cache to repeat common tasks instantly without API costs, and a hardware-level Kill Switch to immediately halt control in emergencies.

2. User Experience Flow

  1. Context Handshake:
    • User triggers assistant over "Femap."
    • Status: "Target Locked: Femap. Manuals Loaded."
  2. Input & Cache Check:
    • User: "Mesh this surface."
    • Fast Path: App checks local Cache. If this exact command was verified before, it executes the stored macro immediately.
    • Slow Path: If new, it queries the RAG Engine (Manuals) + Gemini API to formulate a plan.
  3. Plan Rephrase & Confirmation:
    • App: "Plan: Click Mesh Tool > Select Surface. Confirm?"
    • User confirms.
  4. Visual Execution (The "Self-Healing" Loop):
    • App scans for "Mesh Tool" icon.
    • Scenario A (Found): It clicks and verifies.
    • Scenario B (Not Found): App pauses: "I can't find the 'Mesh Tool'. Point to it and press F1."
    • Teach Mode: User hovers and presses F1. App captures the icon, saves it to /assets/femap/, and resumes execution.
  5. Review & Focus Return:
    • Task done. Focus returns to Femap.
    • If user rejects result, App performs Undo.

3. Technical Architecture (Local Deployment)

The Tech Stack

  • Language: Python 3.10+
  • GUI: PyQt6 (Overlay).
  • OS Control: pywin32 / Quartz (Context detection).
  • Vision: OpenCV (Template Matching) & MSS (Screen Capture).
  • Knowledge: LangChain/FAISS (PDF Manual Indexing).
  • AI: Google Gemini API & Speech-to-Text.
  • Safety: Keyboard library (Global Hooks).

4. Detailed Feature Specifications

A. Smart Context Switching (App Detector)

  • Logic: Identify active window process (e.g., femap.exe).
  • ROI Optimization: Define "Regions of Interest" per app.
    • Example: For Femap, only search for "Tools" in the top 20% of the screen (Toolbar area) to speed up vision processing by 4x.

B. Dynamic Knowledge Loading (RAG)

  • Ingestion: Settings menu allows PDF upload.
  • Retrieval: Searches local vector database for keywords to augment the AI prompt with current software documentation.

C. Self-Healing Visual Engine ("Teach Mode")

  • Problem: Software updates change icons; Resolution scaling makes icons look different.
  • Solution:
    • If vision_engine returns confidence < 0.8:
    • Trigger Snipping Tool: Overlay dims the screen.
    • User Action: User draws a box around the new button OR hovers and hits a hotkey.
    • Asset Save: Image is cropped and saved to /assets/[app_name]/user_defined/.
    • Retry: The logic loop restarts immediately using the new asset.

D. Workflow Caching (The "Speed Memory")

  • Structure: JSON Database (macros.json).
  • Key: Hash(App_Name + User_Command + Context_Type)
  • Value: List[Action_Steps]
  • Logic: Successful executions are saved. Future requests skip the Gemini API call and execute the local list directly.

E. Safety & Rollback

  • Emergency Kill Switch:
    • A separate thread monitors Ctrl + Alt + Esc.
    • If pressed: Immediately terminate PyAutoGUI thread, release all mouse buttons, and reset UI.
  • Undo Stack: Logs every destructive click. Triggers native Ctrl+Z on the target app if "Review" fails.

5. Deployment Procedures

Step 1: Secure API Keys

  • Use keyring for Gemini/Google Cloud keys.

Step 2: Packaging (PyInstaller)

  • Must include new dependencies for PDF parsing and Caching.
    pyinstaller --noconfirm --onedir --windowed --name "VisualCoreAI" \
    --add-data "assets;assets" \
    --hidden-import "langchain" \
    --hidden-import "faiss" \
    --hidden-import "keyboard" \
    --icon "app_icon.ico" \
    main.py
    

Step 3: Permissions

  • Run as Administrator (Windows) to allow the "Teach Mode" hook to intercept clicks over other apps.

6. Summary of Work Required (Module Breakdown)

To build this, you need 8 distinct Python modules:

  1. overlay_ui.py: The floating interface.
  2. app_detector.py: Identifies target app and sets "Regions of Interest" (ROI).
  3. knowledge_loader.py: Parses PDFs and handles RAG queries.
  4. vision_engine.py: The "Eyes" (OpenCV) with ROI support.
  5. snipping_tool.py: (New) Handles the "Teach Mode" / Screen capture for new assets.
  6. cache_manager.py: (New) Saves and retrieves verified macro workflows.
  7. input_monitor.py: (New) Background thread for the "Kill Switch."
  8. main.py: The Controller (Orchestrates the entire loop).