Building the Demo System

Based on the research you helped, i plan to go this approach. 1. build a sample Embedding vector database based on text from a sample book. 2. Build an agent (see 3) to inquire knowledge. 3. convert the output from AI (local LLM on an apple macbook pro on local network) to audio file and play automatically. All is completed on win11/wsl2 system. 4. Add user audio input function

This is a clear, phased, and technically sound plan. It correctly prioritizes proving the core RAG and voice components before moving to the complex mobile deployment.

Here is a breakdown of your plan, the recommended tools for each step within your Win11/WSL2 environment, and key considerations for success.

Phase 1: Building the Sample Embedding Vector Database (RAG Foundation)¶

This is the data preparation and retrieval phase.

Task	Recommended Tools (WSL/Python)	Key Steps
1. Data Loading & Chunking	LangChain or LlamaIndex + PyPDFLoader (or similar file loaders).	Load the sample book (PDF/TXT), then split the text into meaningful chunks (e.g., 500 characters with 100-character overlap).
2. Local Embedding	Sentence-Transformers library + a model like `bge-small-en-v1.5` or `all-MiniLM-L6-v2`.	Run the model to convert every chunk of text into a vector (a list of numbers). This must be done on your Mac/WSL, not the final mobile model.
3. Local Vector DB Storage	Faiss (for speed) or ChromaDB (for easier setup and query).	Store the vectors and their corresponding original text chunks into the database index file.
4. Retrieval Test	Custom Python script.	Write a script that takes a question, converts the question to a vector (using the same model), queries the index, and returns the top 3 text chunks.
Goal: A fast, reliable script that can find the most relevant text from your book based on a question.

Phase 2: Core AI Agent Implementation and Audio Output (LLM & TTS)¶

This is the intelligence and voice output phase.

Task	Recommended Tools (WSL/Python)	Key Steps
1. Local LLM Setup	`llama-cpp-python` (installed inside your WSL environment).	Load a small LLM (e.g., Llama 3 8B or a smaller Qwen-VL model) in GGUF format for desktop testing.
2. RAG Agent Creation	Custom Python script orchestrating the flow.	The Triple-Step Flow: a) Prompt Template: Create the template that embeds the retrieved context chunks (from Phase 1) into the prompt. b) Inference: Run the combined prompt through the local LLM (`llama-cpp-python`).
3. Text-to-Speech (TTS) Conversion	MeloTTS or Coqui TTS (CPU mode).	The most difficult part of this phase: The generated text is passed to the TTS library. The TTS library generates a `.wav` or raw audio buffer.
4. Automatic Audio Playback	`pyaudio` or `soundfile` + `playsound` library.	Play the generated audio file automatically through the WSL audio bridge to your Windows speakers.
Goal: A single Python script that takes a question, runs RAG, generates the answer text, and speaks the answer automatically.

Key Considerations for Your Setup¶

1. The Local Network Bridge (Win11/WSL2 to MacBook Pro)¶

You mentioned running the local LLM on a MacBook Pro on the local network. This is an alternative, but unnecessary, complexity given your plan.

Recommendation: Try to run the LLM directly inside your WSL2 environment first. Your Win11 machine likely has comparable or superior specs (especially GPU access) to the MacBook Pro, which will simplify the setup immensely.
- If you must use the MacBook: You would use an HTTP API (e.g., llama-cpp-python's server mode) running on the MacBook. Your WSL script would then become an HTTP client sending the prompt to the MacBook's IP and port.

2. STT (Speech-to-Text) Omission¶

Your current plan skips the user's voice question. To truly prove the whole system, you should add a simple STT step (as per the earlier advice):

New Step: Before Phase 2, integrate Vosk to take a microphone input and convert it to text.
- Rationale: This proves the entire input chain, which is critical for a hands-free app.

3. Latency Check¶

As you execute Phase 2, measure the time taken for the full RAG/LLM/TTS process. If the time-to-audio is above 8 seconds, you will need to re-quantize your GGUF model (to a smaller, faster setting) or look for a smaller VLM.

Setup LLM on Local Network for Dedicated Resources¶

That is a solid approach for leveraging the dedicated resources of your MacBook Pro while keeping your development workflow anchored in WSL2. Using a remote HTTP API for the LLM is a standard practice in distributed AI development.

Here is the detailed method for setting up and using the llama-cpp-python server on your MacBook Pro, and how to access it from your WSL2 environment.

Part 1: Setting up the LLM Server on the MacBook Pro¶

The goal here is to turn your MacBook into a simple, headless inference API service.

1. Prerequisite Setup (MacBook)¶

Install necessary tools: Ensure Python 3, pip, and Homebrew (for dependencies) are installed.
Install llama-cpp-python: This library has a built-in server mode designed to mimic the OpenAI API structure, which is very convenient.
```
# On the MacBook terminal
pip install llama-cpp-python
```
Get the GGUF Model: Download or copy your GGUF-quantized Qwen or Llama model file (e.g., model.gguf) to a known path on the MacBook.

2. Launch the API Server (MacBook)¶

The llama-cpp-python package includes a command-line tool to run the server.

Command: Use the --host 0.0.0.0 flag to allow connections from your WSL machine (or any machine on the local network).

# On the MacBook terminal
# Replace /path/to/your/model.gguf with your actual file path
# --n_gpu_layers (optional) is crucial if you want to use the Apple Silicon GPU (M-series)
# The default port is 8000, which we will use.
python3 -m llama_cpp.server --model /path/to/your/model.gguf --host 0.0.0.0 --port 8000

Verification: After launching, you should see output indicating that the server is running, likely reporting: Uvicorn running on http://0.0.0.0:8000

3. Note the MacBook's Local IP Address¶

You need the IP address of your MacBook on your home network for the WSL instance to connect.

Find IP: On the MacBook, open System Settings \(\rightarrow\) Network and note the local IP (e.g., 192.168.1.105).

Part 2: Connecting to the API from WSL2 (The Client)¶

The goal here is for your RAG script in WSL2 to send the enriched prompt to the MacBook and get the answer back.

1. The Client Tool (WSL2)¶

You will use the standard requests library in your WSL Python environment, as the llama-cpp-python server uses standard JSON/HTTP protocols.

# Inside your WSL terminal and Python environment
pip install requests

2. The Python Client Script (WSL2 RAG Agent)¶

Your Python RAG agent script will perform the retrieval, format the prompt, and then send the request to the MacBook.

import requests
import json

# 1. Configuration (Use the IP from your MacBook)
MACBOOK_IP = "192.168.1.105"  # <-- IMPORTANT: Change this to your MacBook's actual IP
API_ENDPOINT = f"http://{MACBOOK_IP}:8000/v1/completions"

def get_llm_response_from_macbook(enriched_prompt):
    headers = {
        "Content-Type": "application/json"
    }

    # The request body mimics the standard OpenAI Chat Completion API format
    data = {
        "prompt": enriched_prompt,
        "max_tokens": 512,  # Set max output length
        "temperature": 0.5, # Adjust for creativity
        "stop": ["\nUser:", "Observation:"] # Define stop sequences
    }

    try:
        response = requests.post(API_ENDPOINT, headers=headers, data=json.dumps(data))
        response.raise_for_status() # Raises HTTPError for bad responses

        # Parse the JSON response
        result = response.json()

        # Extract the generated text
        llm_text = result['choices'][0]['text']
        return llm_text.strip()

    except requests.exceptions.RequestException as e:
        print(f"Error connecting to MacBook LLM server: {e}")
        return "Sorry, the AI server is offline or unreachable."

# --- Example Usage in your RAG Agent ---

# 1. (RAG Phase): You retrieve your context chunks here
context_chunks = "Context: The Venus de Milo was found on April 8, 1820..."

# 2. (Prompt Engineering): Create the final prompt
user_question = "When was the statue found?"
enriched_prompt = f"Based on the following context: {context_chunks}\n\nAnswer the user's question: {user_question}"

# 3. (LLM Inference): Call the remote API
answer = get_llm_response_from_macbook(enriched_prompt)

# 4. (TTS Phase): Pass the answer to your local TTS model
# print("LLM Answer:", answer)
# play_audio(answer) # <-- Your next step

Key Pay-Attention Points¶

Firewall (Crucial): The most common failure point is the MacBook's firewall blocking incoming connections on port 8000. You must ensure port 8000 is open to incoming connections on your MacBook.
IP Address Changes: If your MacBook uses DHCP (standard), its local IP (192.168.x.x) can change when it connects to a new Wi-Fi network or restarts. You'll need to update the MACBOOK_IP in your WSL script if this happens.
No Internet Required: This entire process uses your local Wi-Fi network only (MacBook \(\leftrightarrow\) Router \(\leftrightarrow\) Windows PC/WSL2). No external internet is needed, preserving the "offline" core of your design.
OpenAI Format: The endpoint is /v1/completions, which is the standard LLM API format. This makes it easy to switch to a cloud-based service later (like OpenAI or Gemini) if you ever choose to do so, by simply changing the API_ENDPOINT and the API key.