跳转至

Hybrid LLM Method at Ingestion 2

Here is the Mega-Prompt to build the optimized 2-Pass Pipeline (The "Miner & Manager" method) using Google Antigravity.

This method is faster and creates a cleaner final graph on your M1 Max.


Copy and Paste this into Google Antigravity:

Act as a Senior AI Data Engineer. I am building the Ingestion Pipeline for "The Professor" using Python, FalkorDBLite, and Ollama.

We will use a **2-Pass Architecture** to optimize for Apple Silicon hardware:
1. **Pass 1 (The Miner):** High-speed extraction using `qwen2.5:7b` on small text chunks.
2. **Pass 2 (The Manager):** Graph refactoring and organization using `qwen2.5:72b` (or 32b) on the extracted node list.

### FILE 1: `prompts.py`
Create two distinct system prompts:

**1. `get_miner_prompt(chunk_text)` (For Qwen-7B)**
   - **Role:** Data Entry Clerk.
   - **Task:** Analyze the text chunk (approx 1000 words).
   - **Requirements:**
     1. Create a `(:Chunk)` node containing the raw text.
     2. Extract technical `(:Concept)` nodes found in the text.
     3. Create edges: `(:Chunk)-[:EXPLAINS]->(:Concept)`.
     4. If clear relationships exist between concepts, create `(:Concept)-[:RELATED_TO]->(:Concept)`.
   - **Output:** Strict JSON containing nodes and edges. Do NOT filter or organize. Just extract.

**2. `get_manager_prompt(concept_list)` (For Qwen-72B)**
   - **Role:** Curriculum Architect.
   - **Task:** You will receive a raw list of Concept names extracted from the book. You must clean and structure them.
   - **Actions:**
     1. **Merge Synonyms:** Identify duplicates (e.g., "Newton's Second Law" vs "F=ma") and generate `MERGE` logic.
     2. **Build Hierarchy:** Identify logical dependencies and generate `(:Concept)-[:PREREQUISITE]->(:Concept)` edges.
     3. **Garbage Collection:** Identify generic/useless nodes (e.g., "Introduction", "Chapter 4") and generate `DETACH DELETE` logic.
   - **Output:** A list of raw CYPHER commands to execute.

### FILE 2: `ingest_pass1.py` (The Miner)
- **Library:** `pymupdf` (for reading PDF), `langchain.text_splitter` (RecursiveCharacterTextSplitter).
- **Chunking Logic:**
  - Use `RecursiveCharacterTextSplitter`.
  - **Chunk Size:** 1000 words (approx 6000 characters).
  - **Chunk Overlap:** 100 words (approx 600 characters). *Critical to ensure concepts split across pages are not lost.*
- **Ingestion Loop:**
  1. Iterate through chunks.
  2. Embed text with `nomic-embed-text`.
  3. Send to Qwen-7B.
  4. Parse JSON.
  5. Save `(:Chunk)` with vector and `(:Concept)` nodes to FalkorDBLite.

### FILE 3: `refine_pass2.py` (The Manager)
- **Library:** `ollama` (model='qwen2.5:72b').
- **Logic:**
  1. Query FalkorDB: `MATCH (c:Concept) RETURN c.name`.
  2. Batch these names (groups of 50-100) and send to the LLM with `get_manager_prompt`.
  3. Receive the Cypher commands back from the LLM.
  4. Execute the Cypher against FalkorDB to restructure the graph.

### EXECUTION INSTRUCTIONS
- Generate `prompts.py` first.
- Generate `ingest_pass1.py` with robust JSON error handling (if 7B outputs bad JSON, skip or retry).
- Generate `refine_pass2.py`.
- Ensure all database connections use `FalkorDBLite`.

How this Architecture Protects "Original Meaning"

Since you were concerned about losing the book's original thought:

  1. The Anchor (Pass 1): The 7B model creates a hard link: (:Chunk {text: "Original Author's Words"})-[:EXPLAINS]->(:Concept).
    • No matter what the 72B model does later to organize the hierarchy, the definition of the concept remains tied to the raw text.
  2. The Lesson: When "The Professor" teaches you later, it will perform a Vector Search on the (:Chunk) nodes, ensuring it quotes the author directly, rather than hallucinating a generic definition.