Hybrid LLM Method at Ingestion 2
Here is the Mega-Prompt to build the optimized 2-Pass Pipeline (The "Miner & Manager" method) using Google Antigravity.
This method is faster and creates a cleaner final graph on your M1 Max.
Copy and Paste this into Google Antigravity:¶
Act as a Senior AI Data Engineer. I am building the Ingestion Pipeline for "The Professor" using Python, FalkorDBLite, and Ollama.
We will use a **2-Pass Architecture** to optimize for Apple Silicon hardware:
1. **Pass 1 (The Miner):** High-speed extraction using `qwen2.5:7b` on small text chunks.
2. **Pass 2 (The Manager):** Graph refactoring and organization using `qwen2.5:72b` (or 32b) on the extracted node list.
### FILE 1: `prompts.py`
Create two distinct system prompts:
**1. `get_miner_prompt(chunk_text)` (For Qwen-7B)**
- **Role:** Data Entry Clerk.
- **Task:** Analyze the text chunk (approx 1000 words).
- **Requirements:**
1. Create a `(:Chunk)` node containing the raw text.
2. Extract technical `(:Concept)` nodes found in the text.
3. Create edges: `(:Chunk)-[:EXPLAINS]->(:Concept)`.
4. If clear relationships exist between concepts, create `(:Concept)-[:RELATED_TO]->(:Concept)`.
- **Output:** Strict JSON containing nodes and edges. Do NOT filter or organize. Just extract.
**2. `get_manager_prompt(concept_list)` (For Qwen-72B)**
- **Role:** Curriculum Architect.
- **Task:** You will receive a raw list of Concept names extracted from the book. You must clean and structure them.
- **Actions:**
1. **Merge Synonyms:** Identify duplicates (e.g., "Newton's Second Law" vs "F=ma") and generate `MERGE` logic.
2. **Build Hierarchy:** Identify logical dependencies and generate `(:Concept)-[:PREREQUISITE]->(:Concept)` edges.
3. **Garbage Collection:** Identify generic/useless nodes (e.g., "Introduction", "Chapter 4") and generate `DETACH DELETE` logic.
- **Output:** A list of raw CYPHER commands to execute.
### FILE 2: `ingest_pass1.py` (The Miner)
- **Library:** `pymupdf` (for reading PDF), `langchain.text_splitter` (RecursiveCharacterTextSplitter).
- **Chunking Logic:**
- Use `RecursiveCharacterTextSplitter`.
- **Chunk Size:** 1000 words (approx 6000 characters).
- **Chunk Overlap:** 100 words (approx 600 characters). *Critical to ensure concepts split across pages are not lost.*
- **Ingestion Loop:**
1. Iterate through chunks.
2. Embed text with `nomic-embed-text`.
3. Send to Qwen-7B.
4. Parse JSON.
5. Save `(:Chunk)` with vector and `(:Concept)` nodes to FalkorDBLite.
### FILE 3: `refine_pass2.py` (The Manager)
- **Library:** `ollama` (model='qwen2.5:72b').
- **Logic:**
1. Query FalkorDB: `MATCH (c:Concept) RETURN c.name`.
2. Batch these names (groups of 50-100) and send to the LLM with `get_manager_prompt`.
3. Receive the Cypher commands back from the LLM.
4. Execute the Cypher against FalkorDB to restructure the graph.
### EXECUTION INSTRUCTIONS
- Generate `prompts.py` first.
- Generate `ingest_pass1.py` with robust JSON error handling (if 7B outputs bad JSON, skip or retry).
- Generate `refine_pass2.py`.
- Ensure all database connections use `FalkorDBLite`.
How this Architecture Protects "Original Meaning"¶
Since you were concerned about losing the book's original thought:
- The Anchor (Pass 1): The 7B model creates a hard link:
(:Chunk {text: "Original Author's Words"})-[:EXPLAINS]->(:Concept).- No matter what the 72B model does later to organize the hierarchy, the definition of the concept remains tied to the raw text.
- The Lesson: When "The Professor" teaches you later, it will perform a Vector Search on the
(:Chunk)nodes, ensuring it quotes the author directly, rather than hallucinating a generic definition.