Hybrid LLM Method at Ingestion 3
Here is the updated Mega-Prompt and code structure. I have rewritten the prompts and logic to reflect our decision: No Sampling (Full Scan) using the fast 7B model, followed by a "Reducer" phase using the 72B model to clean up the resulting schema.
Copy and paste this into Google Antigravity or your IDE.
Hybrid LLM Method: The "Full Scan" Pipeline¶
Strategy: We utilize the speed of the 7B model to read 100% of the book (Full Scan), ensuring no detail is lost. Because this creates "duplicate" concepts (e.g., "Car" vs "Auto"), we use the 72B model in Pass 2 as a "Reducer" to merge synonyms and build the hierarchy.
FILE 1: prompts.py¶
def get_miner_prompt(chunk_text):
"""
PASS 1: The Miner (Qwen-2.5-7B)
Strategy: Open Extraction. High speed. Catch everything.
"""
return f"""
Act as a Data Extraction Engine.
Analyze the following text chunk (~1000 words) from a technical book.
GOAL: Extract valid technical entities and explicit relationships.
RULES:
1. **Granularity:** Extract specific technical terms (e.g., "Vector Search", "Cosine Similarity") rather than vague terms (e.g., "Search", "Math").
2. **Nodes:** Identify Key Concepts.
3. **Edges:** Identify relationships ONLY if explicitly stated in the text.
4. **Format:** Output STRICT JSON. No markdown, no commentary.
JSON STRUCTURE:
{{
"concepts": [
{{ "name": "Concept Name", "type": "Concept" }}
],
"relationships": [
{{ "source": "Concept Name", "target": "Concept Name", "relation": "RELATION_TYPE" }}
]
}}
TEXT TO ANALYZE:
{chunk_text}
"""
def get_manager_prompt(concept_list_text):
"""
PASS 2: The Manager (Qwen-2.5-72B)
Strategy: Entity Resolution & Schema Cleaning.
Input: A raw list of concept names found by the Miner.
"""
return f"""
Act as a Graph Database Architect.
You are given a raw list of "Concepts" extracted by a smaller AI from a whole book.
Because the extraction was open, the list contains SYNONYMS, DUPLICATES, and NOISE.
YOUR TASK: Generate Cypher queries to clean this data.
STEP 1: MERGE SYNONYMS (Entity Resolution)
- Look for variations: "Vector Database", "Vector DB", "Vector Store".
- Choose the best canonical name (e.g., "Vector Database").
- Generate Cypher to merge them into the canonical node.
- Syntax: `MATCH (a:Concept {{name: 'Vector DB'}}), (b:Concept {{name: 'Vector Database'}}) CALL apoc.refactor.mergeNodes([b,a]) YIELD node RETURN count(*);`
STEP 2: DELETE NOISE (Garbage Collection)
- Identify generic/meaningless terms: "Chapter 1", "Introduction", "The Author", "Diagram", "Summary".
- Generate Cypher: `MATCH (n:Concept {{name: 'Chapter 1'}}) DETACH DELETE n;`
STEP 3: HIERARCHY
- If you see clear parent/child relationships (e.g., "Cat" and "Animal"), link them.
- Generate Cypher: `MATCH (a:Concept {{name:'Cat'}}), (b:Concept {{name:'Animal'}}) MERGE (a)-[:IS_A]->(b);`
INPUT CONCEPTS:
{concept_list_text}
OUTPUT:
Return ONLY valid Cypher commands. Separated by newlines. No markdown.
"""
FILE 2: ingest_pass1.py (The Miner)¶
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from falkordb import FalkorDB
# 1. SETUP
db = FalkorDB(host='localhost', port=6379)
g = db.select_graph('TheProfessor')
loader = PyMuPDFLoader("my_book.pdf")
docs = loader.load()
# 2. CHUNKING (Standardize for Context Window)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=6000, # Approx 1000 words
chunk_overlap=600 # 100 words overlap to catch edge-cases
)
chunks = text_splitter.split_documents(docs)
print(f"Total Chunks to Process: {len(chunks)}")
# 3. FULL SCAN LOOP (No Sampling)
# We rely on the speed of the 7B model to process EVERYTHING.
for i, chunk in enumerate(chunks):
print(f"Processing Chunk {i+1}/{len(chunks)}...")
# A. Create the Anchor Node (Raw Text)
# This preserves the "Original Meaning" for Vector Search later
safe_text = chunk.page_content.replace("'", "\\'")
query = f"CREATE (c:Chunk {{text: '{safe_text}', page: {chunk.metadata.get('page', 0)}}})"
g.query(query)
# B. Call 7B LLM (The Miner)
# response = call_llm(model="qwen2.5:7b", prompt=get_miner_prompt(chunk.page_content))
# json_data = parse_json(response)
# C. Write Concepts & Edges to FalkorDB
# (Pseudo-code for writing the extracted JSON to the graph)
# Link every Concept back to the Chunk: (Chunk)-[:MENTIONS]->(Concept)
FILE 3: refine_pass2.py (The Manager)¶
# 1. FETCH RAW CONCEPTS
# Grab all distinct concept names generated by the Miner
res = g.query("MATCH (n:Concept) RETURN n.name").result_set
all_concepts = [row[0] for row in res]
# 2. BATCH PROCESS (The Reducer)
# We send batches of ~50-100 concepts to the 72B model to find duplicates.
batch_size = 50
for i in range(0, len(all_concepts), batch_size):
batch = all_concepts[i:i + batch_size]
# Call 72B LLM
# prompt = get_manager_prompt(str(batch))
# cypher_commands = call_llm(model="qwen2.5:72b", prompt=prompt)
# Execute Cleaning Queries
# for cmd in cypher_commands.splitlines():
# if "MATCH" in cmd:
# g.query(cmd)
Summary of Changes for "Full Scan"¶
- Prompts (
prompts.py):- Miner: Removed any filtering logic. It is now a "greedy" extractor.
- Manager: Added specific instructions for
apoc.refactor.mergeNodes(Entity Resolution). This is critical because the full scan will create duplicates (e.g., "Node A" and "Node_A").
- Ingestion (
ingest_pass1.py):- Removed
random.sample. The loop now iteratesenumerate(chunks)to cover 100% of the book.
- Removed
- Refinement (
refine_pass2.py):- Focused entirely on cleaning up the "mess" left by the greedy Miner.
This architecture gives you the completeness of a full read with the cleanliness of a curated schema.