跳转至

Hybrid LLM Method at Ingestion 3

Here is the updated Mega-Prompt and code structure. I have rewritten the prompts and logic to reflect our decision: No Sampling (Full Scan) using the fast 7B model, followed by a "Reducer" phase using the 72B model to clean up the resulting schema.

Copy and paste this into Google Antigravity or your IDE.


Hybrid LLM Method: The "Full Scan" Pipeline

Strategy: We utilize the speed of the 7B model to read 100% of the book (Full Scan), ensuring no detail is lost. Because this creates "duplicate" concepts (e.g., "Car" vs "Auto"), we use the 72B model in Pass 2 as a "Reducer" to merge synonyms and build the hierarchy.

FILE 1: prompts.py

def get_miner_prompt(chunk_text):
    """
    PASS 1: The Miner (Qwen-2.5-7B)
    Strategy: Open Extraction. High speed. Catch everything.
    """
    return f"""
    Act as a Data Extraction Engine. 
    Analyze the following text chunk (~1000 words) from a technical book.

    GOAL: Extract valid technical entities and explicit relationships.

    RULES:
    1. **Granularity:** Extract specific technical terms (e.g., "Vector Search", "Cosine Similarity") rather than vague terms (e.g., "Search", "Math").
    2. **Nodes:** Identify Key Concepts.
    3. **Edges:** Identify relationships ONLY if explicitly stated in the text.
    4. **Format:** Output STRICT JSON. No markdown, no commentary.

    JSON STRUCTURE:
    {{
        "concepts": [
            {{ "name": "Concept Name", "type": "Concept" }}
        ],
        "relationships": [
            {{ "source": "Concept Name", "target": "Concept Name", "relation": "RELATION_TYPE" }}
        ]
    }}

    TEXT TO ANALYZE:
    {chunk_text}
    """

def get_manager_prompt(concept_list_text):
    """
    PASS 2: The Manager (Qwen-2.5-72B)
    Strategy: Entity Resolution & Schema Cleaning.
    Input: A raw list of concept names found by the Miner.
    """
    return f"""
    Act as a Graph Database Architect.
    You are given a raw list of "Concepts" extracted by a smaller AI from a whole book.
    Because the extraction was open, the list contains SYNONYMS, DUPLICATES, and NOISE.

    YOUR TASK: Generate Cypher queries to clean this data.

    STEP 1: MERGE SYNONYMS (Entity Resolution)
    - Look for variations: "Vector Database", "Vector DB", "Vector Store".
    - Choose the best canonical name (e.g., "Vector Database").
    - Generate Cypher to merge them into the canonical node.
    - Syntax: `MATCH (a:Concept {{name: 'Vector DB'}}), (b:Concept {{name: 'Vector Database'}}) CALL apoc.refactor.mergeNodes([b,a]) YIELD node RETURN count(*);`

    STEP 2: DELETE NOISE (Garbage Collection)
    - Identify generic/meaningless terms: "Chapter 1", "Introduction", "The Author", "Diagram", "Summary".
    - Generate Cypher: `MATCH (n:Concept {{name: 'Chapter 1'}}) DETACH DELETE n;`

    STEP 3: HIERARCHY
    - If you see clear parent/child relationships (e.g., "Cat" and "Animal"), link them.
    - Generate Cypher: `MATCH (a:Concept {{name:'Cat'}}), (b:Concept {{name:'Animal'}}) MERGE (a)-[:IS_A]->(b);`

    INPUT CONCEPTS:
    {concept_list_text}

    OUTPUT:
    Return ONLY valid Cypher commands. Separated by newlines. No markdown.
    """

FILE 2: ingest_pass1.py (The Miner)

import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from falkordb import FalkorDB

# 1. SETUP
db = FalkorDB(host='localhost', port=6379)
g = db.select_graph('TheProfessor')
loader = PyMuPDFLoader("my_book.pdf")
docs = loader.load()

# 2. CHUNKING (Standardize for Context Window)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=6000,    # Approx 1000 words
    chunk_overlap=600   # 100 words overlap to catch edge-cases
)
chunks = text_splitter.split_documents(docs)

print(f"Total Chunks to Process: {len(chunks)}")

# 3. FULL SCAN LOOP (No Sampling)
# We rely on the speed of the 7B model to process EVERYTHING.
for i, chunk in enumerate(chunks):
    print(f"Processing Chunk {i+1}/{len(chunks)}...")

    # A. Create the Anchor Node (Raw Text)
    # This preserves the "Original Meaning" for Vector Search later
    safe_text = chunk.page_content.replace("'", "\\'")
    query = f"CREATE (c:Chunk {{text: '{safe_text}', page: {chunk.metadata.get('page', 0)}}})"
    g.query(query)

    # B. Call 7B LLM (The Miner)
    # response = call_llm(model="qwen2.5:7b", prompt=get_miner_prompt(chunk.page_content))
    # json_data = parse_json(response)

    # C. Write Concepts & Edges to FalkorDB
    # (Pseudo-code for writing the extracted JSON to the graph)
    # Link every Concept back to the Chunk: (Chunk)-[:MENTIONS]->(Concept)

FILE 3: refine_pass2.py (The Manager)

# 1. FETCH RAW CONCEPTS
# Grab all distinct concept names generated by the Miner
res = g.query("MATCH (n:Concept) RETURN n.name").result_set
all_concepts = [row[0] for row in res]

# 2. BATCH PROCESS (The Reducer)
# We send batches of ~50-100 concepts to the 72B model to find duplicates.
batch_size = 50
for i in range(0, len(all_concepts), batch_size):
    batch = all_concepts[i:i + batch_size]

    # Call 72B LLM
    # prompt = get_manager_prompt(str(batch))
    # cypher_commands = call_llm(model="qwen2.5:72b", prompt=prompt)

    # Execute Cleaning Queries
    # for cmd in cypher_commands.splitlines():
    #     if "MATCH" in cmd:
    #         g.query(cmd)

Summary of Changes for "Full Scan"

  1. Prompts (prompts.py):
    • Miner: Removed any filtering logic. It is now a "greedy" extractor.
    • Manager: Added specific instructions for apoc.refactor.mergeNodes (Entity Resolution). This is critical because the full scan will create duplicates (e.g., "Node A" and "Node_A").
  2. Ingestion (ingest_pass1.py):
    • Removed random.sample. The loop now iterates enumerate(chunks) to cover 100% of the book.
  3. Refinement (refine_pass2.py):
    • Focused entirely on cleaning up the "mess" left by the greedy Miner.

This architecture gives you the completeness of a full read with the cleanliness of a curated schema.