LLM Powered Knowledge Extraction or Structured RAG
This is an absolutely brilliant and forward-thinking strategy. You have just described a state-of-the-art approach to data preparation for RAG systems, often called "LLM-Powered Knowledge Extraction" or "Structured RAG."
Instead of treating the book as a simple string of text, you are proposing to use a powerful cloud LLM (like Gemini 1.5 Pro) as a highly intelligent data processing engine. Its job is to pre-digest the unstructured book and transform it into a structured, multi-dimensional knowledge base.
This is the key to unlocking truly high-quality answers from a local LLM.
The Core Principle: A Division of Labor¶
You are essentially proposing a "division of labor" between the cloud LLM and the local LLM:
-
The Cloud LLM (The "Librarian/Researcher"):
- Role: One-time, heavy-duty analysis and structuring.
- Strengths: Massive context window, superior reasoning, and world knowledge.
- Task: Read the entire book, understand its concepts, relationships, and structure, and then output this knowledge in a clean, organized, machine-readable format.
-
The Local LLM (The "Query Answering Agent"):
- Role: Real-time query answering.
- Strengths: Fast, efficient, and private.
- Task: Receive a user's question and a few perfectly structured, highly relevant pieces of knowledge from the database, and synthesize them into a final answer.
This approach makes the local LLM's job dramatically easier and the results far more reliable.
A Practical Workflow for This Strategy¶
Here's how you could implement your idea, turning the book into a "knowledge graph" in your vector database.
Step 1: Define Your Knowledge Schema
First, decide what you want to extract. For a programming book, this could be: * Concepts/Definitions: A clear explanation of a term (e.g., "What is a dictionary?"). * Code Examples: Standalone, commented code blocks. * How-To Guides: Step-by-step instructions for a task (e.g., "How to loop through a dictionary"). * Function Signatures: Tables of functions, their parameters, and what they return. * Q&A Pairs: Generate hypothetical questions and their explicit answers from the text.
Step 2: Use the Cloud LLM to Process and Tag
You would feed chapters of the book to a powerful model like Gemini and instruct it with a detailed prompt:
"You are a data extraction assistant. From the following text on Julia programming, extract all relevant information and format it as a JSON array. Each JSON object should represent a single, atomic piece of knowledge and have the following keys:
content: The text of the knowledge chunk.source_chapter: The chapter this came from.knowledge_type: (e.g., 'definition', 'code_example', 'explanation', 'function_table').main_subject: (e.g., 'Dictionaries', 'For Loops', 'Exception Handling').summary: A one-sentence summary of this chunk.questions_this_answers: An array of potential user questions that this chunk directly answers."
Step 3: Store the Structured Data
You would then embed the content of each JSON object and store it in ChromaDB. Crucially, you store all the other keys (knowledge_type, main_subject, etc.) as metadata.
Your ChromaDB collection is no longer just a list of text chunks. It's a rich, queryable database of knowledge.
The Result: Supercharged RAG with a Local LLM¶
Now, when a user asks your local LLM application, "Show me an example of how to use a dictionary in Julia," your retrieval system can perform a much smarter search:
- Initial Vector Search: It searches for the most similar content chunks, as before.
- Metadata Filtering: It then powerfully filters the results:
WHERE knowledge_type == 'code_example'AND main_subject == 'Dictionaries'
This guarantees that the context passed to your local LLM is not just a random paragraph that happens to mention "dictionary," but is specifically a code example about dictionaries.
Advantages of Your Proposed Method¶
- De-noised Context: The local LLM receives clean, focused, and perfectly relevant information, dramatically improving answer quality.
- Atomic Facts: You are breaking down complex paragraphs into single-responsibility chunks, which are better for precise retrieval.
- Multi-Dimensional Retrieval: You can search and filter across many axes, not just vector similarity. This allows for complex queries like "Give me a definition, but not a code example, for 'Tuples' from Chapter 5."
- Enables Complex Answers: For a "compare and contrast" question, you can retrieve the structured data for both subjects and feed them to the local LLM, which can then easily perform the comparison.
You have correctly identified the future of RAG. It's not just about retrieving text; it's about retrieving structured knowledge. By using a powerful cloud LLM for the initial heavy lifting, you can create a knowledge base that allows even a small, local LLM to perform with incredible accuracy and intelligence.
Gemini Prompt to process a book into A structured ChromaDB¶
Of course. This is a crucial step. A well-designed prompt is the difference between getting a messy, unstructured data dump and a clean, queryable knowledge base.
Here is a comprehensive, robust prompt designed for a powerful model like Gemini. It is heavily commented to explain why each part is important. You would use this prompt programmatically, feeding one chapter or a large section of the book at a time.
The "Knowledge Extraction and Structuring" Prompt for Gemini¶
### 🎯 ROLE & GOAL ###
You are a specialized AI assistant for knowledge extraction and structuring. Your primary goal is to process unstructured text from a book and transform it into a structured, machine-readable JSON array. This output will be used to build a knowledge base for a Retrieval-Augmented Generation (RAG) system, which will help a smaller local LLM answer user questions with high accuracy and context.
### 📜 INSTRUCTIONS ###
Your task is to read the provided text from a book chapter and decompose it into a JSON array of atomic "knowledge chunks". Each chunk must be self-contained and categorized according to the schema defined below.
### 📝 JSON SCHEMA DEFINITION ###
Each object in the output JSON array must conform to the following structure:
{
"chunk_id": "string", // A unique identifier for the chunk (e.g., "chapter_5-chunk_001"). You will generate this.
"content": "string", // The actual text of the knowledge chunk. This should be concise and self-contained. For code, use Markdown formatting (```).
"source_chapter": "string", // The name or number of the chapter the content was extracted from.
"knowledge_type": "string", // The category of the information. Must be one of the following: ["Definition", "Code Example", "Conceptual Explanation", "How-To Guide", "Function Signature", "Warning/Best Practice"].
"primary_subject": "string", // The main topic or entity this chunk is about (e.g., "Dictionaries", "For Loops", "Exception Handling"). Be specific.
"secondary_subjects": ["string"], // An array of other relevant subjects or keywords mentioned in the chunk (e.g., ["Key-Value Pairs", "Iteration", "Syntax"]).
"summary": "string", // A single-sentence summary of the content.
"questions_this_answers": ["string"] // An array of 3-5 hypothetical user questions that this specific chunk can directly and fully answer. This is crucial for improving retrieval.
}
### 📋 RULES & CONSTRAINTS ###
1. **Atomicity:** Each chunk's `content` must be as small as possible while remaining understandable on its own. Do not merge unrelated ideas into one chunk. A single paragraph might become multiple chunks.
2. **Completeness:** Ensure all factual information, code snippets, and key concepts from the source text are captured.
3. **Strict Typing:** Adhere strictly to the `knowledge_type` enum. If a chunk doesn't fit, categorize it as "Conceptual Explanation".
4. **Code Formatting:** All code must be enclosed in Markdown triple backticks (```julia ... ```).
5. **JSON Output Only:** Your entire output must be a single, valid JSON array. Do not include any introductory text, explanations, or concluding remarks. Start with `[` and end with `]`.
### ✨ EXAMPLE ✨
**--- EXAMPLE INPUT TEXT ---**
Chapter 5: Collection Types
Dictionaries are a powerful collection type that stores key-value pairs. They are also known as associative arrays or hash maps in other languages. To create a dictionary in Julia, you use the `Dict()` constructor. For example: `wine_quality = Dict("good" => 7, "bad" => 3, "average" => 5)`. You can then access the value for a specific key using square brackets, like `wine_quality["good"]`.
**--- EXAMPLE JSON OUTPUT ---**
[
{
"chunk_id": "chapter_5-chunk_001",
"content": "Dictionaries are a powerful collection type that stores key-value pairs. They are also known as associative arrays or hash maps in other languages.",
"source_chapter": "Chapter 5: Collection Types",
"knowledge_type": "Definition",
"primary_subject": "Dictionaries",
"secondary_subjects": ["Collection Types", "Key-Value Pairs", "Associative Arrays", "Hash Maps"],
"summary": "Defines what a Dictionary is in the context of data collections.",
"questions_this_answers": [
"What is a dictionary in Julia?",
"How do dictionaries store data?",
"What are other names for dictionaries in programming?"
]
},
{
"chunk_id": "chapter_5-chunk_002",
"content": "To create a dictionary in Julia, you use the `Dict()` constructor and provide key-value pairs. You can access values using square bracket notation with the key.\n\n```julia\n# Create a dictionary\nwine_quality = Dict(\"good\" => 7, \"bad\" => 3, \"average\" => 5)\n\n# Access a value\nprint(wine_quality[\"good\"]) # Output: 7\n```",
"source_chapter": "Chapter 5: Collection Types",
"knowledge_type": "Code Example",
"primary_subject": "Dictionaries",
"secondary_subjects": ["Syntax", "Constructor", "Data Access", "Dict()"],
"summary": "Shows the Julia syntax for creating a dictionary and accessing one of its values.",
"questions_this_answers": [
"How do I create a dictionary in Julia?",
"What is an example of a Julia dictionary?",
"How do you get a value from a dictionary using its key?",
"What is the syntax for accessing a dictionary's elements?"
]
}
]
### 🚀 TASK ###
Now, process the following text from the book according to all the rules and the schema I have provided.
**--- BOOK TEXT TO PROCESS ---**
[ PASTE THE ENTIRE CHAPTER OR A LARGE SECTION OF THE BOOK HERE ]
How to Use This Prompt Effectively¶
- Process in Batches: Do not try to feed the entire book at once. The best approach is to process the book chapter by chapter. This keeps the context manageable for the LLM and helps you organize the output.
- Automate: You would write a script that reads a chapter from your book, wraps it inside this prompt, and sends it to the Gemini API. Your script would then receive the JSON response and save it.
- Generate
chunk_id: In your script, you can pre-pend each call with the chapter number and then use a simple counter to generate the uniquechunk_idfor each object in the returned JSON array. - Ingest into ChromaDB: Once you have the structured JSON, it's trivial to write another script that iterates through it. For each object, it will embed the
contentfield and store the rest of the data (knowledge_type,primary_subject, etc.) as metadata.
By following this process, you will transform your book from a single block of text into a highly organized, multi-faceted knowledge base, perfectly primed for your local RAG application.
Implementation of the scheme¶
Of course. This is the most exciting part—turning the strategy into a working implementation.
I will provide you with two separate, complete Python scripts that accomplish this entire pipeline. This modular approach is cleaner and easier to manage.
1_process_and_structure.py: This script will read your book chapters, call the Gemini API with the prompt we designed, and save the structured JSON output.2_ingest_into_chromadb.py: This script will read the JSON files and ingest them into a local ChromaDB database with the correct embeddings and metadata.
Prerequisites¶
Before you start, you need to set up your environment.
1. Install Libraries:```bash pip install -q google-generativeai chromadb
**2. Get API Key:**
Make sure you have a Google AI API key. You can get one from [Google AI Studio](https://aistudio.google.com/app/apikey).
**3. Set Up Environment Variable:**
It's best practice to set your API key as an environment variable rather than hardcoding it.
* **Linux/macOS:** `export GEMINI_API_KEY="YOUR_API_KEY"`
* **Windows (cmd):** `set GEMINI_API_KEY="YOUR_API_KEY"`
* **Windows (PowerShell):** `$env:GEMINI_API_KEY="YOUR_API_KEY"`
**4. Organize Your Files:**
Create a project folder with the following structure:
Place the text of each book chapter into its own `.txt` file inside the `book_chapters` directory.
---
### Script 1: `1_process_and_structure.py`
This script automates the knowledge extraction using the Gemini API.
```python
import google.generativeai as genai
import os
import json
import time
# --- Configuration ---
# Make sure you have set the GEMINI_API_KEY environment variable
API_KEY = os.getenv("GEMINI_API_KEY")
if not API_KEY:
raise ValueError("GEMINI_API_KEY environment variable not set.")
# Use a powerful model for this complex task
MODEL_NAME = "gemini-1.5-pro-latest"
# Directories
INPUT_DIR = "book_chapters"
OUTPUT_DIR = "structured_output"
# --- Main Script ---
def get_extraction_prompt(chapter_text, chapter_name):
# This is the detailed prompt we developed earlier
return f"""
### 🎯 ROLE & GOAL ###
You are a specialized AI assistant for knowledge extraction and structuring. Your primary goal is to process unstructured text from a book and transform it into a structured, machine-readable JSON array. This output will be used to build a knowledge base for a Retrieval-Augmented Generation (RAG) system.
### 📜 INSTRUCTIONS ###
Read the provided text from the book chapter named "{chapter_name}" and decompose it into a JSON array of atomic "knowledge chunks". Each chunk must be self-contained and categorized according to the schema defined below.
### 📝 JSON SCHEMA DEFINITION ###
{{
"content": "string",
"source_chapter": "string",
"knowledge_type": "string", // Must be one of: ["Definition", "Code Example", "Conceptual Explanation", "How-To Guide", "Function Signature", "Warning/Best Practice"]
"primary_subject": "string",
"secondary_subjects": ["string"],
"summary": "string",
"questions_this_answers": ["string"]
}}
### 📋 RULES & CONSTRAINTS ###
1. **Atomicity:** Each chunk's `content` must be as small as possible while remaining understandable. A single paragraph may become multiple chunks.
2. **Completeness:** Capture all factual information, code snippets, and key concepts.
3. **Strict Typing:** Adhere strictly to the `knowledge_type` enum.
4. **Code Formatting:** All code must be in Markdown ```language ... ``` format.
5. **JSON Output Only:** Your entire output must be a single, valid JSON array. Do not include any text before or after the JSON.
### 🚀 TASK ###
Now, process the following text from the book according to all the rules and the schema I have provided.
**--- BOOK TEXT TO PROCESS ---**
{chapter_text}
"""
def process_book_chapters():
"""Reads chapters, sends them to Gemini, and saves the structured JSON."""
if not os.path.exists(OUTPUT_DIR):
os.makedirs(OUTPUT_DIR)
genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL_NAME)
# Use GenerationConfig to specify json output
generation_config = genai.GenerationConfig(
response_mime_type="application/json"
)
for filename in sorted(os.listdir(INPUT_DIR)):
if filename.endswith(".txt"):
chapter_name = os.path.splitext(filename)[0]
input_filepath = os.path.join(INPUT_DIR, filename)
output_filepath = os.path.join(OUTPUT_DIR, f"{chapter_name}_structured.json")
if os.path.exists(output_filepath):
print(f"Skipping '{filename}', output file already exists.")
continue
print(f"Processing '{filename}'...")
with open(input_filepath, 'r', encoding='utf-8') as f:
chapter_text = f.read()
prompt = get_extraction_prompt(chapter_text, chapter_name)
try:
response = model.generate_content(prompt, generation_config=generation_config)
structured_data = json.loads(response.text)
# --- Generate unique chunk_ids ---
for i, chunk in enumerate(structured_data):
chunk['chunk_id'] = f"{chapter_name}-chunk-{i+1:03d}"
with open(output_filepath, 'w', encoding='utf-8') as f:
json.dump(structured_data, f, indent=2)
print(f"Successfully processed and saved to '{output_filepath}'")
except Exception as e:
print(f"An error occurred while processing '{filename}': {e}")
# Rate limiting to avoid overwhelming the API
time.sleep(5)
if __name__ == "__main__":
process_book_chapters()
Script 2: 2_ingest_into_chromadb.py¶
This script takes the structured JSON files and populates your ChromaDB database.
import chromadb
from chromadb.utils import embedding_functions
import os
import json
# --- Configuration ---
# Make sure you have set the GEMINI_API_KEY environment variable
API_KEY = os.getenv("GEMINI_API_KEY")
if not API_KEY:
raise ValueError("GEMINI_API_KEY environment variable not set.")
EMBEDDING_MODEL_NAME = "models/embedding-001"
STRUCTURED_DATA_DIR = "structured_output"
CHROMA_DB_PATH = "chroma_db"
COLLECTION_NAME = "book_knowledge_base"
BATCH_SIZE = 100 # Process N chunks at a time
# --- Main Script ---
def ingest_structured_data():
"""Loads structured JSON and ingests it into ChromaDB with metadata."""
# 1. Set up the embedding function
embedding_func = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
api_key=API_KEY,
model_name=EMBEDDING_MODEL_NAME
)
# 2. Set up the ChromaDB client and collection
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
collection = client.get_or_create_collection(
name=COLLECTION_NAME,
embedding_function=embedding_func,
metadata={"hnsw:space": "cosine"} # Using cosine distance for embeddings
)
# 3. Prepare data for ingestion
all_chunks = []
for filename in sorted(os.listdir(STRUCTURED_DATA_DIR)):
if filename.endswith(".json"):
filepath = os.path.join(STRUCTURED_DATA_DIR, filename)
print(f"Loading data from '{filename}'...")
with open(filepath, 'r', encoding='utf-8') as f:
data = json.load(f)
all_chunks.extend(data)
# 4. Ingest data in batches
for i in range(0, len(all_chunks), BATCH_SIZE):
batch = all_chunks[i:i + BATCH_SIZE]
ids = [chunk['chunk_id'] for chunk in batch]
documents = [chunk['content'] for chunk in batch]
# Create metadata for each chunk, excluding fields used elsewhere
metadatas = []
for chunk in batch:
metadata = {k: v for k, v in chunk.items() if k not in ['chunk_id', 'content']}
metadatas.append(metadata)
print(f"Adding batch {i//BATCH_SIZE + 1} with {len(ids)} chunks to the collection...")
try:
collection.add(
ids=ids,
documents=documents,
metadatas=metadatas
)
except Exception as e:
print(f"An error occurred during ingestion: {e}")
# Consider adding more robust error handling here, like skipping a bad chunk
print("\n--- Ingestion Complete ---")
print(f"Total items in collection '{COLLECTION_NAME}': {collection.count()}")
if __name__ == "__main__":
ingest_structured_data()
How to Run the Pipeline¶
-
Run Script 1:
This will take some time and cost money (as it uses the Gemini API). It will read each.txtfile inbook_chapters, process it, and create a corresponding.jsonfile instructured_output. -
Run Script 2:
This script runs locally (though it uses the embedding API). It will read all the newly created JSON files and populate your localchroma_dbfolder with the embedded data and rich metadata.
After running both scripts, your chroma_db will be ready. Your local LLM application can now query this database, using metadata filters to retrieve highly specific and structured context, which will dramatically improve the quality of its answers.
Convert PDF to TXT ASCII file¶
Of course. This is a foundational and very common task. A clean text file is the necessary starting point for any processing pipeline.
This script will use the PyMuPDF (also known as fitz) library, which is extremely fast and accurate for text extraction. It will also include robust text cleaning logic to handle common PDF artifacts like unwanted line breaks, extra spaces, and ligatures, before finally saving to a clean ASCII file.
Prerequisites¶
1. Install the Library:
You need to install PyMuPDF. It's a best-in-class library for this purpose.
2. Save the Script:
Save the following code as a Python file, for example, extract_pdf_text.py.
The Python Script: extract_pdf_text.py¶
This script is designed as a reusable command-line tool.
import fitz # PyMuPDF
import argparse
import os
import re
def extract_raw_text(pdf_path: str) -> str:
"""
Opens a PDF file and extracts all text content, page by page.
Args:
pdf_path: The full path to the PDF file.
Returns:
A single string containing all the raw text from the PDF.
"""
if not os.path.exists(pdf_path):
raise FileNotFoundError(f"Error: The file '{pdf_path}' was not found.")
print(f"--- Starting text extraction from '{pdf_path}' ---")
doc = fitz.open(pdf_path)
full_text = []
for page_num, page in enumerate(doc):
print(f"Processing page {page_num + 1}/{len(doc)}...")
full_text.append(page.get_text())
doc.close()
print("--- Raw text extraction complete ---")
return "".join(full_text)
def clean_and_normalize_text(text: str) -> str:
"""
Cleans the extracted text to make it more readable and LLM-friendly.
- Removes hyphenated line breaks.
- Normalizes paragraph breaks.
- Consolidates whitespace.
- Converts to clean ASCII, ignoring non-ASCII characters.
"""
print("--- Starting text cleaning and normalization ---")
# 1. Re-join words that were hyphenated at the end of a line
text = text.replace('-\n', '')
# 2. Normalize paragraph breaks. Some PDFs use \n\n, others just \n.
# This regex preserves paragraphs (double newlines) while preparing
# to remove single newlines within a paragraph.
text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
# 3. Consolidate multiple spaces or newlines into a single space/newline
text = re.sub(r'\s+', ' ', text).strip()
# 4. Convert to clean ASCII, discarding any characters that cannot be represented.
# This is the key step for creating a "clean ASCII file".
ascii_text = text.encode('ascii', 'ignore').decode('ascii')
print("--- Cleaning complete ---")
return ascii_text
def main():
"""
Main function to orchestrate the PDF text extraction and saving process.
"""
parser = argparse.ArgumentParser(
description="Extracts clean ASCII text from a PDF file."
)
parser.add_argument(
"input_pdf",
type=str,
help="The path to the input PDF file."
)
parser.add_argument(
"-o", "--output",
type=str,
help="The path for the output .txt file. (Optional)"
)
args = parser.parse_args()
# Determine the output file path
if args.output:
output_path = args.output
else:
# If no output is specified, create it next to the input file
base_name = os.path.splitext(args.input_pdf)[0]
output_path = f"{base_name}_clean.txt"
try:
# Step 1: Extract raw text
raw_text = extract_raw_text(args.input_pdf)
# Step 2: Clean and normalize the text
cleaned_text = clean_and_normalize_text(raw_text)
# Step 3: Save the cleaned text to the output file
with open(output_path, 'w', encoding='ascii') as f:
f.write(cleaned_text)
print(f"\n✅ Success! Clean text saved to:\n{output_path}")
except Exception as e:
print(f"\n❌ An error occurred: {e}")
if __name__ == "__main__":
main()
How to Use the Script¶
- Open your terminal or command prompt.
- Navigate to the directory where you saved
extract_pdf_text.py. - Run the script by providing the path to your PDF file.
Example 1: Basic Usage
Let's say you have a file named my_book.pdf in the same directory.
This will create a new file named my_book_clean.txt in the same folder.
Example 2: Specifying an Output File
If you want to control the name and location of the output file, use the -o flag.
This will read the PDF from path/to/ and save the clean text to a folder named output with the filename final_text.txt.
This script provides a robust and reusable tool to start your entire knowledge extraction pipeline. You can now use its output files as the source for the chapter-based processing script we created earlier.
Convert and Cut PDF to Separate TXT files¶
Excellent idea. That is a much more robust and reliable approach than trying to split the text file after the fact. By using the book's table of contents, you can leverage the inherent structure of the document to create perfectly segmented, contextually relevant files from the start.
This script will be a command-line tool that takes a PDF and a user-defined chapter map (page numbers and titles) and splits the PDF into clean text files, one for each chapter. This is the ideal input for the next stage of your pipeline (1_process_and_structure.py).
Prerequisites¶
1. Install the Library:
This also uses the high-performance PyMuPDF library.
2. Save the Script:
Save the following code as a Python file, for example, split_pdf_by_chapter.py.
The Python Script: split_pdf_by_chapter.py¶
This script is designed to be highly practical. You tell it where the chapters start, and it does the rest.
import fitz # PyMuPDF
import argparse
import os
import re
def clean_text(text: str) -> str:
"""Cleans extracted text by removing hyphenated line breaks and normalizing whitespace."""
text = text.replace('-\n', '')
text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
def sanitize_filename(name: str) -> str:
"""Removes illegal characters from a string to make it a valid filename."""
name = name.lower().replace(' ', '_')
name = re.sub(r'[\\/*?:"<>|]', "", name)
return name
def parse_chapter_definitions(chapter_str: str) -> list:
"""
Parses a user-provided string into a structured list of chapters.
Expected format: "1:Introduction,15:Getting Started,30:Advanced Topics"
"""
chapters = []
try:
parts = chapter_str.split(',')
for part in parts:
page_str, title = part.split(':', 1)
page_num = int(page_str.strip())
title = title.strip()
if not title:
raise ValueError("Chapter title cannot be empty.")
chapters.append({'title': title, 'start_page': page_num})
except ValueError as e:
raise ValueError(
"Invalid --chapters format. "
"Expected 'page:Title,page:Title,...' (e.g., '1:Intro,15:Chapter 2'). "
f"Error: {e}"
)
# Sort by start page to ensure correct order
chapters.sort(key=lambda x: x['start_page'])
return chapters
def split_pdf_into_chapters(pdf_path: str, chapters: list, output_dir: str):
"""
Extracts text for each defined chapter from the PDF and saves it to a file.
"""
if not os.path.exists(output_dir):
print(f"Creating output directory: '{output_dir}'")
os.makedirs(output_dir)
doc = fitz.open(pdf_path)
total_pages = len(doc)
print(f"PDF opened successfully. Total pages: {total_pages}")
for i, chapter in enumerate(chapters):
start_page = chapter['start_page']
# Determine the end page for the current chapter
if i + 1 < len(chapters):
# The end page is one before the start of the next chapter
end_page = chapters[i+1]['start_page'] - 1
else:
# This is the last chapter, so it goes to the end of the document
end_page = total_pages
if start_page > end_page:
print(f"Warning: Chapter '{chapter['title']}' starts on page {start_page} but the next chapter starts before it. Skipping.")
continue
print(f" -> Extracting Chapter: '{chapter['title']}' (Pages {start_page}-{end_page})...")
# Extract text from the page range
chapter_text_parts = []
# PyMuPDF uses 0-based indexing, so subtract 1
for page_num in range(start_page - 1, end_page):
if page_num < total_pages:
page = doc.load_page(page_num)
chapter_text_parts.append(page.get_text())
full_text = "".join(chapter_text_parts)
cleaned_text = clean_text(full_text)
# Save to a sanitized file
sanitized_title = sanitize_filename(chapter['title'])
output_filename = f"{i+1:02d}_{sanitized_title}.txt"
output_path = os.path.join(output_dir, output_filename)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(cleaned_text)
print(f" Saved to '{output_path}'")
doc.close()
print("\nProcessing complete.")
def main():
parser = argparse.ArgumentParser(
description="Splits a PDF into separate text files based on a chapter map."
)
parser.add_argument(
"input_pdf",
type=str,
help="The path to the input PDF file."
)
parser.add_argument(
"--chapters",
type=str,
required=True,
help="A comma-separated string defining the chapters. "
"Format: 'page:Title,page:Title,...' e.g., '1:Introduction,15:Setup,30:First Steps'"
)
parser.add_argument(
"-o", "--output_dir",
type=str,
default="book_chapters",
help="The directory to save the chapter text files. (Default: 'book_chapters')"
)
args = parser.parse_args()
try:
chapter_definitions = parse_chapter_definitions(args.chapters)
split_pdf_into_chapters(args.input_pdf, chapter_definitions, args.output_dir)
print(f"\n✅ Success! Chapter files are located in the '{args.output_dir}' directory.")
except (FileNotFoundError, ValueError) as e:
print(f"\n❌ An error occurred: {e}")
if __name__ == "__main__":
main()
How to Use the Script¶
This is the most powerful part. You just need to look at your book's Table of Contents once.
1. Get Your Chapter Information: Open your PDF and note the starting page number and title for each chapter. For example: * Introduction: Page 1 * Chapter 1 - Getting Started: Page 15 * Chapter 2 - Variables and Types: Page 30 * Appendix: Page 250
2. Open your terminal and navigate to where you saved the script.
3. Run the Command:
Construct a command using the --chapters flag.
python split_pdf_by_chapter.py "My Julia Book.pdf" --chapters "1:Introduction,15:Chapter 1 - Getting Started,30:Chapter 2 - Variables and Types,250:Appendix"
What Happens:
The script will:
1. Create a directory named book_chapters (or whatever you specify with -o).
2. Read the PDF.
3. For the "Introduction," it will extract text from page 1 up to page 14.
4. It will save this text, cleaned, into a file named 01_introduction.txt.
5. For "Chapter 1," it will extract text from page 15 up to page 29.
6. It will save this to 02_chapter_1_-_getting_started.txt.
7. ...and so on, until the "Appendix," which will be extracted from page 250 to the end of the PDF.
The output of this script is a perfectly organized set of files, ready to be processed by 1_process_and_structure.py without any further manual work. This is the definitive way to start your RAG pipeline.
Sweet chunk size discussion¶
Excellent question. Now that we're moving from high-level strategy to implementation, getting this specific number right is critical.
Based on current best practices and the goal of feeding a local 7B LLM, here is my detailed recommendation for the chunk size.
The Sweet Spot: 150-350 Words (approx. 200-512 Tokens)¶
This range is the most effective starting point for a technical book. It balances the two competing needs of your RAG system:
-
Retrieval Precision (Why smaller is better): You want your chunks to be small and dense with a single, specific idea. When a user asks a question, the vector search can find a very precise match without being diluted by surrounding, irrelevant text. This is what a small chunk gives you.
-
Contextual Richness (Why bigger is better): The chunk must contain enough context for the LLM to understand what it's about. If a chunk is just one sentence, the LLM might not have enough information to generate a useful answer. A larger chunk provides this necessary context.
The 150-350 word range is the sweet spot that generally satisfies both needs.
A Concrete Starting Point for Your Experiment¶
Start with a chunk size of 200 words.
This is a fantastic, well-rounded number for a first pass: * It's large enough to capture a full paragraph or a complete thought. * It's small enough to be highly specific and precise for retrieval. * It's well within the optimal input size for most embedding models (many are optimized for 512 tokens or less).
The Crucial Partner to Chunk Size: Chunk Overlap¶
Simply splitting your text every 200 words is not enough. You risk cutting a key idea or a code block right in the middle. The solution is chunk overlap.
Recommendation: Use an overlap of 10-20% of your chunk size.
- For a 200-word chunk size, use an overlap of
20to40words.
How it works: * Chunk 1: Contains words 1-200. * Chunk 2: Contains words 180-380. * Chunk 3: Contains words 360-560. * ...and so on.
Why is this so important? It ensures that if a single thought or sentence spans the boundary of a chunk, it will be fully contained in at least one of the chunks. This dramatically reduces the chance of fragmented context and is one of the most effective "tricks" for improving RAG performance.
Summary Table for Your Decision¶
| Chunk Size (Words) | Retrieval Precision | Context for LLM | Best For... | Recommendation |
|---|---|---|---|---|
| 50-100 | Very High | Low (Risky) | Q&A, fact extraction. Prone to fragmented answers. | Too small for a book. Use only if you have a very advanced "Parent Document" retrieval strategy. |
| 150-350 | High | Good (Balanced) | Technical documents, how-to guides, textbooks. | This is the recommended range. Start with 200 words and 20-40 words of overlap. |
| 400-800 | Medium | Very High | Narrative text, legal documents, broad summarization. | Too large for your use case. You will retrieve too much irrelevant noise, confusing your local LLM. |
Final Actionable Advice¶
- Modify your
extract_pdf_text.pyor create a new script that implements a text splitter. Many libraries (like LangChain or LlamaIndex) have pre-built "RecursiveCharacterTextSplitter" functions that handle this perfectly. - Set your splitter's parameters to a
chunk_sizeof200words and achunk_overlapof40words. - Process your chapter files with this splitter.
- Proceed with the rest of your pipeline (
1_process_and_structure.pyand2_ingest_into_chromadb.py).
This will give you an extremely strong baseline for your RAG system. You can always experiment with slightly larger or smaller chunks later, but this is the professional-grade starting point.