跳转至

Prompts to create a tokenized text file from a pdf book

Prompts to create a tokenized text file from a pdf book

This task requires a multi-step approach, as the Gemini model needs to write a complex application with file I/O, text cleaning, chunking logic, and a specific AI task (meaning-based tokenization).

Here is a sequence of prompts you can use to guide the Gemini model to write the complete application. This assumes you are using a chat session or iterating on a single file.


1. Environment & Setup Prompt (Initial Code)

This first prompt asks Gemini to set up the file handling and initial text preparation, which is best done with a specialized library before passing the clean text to the model.

PROMPT 1: Create the Python Project Structure and PDF Text Extractor

You are an expert Python developer specializing in the Google GenAI SDK. 

Your first task is to write the complete Python code for the initial setup.

1.  **Imports:** Include all necessary imports for `google-genai`, file handling, and the `PyMuPDF4LLM` library (as this is best for reliable text extraction and cleaning from PDFs for LLMs).
2.  **Client Initialization:** Initialize the `genai.Client()`.
3.  **`extract_and_clean_pdf(file_path)` function:**
    *   It must take a PDF file path as input.
    *   It should use `pymupdf4llm.to_markdown()` to extract text, which is an excellent way to clean text and preserve structural context (headers, lists) for the LLM.
    *   The function should return the entire cleaned text content as a single string.
4.  **`chunk_text(text, word_count)` function:**
    *   It must take the full cleaned text and a `word_count` (e.g., 500) as input.
    *   It should split the text into chunks, ensuring that each chunk is as close to the target word count as possible without breaking major semantic boundaries (like double newlines `\n\n`).
    *   The function should return a list of text chunks.
5.  **Main Execution Block:** Define the PDF book file path (`PDF_FILE_PATH = "book.pdf"`), the chunk size (`CHUNK_SIZE = 500`), and call the functions to get the list of chunks. Print the number of generated chunks.

**Focus on clean, runnable code and include a clear Python docstring for each function.**

2. Core Logic Prompt (Meaning-Based Tokenization)

This prompt asks Gemini to define the core AI logic: the system instruction, the structured output (JSON schema) for the meaning-based tokens, and the function that calls the Gemini API.

PROMPT 2: Implement the Gemini API Tokenization and Structured Output

Continue with the Python code from the previous step. Your next task is to implement the meaning-based tokenization and structured data generation.

1.  **Define a `save_function`:** Create a simple Python function called `save_chunk_data(data)` that takes a string of data and appends it to a file named `tokens_for_vector_db.txt`. This will be used for function calling.
2.  **Define Function Declaration:** Create the necessary `FunctionDeclaration` for `save_chunk_data` to make it available as a tool to the model.
3.  **Define the JSON Output Schema:** Create a `response_schema` (Pydantic or `types.Schema`) that the model must follow. The schema should have two main fields:
    *   `expanded_chunk` (string): The original 500-word text chunk.
    *   `semantic_tokens` (list of strings): A list of key terms, phrases, or conceptual units that represent the core meaning of the chunk.
4.  **`get_meaning_tokens(chunk)` function:**
    *   It must take a single `chunk` of text as input.
    *   It should use a strong, reasoning-focused model like `gemini-2.5-pro`.
    *   The prompt (as a `system_instruction`) must be: **"You are a sophisticated RAG data pre-processor. Your job is to semantically chunk the provided text by extracting all core concepts, proper nouns, technical terms, and high-level summaries into a list of 'semantic tokens.' Each semantic token should capture a key idea from the text for improved vector embedding and retrieval. You must strictly adhere to the provided JSON schema."**
    *   The function should return the structured JSON object from the model's response.
5.  **Update Main Block:** In the main block, loop through the list of chunks, call `get_meaning_tokens()` for each, and then use the output to call the `save_chunk_data` function.

**Ensure the final code is complete and ready to run, incorporating the functions from Prompt 1.**

3. Final Review & Consolidation Prompt (Refinement)

This final prompt ensures the code is clean, handles errors, and the output format is optimized for a vector database loading process.

PROMPT 3: Review, Error Handling, and Final Output Format

Review the complete Python code written so far.

1.  **Add a `try...except` block:** Implement error handling around the Gemini API call in `get_meaning_tokens` to gracefully handle large chunks or API failures.
2.  **Optimize Saving:** Ensure that the data saved to `tokens_for_vector_db.txt` is written in a clean, easily parsable JSON Lines format, with each line being a complete JSON object corresponding to one chunk's structured output. This is the best format for bulk loading into a vector database.
3.  **Final Code:** Provide the single, complete, and fully commented Python script that combines all the functions and logic defined in Prompt 1 and Prompt 2.

**The final output must be a single, complete Python script.**