Update QWen LLM to Speed up Teaching

Since your app relies on FalkorDB to provide the facts, the LLM does not need to "know" the physics or history itself; it just needs to be smart enough to read the context and follow instructions (teach/grade).

This lowers the intelligence requirement significantly. You do not need the massive 72B model.

The fastest viable model for your M1 Max (64GB) is:

The Winner: Qwen2.5-14B-Instruct¶

Speed on M1 Max: ~40 to 50 tokens/second.
- Experience: This generates text faster than you can comfortably read. It feels "Instant."
Size: ~9 GB (at 4-bit quantization).
Why it works:
- RAG Proficiency: The 14B model is excellent at "Context Adherence." If FalkorDB gives it a chunk of text, it sticks to it without hallucinating.
- Instruction Following: It is smart enough to output the valid JSON needed for your UI (Mermaid diagrams, Grading scores), which smaller models (7B/3B) often mess up.

How to get it:

ollama pull qwen2.5:14b

The "Speed Demon" Alternative: Qwen2.5-7B-Instruct¶

If you want "Video Game Speed" (where the text flies onto the screen), you can drop down to the 7B model.

Speed on M1 Max: ~80 to 100+ tokens/second.
Size: ~5 GB.
The Risk: It is "chattier." It might sometimes ignore your instruction to "be concise" or it might make small syntax errors in the JSON grading.
Use Case: Use this for the Lectures (Phase 2), but stick to the 14B for the Grading (Phase 3) where logic matters more than speed.

My Recommendation¶

Start with Qwen2.5-14B. It is the perfect balance. It is faster than human speech (great for TTS) but smart enough to act like a Professor.

Update your app.py or prompts.py to point to: model="qwen2.5:14b"