Museum Personal Guide
This is a fantastic concept for a museum app. The requirement for local, small LLM is key, as it provides instant, offline, and private interaction, which is a significant advantage in large museum settings (where Wi-Fi/cellular might be spotty) or for historical/private collections.
Here is the recommended local LLM and a detailed implementation path.
Recommended Local Multimodal LLM¶
The core of your application requires a Vision-Language Model (VLM) that can take an image and a user's question (in text form, after Speech-to-Text) and generate a relevant text answer.
| Component | Recommended Model(s) | Key Feature for Museum App |
|---|---|---|
| Core VLM (Image + Text) | Qwen2.5-VL-7B-Instruct | A compact 7B-parameter vision-language model explicitly optimized for mobile deployment and visual comprehension (analyzing images and text within them). |
| Alternative VLM | Gemma-3n (Multimodal) | Designed for Edge AI/mobile and supports vision, text, and audio, making it a powerful, pre-integrated multimodal option. |
| LLM Inference Engine | llama.cpp / MLLM | The foundational C++ library for high-speed, quantized model inference on resource-constrained devices (supports many models including Llama and Qwen). |
The "Local" Constraint Requires Three Components:¶
To achieve your full vision (Picture \(\rightarrow\) AI \(\rightarrow\) Audio \(\rightarrow\) Follow-up Question), you need to chain three distinct models locally:
- Speech-to-Text (STT): User's audio question \(\rightarrow\) Text.
- Vision-Language Model (VLM): User's Picture + Text \(\rightarrow\) AI Text Answer.
- Text-to-Speech (TTS): AI Text Answer \(\rightarrow\) Audio Output.
Implementation Path: A 4-Stage Architecture¶
Your implementation path should focus on integrating these three local models using an efficient mobile deployment framework.
Stage 1: The Foundation (Mobile Framework & LLM Inference)¶
This is the glue that holds the entire local AI system together.
- Choose a Cross-Platform Framework: For rapid development and cross-OS compatibility, consider React Native (using bindings like
llama.rn) or a native development approach (Swift for iOS, Kotlin/Java for Android). - Select the Inference Engine: Implement the core model logic using llama.cpp. This C++ library is the industry standard for fast, on-device LLM inference and is optimized for models saved in the GGUF format, which is a highly-quantized format for CPU/NPU use on mobile.
- Model Preparation: Download and quantize your chosen VLM (e.g., Qwen2.5-VL-7B-Instruct) into a high-efficiency format (like GGUF or a format optimized for the MLLM engine) that fits within the memory limits of a modern smartphone (typically aiming for under 4GB total model size).
Stage 2: The Input—Speech-to-Text (STT)¶
You need to convert the user's spoken question ("What is this painting about?") into text that the VLM can process.
- Recommended STT Library: Use Vosk API. Vosk is an open-source, offline speech recognition toolkit with small, portable models (around 50MB per language) that run on both Android and iOS with low latency.
- Implementation: The app records the user's voice, passes the audio to the local Vosk model, and receives the transcribed text.
Stage 3: The Core Logic—Vision-Language Model (VLM)¶
This is where the intelligence lies, combining the image and text question.
- Input Flow: The app takes the camera photo and the transcribed text from the STT model (Stage 2).
- VLM Processing: The VLM (e.g., Qwen2.5-VL-7B-Instruct running on
llama.cppor MLLM) processes both the image pixels (or a visual embedding derived from the image) and the text prompt to generate a textual response.- Prompt Example:
[Image of a painting] + "What is the historical significance of the person wearing red in this painting?" - Output:
"The person wearing red is likely a patron or donor..."
- Prompt Example:
Stage 4: The Output—Text-to-Speech (TTS)¶
You need to convert the VLM's text answer into clear, natural-sounding audio.
- Recommended TTS Library: The most compelling local, open-source choice is NeuTTS Air or a comparable lightweight alternative like MeloTTS. NeuTTS Air is built off a small LLM backbone and is specifically optimized for on-device deployment in the GGUF/GGML format, ensuring real-time, natural-sounding audio output without a cloud connection.
- Implementation: The AI's generated text answer is streamed or passed as a block to the local TTS model, which generates the audio waveform and plays it through the device's speaker.
Architectural Summary (User Experience Flow)¶
- User Action: User takes a picture of a museum object and taps a "Speak" button.
- Initial Input (STT): User speaks their question. (Vosk runs locally)
- Processing (VLM): The picture and the transcribed text are fed to the local Qwen2.5-VL-7B-Instruct. (llama.cpp/MLLM runs locally)
- AI Output (TTS): The VLM's text response is fed to NeuTTS Air, which generates and streams the audio. (NeuTTS Air runs locally)
- Loop for Follow-up: The app remains open for the user to ask a follow-up question, restarting the cycle from Step 2, ensuring a fluid, conversational experience.