Museum Personal Guide

This is a fantastic concept for a museum app. The requirement for local, small LLM is key, as it provides instant, offline, and private interaction, which is a significant advantage in large museum settings (where Wi-Fi/cellular might be spotty) or for historical/private collections.

Here is the recommended local LLM and a detailed implementation path.

Recommended Local Multimodal LLM¶

The core of your application requires a Vision-Language Model (VLM) that can take an image and a user's question (in text form, after Speech-to-Text) and generate a relevant text answer.

Component	Recommended Model(s)	Key Feature for Museum App
Core VLM (Image + Text)	Qwen2.5-VL-7B-Instruct	A compact 7B-parameter vision-language model explicitly optimized for mobile deployment and visual comprehension (analyzing images and text within them).
Alternative VLM	Gemma-3n (Multimodal)	Designed for Edge AI/mobile and supports vision, text, and audio, making it a powerful, pre-integrated multimodal option.
LLM Inference Engine	llama.cpp / MLLM	The foundational C++ library for high-speed, quantized model inference on resource-constrained devices (supports many models including Llama and Qwen).

The "Local" Constraint Requires Three Components:¶

To achieve your full vision (Picture \(\rightarrow\) AI \(\rightarrow\) Audio \(\rightarrow\) Follow-up Question), you need to chain three distinct models locally:

Speech-to-Text (STT): User's audio question \(\rightarrow\) Text.
Vision-Language Model (VLM): User's Picture + Text \(\rightarrow\) AI Text Answer.
Text-to-Speech (TTS): AI Text Answer \(\rightarrow\) Audio Output.

Implementation Path: A 4-Stage Architecture¶

Your implementation path should focus on integrating these three local models using an efficient mobile deployment framework.

Stage 1: The Foundation (Mobile Framework & LLM Inference)¶

This is the glue that holds the entire local AI system together.

Choose a Cross-Platform Framework: For rapid development and cross-OS compatibility, consider React Native (using bindings like llama.rn) or a native development approach (Swift for iOS, Kotlin/Java for Android).
Select the Inference Engine: Implement the core model logic using llama.cpp. This C++ library is the industry standard for fast, on-device LLM inference and is optimized for models saved in the GGUF format, which is a highly-quantized format for CPU/NPU use on mobile.
Model Preparation: Download and quantize your chosen VLM (e.g., Qwen2.5-VL-7B-Instruct) into a high-efficiency format (like GGUF or a format optimized for the MLLM engine) that fits within the memory limits of a modern smartphone (typically aiming for under 4GB total model size).

Stage 2: The Input—Speech-to-Text (STT)¶

You need to convert the user's spoken question ("What is this painting about?") into text that the VLM can process.

Recommended STT Library: Use Vosk API. Vosk is an open-source, offline speech recognition toolkit with small, portable models (around 50MB per language) that run on both Android and iOS with low latency.
Implementation: The app records the user's voice, passes the audio to the local Vosk model, and receives the transcribed text.

Stage 3: The Core Logic—Vision-Language Model (VLM)¶

This is where the intelligence lies, combining the image and text question.

Input Flow: The app takes the camera photo and the transcribed text from the STT model (Stage 2).
VLM Processing: The VLM (e.g., Qwen2.5-VL-7B-Instruct running on llama.cpp or MLLM) processes both the image pixels (or a visual embedding derived from the image) and the text prompt to generate a textual response.
- Prompt Example: [Image of a painting] + "What is the historical significance of the person wearing red in this painting?"
- Output: "The person wearing red is likely a patron or donor..."

Stage 4: The Output—Text-to-Speech (TTS)¶

You need to convert the VLM's text answer into clear, natural-sounding audio.

Recommended TTS Library: The most compelling local, open-source choice is NeuTTS Air or a comparable lightweight alternative like MeloTTS. NeuTTS Air is built off a small LLM backbone and is specifically optimized for on-device deployment in the GGUF/GGML format, ensuring real-time, natural-sounding audio output without a cloud connection.
Implementation: The AI's generated text answer is streamed or passed as a block to the local TTS model, which generates the audio waveform and plays it through the device's speaker.

Architectural Summary (User Experience Flow)¶

User Action: User takes a picture of a museum object and taps a "Speak" button.
Initial Input (STT): User speaks their question. (Vosk runs locally)
Processing (VLM): The picture and the transcribed text are fed to the local Qwen2.5-VL-7B-Instruct. (llama.cpp/MLLM runs locally)
AI Output (TTS): The VLM's text response is fed to NeuTTS Air, which generates and streams the audio. (NeuTTS Air runs locally)
Loop for Follow-up: The app remains open for the user to ask a follow-up question, restarting the cycle from Step 2, ensuring a fluid, conversational experience.