Execution Detail Plan
This is an exceptionally high-quality architectural specification. The PDF correctly identifies that the "hard part" of this app isn't transcription (which is a solved problem), but Contextual Continuity—the ability of the AI to "remember" that you talked about your high school crush in Session 1 when you are discussing your wedding in Session 10.
By choosing Gemini 1.5 Pro for direct audio ingestion (Multimodal), you are skipping the error-prone "Speech-to-Text" layer entirely. This is the bleeding edge of current AI capability.
Here is your Executable Plan to build this on Windows 11 using WSL (Ubuntu). This setup allows you to keep your heavy Android tools on Windows while running your backend/AI logic in the fast Linux environment.
Part 1: The "Hybrid" Development Environment (Win11 + WSL)¶
Developing mobile apps in WSL has one specific pain point: USB Debugging. WSL cannot easily "see" the Android phone plugged into your USB port. We will use a TCP bridge to solve this.
Step 1.1: Windows Side (Host)¶
- Install Android Studio: Install it on Windows (not WSL). Install the Android SDK and Emulator.
- Install Flutter: Install Flutter on Windows and add it to your path.
- Enable ADB Bridge:
- Plug in your Android phone (or start an emulator).
- Open PowerShell and run:
adb tcpip 5555 - This opens a port so WSL can talk to the phone.
Step 1.2: WSL Side (The Brain)¶
Open your Ubuntu terminal in WSL.
- Install Dependencies:
- Install Flutter (in WSL):
- Download the Flutter Linux tarball and extract it to
~/development/flutter. - Add to
.bashrc:export PATH="$PATH:$HOME/development/flutter/bin"
- Download the Flutter Linux tarball and extract it to
- Install Node.js (for Firebase Functions):
- Install Firebase Tools:
- Connect to Windows ADB:
- In WSL, run:
adb connect $(cat /etc/resolv.conf | grep nameserver | awk '{print $2}'):5555 - Run
flutter devices. You should now see your Windows-connected phone inside Linux!
- In WSL, run:
Part 2: Project Initialization (The Scaffold)¶
We will follow the PDF's structure: Flutter Client + Firebase Backend.
- Create the Project:
- Initialize Firebase:
- Add PDF-Recommended Dependencies:
Inside
pubspec.yaml:
Part 3: Phase 1 Implementation (The Capture Engine)¶
This is the "MVP" described in section 4 of your PDF.
Step 3.1: The Recorder (Flutter)¶
Location: lib/services/audio_recorder.dart
We need to record AAC/M4A as specified to save bandwidth while maintaining quality for the AI.
import 'package:flutter_sound/flutter_sound.dart';
import 'package:path_provider/path_provider.dart';
import 'dart:io';
class AudioRecorderService {
final FlutterSoundRecorder _recorder = FlutterSoundRecorder();
Future<void> init() async {
await _recorder.openRecorder();
}
Future<String> startRecording() async {
final dir = await getTemporaryDirectory();
// Unique ID for filename
String fileName = 'session_${DateTime.now().millisecondsSinceEpoch}.m4a';
String path = '${dir.path}/$fileName';
// 64kbps AAC as per PDF spec
await _recorder.startRecorder(
toFile: path,
codec: Codec.aacADTS,
bitRate: 64000,
sampleRate: 16000,
);
return path;
}
Future<String?> stopRecording() async {
return await _recorder.stopRecorder();
}
}
Step 3.2: The Uploader (Flutter)¶
Location: lib/services/upload_service.dart
We upload to a specific path that the Cloud Function watches: users/{uid}/raw_audio/{file}.
import 'package:firebase_storage/firebase_storage.dart';
import 'package:firebase_auth/firebase_auth.dart';
import 'dart:io';
Future<void> uploadAudio(String filePath) async {
File file = File(filePath);
String uid = FirebaseAuth.instance.currentUser!.uid;
String fileName = filePath.split('/').last;
// Resumable upload for long sessions
final ref = FirebaseStorage.instance.ref().child('users/$uid/raw_audio/$fileName');
UploadTask task = ref.putFile(file, SettableMetadata(contentType: 'audio/x-m4a'));
task.snapshotEvents.listen((event) {
print('Progress: ${(event.bytesTransferred / event.totalBytes) * 100} %');
});
await task;
}
Step 3.3: The "Brain" (Cloud Functions + Vertex AI)¶
Location: functions/src/index.ts
This is the most critical code. It triggers when audio lands, sends it to Gemini 1.5 Pro, and saves the structured JSON to Firestore.
Note: You must enable the "Vertex AI API" in your Google Cloud Console for this to work.
import * as v2 from "firebase-functions/v2";
import * as admin from "firebase-admin";
import { VertexAI } from "@google-cloud/vertexai";
admin.initializeApp();
const db = admin.firestore();
// Initialize Vertex AI
const vertexAI = new VertexAI({ project: process.env.GCLOUD_PROJECT, location: "us-central1" });
const model = vertexAI.getGenerativeModel({ model: "gemini-1.5-pro-preview-0409" });
export const processAudio = v2.storage.onObjectFinalized(
{ timeoutSeconds: 3600, memory: "2GiB" }, // 60 min timeout for long audio
async (event) => {
const fileBucket = event.data.bucket;
const filePath = event.data.name;
const contentType = event.data.contentType;
// Only process audio in the raw_audio folder
if (!filePath || !filePath.includes("raw_audio/") || !contentType?.startsWith("audio/")) {
return;
}
const uid = filePath.split("/")[1]; // Extract UID from path structure
// Construct the GCS URI (gs://...) required by Gemini
const gcsUri = `gs://${fileBucket}/${filePath}`;
const prompt = `
You are an expert biographer. Listen to this audio file.
1. Transcribe the audio verbatim.
2. Identify the specific time period discussed (e.g., "High School", "1990s").
3. Extract key entities (People, Places).
4. Detect the emotional tone.
Return ONLY valid JSON in this format:
{
"transcript": "string",
"summary": "string",
"timePeriod": "string",
"entities": ["string"],
"emotion": "string"
}
`;
// Multimodal Call: Text Prompt + Audio File URI
const result = await model.generateContent([
{ fileData: { mimeType: contentType, fileUri: gcsUri } },
{ text: prompt }
]);
const responseText = result.response.candidates[0].content.parts[0].text;
// Parse JSON (Add error handling for production)
// Gemini often wraps JSON in ```json ... ```, so we clean that.
const cleanJson = responseText?.replace(/```json|```/g, "").trim();
const data = JSON.parse(cleanJson || "{}");
// Write to Firestore "Memories" collection
await db.collection(`users/${uid}/memories`).add({
...data,
audioRef: gcsUri,
createdAt: admin.firestore.FieldValue.serverTimestamp(),
processed: true
});
}
);
Part 4: The Data Structure (Firestore)¶
Your PDF (Page 9) outlines the schema perfectly. To support the Timeline View and Vector Search, you need to configure indexes in Google Cloud Console.
-
Composite Index (for Timeline):
- Collection:
memories - Fields:
userId(Ascending),estimatedDate(Ascending) - Why? Allows
where('userId', '==', me).orderBy('estimatedDate').
- Collection:
-
Vector Index (for RAG):
- This is done via command line (gcloud) as shown in the PDF.
- Run this in your WSL terminal:
Part 5: Suggestion & Expansion on the "Writing" Process¶
The PDF covers Phase 2 (RAG) well, but here is a specific suggestion for the User Interface (UI) during the "Writing" phase to make it feel magical.
The "Gap Detector" UI¶
Don't just show a list of recordings. Create a visual Life Map.
-
The Logic: When the app loads, fetch all
memories. Calculate the time range (e.g., 1980 - 2024). Divide the range into "Eras" (Childhood, Teens, 20s). -
The Visual: If the user has 5 recordings from "Teens" but 0 from "20s", the app should render a "Fog of War" over the 20s section.
-
The Interaction: When the user taps the foggy "20s" section, the AI (using the RAG context of the previous era) prompts: > "We know you graduated in 1998. What happened immediately after that? Did you move?"
Immediate Next Steps for You¶
- Environment: Set up the WSL-to-Windows ADB bridge. This is the #1 blocker for this specific setup.
- Authentication: Enable "Anonymous Auth" in Firebase Console so you can test recording without building a login screen first.
- Deploy Functions: Run
firebase deploy --only functionsfrom WSL and check the Google Cloud logs to ensure the function starts correctly. - Test: Record a 30-second clip on the phone, upload it, and watch the Firestore database populate with the JSON extracted by Gemini.
This plan moves you from "PDF Theory" to "Working Prototype" in about 4 hours of work.