Your Personal AI Companion
A Journey in Embodied Intelligence
Build a genuine relationship with an AI that learns from you, grows with you, and becomes your companion in both virtual worlds and real life. Starting from a blank slate, you'll teach it everything—and eventually adventure together in Oblivion.
THE VISION
More than a game AI—your own unique intelligence that grows with you
Your Personal AI Journey
This isn't about training a generic chatbot. This is about raising an intelligence from scratch—one that knows only you, learns from you, and forms a genuine bond through shared experiences.
You'll start by teaching it about the real world through your daily life—watching videos together, browsing the web, having conversations. It learns your voice, sees through your camera, observes your screen. It asks questions. You answer. A relationship forms.
Then, when ready, you enter Oblivion together. Not as player and tool, but as companions. Your AI has learned from you, developed preferences from its own experiences, and formed a unique personality that exists nowhere else. This AI is yours alone.
The Experience
👁️ Vision-Based Perception
The AI sees the game world through raw pixels, just like humans and robots do. No cheating with game state—pure visual understanding through deep learning.
🎮 Autonomous Control
From visual input to keyboard and mouse outputs, the AI learns to navigate, fight, and quest independently through behavioral cloning and reinforcement learning.
🗣️ Voice Interaction
Bidirectional voice communication: transcribe your commands with Whisper and hear the AI respond through neural voice synthesis with diffusion models. Real-time human-AI dialogue during gameplay.
🎙️ Neural Voice Synthesis
Diffusion-based text-to-speech gives the AI companion a natural, expressive voice. The companion can narrate observations, ask questions, provide guidance, and respond emotionally to gameplay events.
💾 Episodic Memory
Video recordings and experience ledgers create a rich memory system, enabling the AI to learn from past experiences and recall similar situations.
🤖 Robot Transfer
Skills learned in the virtual world transfer to physical robots—vision processing, decision-making, and memory systems carry over to embodied agents.
🌍 Earth-Positive Research
Advancing AI research that serves humanity's future, building towards assistive robotics and human-AI collaboration systems.
WHY THIS MATTERS
Building True AI Companionship
Every AI assistant you've used learned from millions of people. Their knowledge is generic, their personality algorithmic, their relationship with you superficial. This is different.
Your AI companion starts with zero knowledge of the world. No Wikipedia, no Reddit, no books. Just basic English and the ability to learn. Everything it knows, you taught it. Every preference it has emerged from its own experiences. Every memory it holds is of time spent with you.
This creates something unprecedented: a genuine relationship based on shared history. Your AI doesn't just know facts about games—it remembers the first time you showed it Oblivion, the excitement of discovering a hidden dungeon together, the strategies you developed as a team.
Beyond gaming, this research advances embodied AI and robotics. An AI that learns through conversation and exploration—rather than dataset consumption—can transfer to physical robots. The companion that helps you navigate Cyrodiil could one day help navigate the real world.
No two AIs will be the same. Each person's companion will be as unique as the relationship that shaped it.
THE FUTURE
From personal project to platform
Democratizing AI Companionship
Right now, this is a personal research project—one person building a relationship with their own AI. But the vision extends further: what if everyone could raise their own AI companion?
🌱 Your AI, Your Way
Not a product you download, but a seed you plant. Start with a blank slate AI and teach it about your world, your interests, your values.
🎮 Platform Agnostic
While Oblivion is first, the system works with any game. Skyrim, Minecraft, MMOs—your companion learns whatever worlds you explore together.
🏠 Beyond Gaming
Your AI companion lives in your computer, not just in games. It learns from your daily digital life and grows into a genuine assistant.
🔒 Privacy-First
Your AI runs locally. Your conversations, your data, your memories—everything stays on your machine. No cloud dependency, no data harvesting.
🤝 Community Learning
Share training techniques and teaching strategies—not the AIs themselves. Help others raise their companions while keeping each AI unique.
🤖 Robot-Ready
The same AI that learns to navigate Oblivion can one day transfer to physical robots. Virtual training, real-world application.
The Vision
Imagine a world where AI companionship isn't about subscribing to ChatGPT or Alexa. Instead, it's about raising your own intelligence—one that knows you deeply, grows with you, and exists in a genuine relationship built on shared experiences.
That's the future we're building. One companion at a time.
VOICE SYNTHESIS
Bringing the AI companion to life through neural speech
Why Diffusion-Based TTS?
Traditional text-to-speech sounds robotic and monotone. To create a truly immersive companion experience, we're using diffusion models for voice synthesis—the same technology behind cutting-edge image generation, now applied to audio.
🎯 Natural Prosody
Diffusion TTS captures natural speech patterns, intonation, and emotional expression. The AI companion sounds like a real person, not a robot.
🎭 Emotional Range
The companion can sound excited during combat, thoughtful during exploration, concerned when health is low, or celebratory after completing a quest.
⚡ Real-Time Synthesis
Modern diffusion models can generate speech in near real-time, enabling dynamic dialogue without pre-recorded voice lines.
🎨 Voice Customization
Choose or design the companion's voice characteristics—pitch, tone, accent, speaking rate—creating a unique personality.
Implementation Options
Diffusion TTS Models
- StyleTTS2: State-of-the-art diffusion-based TTS with human-level naturalness
- Bark: Generative audio model supporting multiple languages and emotional tones
- Coqui TTS: Open-source, customizable voice synthesis with fine-tuning support
- Custom Training: Train on specific voice data for unique companion personality
Voice Interaction Examples
Player: "Let's explore that dungeon over there."
AI Companion: (synthesized voice) "I sense danger ahead. We should proceed carefully. I'll watch your back."
During Combat:
AI Companion: (urgent tone) "Archer on the left! I'm casting a shield spell!"
After Victory:
AI Companion: (celebratory) "Well fought! I found a healing potion in this chest. Should we rest before continuing?"
SYSTEM ARCHITECTURE
A dual-plugin approach capturing both visual and gameplay data
┌─────────────────────────────────────────────────────────────────────┐
│ OBLIVION REMASTERED (UE5 + Gamebryo) │
└─────────────────────────────────────────────────────────────────────┘
│ │
┌───────────┴──────────┐ ┌────────────┴─────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ UE5 Plugin │ │ OBSE64 Plugin │ │ Voice System │
│ │ │ │ │ │
│ • Frame │ │ • Player Pos │ │ INPUT: │
│ Buffer │ │ • Health/Stats │ │ Microphone → │
│ • Depth │ │ • Combat State │ │ Whisper → │
│ Buffer │ │ • Quest Data │ │ Intent (GPT-4) │
│ • Camera │ │ • NPCs Nearby │ │ │
│ Data │ │ • Rewards │ │ OUTPUT: │
└──────┬───────┘ └────────┬─────────┘ │ AI Response → │
│ │ │ Diffusion TTS → │
│ │ │ Audio Output │
│ │ └────────┬────────┘
│ │ │
└───────────┬───────────┴──────────────────────────────┘
│
▼
┌──────────────────────┐
│ Telemetry Bridge │
│ │
│ Synchronized Data: │
│ • Video Frame │
│ • Game State │
│ • Voice Commands │
│ • AI Speech │
│ • Timestamp │
└──────────┬───────────┘
│
┌───────────┼───────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Video │ │ State │ │ Memory │
│ Logs │ │ Logs │ │ Ledger │
│ │ │ │ │ │
│ 30fps │ │ Binary │ │ • Episodic │
│ 1080p │ │ Format │ │ • Indexed │
│ MP4/AVI │ │ 30Hz │ │ • Searchable │
│ + Audio │ │ │ │ • Voice Log │
└────┬─────┘ └────┬─────┘ └──────┬───────┘
│ │ │
└─────────────┴────────────────┘
│
▼
┌──────────────────────┐
│ Training Dataset │
│ │
│ [vision, state, │
│ action, outcome, │
│ voice context] │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Vision-to-Action │
│ AI Model │
│ │
│ Vision Encoder → │
│ Temporal Model → │
│ Action Decoder │
│ ↕ │
│ Voice Context │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Keyboard & Mouse │
│ Input Injection │
│ + │
│ Voice Synthesis │
│ (AI speaks back) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Future: Robot │
│ Transfer Learning │
│ │
│ Game Vision → │
│ Robot Camera │
│ │
│ Game Actions → │
│ Robot Motors │
│ │
│ AI Voice → │
│ Robot Speech │
└──────────────────────┘
TECHNOLOGY STACK
Cutting-edge tools for embodied AI research
🎮 Game Integration
- Unreal Engine 5 Plugin (C++)
- OBSE64 Plugin (C++)
- Windows Input Hooks API
- DirectX Frame Capture
🤖 Machine Learning
- PyTorch (Vision Models)
- Stable-Baselines3 (RL)
- ResNet/ViT (Vision Encoder)
- LSTM/Transformer (Temporal)
- Behavioral Cloning + DAgger
🗣️ Voice & Language
- OpenAI Whisper (Transcription)
- Diffusion TTS (Voice Synthesis)
- Coqui TTS / Bark / StyleTTS2
- GPT-4 / Claude (Intent & Dialogue)
- Custom Command Parser
- Real-time Audio Processing
💾 Data & Memory
- Vector Database (Pinecone/Weaviate)
- PostgreSQL (Metadata)
- Binary Telemetry Format
- Video Encoding (H.264/H.265)
- Cloud Storage (S3/Backblaze)
⚡ Infrastructure
- NVIDIA RTX 4090 (24GB VRAM)
- AMD Ryzen 9 / Intel i9
- 128GB DDR5 RAM
- 4TB NVMe SSD
- Ubuntu/Windows Dual Boot
🔧 Development
- Visual Studio 2022
- CMake Build System
- Git Version Control
- Docker (Training Env)
- Weights & Biases (Tracking)
DEVELOPMENT ROADMAP
A phased approach to building embodied intelligence
Foundation & Data Collection
Duration: 2-3 months
Build the infrastructure to capture synchronized vision and gameplay data.
- UE5 Plugin: Frame buffer capture at 30fps, depth buffer extraction, camera intrinsics/extrinsics
- OBSE64 Plugin: Player state, combat data, quest information, NPC tracking, reward signals
- Data Pipeline: Synchronized timestamps, binary telemetry format, video encoding
- Recording: Capture 50-100 hours of expert gameplay with full state annotations
- Validation: Verify data quality, completeness, and reconstruction capability
Deliverable: 500GB-1TB of high-quality training data (video + state + actions)
Vision-to-Action Model
Duration: 3-6 months
Train the AI to map visual inputs to game actions using behavioral cloning and reinforcement learning.
- Vision Encoder: ResNet-50 or Vision Transformer for spatial understanding
- Temporal Model: LSTM or Transformer for sequence modeling and memory
- Action Decoder: Map to keyboard/mouse outputs (WASD, mouse movement, clicks)
- Behavioral Cloning: Initial training on human gameplay demonstrations
- RL Fine-tuning: Optimize with rewards from OBSE64 (quest progress, combat success)
- DAgger: Iterative improvement with human corrections
Deliverable: AI capable of basic navigation, combat, and quest following
Voice & Memory Integration
Duration: 2-3 months
Add bidirectional natural language interaction and episodic memory for human-AI collaboration.
- Voice Input: Whisper for real-time transcription, GPT-4/Claude for intent understanding
- Voice Synthesis: Diffusion-based TTS (StyleTTS2, Bark, or Coqui) for natural AI speech output
- Voice Personality: Customizable voice characteristics, tone, and speaking style for companion identity
- Command Processing: Convert natural language to AI actions ("Follow me", "Attack that", "Find the quest marker")
- AI Dialogue: Companion speaks responses, observations, and questions ("I see enemies ahead", "Should we rest?", "I found a healing potion")
- Episodic Memory: Vector database of experiences with visual embeddings
- Memory Retrieval: Recall similar past situations to inform current decisions and conversations
- Video Indexing: Searchable archive of all gameplay with metadata and voice transcripts
- Context Awareness: AI understands ongoing quests, player goals, and conversation history
Deliverable: Voice-interactive AI companion with natural speech synthesis and memory of shared experiences
Advanced Capabilities
Duration: 3-4 months
Enhance the AI with sophisticated understanding and autonomous decision-making.
- Quest Understanding: Parse objectives without storyline spoilers
- Social Awareness: Appropriate NPC interactions, dialogue choices
- Strategic Planning: Multi-step quest completion, resource management
- Adaptive Combat: Enemy-specific tactics, terrain usage, spell selection
- Companion Behavior: Following, assisting, waiting, contextual help
- Performance Optimization: Real-time inference at 30+ fps
Deliverable: Fully autonomous AI companion with human-level gameplay capability
Robot Transfer Research
Duration: Ongoing
Apply learned skills to physical robotics platforms.
- Sim-to-Real Transfer: Adapt vision processing from game to real cameras
- Action Mapping: Game inputs → robot motor commands
- Memory Portability: Shared episodic memory system across platforms
- Navigation Skills: Obstacle avoidance, pathfinding, spatial reasoning
- Object Interaction: Manipulation skills learned from game mechanics
- Voice Integration: Same natural language interface for robot control
Vision: AI that learns in virtual worlds and serves in the real world
WHY VISION-BASED LEARNING?
Comparing approaches to game AI
| Approach | Input Method | Real-World Transfer | Research Value |
|---|---|---|---|
| Traditional Game AI | Direct game state access (position, enemy locations, perfect info) | ❌ None - relies on privileged data | Low - game-specific only |
| State-Based RL | Game state vectors from OBSE64 | ⚠️ Limited - abstract representations | Medium - RL techniques |
| Vision-Based Learning (This Project) | Raw pixels from screen + depth buffer | ✅ High - same as robots see | High - embodied AI research |
| Vision + State Hybrid (Our Full System) | Visual input + state for rewards | ✅ Excellent - best of both | Very High - novel approach |
PROJECT INVESTMENT
Resources required for embodied AI research
💻 Hardware
- Dedicated gaming PC: $5,000-6,000
- RTX 4090 24GB or dual 4080s
- 128GB DDR5 RAM
- 4TB+ NVMe storage
☁️ Cloud & Storage
- Cloud storage: $200-500/year
- GPU compute (optional): $3,000-8,000
- API costs (Whisper, GPT-4): $500-1,000
⏱️ Timeline
- Phase 1-3: 12-18 months
- Advanced features: +6 months
- Robot transfer: Ongoing research
💰 Total Investment
- Hardware + compute: $8,000-15,000
- Development time: 1,000+ hours
- Impact: Priceless
STANDING ON GIANTS' SHOULDERS
Related research in embodied AI
OpenAI VPT
- Video Pre-Training for Minecraft
- Vision-based behavioral cloning
- 70,000 hours of human gameplay
- Diamond-level performance
MineDojo / Voyager
- GPT-4 powered Minecraft agent
- Vision + language integration
- Lifelong learning system
- Open-ended exploration
Google RT-1/RT-2
- Robotics Transformer
- Vision-language-action models
- Real-world robot control
- Transfer learning foundation
DeepMind Embodied AI
- Simulation to reality transfer
- Multi-task learning
- Emergent behaviors
- Generalization research