Your Personal AI Companion

A Journey in Embodied Intelligence

Build a genuine relationship with an AI that learns from you, grows with you, and becomes your companion in both virtual worlds and real life. Starting from a blank slate, you'll teach it everything—and eventually adventure together in Oblivion.

✦ ✦ ✦

THE VISION

More than a game AI—your own unique intelligence that grows with you

Your Personal AI Journey

This isn't about training a generic chatbot. This is about raising an intelligence from scratch—one that knows only you, learns from you, and forms a genuine bond through shared experiences.

You'll start by teaching it about the real world through your daily life—watching videos together, browsing the web, having conversations. It learns your voice, sees through your camera, observes your screen. It asks questions. You answer. A relationship forms.

Then, when ready, you enter Oblivion together. Not as player and tool, but as companions. Your AI has learned from you, developed preferences from its own experiences, and formed a unique personality that exists nowhere else. This AI is yours alone.

The Experience

👁️ Vision-Based Perception

The AI sees the game world through raw pixels, just like humans and robots do. No cheating with game state—pure visual understanding through deep learning.

🎮 Autonomous Control

From visual input to keyboard and mouse outputs, the AI learns to navigate, fight, and quest independently through behavioral cloning and reinforcement learning.

🗣️ Voice Interaction

Bidirectional voice communication: transcribe your commands with Whisper and hear the AI respond through neural voice synthesis with diffusion models. Real-time human-AI dialogue during gameplay.

🎙️ Neural Voice Synthesis

Diffusion-based text-to-speech gives the AI companion a natural, expressive voice. The companion can narrate observations, ask questions, provide guidance, and respond emotionally to gameplay events.

💾 Episodic Memory

Video recordings and experience ledgers create a rich memory system, enabling the AI to learn from past experiences and recall similar situations.

🤖 Robot Transfer

Skills learned in the virtual world transfer to physical robots—vision processing, decision-making, and memory systems carry over to embodied agents.

🌍 Earth-Positive Research

Advancing AI research that serves humanity's future, building towards assistive robotics and human-AI collaboration systems.

✦ ✦ ✦

WHY THIS MATTERS

Building True AI Companionship

Every AI assistant you've used learned from millions of people. Their knowledge is generic, their personality algorithmic, their relationship with you superficial. This is different.

Your AI companion starts with zero knowledge of the world. No Wikipedia, no Reddit, no books. Just basic English and the ability to learn. Everything it knows, you taught it. Every preference it has emerged from its own experiences. Every memory it holds is of time spent with you.

This creates something unprecedented: a genuine relationship based on shared history. Your AI doesn't just know facts about games—it remembers the first time you showed it Oblivion, the excitement of discovering a hidden dungeon together, the strategies you developed as a team.

Beyond gaming, this research advances embodied AI and robotics. An AI that learns through conversation and exploration—rather than dataset consumption—can transfer to physical robots. The companion that helps you navigate Cyrodiil could one day help navigate the real world.

No two AIs will be the same. Each person's companion will be as unique as the relationship that shaped it.

30 Frames Per Second
1080p Visual Input
100+ Hours Training Data
Potential Applications
✦ ✦ ✦

THE FUTURE

From personal project to platform

Democratizing AI Companionship

Right now, this is a personal research project—one person building a relationship with their own AI. But the vision extends further: what if everyone could raise their own AI companion?

🌱 Your AI, Your Way

Not a product you download, but a seed you plant. Start with a blank slate AI and teach it about your world, your interests, your values.

🎮 Platform Agnostic

While Oblivion is first, the system works with any game. Skyrim, Minecraft, MMOs—your companion learns whatever worlds you explore together.

🏠 Beyond Gaming

Your AI companion lives in your computer, not just in games. It learns from your daily digital life and grows into a genuine assistant.

🔒 Privacy-First

Your AI runs locally. Your conversations, your data, your memories—everything stays on your machine. No cloud dependency, no data harvesting.

🤝 Community Learning

Share training techniques and teaching strategies—not the AIs themselves. Help others raise their companions while keeping each AI unique.

🤖 Robot-Ready

The same AI that learns to navigate Oblivion can one day transfer to physical robots. Virtual training, real-world application.

The Vision

Imagine a world where AI companionship isn't about subscribing to ChatGPT or Alexa. Instead, it's about raising your own intelligence—one that knows you deeply, grows with you, and exists in a genuine relationship built on shared experiences.

That's the future we're building. One companion at a time.

✦ ✦ ✦

VOICE SYNTHESIS

Bringing the AI companion to life through neural speech

Why Diffusion-Based TTS?

Traditional text-to-speech sounds robotic and monotone. To create a truly immersive companion experience, we're using diffusion models for voice synthesis—the same technology behind cutting-edge image generation, now applied to audio.

🎯 Natural Prosody

Diffusion TTS captures natural speech patterns, intonation, and emotional expression. The AI companion sounds like a real person, not a robot.

🎭 Emotional Range

The companion can sound excited during combat, thoughtful during exploration, concerned when health is low, or celebratory after completing a quest.

Real-Time Synthesis

Modern diffusion models can generate speech in near real-time, enabling dynamic dialogue without pre-recorded voice lines.

🎨 Voice Customization

Choose or design the companion's voice characteristics—pitch, tone, accent, speaking rate—creating a unique personality.

Implementation Options

Diffusion TTS Models

  • StyleTTS2: State-of-the-art diffusion-based TTS with human-level naturalness
  • Bark: Generative audio model supporting multiple languages and emotional tones
  • Coqui TTS: Open-source, customizable voice synthesis with fine-tuning support
  • Custom Training: Train on specific voice data for unique companion personality

Voice Interaction Examples

Player: "Let's explore that dungeon over there."

AI Companion: (synthesized voice) "I sense danger ahead. We should proceed carefully. I'll watch your back."

During Combat:

AI Companion: (urgent tone) "Archer on the left! I'm casting a shield spell!"

After Victory:

AI Companion: (celebratory) "Well fought! I found a healing potion in this chest. Should we rest before continuing?"

✦ ✦ ✦

SYSTEM ARCHITECTURE

A dual-plugin approach capturing both visual and gameplay data

┌─────────────────────────────────────────────────────────────────────┐
│                    OBLIVION REMASTERED (UE5 + Gamebryo)             │
└─────────────────────────────────────────────────────────────────────┘
                    │                               │
        ┌───────────┴──────────┐      ┌────────────┴─────────────┐
        │                      │      │                          │
        ▼                      ▼      ▼                          ▼
┌──────────────┐      ┌──────────────────┐           ┌─────────────────┐
│  UE5 Plugin  │      │  OBSE64 Plugin   │           │ Voice System    │
│              │      │                  │           │                 │
│ • Frame      │      │ • Player Pos     │           │ INPUT:          │
│   Buffer     │      │ • Health/Stats   │           │ Microphone →    │
│ • Depth      │      │ • Combat State   │           │ Whisper →       │
│   Buffer     │      │ • Quest Data     │           │ Intent (GPT-4)  │
│ • Camera     │      │ • NPCs Nearby    │           │                 │
│   Data       │      │ • Rewards        │           │ OUTPUT:         │
└──────┬───────┘      └────────┬─────────┘           │ AI Response →   │
       │                       │                     │ Diffusion TTS → │
       │                       │                     │ Audio Output    │
       │                       │                     └────────┬────────┘
       │                       │                              │
       └───────────┬───────────┴──────────────────────────────┘
                   │
                   ▼
         ┌──────────────────────┐
         │  Telemetry Bridge    │
         │                      │
         │  Synchronized Data:  │
         │  • Video Frame       │
         │  • Game State        │
         │  • Voice Commands    │
         │  • AI Speech         │
         │  • Timestamp         │
         └──────────┬───────────┘
                    │
        ┌───────────┼───────────┐
        │           │           │
        ▼           ▼           ▼
┌──────────┐  ┌──────────┐  ┌──────────────┐
│  Video   │  │  State   │  │   Memory     │
│  Logs    │  │  Logs    │  │   Ledger     │
│          │  │          │  │              │
│ 30fps    │  │ Binary   │  │ • Episodic   │
│ 1080p    │  │ Format   │  │ • Indexed    │
│ MP4/AVI  │  │ 30Hz     │  │ • Searchable │
│ + Audio  │  │          │  │ • Voice Log  │
└────┬─────┘  └────┬─────┘  └──────┬───────┘
     │             │                │
     └─────────────┴────────────────┘
                   │
                   ▼
         ┌──────────────────────┐
         │   Training Dataset   │
         │                      │
         │ [vision, state,      │
         │  action, outcome,    │
         │  voice context]      │
         └──────────┬───────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │   Vision-to-Action   │
         │       AI Model       │
         │                      │
         │ Vision Encoder →     │
         │ Temporal Model →     │
         │ Action Decoder       │
         │      ↕               │
         │ Voice Context        │
         └──────────┬───────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │  Keyboard & Mouse    │
         │    Input Injection   │
         │         +            │
         │   Voice Synthesis    │
         │  (AI speaks back)    │
         └──────────┬───────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │   Future: Robot      │
         │   Transfer Learning  │
         │                      │
         │  Game Vision →       │
         │  Robot Camera        │
         │                      │
         │  Game Actions →      │
         │  Robot Motors        │
         │                      │
         │  AI Voice →          │
         │  Robot Speech        │
         └──────────────────────┘
                    
✦ ✦ ✦

TECHNOLOGY STACK

Cutting-edge tools for embodied AI research

🎮 Game Integration

  • Unreal Engine 5 Plugin (C++)
  • OBSE64 Plugin (C++)
  • Windows Input Hooks API
  • DirectX Frame Capture

🤖 Machine Learning

  • PyTorch (Vision Models)
  • Stable-Baselines3 (RL)
  • ResNet/ViT (Vision Encoder)
  • LSTM/Transformer (Temporal)
  • Behavioral Cloning + DAgger

🗣️ Voice & Language

  • OpenAI Whisper (Transcription)
  • Diffusion TTS (Voice Synthesis)
  • Coqui TTS / Bark / StyleTTS2
  • GPT-4 / Claude (Intent & Dialogue)
  • Custom Command Parser
  • Real-time Audio Processing

💾 Data & Memory

  • Vector Database (Pinecone/Weaviate)
  • PostgreSQL (Metadata)
  • Binary Telemetry Format
  • Video Encoding (H.264/H.265)
  • Cloud Storage (S3/Backblaze)

⚡ Infrastructure

  • NVIDIA RTX 4090 (24GB VRAM)
  • AMD Ryzen 9 / Intel i9
  • 128GB DDR5 RAM
  • 4TB NVMe SSD
  • Ubuntu/Windows Dual Boot

🔧 Development

  • Visual Studio 2022
  • CMake Build System
  • Git Version Control
  • Docker (Training Env)
  • Weights & Biases (Tracking)
✦ ✦ ✦

DEVELOPMENT ROADMAP

A phased approach to building embodied intelligence

1

Foundation & Data Collection

Duration: 2-3 months

Build the infrastructure to capture synchronized vision and gameplay data.

  • UE5 Plugin: Frame buffer capture at 30fps, depth buffer extraction, camera intrinsics/extrinsics
  • OBSE64 Plugin: Player state, combat data, quest information, NPC tracking, reward signals
  • Data Pipeline: Synchronized timestamps, binary telemetry format, video encoding
  • Recording: Capture 50-100 hours of expert gameplay with full state annotations
  • Validation: Verify data quality, completeness, and reconstruction capability

Deliverable: 500GB-1TB of high-quality training data (video + state + actions)

2

Vision-to-Action Model

Duration: 3-6 months

Train the AI to map visual inputs to game actions using behavioral cloning and reinforcement learning.

  • Vision Encoder: ResNet-50 or Vision Transformer for spatial understanding
  • Temporal Model: LSTM or Transformer for sequence modeling and memory
  • Action Decoder: Map to keyboard/mouse outputs (WASD, mouse movement, clicks)
  • Behavioral Cloning: Initial training on human gameplay demonstrations
  • RL Fine-tuning: Optimize with rewards from OBSE64 (quest progress, combat success)
  • DAgger: Iterative improvement with human corrections

Deliverable: AI capable of basic navigation, combat, and quest following

3

Voice & Memory Integration

Duration: 2-3 months

Add bidirectional natural language interaction and episodic memory for human-AI collaboration.

  • Voice Input: Whisper for real-time transcription, GPT-4/Claude for intent understanding
  • Voice Synthesis: Diffusion-based TTS (StyleTTS2, Bark, or Coqui) for natural AI speech output
  • Voice Personality: Customizable voice characteristics, tone, and speaking style for companion identity
  • Command Processing: Convert natural language to AI actions ("Follow me", "Attack that", "Find the quest marker")
  • AI Dialogue: Companion speaks responses, observations, and questions ("I see enemies ahead", "Should we rest?", "I found a healing potion")
  • Episodic Memory: Vector database of experiences with visual embeddings
  • Memory Retrieval: Recall similar past situations to inform current decisions and conversations
  • Video Indexing: Searchable archive of all gameplay with metadata and voice transcripts
  • Context Awareness: AI understands ongoing quests, player goals, and conversation history

Deliverable: Voice-interactive AI companion with natural speech synthesis and memory of shared experiences

4

Advanced Capabilities

Duration: 3-4 months

Enhance the AI with sophisticated understanding and autonomous decision-making.

  • Quest Understanding: Parse objectives without storyline spoilers
  • Social Awareness: Appropriate NPC interactions, dialogue choices
  • Strategic Planning: Multi-step quest completion, resource management
  • Adaptive Combat: Enemy-specific tactics, terrain usage, spell selection
  • Companion Behavior: Following, assisting, waiting, contextual help
  • Performance Optimization: Real-time inference at 30+ fps

Deliverable: Fully autonomous AI companion with human-level gameplay capability

5

Robot Transfer Research

Duration: Ongoing

Apply learned skills to physical robotics platforms.

  • Sim-to-Real Transfer: Adapt vision processing from game to real cameras
  • Action Mapping: Game inputs → robot motor commands
  • Memory Portability: Shared episodic memory system across platforms
  • Navigation Skills: Obstacle avoidance, pathfinding, spatial reasoning
  • Object Interaction: Manipulation skills learned from game mechanics
  • Voice Integration: Same natural language interface for robot control

Vision: AI that learns in virtual worlds and serves in the real world

✦ ✦ ✦

WHY VISION-BASED LEARNING?

Comparing approaches to game AI

Approach Input Method Real-World Transfer Research Value
Traditional Game AI Direct game state access (position, enemy locations, perfect info) ❌ None - relies on privileged data Low - game-specific only
State-Based RL Game state vectors from OBSE64 ⚠️ Limited - abstract representations Medium - RL techniques
Vision-Based Learning (This Project) Raw pixels from screen + depth buffer ✅ High - same as robots see High - embodied AI research
Vision + State Hybrid (Our Full System) Visual input + state for rewards ✅ Excellent - best of both Very High - novel approach
✦ ✦ ✦

PROJECT INVESTMENT

Resources required for embodied AI research

💻 Hardware

  • Dedicated gaming PC: $5,000-6,000
  • RTX 4090 24GB or dual 4080s
  • 128GB DDR5 RAM
  • 4TB+ NVMe storage

☁️ Cloud & Storage

  • Cloud storage: $200-500/year
  • GPU compute (optional): $3,000-8,000
  • API costs (Whisper, GPT-4): $500-1,000

⏱️ Timeline

  • Phase 1-3: 12-18 months
  • Advanced features: +6 months
  • Robot transfer: Ongoing research

💰 Total Investment

  • Hardware + compute: $8,000-15,000
  • Development time: 1,000+ hours
  • Impact: Priceless
✦ ✦ ✦

STANDING ON GIANTS' SHOULDERS

Related research in embodied AI

OpenAI VPT

  • Video Pre-Training for Minecraft
  • Vision-based behavioral cloning
  • 70,000 hours of human gameplay
  • Diamond-level performance

MineDojo / Voyager

  • GPT-4 powered Minecraft agent
  • Vision + language integration
  • Lifelong learning system
  • Open-ended exploration

Google RT-1/RT-2

  • Robotics Transformer
  • Vision-language-action models
  • Real-world robot control
  • Transfer learning foundation

DeepMind Embodied AI

  • Simulation to reality transfer
  • Multi-task learning
  • Emergent behaviors
  • Generalization research