In brief
- X-OmniClaw is an open-source Android AI agent developed by Oppo that runs its core logic directly on your device and only connects to the cloud for complex reasoning tasks.
- The system creates a long-term semantic memory using your photo gallery and past interactions, enabling it to function as a persistent assistant instead of a one-time chatbot.
- A behavior cloning feature allows users to record a navigation path once so the agent can replay it instantly via Android deeplink, skipping multi-step app navigation in future sessions.
Your phone already has a camera, a microphone, and a screen. It can see what you’re looking at in real life and what’s happening on its own display. Now, the AI team from Chinese smartphone manufacturer Oppo has realized that all that hardware, mostly underused, is exactly what you need to build a genuinely useful mobile AI agent.
That project is X-OmniClaw, published by the Multi-X Team. It’s an open-source AI agent framework for Android that turns your phone into a hands-free, context-aware assistant capable of running real tasks across real apps, without routing everything through a cloud copy of your device.
Most mobile AI systems don’t actually run on your phone. They run on cloud servers that host virtual copies of Android, letting an AI tap and scroll through apps remotely. The result: no access to your real camera, your actual photos, or your local files—just a stranger using a copy of your phone.
X-OmniClaw takes the opposite approach. According to the technical report, it introduces “an edge-native architecture that executes directly on the user’s physical device, thereby eliminating the gap between simulated environments and real-world interaction contexts.”
The report uses a car analogy: The smartphone is “the vehicle,” X-OmniClaw is “the internal engine for control and perception,” and the cloud-based language model is only called in as “the fuel” when heavy reasoning is needed. Everything else stays local.
How the Oppo AI phone agent works
X-OmniClaw’s overall architecture is built on three pillars: Omni Perception, Omni Action, and Omni Memory that work as one continuous loop, with cloud LLMs called in only for heavy reasoning, according to Oppo.
Omni Perception covers everything the phone can sense. It combines camera feeds, screen content, and voice input into a single pipeline. A vision-language model interprets the scene before the agent does anything else. So if you point your camera at a bottle and ask, “how much does this cost?”, the agent first figures out what you’re looking at, then opens the relevant shopping app and starts searching. No guessing required.
Omni Memory is the key feature that sets X-OmniClaw apart from a basic, single-response chatbot. The agent keeps track of context as you move between tasks, switch apps, and even across different sessions. On top of that, it creates a long-term semantic memory by analyzing your photo gallery, converting unstructured images into organized notes about objects, scenes, and events. According to the report, “runtime continuity is what enables X-OmniClaw to function as a persistent device agent instead of a one-off response tool.”
Omni Action takes care of the actual execution. It merges XML interface data with an on-device visual model and OCR—a character-recognition layer that determines precisely where to tap, even on cluttered, ad-filled screens where interface structure alone falls short. It also features behavior cloning: navigate to a hard-to-reach app page once while the system records it, and the agent can instantly replay that path next time using an Android deeplink shortcut.
What the Oppo AI agent is capable of doing

Oppo outlined several real-world tasks the model can handle. For instance, the agent recognizes a physical product through the camera, launches Taobao, browses through search results, and delivers a price comparison—all without any manual typing.
Oppo also showcased a floating on-screen assistant that walks a user through math problems one step at a time: it independently reads the screen, works through each question, and moves forward once completed.
Another demo involved a user asking the agent to compile a highlight reel from parrot-themed photos. The system searches the gallery, locates relevant images using its semantic memory, opens CapCut’s video editor through a deeplink, batch-selects the chosen files, and produces the video. A task that previously took “several minutes or more” is now reduced to a few automated actions.

2026: The year of agentic AI
AI agents have emerged as one of the hottest topics in the tech world. OpenClaw—the open-source agent framework that amassed over 373,000 GitHub stars and eventually received backing from OpenAI—kicked off the current wave by demonstrating what persistent, locally-run agents could accomplish on PCs. Hermes Agent by Nous Research pushed things even further with a self-improving learning loop that builds on its capabilities over time.
Both of these primarily run on desktop hardware. X-OmniClaw brings the same architecture to the device you carry with you every day. The team built on the open-source HermesApp codebase, and the paper directly acknowledges OpenClaw’s structured skill model as a foundational influence, then tailored it for the multimodal, always-on demands of a smartphone.
The code is now available on GitHub. Oppo has committed to releasing all project assets and continuing to update the system as it develops further.
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.



