Inspiration

The idea came from a frustrating afternoon trying to automate a job application on a popular mobile job board. Every tool we tried failed within minutes, blocked by anti-bot fingerprinting, randomized DOM class names, and behavioral detection. At some point it clicked: the entire mobile-first internet is effectively invisible to AI.

Modern automation tools are built on APIs. Structured gates that developers deliberately open for machines. But the apps that actually matter in people's daily lives, the gig work platforms, job boards, messaging apps, mobile banking, either have no public API, a heavily rate-limited one, or one locked behind expensive enterprise pricing. If a human can see it and tap it, why can't an AI?

That question became the whole point of Iron Claw. Instead of teaching AI to knock on developer-built API doors, we wanted to teach it to walk through any door, the same way a human does, by looking at the screen.


What We Learned

The Accessibility Layer Is a Hidden Superpower

Android's AccessibilityService API was originally built for screen readers. What we found is that it exposes a structured XML hierarchy of every UI element on screen, including metadata like resource-id, content-desc, and is-clickable, that is far more semantically useful than a raw screenshot. This tree is what Droidrun reads, and it turns out to be a surprisingly clean input for an LLM trying to figure out what to do next.

UI-Native Agents Break Less

Traditional bots fail the moment a site changes its HTML class names or restructures its DOM. A UI-native agent doesn't care. To the LLM, a button labeled "Apply" is an "Apply" button regardless of what the underlying markup looks like. The same quality that makes GUIs intuitive for humans is what makes this approach durable.

The Brain and the Body Need to Be Separate

Early on we tried running inference closer to the device. It throttled within minutes from heat. The lesson was obvious in hindsight: heavy LLM reasoning lives in the cloud as the Control Plane, the Android device (local or via MobileRun Cloud) is purely the Execution Plane. That separation shaped every architectural decision after it.

Automation Has a Real Cost Function

Given a task \(T\) decomposed into \(n\) atomic UI actions, the total cost looks like this:

$$ C(T) = \sum_{i=1}^{n} \left( \alpha \cdot l_i + \beta \cdot e_i \right) $$

where \(l_i\) is inference latency at step \(i\), \(e_i\) is the error probability, and \(\alpha, \beta\) are tunable weights. Minimizing \(C(T)\) pushed us toward concrete decisions: use Android Intents for precision tasks like alarms rather than tapping through UI, prune accessibility trees before sending to the LLM (cut token costs by roughly 60%), and only use visual navigation where structured shortcuts don't exist.

The Agent Loop Is Just a Feedback System

The core execution model is a closed loop:

$$ s_{t+1} = f(s_t, a_t), \quad a_t = \pi_\theta(s_t) $$

At each step, the agent reads the current UI state \(s_t\) from the accessibility snapshot, passes it to the LLM policy \(\pi_\theta\), picks an action \(a_t\), executes it, and observes the new state \(s_{t+1}\). No hardcoded scripts, no brittle selectors. Just observe, think, act, repeat.


How We Built It

The Stack

Iron Claw is a distributed system with three layers.

Gateway (Control Plane) is a FastAPI server in Python 3.11+ that acts as the brain. It doesn't execute actions itself, it decides what actions are needed. It handles intent parsing, task scheduling via APScheduler, chat interfaces over Telegram and the web UI, and Bio-Memory, a local vector store holding the user's parsed resume, preferences, and credentials that the LLM can pull from mid-task.

Execution Plane is MobileRun Cloud running the persistent Android environment. Droidrun sits on top, installing a portal.apk that uses AccessibilityService to capture the full UI tree at every step. ADB traffic between the Gateway and the device runs through a Tailscale tunnel so screen data never hits the public internet unencrypted.

Web UI is a React/Vite frontend with a chat interface, action pills, a thread sidebar with keyboard shortcuts (\(\text{Cmd}+\text{Shift}+\text{O}\) for new chat, \(\text{Cmd}+\text{Shift}+\text{S}\) to toggle sidebar), and real-time device mirroring over WebRTC. The mirror service relays signaling between the Portal app's built-in H.264 stream and the browser, which got us out of the 1-3 FPS ceiling that MJPEG was stuck at.

The Three Core Modules

Module A: The Job Hunter bypasses anti-bot defenses by operating at the presentation layer. It ingests a resume (parsed to structured JSON via LLM extraction), pushes the PDF to the device via adb push so it's physically present when file upload dialogs appear, then drives mobile Chrome through Google Jobs with visual filtering: tapping filter chips, reading job cards, and filling application forms by mapping Bio-Memory fields to whatever's on screen. A Flask microservice handles orchestration and tracks applications to MongoDB and Google Sheets.

Module B: The Temporal Guardian uses a deliberate hybrid approach. For precision tasks like alarms, the agent uses Android Intents via ADB shell (android.intent.action.SET_ALARM) rather than tapping through the Clock UI, which is more reliable and less prone to mis-taps. For calendar tasks, Droidrun opens Google Calendar visually, reads existing time blocks before scheduling to catch conflicts, and handles complex recurrence UIs that the Calendar API doesn't expose.

Module C: The Active Interrupter handles wake-up calls via Vapi with dynamic timezone resolution. Every hour, the agent polls adb shell dumpsys location, extracts GPS coordinates, maps them to a timezone string via timezonefinder, and updates the APScheduler trigger so the Vapi call fires at 2:00 AM in the user's current local time, even after travel. The Vapi assistant includes a cognitive check (solve a math problem) to verify the user is actually awake, not just half-asleep saying "yeah I'm up."

OpenClaw and Creao-Powered Tools

To give the agent reach beyond the device itself, Iron Claw integrates with OpenClaw, a webhook-based task queue that lets external orchestrators dispatch structured tasks and poll their status.

For live web intelligence, we used Creao to wire two tools into the OpenClaw integration:

  • Web Search lets the agent run structured queries and pull current info, mainly used during job hunting to validate company details, check posting freshness, and cross-reference role requirements before starting an application.
  • Web Scraping lets the agent pull structured data from dynamically rendered pages or session-dependent sites, useful for grabbing full job descriptions or form field schemas before the mobile agent opens the app.

Both tools register as processors in the OpenClaw webhook handler. When a task payload includes a web_search or scrape step, the Gateway routes it through the Creao client, gets the result back, and injects it into the LLM's context before the next device action. The agent effectively does its research in parallel with its phone-based execution.

OpenClaw also handles task lifecycle: requests create TaskInfo objects that move through queued, running, and completed/failed states, return a runId immediately for async polling, and optionally send Telegram notifications for visibility on long-running jobs.

Human-in-the-Loop (HITL)

When the agent hits something it can't resolve on its own, a visual CAPTCHA, an unexpected dialog, a multi-factor auth prompt, it pauses and sends a Telegram message with a screenshot and three buttons: Retry, Abort, or "I solved it." The user can open the MobileRun WebRTC stream, fix it manually on the live device, then tap "I solved it" to resume. That handoff is what makes the system usable for anything high-stakes.


Challenges Faced

1. ADB Sends Everything in Plaintext

ADB transmits keystrokes and screen dumps in cleartext by default. Routing that over a public cloud connection for MobileRun was out of the question given the screen content could include passwords, 2FA codes, and resume files. We set up a Tailscale mesh network between the Gateway and the MobileRun instance to encrypt the tunnel. The agent connects to localhost:5555 rather than any public IP.

2. Accessibility Trees Are Huge

Feeding a raw accessibility tree to an LLM at every step is expensive and slow. Trees in content-heavy apps can have hundreds of nodes, which blew through token budgets and hurt reasoning quality. We wrote a pruning heuristic that strips layout-only nodes (frames, containers with no semantic content) while keeping all interactive and textual elements. That got average tree size down by about 60% without losing anything actionable.

3. UIs Change Without Warning

A button that existed on Tuesday might be a menu item by Thursday. Hardcoded navigation paths break constantly. We handled this with a retry-with-replan loop: if an action fails or an expected element isn't there, the agent re-snapshots the screen and rethinks a path to the goal from the current state instead of throwing an error. That made the system tolerant of minor layout changes with no manual fixes.

4. Android 16 Broke Screen Recording

An OEM hardening change on Android 16 (OnePlus CPH2585) broke adb screenrecord stdout piping, which the legacy H.264/fMP4 mirror server depended on. We rebuilt device mirroring from scratch using the Portal app's native WebRTC stack (WebRtcManager.kt). The app already had a full WebRTC implementation with MediaProjectionAutoAccept, so the new mirror service just acts as a WebSocket signaling relay between the Portal app and the browser, with H.264 video flowing peer-to-peer. That got us from 1-3 FPS to smooth real-time preview.

5. Accessibility Permissions Mean Full Device Access

Giving AccessibilityService to Droidrun is essentially giving it God Mode on the device: banking apps, 2FA codes, private messages, all of it. We added a strict package whitelist enforced at the Gateway level. If the agent navigates outside the allowed app list (com.android.chrome, com.google.android.calendar, com.google.android.deskclock), the Gateway fires adb shell input keyevent HOME immediately and kills the session. We also use ephemeral sessions, Chrome Incognito mode, and delete resume.pdf from the device right after upload.

6. Timezone Scheduling Is Trickier Than It Looks

A static cron job for a 2 AM wake call breaks as soon as the user travels. We poll GPS coordinates every hour, map them to a timezone string via timezonefinder, and hot-update the APScheduler trigger. The edge cases added up fast: GPS disabled (IP geolocation fallback), airplane mode, and the brief window where location data is stale but the scheduler has already fired.


Iron Claw is a bet that the next wave of personal AI doesn't live in a chat window. It lives in your phone, running the same apps you use every day, while you sleep.

Built With

Share this project:

Updates