midscene-android-automationAI-powered Android device automation using Midscene. Control Android devices with natural language commands via ADB. Perform taps, swipes, text input, app la...
Install via ClawdBot CLI:
clawdbot install quanru/midscene-android-automationCRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:
>
1. Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
2. Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
3. Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer.
Automate Android devices using npx @midscene/android@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.
Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
Example: Gemini (Gemini-3-Flash)
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
Example: Qwen3-VL
MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
MIDSCENE_MODEL_NAME="qwen/qwen3-vl-235b-a22b-instruct"
MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
MIDSCENE_MODEL_FAMILY="qwen3-vl"
Example: Doubao Seed 1.6
MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-1-6-250615"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-vision"
Commonly used models: Doubao Seed 1.6, Qwen3-VL, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.
If the model is not configured, ask the user to set it up. See Model Configuration for supported providers.
npx @midscene/android@1 connect
npx @midscene/android@1 connect --deviceId emulator-5554
npx @midscene/android@1 take_screenshot
After taking a screenshot, read the saved image file to understand the current screen state before deciding the next action.
Use act to interact with the device and get the result. It autonomously handles all UI interactions internally — tapping, typing, scrolling, swiping, waiting, and navigating — so you should give it complex, high-level tasks as a whole rather than breaking them into small steps. Describe what you want to do and the desired effect in natural language:
# specific instructions
npx @midscene/android@1 act --prompt "type hello world in the search field and press Enter"
npx @midscene/android@1 act --prompt "long press the message bubble and tap Delete in the popup menu"
# or target-driven instructions
npx @midscene/android@1 act --prompt "open Settings and navigate to Wi-Fi settings, tell me the connected network name"
npx @midscene/android@1 disconnect
Since CLI commands are stateless between invocations, follow this pattern:
act to perform the desired action or target-driven instructions.adb shell am start -n ) before invoking any midscene commands. Then take a screenshot to confirm the app is actually in the foreground. Only after visual confirmation should you proceed with UI automation using this skill. ADB commands are significantly faster than using midscene to navigate to and open apps."the Wi-Fi toggle switch on the right side" instead of "the toggle"."the search icon at the top right", "the third item in the list").act command: When performing consecutive operations within the same app, combine them into one act prompt instead of splitting them into separate commands. For example, "open Settings, tap Wi-Fi, and toggle it on" should be a single act call, not three. This reduces round-trips, avoids unnecessary screenshot-analyze cycles, and is significantly faster.Example — Popup menu interaction:
npx @midscene/android@1 act --prompt "long press the message bubble and tap Delete in the popup menu"
npx @midscene/android@1 take_screenshot
Example — Form interaction:
npx @midscene/android@1 act --prompt "fill in the username field with 'testuser' and the password field with 'pass123', then tap the Login button"
npx @midscene/android@1 take_screenshot
| Problem | Solution |
|---|---|
| ADB not found | Install Android SDK Platform Tools: brew install android-platform-tools (macOS) or download from developer.android.com. |
| Device not listed | Check USB connection, ensure USB debugging is enabled in Developer Options, and run adb devices. |
| Device shows "unauthorized" | Unlock the device and accept the USB debugging authorization prompt. Then run adb devices again. |
| Device shows "offline" | Disconnect and reconnect the USB cable. Run adb kill-server && adb start-server. |
| Command timeout | The device screen may be off or locked. Wake the device with adb shell input keyevent KEYCODE_WAKEUP and unlock it. |
| API key error | Check .env file contains MIDSCENE_MODEL_API_KEY=. See Model Configuration. |
| Wrong device targeted | If multiple devices are connected, use --deviceId flag with the connect command. |
Generated Mar 1, 2026
Automates UI testing for Android apps by performing taps, swipes, and text input based on visual analysis of screenshots. Ideal for QA teams to run regression tests without writing code, reducing manual effort and ensuring consistency across device states.
Enables support agents to remotely guide users through app troubleshooting by automating steps like navigating to settings or resetting preferences. Uses natural language commands to interact with the user's device via ADB, improving resolution times and accuracy.
Automates repetitive tasks in social media or content management apps, such as deleting posts or flagging inappropriate content. Leverages screenshot analysis to identify and interact with specific UI elements, streamlining moderation processes.
Guides new users through setup processes in e-commerce or banking apps by automating taps to complete forms or enable features. Helps reduce drop-off rates during onboarding by providing hands-free, visual-driven interactions.
Offers the automation skill as a cloud-based service with tiered pricing based on usage volume or features. Targets businesses needing scalable Android automation, generating recurring revenue through monthly or annual subscriptions.
Sells perpetual licenses to large organizations for internal use, such as in QA departments or IT support teams. Includes customization, support, and training, providing high-value deals with upfront and maintenance revenue.
Provides a free basic version with limited commands or device connections, encouraging adoption. Monetizes through premium upgrades for advanced features like batch processing or priority support, driving conversion from individual users to teams.
💬 Integration Tip
Ensure environment variables for the AI model are configured before use, and always run commands synchronously to maintain the screenshot-analyze-act loop.
Full Windows desktop control. Mouse, keyboard, screenshots - interact with any Windows application like a human.
Control Android devices via ADB with support for UI layout analysis (uiautomator) and visual feedback (screencap). Use when you need to interact with Android apps, perform UI automation, take screenshots, or run complex ADB command sequences.
Build, test, and ship iOS apps with Swift, Xcode, and App Store best practices.
Control macOS GUI apps visually — take screenshots, click, scroll, type. Use when the user asks to interact with any Mac desktop application's graphical interface.
Best practices and example-driven guidance for building SwiftUI views and components. Use when creating or refactoring SwiftUI UI, designing tab architecture with TabView, composing screens, or needing component-specific patterns and examples.
Write safe Swift code avoiding memory leaks, optional traps, and concurrency bugs.