midscene-ios-automationAI-powered iOS device automation using Midscene CLI. Control iOS devices and simulators with natural language commands via WebDriverAgent. Triggers: ios, iph...
Install via ClawdBot CLI:
clawdbot install quanru/midscene-ios-automationCRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:
>
1. Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
2. Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
3. Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer.
Automate iOS devices using npx @midscene/ios@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.
Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
Example: Gemini (Gemini-3-Flash)
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
Example: Qwen3-VL
MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
MIDSCENE_MODEL_NAME="qwen/qwen3-vl-235b-a22b-instruct"
MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
MIDSCENE_MODEL_FAMILY="qwen3-vl"
Example: Doubao Seed 1.6
MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-1-6-250615"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-vision"
Commonly used models: Doubao Seed 1.6, Qwen3-VL, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.
If the model is not configured, ask the user to set it up. See Model Configuration for supported providers.
npx @midscene/ios@1 connect
npx @midscene/ios@1 take_screenshot
After taking a screenshot, read the saved image file to understand the current screen state before deciding the next action.
Use act to interact with the device and get the result. It autonomously handles all UI interactions internally — tapping, typing, scrolling, swiping, waiting, and navigating — so you should give it complex, high-level tasks as a whole rather than breaking them into small steps. Describe what you want to do and the desired effect in natural language:
# specific instructions
npx @midscene/ios@1 act --prompt "type hello world in the search field and press Enter"
npx @midscene/ios@1 act --prompt "tap Delete, then confirm in the alert dialog"
# or target-driven instructions
npx @midscene/ios@1 act --prompt "open Settings and navigate to Wi-Fi, tell me the connected network name"
npx @midscene/ios@1 disconnect
Since CLI commands are stateless between invocations, follow this pattern:
act to perform the desired action or target-driven instructions."the Settings icon in the top-right corner" instead of "the icon"."the search icon at the top right", "the third item in the list").act command: When performing consecutive operations within the same app, combine them into one act prompt instead of splitting them into separate commands. For example, "open Settings, tap Wi-Fi, and check the connected network" should be a single act call, not three. This reduces round-trips, avoids unnecessary screenshot-analyze cycles, and is significantly faster.Example — Alert dialog interaction:
npx @midscene/ios@1 act --prompt "tap the Delete button and confirm in the alert dialog"
npx @midscene/ios@1 take_screenshot
Example — Form interaction:
npx @midscene/ios@1 act --prompt "fill in the username field with 'testuser' and the password field with 'pass123', then tap the Login button"
npx @midscene/ios@1 take_screenshot
Symptom: Connection refused or timeout errors.
Solution:
Symptom: No device detected or connection errors.
Solution:
Symptom: Authentication or model errors.
Solution:
.env file contains MIDSCENE_MODEL_API_KEY=.Generated Mar 1, 2026
Automate functional and regression testing for iOS applications by simulating user interactions like tapping buttons, entering text, and navigating through screens. This reduces manual testing effort and ensures consistent test execution across different app versions.
Guide users through initial device setup processes, such as configuring Wi-Fi, installing apps, or adjusting settings, by automating taps and swipes based on visual cues. This streamlines deployment for enterprises or customer support teams.
Automate the review of user-generated content within iOS apps by navigating through feeds, tapping on reported items, and taking screenshots for documentation. This helps scale moderation efforts while maintaining visual context.
Test iOS apps for accessibility compliance by automating interactions with screen elements to ensure they are usable with assistive technologies. This includes checking contrast, navigation flows, and element visibility.
Automate common e-commerce tasks like product search, adding items to cart, and checkout processes on iOS apps. This can be used for demo purposes, user training, or testing payment integrations.
Offer a cloud-based platform where testing teams subscribe to access iOS automation tools, with tiered pricing based on usage, number of devices, or features. Revenue comes from monthly or annual subscriptions.
Provide bespoke automation solutions for businesses needing tailored iOS workflows, such as app-specific testing scripts or device management setups. Revenue is generated through project-based fees or hourly consulting rates.
Release a free version of the automation tool with basic capabilities, while charging for advanced features like parallel testing, detailed analytics, or priority support. This model attracts users and upsells to premium tiers.
💬 Integration Tip
Ensure environment variables for model APIs are correctly set before execution to avoid failures, and always run commands synchronously to maintain the screenshot-analyze-act loop.
Full Windows desktop control. Mouse, keyboard, screenshots - interact with any Windows application like a human.
Control Android devices via ADB with support for UI layout analysis (uiautomator) and visual feedback (screencap). Use when you need to interact with Android apps, perform UI automation, take screenshots, or run complex ADB command sequences.
Build, test, and ship iOS apps with Swift, Xcode, and App Store best practices.
Control macOS GUI apps visually — take screenshots, click, scroll, type. Use when the user asks to interact with any Mac desktop application's graphical interface.
Best practices and example-driven guidance for building SwiftUI views and components. Use when creating or refactoring SwiftUI UI, designing tab architecture with TabView, composing screens, or needing component-specific patterns and examples.
Write safe Swift code avoiding memory leaks, optional traps, and concurrency bugs.