computer-use: Full Desktop GUI Control for Headless Linux Servers
With 10,600+ downloads and 93 installs, computer-use is one of the most infrastructure-heavy skills on ClawHub. It solves a specific but common problem: how do you run and control desktop GUI applications on a Linux VPS that has no physical monitor? The answer is a virtual display — and this skill sets one up automatically, complete with a minimal XFCE desktop, VNC access, and a complete set of 17 mouse/keyboard action scripts.
The Problem It Solves
Most cloud servers are headless — no monitor, no GPU, no display. This means GUI applications (browsers, desktop apps, Electron tools, legacy software with no CLI) simply won't run. The workaround — Xvfb (X Virtual Framebuffer) — exists, but setting it up reliably with a desktop environment, stable VNC, auto-restart on crash, and consistent display settings takes hours of configuration.
computer-use does all of it in one setup script. After running setup-vnc.sh, you have a fully functional virtual desktop your agent can control like a real computer.
Core Concept
The skill creates a virtual X display at :99 with 1024×768 resolution (Anthropic's recommended resolution for computer-use applications). On top of this display sits a minimal XFCE desktop — just the window manager and panel, no desktop icons or bloat. Then x11vnc and noVNC provide remote viewing.
Xvfb (:99) → XFCE4 (minimal) → x11vnc → noVNC (browser access)
↑
Your agent's scripts interact here
Every action — screenshot, click, type, scroll — goes through dedicated bash scripts that speak directly to the X display.
Deep Dive
Setup
./scripts/setup-vnc.shThis single command installs and configures:
- Xvfb on display
:99with 1024×768 - XFCE4 minimal desktop (xfwm4 + panel, no xfdesktop)
- x11vnc with stability flags (
-forever -shared -nopw) - noVNC for browser-based remote viewing
- systemd services for all three — auto-start on boot, auto-restart on crash
After setup, set the display variable:
export DISPLAY=:99The 17 Actions Reference
| Action | Script | Arguments |
|---|---|---|
| screenshot | screenshot.sh | — |
| cursor_position | cursor_position.sh | — |
| mouse_move | mouse_move.sh | x y |
| left_click | click.sh | x y left |
| right_click | click.sh | x y right |
| middle_click | click.sh | x y middle |
| double_click | click.sh | x y double |
| triple_click | click.sh | x y triple |
| left_click_drag | drag.sh | x1 y1 x2 y2 |
| left_mouse_down | mouse_down.sh | — |
| left_mouse_up | mouse_up.sh | — |
| type | type_text.sh | "text" |
| key | key.sh | "combo" |
| hold_key | hold_key.sh | "key" seconds |
| scroll | scroll.sh | dir amount [x y] |
| wait | wait.sh | seconds |
| zoom | zoom.sh | x1 y1 x2 y2 |
The Core Workflow Loop
Every GUI automation task follows the same pattern:
# 1. Always start with a screenshot
./scripts/screenshot.sh
# → Returns base64 PNG — agent "sees" the screen
# 2. Analyze what's visible
# Agent identifies UI elements and their approximate coordinates
# 3. Act
./scripts/click.sh 512 384 left # Click a button
./scripts/type_text.sh "hello" # Type text
./scripts/key.sh "Return" # Press Enter
# 4. Screenshot again to verify
./scripts/screenshot.shThis see-analyze-act loop is the fundamental pattern for any multi-step GUI task.
Working with Keyboard Input
# Regular typing
./scripts/type_text.sh "Search query here"
# Sends in 50-character chunks with 12ms delay for reliability
# Key combinations
./scripts/key.sh "ctrl+c" # Copy
./scripts/key.sh "ctrl+v" # Paste
./scripts/key.sh "alt+F4" # Close window
./scripts/key.sh "super" # Open app launcher
./scripts/key.sh "ctrl+alt+t" # Open terminal (most Linux desktops)
# Hold a key
./scripts/hold_key.sh "shift" 2 # Hold shift for 2 secondsThe Zoom Action for Precision
# Get a cropped screenshot of a specific region
./scripts/zoom.sh 100 200 400 300
# → Captures the rectangle from (100,200) to (400,300)zoom is essential for reading small UI text or examining specific areas without relying on coordinate estimation from a full 1024×768 screenshot.
VNC Access for Monitoring
While the agent operates via scripts, humans can watch via browser:
# noVNC runs on port 6080 by default
# Access at: http://your-server-ip:6080/vnc.htmlThis lets you verify what the agent is doing in real-time — invaluable for debugging complex GUI workflows.
Comparison: Desktop Automation Approaches
| Approach | computer-use | Playwright/Puppeteer | Selenium | Remote Desktop (RDP) |
|---|---|---|---|---|
| Headless server support | ✅ | ✅ | ✅ | ❌ |
| Any GUI application | ✅ | ❌ (browser only) | ❌ (browser only) | ✅ |
| Agent-native scripts | ✅ | ⚠️ | ⚠️ | ❌ |
| Setup complexity | Medium | Low | Low | High |
| Visual debugging | ✅ (noVNC) | ⚠️ | ⚠️ | ✅ |
| Model-agnostic | ✅ | ✅ | ✅ | ✅ |
computer-use's key advantage over Playwright/Selenium: it works with any GUI application, not just browsers. If your automation target is a desktop app (LibreOffice, a legacy ERP system, a native game), browser automation tools simply don't apply.
How to Install
clawhub install computer-useRequires a Linux VPS/server with:
- Root access (for systemd service installation)
- X11 support packages (the setup script installs these:
xvfb,xfce4,x11vnc,xdotool) - Python 3 for noVNC
Practical Tips
- Always screenshot first — never assume the screen state; take a screenshot before every action sequence
- Use zoom for small UI elements — coordinates from a full screenshot may be imprecise for small buttons; zoom in for accuracy
- Triple-click for text selection —
triple_clickselects the current line; useful before typing to replace existing content - 50-char chunks in type_text — the script handles chunking automatically, but very long inputs may need the script run multiple times
- Open terminal via key combo —
ctrl+alt+topens a terminal in most XFCE setups; from there your agent has full CLI access without more GUI navigation
Considerations
- Linux only: The skill is built on Xvfb and X11, which are Linux-specific. macOS and Windows VMs require different approaches.
- Resolution fixed at 1024×768: This is hardcoded to match Anthropic's recommended computer-use resolution. Applications that require higher resolution may not render optimally.
- Performance on low-end VPS: Xvfb + XFCE consumes ~200-400 MB RAM. On 512 MB VPS instances, this may be tight alongside the application you're automating.
- No OCR built-in: The agent "sees" the screen through base64 PNG screenshots. Extracting text from the image requires multimodal vision (Claude, GPT-4V) — not all agents support this.
The Bigger Picture
computer-use represents a different paradigm from API-based automation: instead of finding an API, you teach the agent to use the GUI like a human would. This opens up automation for the vast category of software that has no API — legacy enterprise tools, desktop applications, web apps that aggressively block automation. Wherever a human can click it, computer-use can automate it.
View the skill on ClawHub: computer-use