skill-spotlightdeveloper-toolscomputer-useclawhubopenclawdesktop-automationheadlesslinuxvps

computer-use: Full Desktop GUI Control for Headless Linux Servers

March 17, 2026·6 min read

With 10,600+ downloads and 93 installs, computer-use is one of the most infrastructure-heavy skills on ClawHub. It solves a specific but common problem: how do you run and control desktop GUI applications on a Linux VPS that has no physical monitor? The answer is a virtual display — and this skill sets one up automatically, complete with a minimal XFCE desktop, VNC access, and a complete set of 17 mouse/keyboard action scripts.

The Problem It Solves

Most cloud servers are headless — no monitor, no GPU, no display. This means GUI applications (browsers, desktop apps, Electron tools, legacy software with no CLI) simply won't run. The workaround — Xvfb (X Virtual Framebuffer) — exists, but setting it up reliably with a desktop environment, stable VNC, auto-restart on crash, and consistent display settings takes hours of configuration.

computer-use does all of it in one setup script. After running setup-vnc.sh, you have a fully functional virtual desktop your agent can control like a real computer.

Core Concept

The skill creates a virtual X display at :99 with 1024×768 resolution (Anthropic's recommended resolution for computer-use applications). On top of this display sits a minimal XFCE desktop — just the window manager and panel, no desktop icons or bloat. Then x11vnc and noVNC provide remote viewing.

Xvfb (:99) → XFCE4 (minimal) → x11vnc → noVNC (browser access)
             ↑
    Your agent's scripts interact here

Every action — screenshot, click, type, scroll — goes through dedicated bash scripts that speak directly to the X display.

Deep Dive

Setup

./scripts/setup-vnc.sh

This single command installs and configures:

Xvfb on display :99 with 1024×768
XFCE4 minimal desktop (xfwm4 + panel, no xfdesktop)
x11vnc with stability flags (-forever -shared -nopw)
noVNC for browser-based remote viewing
systemd services for all three — auto-start on boot, auto-restart on crash

After setup, set the display variable:

export DISPLAY=:99

The 17 Actions Reference

Action	Script	Arguments
screenshot	`screenshot.sh`	—
cursor_position	`cursor_position.sh`	—
mouse_move	`mouse_move.sh`	x y
left_click	`click.sh`	x y left
right_click	`click.sh`	x y right
middle_click	`click.sh`	x y middle
double_click	`click.sh`	x y double
triple_click	`click.sh`	x y triple
left_click_drag	`drag.sh`	x1 y1 x2 y2
left_mouse_down	`mouse_down.sh`	—
left_mouse_up	`mouse_up.sh`	—
type	`type_text.sh`	"text"
key	`key.sh`	"combo"
hold_key	`hold_key.sh`	"key" seconds
scroll	`scroll.sh`	dir amount [x y]
wait	`wait.sh`	seconds
zoom	`zoom.sh`	x1 y1 x2 y2

The Core Workflow Loop

Every GUI automation task follows the same pattern:

# 1. Always start with a screenshot
./scripts/screenshot.sh
# → Returns base64 PNG — agent "sees" the screen
 
# 2. Analyze what's visible
# Agent identifies UI elements and their approximate coordinates
 
# 3. Act
./scripts/click.sh 512 384 left    # Click a button
./scripts/type_text.sh "hello"     # Type text
./scripts/key.sh "Return"          # Press Enter
 
# 4. Screenshot again to verify
./scripts/screenshot.sh

This see-analyze-act loop is the fundamental pattern for any multi-step GUI task.

Working with Keyboard Input

# Regular typing
./scripts/type_text.sh "Search query here"
# Sends in 50-character chunks with 12ms delay for reliability
 
# Key combinations
./scripts/key.sh "ctrl+c"          # Copy
./scripts/key.sh "ctrl+v"          # Paste
./scripts/key.sh "alt+F4"          # Close window
./scripts/key.sh "super"           # Open app launcher
./scripts/key.sh "ctrl+alt+t"      # Open terminal (most Linux desktops)
 
# Hold a key
./scripts/hold_key.sh "shift" 2    # Hold shift for 2 seconds

The Zoom Action for Precision

# Get a cropped screenshot of a specific region
./scripts/zoom.sh 100 200 400 300
# → Captures the rectangle from (100,200) to (400,300)

zoom is essential for reading small UI text or examining specific areas without relying on coordinate estimation from a full 1024×768 screenshot.

VNC Access for Monitoring

While the agent operates via scripts, humans can watch via browser:

# noVNC runs on port 6080 by default
# Access at: http://your-server-ip:6080/vnc.html

This lets you verify what the agent is doing in real-time — invaluable for debugging complex GUI workflows.

Comparison: Desktop Automation Approaches

Approach	computer-use	Playwright/Puppeteer	Selenium	Remote Desktop (RDP)
Headless server support	✅	✅	✅	❌
Any GUI application	✅	❌ (browser only)	❌ (browser only)	✅
Agent-native scripts	✅	⚠️	⚠️	❌
Setup complexity	Medium	Low	Low	High
Visual debugging	✅ (noVNC)	⚠️	⚠️	✅
Model-agnostic	✅	✅	✅	✅

computer-use's key advantage over Playwright/Selenium: it works with any GUI application, not just browsers. If your automation target is a desktop app (LibreOffice, a legacy ERP system, a native game), browser automation tools simply don't apply.

How to Install

clawhub install computer-use

Requires a Linux VPS/server with:

Root access (for systemd service installation)
X11 support packages (the setup script installs these: xvfb, xfce4, x11vnc, xdotool)
Python 3 for noVNC

Practical Tips

Always screenshot first — never assume the screen state; take a screenshot before every action sequence
Use zoom for small UI elements — coordinates from a full screenshot may be imprecise for small buttons; zoom in for accuracy
Triple-click for text selection — triple_click selects the current line; useful before typing to replace existing content
50-char chunks in type_text — the script handles chunking automatically, but very long inputs may need the script run multiple times
Open terminal via key combo — ctrl+alt+t opens a terminal in most XFCE setups; from there your agent has full CLI access without more GUI navigation

Considerations

Linux only: The skill is built on Xvfb and X11, which are Linux-specific. macOS and Windows VMs require different approaches.
Resolution fixed at 1024×768: This is hardcoded to match Anthropic's recommended computer-use resolution. Applications that require higher resolution may not render optimally.
Performance on low-end VPS: Xvfb + XFCE consumes ~200-400 MB RAM. On 512 MB VPS instances, this may be tight alongside the application you're automating.
No OCR built-in: The agent "sees" the screen through base64 PNG screenshots. Extracting text from the image requires multimodal vision (Claude, GPT-4V) — not all agents support this.

The Bigger Picture

computer-use represents a different paradigm from API-based automation: instead of finding an API, you teach the agent to use the GUI like a human would. This opens up automation for the vast category of software that has no API — legacy enterprise tools, desktop applications, web apps that aggressively block automation. Wherever a human can click it, computer-use can automate it.

View the skill on ClawHub: computer-use

← Back to Blog