Logo
ClawHub Skills Lib
HomeCategoriesUse CasesTrendingBlog
HomeCategoriesUse CasesTrendingBlog
ClawHub Skills Lib
ClawHub Skills Lib

Browse 26,000+ community-built AI agent skills for OpenClaw. Updated daily from clawhub.ai.

Explore

  • Home
  • Trending
  • Use Cases
  • Blog

Categories

  • Development
  • AI & Agents
  • Productivity
  • Communication
  • Data & Research
  • Business
  • Platforms
  • Lifestyle
  • Education
  • Design

Use Cases

  • Security Auditing
  • Workflow Automation
  • Finance & Fintech
  • MCP Integration
  • Crypto Trading
  • Web3 & DeFi
  • Data Analysis
  • Social Media
  • 中文平台技能
  • All Use Cases →
© 2026 ClawHub Skills Lib. All rights reserved.Built with Next.js · Supabase · Prisma
Home/Blog/computer-use: Full Desktop GUI Control for Headless Linux Servers
skill-spotlightdeveloper-toolscomputer-useclawhubopenclawdesktop-automationheadlesslinuxvps

computer-use: Full Desktop GUI Control for Headless Linux Servers

March 17, 2026·6 min read

With 10,600+ downloads and 93 installs, computer-use is one of the most infrastructure-heavy skills on ClawHub. It solves a specific but common problem: how do you run and control desktop GUI applications on a Linux VPS that has no physical monitor? The answer is a virtual display — and this skill sets one up automatically, complete with a minimal XFCE desktop, VNC access, and a complete set of 17 mouse/keyboard action scripts.

The Problem It Solves

Most cloud servers are headless — no monitor, no GPU, no display. This means GUI applications (browsers, desktop apps, Electron tools, legacy software with no CLI) simply won't run. The workaround — Xvfb (X Virtual Framebuffer) — exists, but setting it up reliably with a desktop environment, stable VNC, auto-restart on crash, and consistent display settings takes hours of configuration.

computer-use does all of it in one setup script. After running setup-vnc.sh, you have a fully functional virtual desktop your agent can control like a real computer.

Core Concept

The skill creates a virtual X display at :99 with 1024×768 resolution (Anthropic's recommended resolution for computer-use applications). On top of this display sits a minimal XFCE desktop — just the window manager and panel, no desktop icons or bloat. Then x11vnc and noVNC provide remote viewing.

Xvfb (:99) → XFCE4 (minimal) → x11vnc → noVNC (browser access)
             ↑
    Your agent's scripts interact here

Every action — screenshot, click, type, scroll — goes through dedicated bash scripts that speak directly to the X display.

Deep Dive

Setup

./scripts/setup-vnc.sh

This single command installs and configures:

  • Xvfb on display :99 with 1024×768
  • XFCE4 minimal desktop (xfwm4 + panel, no xfdesktop)
  • x11vnc with stability flags (-forever -shared -nopw)
  • noVNC for browser-based remote viewing
  • systemd services for all three — auto-start on boot, auto-restart on crash

After setup, set the display variable:

export DISPLAY=:99

The 17 Actions Reference

ActionScriptArguments
screenshotscreenshot.sh—
cursor_positioncursor_position.sh—
mouse_movemouse_move.shx y
left_clickclick.shx y left
right_clickclick.shx y right
middle_clickclick.shx y middle
double_clickclick.shx y double
triple_clickclick.shx y triple
left_click_dragdrag.shx1 y1 x2 y2
left_mouse_downmouse_down.sh—
left_mouse_upmouse_up.sh—
typetype_text.sh"text"
keykey.sh"combo"
hold_keyhold_key.sh"key" seconds
scrollscroll.shdir amount [x y]
waitwait.shseconds
zoomzoom.shx1 y1 x2 y2

The Core Workflow Loop

Every GUI automation task follows the same pattern:

# 1. Always start with a screenshot
./scripts/screenshot.sh
# → Returns base64 PNG — agent "sees" the screen
 
# 2. Analyze what's visible
# Agent identifies UI elements and their approximate coordinates
 
# 3. Act
./scripts/click.sh 512 384 left    # Click a button
./scripts/type_text.sh "hello"     # Type text
./scripts/key.sh "Return"          # Press Enter
 
# 4. Screenshot again to verify
./scripts/screenshot.sh

This see-analyze-act loop is the fundamental pattern for any multi-step GUI task.

Working with Keyboard Input

# Regular typing
./scripts/type_text.sh "Search query here"
# Sends in 50-character chunks with 12ms delay for reliability
 
# Key combinations
./scripts/key.sh "ctrl+c"          # Copy
./scripts/key.sh "ctrl+v"          # Paste
./scripts/key.sh "alt+F4"          # Close window
./scripts/key.sh "super"           # Open app launcher
./scripts/key.sh "ctrl+alt+t"      # Open terminal (most Linux desktops)
 
# Hold a key
./scripts/hold_key.sh "shift" 2    # Hold shift for 2 seconds

The Zoom Action for Precision

# Get a cropped screenshot of a specific region
./scripts/zoom.sh 100 200 400 300
# → Captures the rectangle from (100,200) to (400,300)

zoom is essential for reading small UI text or examining specific areas without relying on coordinate estimation from a full 1024×768 screenshot.

VNC Access for Monitoring

While the agent operates via scripts, humans can watch via browser:

# noVNC runs on port 6080 by default
# Access at: http://your-server-ip:6080/vnc.html

This lets you verify what the agent is doing in real-time — invaluable for debugging complex GUI workflows.

Comparison: Desktop Automation Approaches

Approachcomputer-usePlaywright/PuppeteerSeleniumRemote Desktop (RDP)
Headless server support✅✅✅❌
Any GUI application✅❌ (browser only)❌ (browser only)✅
Agent-native scripts✅⚠️⚠️❌
Setup complexityMediumLowLowHigh
Visual debugging✅ (noVNC)⚠️⚠️✅
Model-agnostic✅✅✅✅

computer-use's key advantage over Playwright/Selenium: it works with any GUI application, not just browsers. If your automation target is a desktop app (LibreOffice, a legacy ERP system, a native game), browser automation tools simply don't apply.

How to Install

clawhub install computer-use

Requires a Linux VPS/server with:

  • Root access (for systemd service installation)
  • X11 support packages (the setup script installs these: xvfb, xfce4, x11vnc, xdotool)
  • Python 3 for noVNC

Practical Tips

  1. Always screenshot first — never assume the screen state; take a screenshot before every action sequence
  2. Use zoom for small UI elements — coordinates from a full screenshot may be imprecise for small buttons; zoom in for accuracy
  3. Triple-click for text selection — triple_click selects the current line; useful before typing to replace existing content
  4. 50-char chunks in type_text — the script handles chunking automatically, but very long inputs may need the script run multiple times
  5. Open terminal via key combo — ctrl+alt+t opens a terminal in most XFCE setups; from there your agent has full CLI access without more GUI navigation

Considerations

  • Linux only: The skill is built on Xvfb and X11, which are Linux-specific. macOS and Windows VMs require different approaches.
  • Resolution fixed at 1024×768: This is hardcoded to match Anthropic's recommended computer-use resolution. Applications that require higher resolution may not render optimally.
  • Performance on low-end VPS: Xvfb + XFCE consumes ~200-400 MB RAM. On 512 MB VPS instances, this may be tight alongside the application you're automating.
  • No OCR built-in: The agent "sees" the screen through base64 PNG screenshots. Extracting text from the image requires multimodal vision (Claude, GPT-4V) — not all agents support this.

The Bigger Picture

computer-use represents a different paradigm from API-based automation: instead of finding an API, you teach the agent to use the GUI like a human would. This opens up automation for the vast category of software that has no API — legacy enterprise tools, desktop applications, web apps that aggressively block automation. Wherever a human can click it, computer-use can automate it.


View the skill on ClawHub: computer-use

← Back to Blog