Agentbrisk
autonomouscomputer-useapi Featured Status: active

Anthropic Computer Use

Claude's computer-use capability that powers desktop and browser agents


Anthropic Computer Use is Claude's native capability to see and control a desktop environment through screenshots, mouse actions, keyboard input, and shell commands. First released in public beta in October 2024 alongside Claude 3.5 Sonnet, it has since expanded to Claude Opus 4.7 and Sonnet 4.6 with an updated tool schema and new actions including pixel-level zoom. It's not a product you can buy; it's an API capability that developers wire into their own agent loops, Docker containers, or desktop applications. Products like Cline, Open Interpreter, and many internal enterprise agents run on top of it. If you're building something that needs a model to operate software the way a person would, this is the lowest-level primitive available from Anthropic.

When you ask what powers the wave of desktop AI agents that have shipped in the past 18 months, the answer keeps coming back to the same place: Anthropic Computer Use. Claude’s computer-use capability, first released in public beta in October 2024, is the API primitive that lets Claude see a screen and operate it. Not through a special browser extension or a proprietary desktop client — through a raw API that any developer can call. Understanding Anthropic Computer Use means understanding what most of the interesting computer-controlling AI products are actually built on.

Quick verdict

If you’re a developer building an agent that needs to operate software with a GUI, computer use is one of the two serious options at the API level (the other being whatever Microsoft eventually ships). For end users who just want an AI to click around for them, you want a wrapper product, not the API itself. The capability is real but still in beta, and “cumbersome and error-prone” is Anthropic’s own characterization in their docs.

What is Anthropic Computer Use, exactly?

The product confusion starts with the name. Anthropic Computer Use isn’t a product like Claude Desktop or Claude Code. It’s a capability baked into the Anthropic Messages API. When you enable it, you get access to three tool types you can hand to Claude in a single API call: the computer tool (screenshot capture, mouse, keyboard), the bash tool (shell command execution), and the text editor tool (direct file reading and string-replace edits).

The key thing to internalize is that Claude doesn’t directly connect to your machine or any display. The model tells you what it wants to do — “click at coordinates [450, 300]”, “type this text”, “take a screenshot” — and your application executes those instructions in whatever environment you’ve set up. Then your app sends the results (a new screenshot, command output, file content) back to Claude. That cycle repeats until the task is finished. Anthropic calls this the agent loop.

This architecture is intentional and consequential. It means computer use is infrastructure, not software. The capability runs inside whatever environment you provision: a Docker container, a VM, a remote Linux desktop, even a sandboxed browser. Anthropic provides a Docker-based reference implementation on GitHub with a working agent loop, virtual X11 display, and a web interface you can use to watch Claude work. But the environment, the security model, the cost controls, and the integration layer are all yours to build and own.

This is why so many developer tools — Cline, Open Interpreter, and countless internal enterprise agents — list Anthropic as an underlying provider. They’re calling this API and wrapping it in a product experience. Computer use is the substrate, not the surface.

The capability became generally available for production use in 2025 and has since been versioned across three tool schemas. The current version, computer_20251124, runs on Claude Opus 4.7 and Sonnet 4.6 and adds capabilities the earlier schema didn’t have. If you’re starting a new project today, that’s the version you want.

The features that make it the developer foundation

Screenshot understanding plus action

The computer tool captures whatever is currently displayed on the virtual screen and sends it to Claude as an image. Claude reasons about what it sees and responds with a specific action: a click at pixel coordinates, text to type, a key combination to press. On the computer_20251124 schema, there’s a zoom action that lets Claude request a high-resolution view of a specific rectangular region before deciding where to click. This matters in practice because UI elements in dense applications — spreadsheet cells, small buttons, nested menus — are easy to misread at full screenshot scale.

The full action set covers left, right, and middle clicks; double and triple click; click-and-drag; scroll in any direction with configurable amount; modifier-key chords (shift-click for range selection, ctrl-click for multi-select); and a wait action for pacing interactions with slow applications. That’s a full input model. The gap between “Claude can click” and “Claude reliably clicks the right thing” is still real, but the action primitives themselves aren’t the bottleneck.

Bash and computer tool integration

The bash tool runs in a persistent shell session alongside the visual control loop. This matters for workflows that aren’t purely GUI-driven. You can have Claude inspect a file with cat, write a script, check environment variables, or run a test suite — all within the same API conversation as the visual actions. The bash tool’s session persists across multiple tool calls in a single agent loop, so Claude can set an environment variable and reference it ten steps later.

The text editor tool completes the triad. Rather than having Claude type into a text editor application pixel by pixel, the text editor tool reads file contents directly and performs string-replacement edits. For code-heavy workflows this is dramatically more reliable than simulated keyboard input.

This combination — visual control for GUI interaction, shell access for system-level work, and direct file editing — means a single agent loop can handle tasks that span both visual and programmatic interfaces without switching models or tools.

Tool use API for orchestration

Computer use slots into the same tool use framework as every other Anthropic tool. That means you can combine computer use with custom tools in the same tools array. Your agent can call a computer action, then call a custom API endpoint, then take another screenshot, all within a single conversation. For developers already using Anthropic’s tool use system, adding computer use doesn’t require a different mental model or a different SDK.

The agent loop pattern that computer use encourages is also idiomatic for Claude Code and other Anthropic agent products. The same reasoning about iteration limits, cost controls, and human-in-the-loop confirmation points applies across all of them.

Sandboxed reference implementation

Anthropic maintains a reference implementation in the anthropic-quickstarts GitHub repository. It’s a Docker container running a Linux desktop with Mutter (window manager), Tint2 (panel), Firefox, LibreOffice, and common utilities pre-installed. The container exposes a web interface that shows you a live view of what Claude sees and a chat input where you issue tasks.

The reference implementation is updated to track new tool schema versions. When computer_20251124 shipped, the Docker image was updated to include the new action handlers. For most developers, this is the right way to start: pull the container, set your API key, watch Claude work, and then fork the implementation to suit your production environment. Building a custom environment from scratch is an option the docs cover in detail, but the reference implementation answers almost every question about the architecture before you have to ask it.

Integration with Claude Desktop and Code

Computer use isn’t confined to custom applications you build yourself. Claude Desktop and Claude Code both integrate computer use capabilities for their respective contexts. Claude Code uses computer use internally when operating in agentic modes that require GUI interaction. For developers already in the Anthropic ecosystem, this means computer use is available without setting up a separate Docker environment for certain workflows — though the full programmable API access still requires the custom setup path.

Pricing

Computer use uses standard Anthropic API token pricing with a few specific overhead costs on top.

Every API call that includes the computer use tool adds 466 to 499 tokens to the system prompt automatically. Beyond that, each tool definition costs tokens at input pricing. For the computer use tool specifically, that’s 735 input tokens per definition. If you’re passing all three tools (computer, bash, text editor), you’re paying for all three tool definition costs plus the system prompt overhead before Claude has processed a single word of your actual prompt.

Then screenshots. Every screenshot Claude receives is an image, and images are billed at vision rates. The cost varies by image size, but a 1024x768 screenshot at standard compression runs somewhere in the range of 1,000 to 1,500 tokens. An agent loop with 30 screenshots — not unusual for a multi-step task — adds 30,000 to 45,000 input tokens just in screen captures.

Stack that onto Sonnet 4.6 pricing ($3 per million input tokens, $15 per million output tokens) and a moderately complex task can cost anywhere from $0.20 to $2.00 depending on how many screenshots the agent takes. Tasks that require Claude to navigate multiple pages or retry failed actions multiple times skew expensive.

There’s no dedicated computer use pricing tier. You pay model pricing for whichever Claude model you run, plus the token overhead described above. Opus 4.7 costs more per token than Sonnet 4.6, so choosing your model matters for cost-sensitive automation. Most teams doing high-volume computer use run Sonnet 4.6 for cost reasons and step up to Opus 4.7 for tasks requiring more complex reasoning.

There’s no free tier for the API. You need an Anthropic account with billing configured.

Where Computer Use wins and where it doesn’t

Computer use is genuinely good at multi-step browser tasks with reasonable error tolerance. The WebArena benchmark scores show state-of-the-art performance for single-agent browser navigation among publicly evaluated systems. For tasks like “fill in this form from a spreadsheet”, “download a report from this internal portal”, or “run through this test checklist in a GUI application”, it works.

Where it struggles: precision work in dense UIs, applications that move or resize frequently, anything requiring real-time speed, and any workflow that exposes the agent to untrusted web content without proper sandboxing. Spreadsheet cell selection has historically been a weak point (Anthropic acknowledges this explicitly), though the addition of fine-grained mouse controls in newer schemas has improved it. The latency per action is slower than human interaction, so “run this task in the background without hurry” is the right frame, not “speed this up.”

Prompt injection is a live concern. A webpage that contains instructions crafted to redirect the agent is a real attack surface. Anthropic’s automatic prompt injection classifiers on screenshots help, but they’re a mitigation, not an elimination. If your agent will visit untrusted URLs, environment isolation is not optional.

Who Computer Use is built for

Computer use is a developer API. The primary user is an engineer building an application or workflow that needs AI-driven desktop or browser automation.

That includes teams building internal automation tools to replace brittle RPA scripts with something that can handle UI changes without breaking. It includes product teams building developer tools like Cline or code assistants that need to operate IDEs and terminals visually. It includes QA engineers writing test suites for GUI applications that have no API surface.

If you’re a developer already familiar with the Anthropic API and you need a model to operate software with a GUI, computer use is the natural next step. If you’re looking for an off-the-shelf tool that a non-technical user can run without writing code, the raw API isn’t designed for that role. The best AI agent for coding roundup covers some of the packaged products built on top of this API that are more approachable for end users.

Computer Use vs the alternatives

vs OpenAI Operator

Operator is a browser-specific hosted product. Non-technical users can interact with it through a UI; OpenAI manages the browser environment. Computer Use is an API primitive where you own the environment and the agent loop. Operator is narrower in scope (browser only) but more accessible. Computer Use is broader (full desktop plus shell access) but requires engineering work to use. If you’re evaluating which to build on, Computer Use gives you more control; Operator gets you to a demo faster.

vs Manus

Manus is an agentic AI product that combines web browsing, code execution, and tool use in a single hosted offering. It’s a product built on top of capabilities; Anthropic Computer Use is one of those underlying capabilities. Manus handles its own infrastructure and has a higher-level task interface. Computer Use is lower in the stack: more flexible, more configurable, and less opinionated about what the agent does with its access to a computer.

vs Browser Use

Browser Use is an open-source Python framework that gives AI models programmatic browser control through Playwright, without needing visual screenshot-based interaction. It’s faster and cheaper for browser-only tasks because it doesn’t need to take screenshots and parse them; it reads the DOM directly. For workflows that stay in a browser and work with standard web pages, Browser Use is often more reliable and more economical. Anthropic Computer Use handles a broader surface (native desktop apps, arbitrary UI, shell access) that Browser Use can’t reach. They’re complementary tools for different parts of the automation problem.

Getting started

The fastest path to a working computer use setup is the Docker reference implementation. Pull the repo from anthropic-quickstarts on GitHub, navigate to the computer-use-demo directory, and follow the setup instructions. You’ll need Docker, your Anthropic API key, and about ten minutes.

git clone https://github.com/anthropics/anthropic-quickstarts.git
cd anthropic-quickstarts/computer-use-demo
docker build -t computer-use-demo .
docker run -p 8080:8080 -e ANTHROPIC_API_KEY=your_key computer-use-demo

Open localhost:8080 and you’ll see a web interface with a live view of the virtual desktop and a prompt input. Type a task and watch Claude work.

For custom implementations, the API call requires the beta header anthropic-beta: computer-use-2025-11-24 for Opus 4.7 and Sonnet 4.6. You pass the tool definitions in the tools array alongside your user message. Anthropic’s documentation covers the full action schema, coordinate scaling for displays wider than 1568 pixels, and error handling patterns in detail. The docs at the computer use page are among the better technical references Anthropic publishes: specific, code-heavy, and honest about limitations.

The bottom line

Anthropic Computer Use is infrastructure. It’s the capability layer that a generation of computer-controlling AI agents runs on. For developers, it’s the right abstraction: direct access to model-driven desktop control without Anthropic making too many product decisions for you. The beta label is real; accuracy is imperfect, latency is noticeable, and the prompt injection surface requires active management. But the underlying capability has matured significantly since October 2024, and the trajectory is toward a more reliable, more capable tool with each new schema version. If you’re building in this space, you need to understand computer use whether or not you use it directly, because you’re almost certainly standing on top of it.

Key features

  • Screenshot capture with pixel-accurate coordinate targeting and zoom
  • Mouse control: click, drag, scroll, double-click, right-click
  • Keyboard input: type text, key combos, modifier-key chords
  • Bash tool for shell command execution alongside visual control
  • Text editor tool for direct file reads and string-replace edits
  • Docker-based reference implementation with agent loop and web UI
  • Zero Data Retention eligible: screenshots not stored after API response

Pros and cons

Pros

  • + Tight integration with Anthropic models means the same context window handles reasoning and screen control
  • + Three complementary tools (computer, bash, text editor) cover most automation scenarios without extra infrastructure
  • + Zero Data Retention eligibility makes it viable for sensitive enterprise workflows
  • + Active versioning (three tool schemas released in 18 months) shows genuine investment in the capability
  • + Prompt injection classifiers run automatically on screenshots to catch the most common attack vector
  • + Docker reference implementation ships with a working agent loop, web UI, and tool stubs

Cons

  • − Still in beta; Anthropic's own docs acknowledge it is "cumbersome and error-prone" on difficult tasks
  • − Each screenshot adds vision pricing on top of model token costs, making long sessions expensive
  • − Requires you to build or run the execution environment; there is no hosted sandbox you can point at a URL
  • − Latency per action is slower than human interaction, limiting real-time use cases
  • − No Windows-native display support in the reference implementation (Linux/Xvfb only)

Who is Anthropic Computer Use for?

  • Developers building custom desktop or browser agents who want Claude's reasoning paired with screen control
  • QA engineers automating GUI test suites against applications that have no API or programmatic interface
  • Enterprise teams replacing RPA scripts with AI-driven workflows that can handle UI changes gracefully
  • Researchers and tool builders who layer computer use under products like Cline or Open Interpreter

Alternatives to Anthropic Computer Use

If Anthropic Computer Use isn't quite the right fit, the closest alternatives are openai-operator , manus , and browser-use . See our full Anthropic Computer Use alternatives page for side-by-side comparisons.

Frequently Asked Questions

What is Anthropic Computer Use?
Anthropic Computer Use is a set of API tools that lets Claude interact with a desktop environment the way a person would. You send Claude a screenshot of a screen, and it responds with actions: click at these coordinates, type this text, run this shell command. Your application executes those actions, captures a new screenshot, and sends it back. That cycle continues until the task is done. It's a capability in the Anthropic API, not a standalone product. You need to provide the environment and the agent loop yourself, or use the Docker reference implementation Anthropic publishes on GitHub.
How much does Anthropic Computer Use cost?
There is no separate tier or add-on fee for computer use. You pay standard Anthropic API rates for the model you choose, plus the token overhead the computer use tools add. Each API call with computer use enabled adds roughly 466-499 tokens for the system prompt and 735 input tokens per tool definition. Screenshots are billed at standard vision rates, which varies by image size. On Claude Sonnet 4.6 (the common choice for cost-sensitive workloads), input tokens run $3 per million and output tokens $15 per million. A session with dozens of screenshots and actions can get expensive fast.
How does Computer Use compare to OpenAI Operator?
OpenAI Operator is a hosted, browser-focused product with a UI that non-technical users can interact with directly. Anthropic Computer Use is a raw API capability aimed at developers. Operator handles the infrastructure for you but limits you to browser tasks in OpenAI's environment. Computer Use gives you full control over the desktop environment and the entire action space (including shell commands and file edits), but you build and host the environment yourself. If you need something that works out of the box for a non-technical user, Operator is closer to that. If you need a programmable primitive to build your own agent product, Computer Use is the lower-level option.
Can I use Anthropic Computer Use without writing code?
Not directly. Computer Use is an API capability, and using it requires building or running an application that handles the agent loop: taking screenshots, sending them to Claude, receiving action instructions, and executing them on a real or virtual display. Anthropic provides a Docker reference implementation with a web UI that gets you running quickly, but spinning it up requires Docker and a terminal. If you want no-code computer use, look at products built on top of the API rather than the API itself.
Is Anthropic Computer Use safe to run on my own computer?
Anthropic recommends against running computer use directly on your primary machine without isolation. The model can click, type, and run shell commands; a single misread instruction or prompt injection from a malicious webpage could cause real damage. The recommended approach is a dedicated virtual machine or Docker container with minimal privileges, no access to sensitive accounts, and restricted internet access via an allowlist. Anthropic's own reference implementation uses Docker for this reason. Prompt injection classifiers run automatically on screenshots, but those are a safety net, not a substitute for environment isolation.
What can Claude actually click on?
Claude can target any pixel coordinate on the virtual display. The computer use tool supports left click, right click, middle click, double click, triple click, click-and-drag, scroll in any direction, holding modifier keys during click actions, and, on the newest tool version (computer_20251124, available for Opus 4.7 and Sonnet 4.6), a zoom action that lets Claude inspect a specific screen region at full resolution before acting. In practice this means any GUI element is fair game: buttons, menus, form fields, checkboxes, file dialogs. The challenge is accuracy, not reach.

Related agents