CUAs Can Become APIs
What would happen if you could interact programmatically with an agent controlling a computer?
AI agents can now use a computer like a human would do. This feature is available thanks to the recent rise of CUAs (Computer-Using Agents) from several companies such as Adept, Anthropic, and OpenAI. One of the goals of CUAs is to perform tasks that aren't easy to execute programmatically. In other words, one benefit of CUAs is that they can work even when an API isn't available. If CUAs can work in a world with no APIs, they can be a bridge between the human and the programmatic worlds. This makes me think that CUAs themselves can also become APIs that other agents, people, or human-made software can consume. What are the implications of being able to programmatically consume a CUA? Stay with me as I explore.
This article is brought to you with the help of our supporter: Speakeasy.
Further expanding its best-in-class API tooling, Speakeasy now empowers teams with open standards. The platform simplifies OpenAPI Overlay adoption so you can focus on building. Click the button below to check out the playground.
Agentic workflows are the future of automation. Agentic in the sense they have the agency to make decisions and adapt themselves if needed. In this agentic future, "how much control are we humans willing to hand over?" someone asked me yesterday. "How much control do you hand over to an employee?" I replied. One form of control is the ability to use your own computer and access all your applications and documents. Anyone with that level of control can easily act on your behalf. Another form of control is being able to access certain Web apps on your behalf. One thing that's clear is that you should be able to define how much control you want to hand over to an AI agent.
All existing CUA solutions require human input, at least in the beginning, when the agent can't perform a task without help. One of the reasons for that is not having access to credentials. If the agent can't sign in to a Web app, it certainly can't complete the task on your behalf. But, after that initial hand-holding, the agent can pretty much do everything as you would. And it's after this moment that programmatic access to the CUA is possible. Identifying if a CUA can work autonomously is necessary, then. That isn't totally possible because you can't anticipate all the actions an agent would want to perform. However, you can at least verify if the agent has access to the tools it needs and can execute a "happy path" towards reaching its desired goals.
Making a CUA available as an API also requires programmatic access to the CUA itself. While OpenAI doesn't offer programmatic access to Operator, Anthropic's Computer Use does. There's even an implementation demo you can try to see how you can mold Computer Use to your own needs. There are other options, like AutoGen's WebSurfer. Even though it's more limited—it can only browse Web sites—it can act on your behalf on Web apps, just like Operator does. There's a great example of an implementation in Python that illustrates how easily you can manipulate WebSurfer. You can, in theory, make any of these solutions available through an API. It should be as simple as launching an API server and connecting its operations to the underlying CUA.
Now, let's think about the implications of having a CUA available through an API. First of all, how much of the CUA would be available through the API—and to whom? Would everyone have access to everything the CUA can do? Or, would the access be restricted to certain tasks, depending on who the API consumer is? Speaking of tasks, would the API expose individual tasks "a-la-carte," or would it let consumers send any prompt to the AI agent? And, would all tasks run on the same computer, either virtual or physical, or would each task run on its own instance? There are many more questions, for sure. The point I'm trying to make is not that it's hard to make these decisions. What I'm showing is that the difficult part isn't the technology. That already exists.
Opening a CUA to an API is, technically speaking, easy. In fact, if Python is your favorite programming language, you can use something like FastAPI. I mention Python because that's the language both Anthropic Computer Use demo and WebSurfer are written in. I think it's worth trying these solutions at least once.