Self-Operating Computer

Open Source

Free

A framework that enables multimodal models to operate computers. Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to achieve goals.

Product Screenshot

Features

Multimodal Compatibility

Designed for various multimodal models, currently integrated with GPT-4o, o1, Gemini Pro Vision, Claude 3, and LLaVa

Cross-Platform Support

Compatible with Mac OS, Windows, and Linux (with X server installed)

Screen Interaction

Models can view the screen and decide on sequences of mouse and keyboard operations to achieve goals

Pricing

Open Source Version

Free

MIT license
Requires self-configuration of OpenAI API keys

Use Cases

Automated UI Testing

Automatically performs user interface testing, simulating human operations

Accessibility Functions

Helps users with limited mobility operate computers