A framework that enables multimodal models to operate computers. Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to achieve goals.
Designed for various multimodal models, currently integrated with GPT-4o, o1, Gemini Pro Vision, Claude 3, and LLaVa
Compatible with Mac OS, Windows, and Linux (with X server installed)
Models can view the screen and decide on sequences of mouse and keyboard operations to achieve goals
Free
Automatically performs user interface testing, simulating human operations
Helps users with limited mobility operate computers