News

Open-AutoGLM: An Open-Source Phone Agent Framework for Natural Language Control of Android and HarmonyOS Devices

Open-AutoGLM: An Open-Source Phone Agent Framework for Natural Language Control of Android and HarmonyOS Devices

"'Open Meituan and search for nearby hot pot restaurants.' 'Send a message to File Transfer Assistant: deployment successful.' — Spoken, and the phone does it automatically." This vision is now a reality.

Today, we highlight Open-AutoGLM (GitHub), an open-source project by zai-org (Zhipu AI ecosystem). This innovative phone agent framework enables natural language control over mobile devices.

Open-AutoGLM is built upon two core components: a Phone Agent framework and the AutoGLM-Phone vision-language models. The Python-based Phone Agent runs on your computer and controls devices via ADB (Android) or HDC (HarmonyOS). Its operational loop involves: capturing a device screenshot, which is then interpreted by a visual model to understand the interface, followed by the output of an action (e.g., launch app, tap coordinates, type text). This action is subsequently executed via ADB/HDC. The AutoGLM-Phone series comprises 9B-parameter vision-language models specifically optimized for mobile interfaces. These models can be accessed via Zhipu BigModel or ModelScope APIs, or deployed on your own vLLM/SGLang service. Users can simply articulate commands like "open Xiaohongshu and search for food," and the Agent autonomously completes the entire sequence. The framework also incorporates sensitive operation confirmation and human takeover capabilities, crucial for scenarios like logins or CAPTCHA challenges. Open-AutoGLM supports Android 7.0+ and HarmonyOS NEXT, covering over 50 Android and 60 HarmonyOS applications, and is integrable with tools like Midscene.js.

To fully grasp Open-AutoGLM, it's essential to understand its positioning as a Phone Agent framework combined with the AutoGLM-Phone model for "natural language to phone operations." Key aspects include its working pipeline—screenshot, visual model processing, action parsing, ADB/HDC execution, remote debugging, and human takeover mechanisms. Environment setup requires Python, ADB/HDC, developer options, and ADB Keyboard for Android. Model acquisition and deployment options range from Zhipu/ModelScope API usage to self-hosted vLLM/SGLang services. Furthermore, understanding the supported applications, available actions, and the structure for secondary development is crucial.

Prerequisites for engaging with Open-AutoGLM include familiarity with Python 3.10+, pip, and virtual environments, along with a basic understanding of ADB or HDC concepts for device connectivity and command execution. For those opting to self-host the model service, foundational experience with GPUs and vLLM/SGLang is beneficial; otherwise, using cloud APIs only necessitates applying for a key.

In essence, Open-AutoGLM integrates an open-source Phone Agent framework with the AutoGLM-Phone vision-language model to achieve "natural language control of phones." Users issue commands from a computer, and the Agent, leveraging multimodal screen understanding and planning, orchestrates actions like app launching, tapping, typing, and swiping via ADB or HDC. The framework features built-in sensitive operation confirmation and human takeover for tasks such as logins or CAPTCHAs, alongside support for WiFi/network remote debugging, eliminating the need for continuous physical connection. The model suite includes AutoGLM-Phone-9B (optimized for Chinese) and AutoGLM-Phone-9B-Multilingu.

↗ Read original source