CommunityProductivity & Collaborationgithub.com

web-infra-dev/desktop-computer-automation

Vision-driven desktop automation using Midscene. Control your local desktop (macOS, Windows, Linux) or a remote Windows desktop over RDP with natural language commands. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. ⚠️ In local mode this takes over the user's real mouse and keyboard. For web apps, prefer "Browser Automation" instead. Only use this for desktop-native apps (Electron, Qt, native macOS/Windows/Linux) that cannot run in a browser, or for driving a remote Windows host via RDP. Triggers: open app, press key, desktop, computer, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, control computer, mouse click, keyboard shortcut, screen capture, find on screen, read screen, verify window, close app, test Electron app, rdp, remote desktop, windows server, connect via rdp Powered by Midscene.js (https://midscenejs.com)

Works with~Claude Code~Codex CLI~Cursor
npx skills add https://github.com/web-infra-dev/midscene-skills/tree/main/skills/desktop-computer-automation

Ask in your favorite AI

Open a new chat with this agent skill pre-loaded.

Documentation

web-infra-dev/desktop-computer-automation

Vision-driven desktop automation using Midscene. Control your local desktop (macOS, Windows, Linux) or a remote Windows desktop over RDP with natural language commands. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. ⚠️ In local mode this takes over the user's real mouse and keyboard. For web apps, prefer "Browser Automation" instead. Only use this for desktop-native apps (Electron, Qt, native macOS/Windows/Linux) that cannot run in a browser, or for driving a remote Windows host via RDP. Triggers: open app, press key, desktop, computer, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, control computer, mouse click, keyboard shortcut, screen capture, find on screen, read screen, verify window, close app, test Electron app, rdp, remote desktop, windows server, connect via rdp Powered by Midscene.js (https://midscenejs.com)

Individual skills in this repo

This repo contains 5 individual skills — each has its own dedicated page.

web-infra-dev/android-device-automation

Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control Android devices with natural language commands via ADB. Perform taps, swipes, text input, app launches, screenshots, and more. Trigger keywords: android, phone, mobile app, tap, swipe, install app, open app on phone, android device, mobile automation, adb, launch app, mobile screen, test android app, verify mobile app, QA on phone, check the app on android, test on device, see if the app works on phone, end-to-end test on android, visual verification on mobile Powered by Midscene.js (https://midscenejs.com)

web-infra-dev/browser-automation

Vision-driven browser automation using Midscene. Operates from screenshots — no DOM or accessibility labels needed. Runs in headless Puppeteer — does NOT take over the user's mouse or keyboard. Also supports CDP mode and Bridge mode to connect to an existing Chrome. Use this skill when the user wants to: - Browse, navigate, or open web pages - Scrape, extract, or collect data from websites - Fill out forms, click buttons, or interact with web elements - Verify, validate, test, or QA frontend UI behavior - Take screenshots of web pages - Automate multi-step web workflows - Test what was just built, see if it works in browser - Connect to Chrome via CDP, DevTools Protocol, or remote debugging - Connect to user's Chrome browser, control my browser, operate my Chrome Powered by Midscene.js (https://midscenejs.com)

web-infra-dev/harmonyos-device-automation

Vision-driven HarmonyOS NEXT device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control HarmonyOS devices with natural language commands via HDC. Perform taps, swipes, text input, app launches, screenshots, and more. Trigger keywords: harmony, harmonyos, 鸿蒙, hdc, huawei device, harmony app, harmony automation, harmony phone, harmony tablet, test harmony app, verify on harmonyos, QA on 鸿蒙, check the app on harmony, test on huawei device, see if the app works on harmony, end-to-end test on harmonyos, visual verification on 鸿蒙 Powered by Midscene.js (https://midscenejs.com)

web-infra-dev/ios-device-automation

Vision-driven iOS device automation using Midscene CLI. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control iOS devices with natural language commands via WebDriverAgent. Triggers: ios, iphone, ipad, ios app, tap on iphone, swipe, mobile app ios, ios device, ios testing, iphone automation, ipad automation, ios screen, ios navigate, test ios app, verify on iphone, QA on ipad, check the app on ios, test on ios device, see if the app works on iphone, end-to-end test on ios, visual verification on ios Powered by Midscene.js (https://midscenejs.com)

web-infra-dev/vitest-midscene-e2e

Enhances Vitest with Midscene for AI-powered UI testing across Web (Playwright), Android (ADB), and iOS (WDA). Scaffolds new projects, converts existing projects, and creates/updates/debugs/runs E2E tests using natural-language UI interactions. Triggers: write test, add test, create test, update test, fix test, debug test, run test, e2e test, midscene test, new project, convert project, init project, 写测试, 加测试, 创建测试, 更新测试, 修复测试, 调试测试, 运行测试, 新建工程, 转化工程.

Related Skills

Pango470/ARK-index

🗂 Generate fast, language-aware code indexes with automatic test mapping and incremental updates for efficient agent workflows.

community

okx/okx-cex-trade

Use this skill when the user asks to 'buy BTC', 'sell ETH', 'place a limit order', 'cancel my order', 'amend my order', 'long BTC perp', 'short ETH swap', 'open a position', 'close a position', 'set TP/SL', 'trailing stop', 'set leverage', 'check my orders', 'fill history', 'buy/sell call/put option', 'option chain', 'implied volatility', 'IV', 'option Greeks (delta/gamma/theta/vega)', 'delta hedge', 'option fills', 'event contract', 'buy Yes/No', 'buy Up/Down', 'BTC above', 'price above', '15min price', 'prediction market', 'browse event contracts', 'list event contracts', 'available prediction markets', or any request to browse, place, cancel, or amend spot, swap, futures, options, or event contract orders on OKX. Covers spot trading, perpetual swap, delivery futures, options (calls/puts, Greeks, IV), event contracts (Yes/No, Up/Down), and algo orders (TP/SL/trailing). Requires API credentials. Do NOT use for market data (okx-cex-market), balances/positions (okx-cex-portfolio), or bots (okx-cex-bot).

community

hongphuc5497/skills

Curated agent skills for Hermes, Claude Code, Codex and more -- tailored for my workflow

community

remorses/playwriter

Control the user own Chrome browser via Playwriter extension with Playwright code snippets in a stateful local js sandbox. Use this over other Playwright MCPs to automate the browser — it connects to the user's existing Chrome instead of launching a new one. Use this cli for navigating JS-heavy websites (Instagram, Twitter, cookie/login walls, lazy-loaded UIs) instead of webfetch/curl. ALWAYS load this skill before using any playwriter commands

community

Dqz00116/skill-lib

A curated collection of reusable AI Agent Skills for standardized workflows, best practices, and domain expertise.

community

microsoft/Multi-Agent-Custom-Automation-Engine-Solution-Accelera

The Multi-Agent Custom Automation Engine Solution Accelerator is an AI-driven system that manages a group of AI agents to accomplish tasks based on user input. Powered by Microsoft Agent Framework, Azure Foundry, Azure Cosmos DB, and infrastructure services, it provides a reference application, allowing you to hit the ground running.

community