SCOPE is a modular multimodal agentic system for natural-language PTZ camera control. A Small Language Model planner orchestrates a fixed action space — a set of skills (camera-control and perception workflows) exposed through an OpenAI-compatible JSON tool schema identical on Blender simulation and a physical AXIS PTZ. A Vision-Language Model handles perception as a callable skill.
The paper benchmarks 20 SLM × VLM configurations across a 541-task evaluation set spanning 8 categories — counting, descriptor, spatial reasoning, OCR, single- and multi-step planning. Best configuration (Qwen3-30B-A3B + Moondream3) reaches 73.8% overall accuracy with sub-second latency on a single edge GPU.
Python Blender 4.0+ Qwen3 Mistral Moondream2/3 Qwen2.5-VL Ollama vLLM OpenAI-compatible tools LLM-as-Judge
Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra — Armada AI
@inproceedings{hindsbo2026scope,
title = {SCOPE: A Real-Time Natural Language Camera Agent at the Edge},
author = {Hindsbo, Nikolaj and Ehsani, Sina and Mishra, Pragyana},
booktitle = {Proceedings of the ACM/IEEE International Conference on
Human-Robot Interaction (HRI '26)},
year = {2026},
publisher = {ACM},
doi = {10.1145/3757279.3785641},
}