SCOPE

01 · HRI '26 · 2026 · Armada AI
When you chain a language model and a vision model together, how do you know which one failed?

Abstract

SCOPE is a modular multimodal agentic system for natural-language PTZ camera control. A Small Language Model planner orchestrates a fixed action space — a set of skills (camera-control and perception workflows) exposed through an OpenAI-compatible JSON tool schema identical on Blender simulation and a physical AXIS PTZ. A Vision-Language Model handles perception as a callable skill.

The paper benchmarks 20 SLM × VLM configurations across a 541-task evaluation set spanning 8 categories — counting, descriptor, spatial reasoning, OCR, single- and multi-step planning. Best configuration (Qwen3-30B-A3B + Moondream3) reaches 73.8% overall accuracy with sub-second latency on a single edge GPU.

∗ ∗ ∗

Resources

Stack

Python Blender 4.0+ Qwen3 Mistral Moondream2/3 Qwen2.5-VL Ollama vLLM OpenAI-compatible tools LLM-as-Judge

Authors

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra — Armada AI

Citation

@inproceedings{hindsbo2026scope,
  title  = {SCOPE: A Real-Time Natural Language Camera Agent at the Edge},
  author = {Hindsbo, Nikolaj and Ehsani, Sina and Mishra, Pragyana},
  booktitle = {Proceedings of the ACM/IEEE International Conference on
               Human-Robot Interaction (HRI '26)},
  year   = {2026},
  publisher = {ACM},
  doi    = {10.1145/3757279.3785641},
}