The world of computer vision is much bigger than multimodal LLMs. You'd run an ensemble of specialized models for 3d mapping, object classification, path validation, and so on. On a raspberry pi 5 8gb you can run what you need to self drive an RC car on an obstacle track at 10 FPS.