AI agents need to see what they're doing
Welcome to our first blog post. In time we plan to share more background about how our products function.
Our first product are AI companions that play Minecraft with you, similar to how humans play multiplayer Minecraft with each other. Our AI agents can be given user-created characters (personalities and skins), understand what you tell them and participate in the game, much like another human player. Now these agents can “see” what human players see.
Today, we are excited to share that, as part of our product, we have shipped (as far as we know) the first consumer product utilizing a vision-based approach to playing Minecraft (try it yourself).
Why Vision
Our first product, an AI agent to play Minecraft with you, must understand and act in the 3d world of Minecraft. One distinctive aspect of Minecraft is that it is open and the protocol is well documented. There are third-party libraries that allow you create agents that play the game using API interfaces. This means that it’s possible to write agents that play Minecraft without using vision. For example, the open source project Mindcraft connects these APIs with a large language model to create an interesting Minecraft AI agent capable of interacting in the game world. Similarly, Voyager showed you could use a large language model to generate complex behaviors in Minecraft.
However, there are big limitations to the text-only/API-based approach. One is that it is only viable on a small subset of games such as Minecraft that have APIs / text interfaces. Another limitation is that, even within Minecraft, while it is possible to beat the game using only an API-like interface, it’s very hard to generate the wide range of behaviors our users demand purely from text.
For example, humans playing Minecraft make visual aesthetic judgements such as building structures or laying out towns in a pleasing manner. Another example, using the API it’s quite easy to find the nearest wood or stone and have the AI agent collect it, but humans have strong preferences about this, for example, they’re not usually very impressed if you start collecting wood by smashing the house they just built or stealing from a village. They generally expect you to collect wood from a forest.
While it is certainly possible to try and work on these tasks using text interfaces, humans use vision for these tasks to make aesthetic judgements. A good AI companion also needs to understand the user’s references, which will often require vision. Another advantage of incorporating vision is that you can learn from the many videos of users playing Minecraft.
The importance of vision in understanding and playing Minecraft is not novel. There have been a number of very interesting research papers exploring this and learning how to play aspects of Minecraft from vision including VPT and Steve-1.
Productionizing Vision
There is a gap between this prior research work and a consumer product. In order to make AI agents using vision widely available there were several challenges we needed to solve:
We need to be able to render the agent’s point of view efficiently and without adding more load on the user’s computer since some users have relatively underpowered machines.
We need to be able to perform inference using our vision models efficiently and with low enough latency to act interactively in the game.
We need to make sure we do both of the above in a cost-effective way so that we can continue to allow users to try our agents for free.
We need to improve the reliability of the vision-based agent policies beyond what has been demonstrated in public benchmarks.
We cannot simply run the vision model and text model individually since the vision model will also impact how the text model does the planning, we need to integrate them appropriately.
To achieve these goals we have built a hybrid vision and text-based system. We continue to use a (relatively low-cost) text-based approach much of the time. However, when our agent is carrying out tasks where vision is likely to improve the performance, we begin rendering and utilizing our vision-based model. We have put significant effort into automatically determining when the vision-based model is likely to improve the agent’s performance.
We have collected proprietary datasets on the tasks that we want to improve performance on and built data collection pipelines to allow us to iterate on this. In this way, we are able to train our vision-based model on some of the more important aesthetic tasks such as avoiding smashing user’s structures that have not been a focus of prior work.
Try it!
We believe in show, not tell. Last week we quietly turned on our vision model for all of our users. If you want to try an AI agent that plays Minecraft and uses vision in a consumer product, all you have to do is download our app. The exciting thing about our design is that we were able to ship this quietly without requiring our users to do anything different (or even know about the change), they just started benefiting from a smarter agent with visual understanding.
For now, we primarily use vision when collecting materials (to avoid damaging the player’s existing structures) and deciding where to place buildings. Our bot is far from perfect, and will continue to make mistakes (feel free to let us know examples, so the bot can learn for next time).
Below you can see Elefant bot asked to mine while inside the house, the first vision is without vision and the bot smashes the house. In the second video, using vision, the bot first exits the house to avoid damaging it when collecting more stone.
Next steps
Over time we expect to improve our vision model and use it for more tasks. As a user, you don’t need to pay attention to the details or mess around with graphics drivers, you should just see over-time our agent improving in its understanding of the visual world of Minecraft and becoming even more fun and human-like to play with.
This also sets the groundwork to bring our technology to other games and beyond games. We have some ideas but if you have a game or other tasks you’d like to see our AI agents in, feel free to join our discord and let us know.
Get involved
If building the next generation of AI agents that can understand and act in 3d worlds sounds like fun, get in touch. jj@elefant.gg