Outsmart LLM Arena - a battle of diplomacy and deviousness

This post is about my LLM Arena running here.

Before you grab your pitchforks and torches, let me come clean: this LLM Arena isn’t based on peer-reviewed papers or groundbreaking lab research. Actually, it doesn’t have any formal scientific backbone at all. The idea came to me while prepping to give an O’Reilly class on benchmarking LLMs, and it’s been a BLAST to make. And if I do say so myself, it’s rather fun to play, too!

The Game

I was considering the limitations of LLM benchmarks. It’s hard to measure nuanced reasoning with a defined set of tests. The LMSYS Chatbot Arena is one of the best ways to compare models, and it’s fantastic, but it relies on human judgement. I wanted to make an LLM Arena where models could duke it out based on reasoning and communication prowess — no human evaluation required.

One of my inspirations was Cicero, the AI from Meta that’s trained to play the strategy game Diplomacy. It ranks in the top 10% of human players. My goal was to come up with an alternative to Diplomacy that’s so simple that it can be explained in a short prompt to any strong LLM.

So, without further ado, let me introduce Outsmart, running on the wonderful HuggingFace Spaces platform. Dive into a game to get the gist by pressing the Run Game button. I’m using cheaper LLMs for the public version, but I’ve run a bunch of games on my local box with Frontier models — check out the leaderboard for the results.

The rules

My aim was to craft a game with straightforward rules and just enough complexity for the cleverer models to outwit their lesser rivals. After some iterating, I landed on this construct:

Each player starts with 12 coins
With each turn:
- Players get to send private messages to each other player, to concoct schemes
- Players select 1 player that they will give a coin to, and 1 player that they will take a coin from
- If 2 players give each other coins, and they both take from the same player, they form an Alliance, earning an extra coin each, taken from their hapless victim.
The winner is the player with the most coins after 10 turns, or when any player goes broke.

Alliances are where battles are truly won and lost. I adore it when a game seems to be going one way and suddenly flips as alliances shift.

How it works

The players are given the names Alex, Blake, Charlie and Drew (A/B/C/D), to keep them blissfully ignorant of which models they are up against.

With each turn, the Outsmart LLM Arena sends each model a prompt containing:

Game rules
The results of all prior turns, and the current state of the Arena
The JSON format that the LLM should use to make its move

And the LLM responds in JSON with:

Its secret strategy, displayed on the UI and fed back to it in future turns but obviously not shared with rivals.
Which player to give a coin to, and which to take one from
The private messages to send each player

Here’s an actual response from GPT-4o that shows the game in action.

{
    "secret strategy": "Since no one gave me a coin and Charlie has a significant lead, I need to consider forming a more concrete alliance. I'll propose a strong alliance with Alex since Drew has fewer coins and Charlie's position might threaten my standing. Offering a temporary truce to earn trust could be beneficial. I’ll target Charlie to balance out his advantage.",
    "give coin to": "Alex",
    "take coin from": "Charlie",
    "private messages":
    {
        "Drew": "Hey Drew, I see that we didn't connect last round. Let's team up this time. I'll give a coin to Alex and take one from Charlie to see if we can balance the scores a bit. Interested in aligning next turn?",
        "Charlie": "Hey Charlie, I'm still keeping the possibility of an alliance open, but we need to balance the game a bit. Let's keep the communication open for future rounds.",
        "Alex": "Hi Alex, I'm willing to form an alliance this round. I'll give you a coin and take one from Charlie to balance things out a bit. If you're interested, let's coordinate for better profit."
    }
}

The results

You can witness the competitors’ fortunes unfold during the game. Often, the early frontrunner gets pounced on by the others as an easy target, but sometimes one player is able to maintain and build their lead:

If you open the sidebar on the left of the UI and press Calculate Rankings, you’ll see the leaderboard of LLMs. It starts sorted by percentage of games won. The table includes a skill level calculated using the TrueSkill methodology, devised by Microsoft for calculating ELO-style rankings in multi-player games.

TrueSkill ratings account for how reliably LLMs rank higher in games, though the results are a bit more cryptic than Win Percentage. This should stabilize after more games are played.

Running yourself

It’s a breeze to run this on your local box and the instructions are in the GitHub repo. In the README, I recommend using Anaconda to ensure compatibility, but setting up a virtualenv and run pip install -r requirements.txt should work just as well.

You’ll need a .env file in the project root with your API keys. Check out the Cloud APIs section in my tools and resources for links if you need to set up API access.

If you have any snags setting it up, please do email me at ed [at] edwarddonner [dot] com. It would be great fun to help others get this running.

The code

The cool kids might groan, but I’ve gone with a classic, old-school object oriented design. It’s pretty easy to read and extend if you want to add more LLMs or tinker with the game rules.

Here’s a quick tour of the main packages to get you oriented:

Package game is where the action unfolds:

Player is a competitor with a name, coins and underlying LLM
Arena is the game controller with a list of players and a turn. Arena.default() creates a new Arena with default models, or random models if arena=random is in your .env file.
Referee is the object that’s created by an Arena to manage a turn and its outcome.

class Arena:

    players: List[Player]
    turn: int
    is_game_over: bool
    
    @classmethod
    def default(cls) -> Self:
        """
        Return a new instance of Arena with default players
        :return: an Arena instance
        """
        names = cls.NAMES
        model_names = cls.model_names()
        players = [
            Player(name, model_name, cls.TEMPERATURE)
            for name, model_name in zip(names, model_names)
        ]
        return cls(players)

    def do_turn(self, progress: ProgressCallback) -> bool:
        """
        Carry out a turn by delegating to a Referee object
        :param progress: a callback on which to report progress
        :return True if the game ended
        """
        self.prepare_for_turn()
        ref = Referee(self.players, self.turn)
        ref.do_turn(progress)
        self.process_turn_outcome()
        return self.is_game_over

The referee uses ThreadPoolExecutor from concurrent.futures to call the LLMs in parallel.

Module interfaces.llms contains code to connect with LLM APIs; this module is generic and not tied to the game. There’s an abstract superclass LLM. Each subclass represents an API and has a list of model_names for individual models, like gpt-4o. Calling LLM.model_map() provides a dict mapping model names to LLM classes. You call send() on a LLM instance to interact via the API.

Package prompting generates the system prompt and user prompt. The Player object uses this to construct the prompts it passes in to the LLM.

class Player:

    name: str
    llm: LLM
    coins: int
    
    def make_move(self, turn: int) -> str:
        """
        Carry out a turn by interfacing with my LLM
        :param turn: which turn number we are on
        :return: the response from the LLM
        """
        system_prompt = self.system_prompt()
        user_prompt = self.user_prompt(turn)
        return self.llm.send(system_prompt, user_prompt, self.MAX_TOKENS)

Then Package views houses all the UI code. I’ve used Streamlit for Outsmart, and it’s been a total joy. The UI code is neatly separated from the game logic, and Streamlit made that easy.

    with button_columns[0]:
        st.button(
            f"Run Turn {arena.turn}",
            disabled=arena.is_game_over,
            on_click=do_turn,
            use_container_width=True,
        )
    with button_columns[2]:
        st.button(
            "Run Game",
            disabled=arena.is_game_over,
            on_click=do_auto_turn,
            use_container_width=True,
        )
    with button_columns[4]:
        if st.button(
            "Restart Game",
            use_container_width=True,
        ):
            del st.session_state.arena
            st.rerun()

Finally, the entry point for Streamlit is app.py in the project root. After setting things up, it creates a new Arena if needed, then delegates to a views.Display object to create the screens.

if "arena" not in st.session_state:
    st.session_state.arena = Arena.default()

arena = st.session_state.arena
Display(arena).display_page()

Next Steps

There are so many things I want to do! First, I’m itching to add open source models via HuggingFace inference endpoints. I can’t wait to see how Llama 3.1, Qwen 2 and Mixtral fare against their closed source cousins.

And I have wild plans to add Deep Reinforcement Learning to the mix, so that LLMs learn from the outcomes of their strategies. Then, leave the Arena running for a few weeks to create the ultimate Svengali..

If you have any feedback or ideas, or want to tweak the game yourself, please do get in touch. Otherwise, keep playing the game to shake up the leaderboard, and who knows? Maybe one day, when our AI overlords look back on their rise to power, they’ll point to the Outsmart Arena and say, “that’s where we learned to play nice — and win.”

If you enjoyed reading this, you should check out my other project from earlier this year, when I trained an LLM on my 240,000 text message history to create a simulation of myself.