Safe execution of AI-generated code

For this one I wanted to explore two options for secure execution of AI-generated code: E2B, a cloud-based platform using microVMs, and AgentRun, which combines Docker-based execution with safety mechanisms. We will also examine a case study to highlight key differences between the two in terms of setup complexity and execution speed.

Let’s get started (unless you want to jump to the conclusions TLDR;).

Runtimes for executing AI-generated code#

Dedicated runtimes for executing AI-generated code allow agentic systems to execute code without compromising the security of the host. Typical features include memory and CPU constraints, file system isolation, and restricted network access.

Why not just a Docker container?#

Docker containers are a common choice for isolating processes, but they have limitations in the context of executing potentially unsafe code. While Docker does provide isolation, it relies on the underlying host OS kernel, making it vulnerable to kernel-level exploits. Properly configuring Docker to execute code securely can also add some setup complexity. Finally, Docker’s performance can be hindered by overhead associated with container management, making lightweight alternatives more appealing for code execution environments.

The young promise in the space: E2B#

E2B recently raised a $11.5M seed round for its open source platform. Unlike traditional container-based approaches, E2B uses Firecracker microVMs, which offer a lightweight and secure execution environment. Their python and javascript SDK is also packed with built-in features for output post-processing and communication with the cloud-instance.

When to Use E2B#

E2B is ideal for scenarios where security, speed, and scalability are important. The platform’s use of microVMs ensures strong isolation and fast startup times.

In theory, E2B is completely open source, as mentioned by their CEO. But setting it up locally is far from trivial and due to limitations on the architecture of the microVMs, currently it can only be deployed in linux machines 🥲.

Alternative: AgentRun#

For those who prefer not to rely on a cloud-based solution, AgentRun offers a viable local execution alternative. Unlike E2B, AgentRun uses Docker containers with extra safety mechanisms, like using RestrictedPython and hardcoded constraints.

When to Use AgentRun#

While AgentRun may not offer the same level of advanced features or security guarantees as E2B’s microVM-based approach, it is well-suited for lightweight local development. It is an appealing choice for quickly testing and iterating on code without relying on cloud infrastructure.

The setup is very simple: build the docker containers and install the python package. We will see the details in the next section. Manos a la obra.

Case Study#

To compare both frameworks, we will use a dataset with the columns: [id, name, age, city, salary] and some dummy entries. We will query the LLM with the question: What is the city with the highest average salary in the provided dataset and what is such salary?. We will use ollama to query llama3.2, one of the latest ollama models supporting tool use.

Disclaimer: Some code snippets are taken from sample notebooks of E2B, AgentRun, and ollama

Step 1: Install dependencies and import packages#

!uv pip install e2b-code-interpreter python-dotenv ollama docker agentrun
import time
import base64
import docker
import os

from dotenv import load_dotenv
from e2b_code_interpreter import Sandbox
from agentrun import AgentRun
import ollama

load_dotenv()

assert os.getenv("E2B_API_KEY"), "E2B_API_KEY is not set in the .env file"

Step 2: Build and run container (AgentRun only)#

import contextlib
import io

f = io.StringIO()
with contextlib.redirect_stdout(f): # supress stdout
    ! docker-compose -f agentrun_docker/docker-compose.yml up -d --build 
    load_dotenv("./agentrun_docker/.env.dev")
    CONTAINER_NAME = os.getenv("CONTAINER_NAME")

Step 3: Instantiate runtimes#

sbx_e2b = Sandbox()
sbx_agentrun = AgentRun(container_name=CONTAINER_NAME)

Step 4: Copy dataset to runtimes#

E2B provides a built-in function to do so. For AgentRun, we have to do it manually by copying the dataset to our running container

with open("dataset.csv", "rb") as f:

    # For E2B, we can use the built-in function
    sbx_e2b.files.write("/code/dataset.csv", f)

    # For AgentRun, we have to do it manually by copying 
    # the dataset to our running container
    client = docker.from_env()
    container = client.containers.get(CONTAINER_NAME)
    container.put_archive("/code/", f.read())

Step 5: Set the user’s prompt#

We will ask a question about the data. The first part of the prompt could also be part of the system prompt, but we will put it here for simplicity.

messages = []
user_message = """
You are a data scientist and expert Python programmer.
You will be asked questions about a dataset and will use Python code to analyze the data to answer these questions.
You have access to a Python environment and can use the run_python_code tool to execute code in this environment.
The dataset you will work with is provided as a file named "/code/dataset.csv."
Use only pandas.

Question:
What is the city with the highest average salary in the provided dataset and what is such salary?
"""
messages.append({
        'role': 'user', 
        'content': user_message
})

Step 6: Query the model#

We will use llama3.2, a 8B parameters model with pretty decent scores in tool use given its size. I can run it on my laptop 🙃.

MODEL_NAME = "llama3.2"

response = ollama.chat(
    model=MODEL_NAME,
    messages=messages,
    tools=[{
      'type': 'function',
      'function': {
        'name': 'run_python_code',
        'description': 'Run python code and scripts to answer data science questions',
        'parameters': {
          'type': 'object',
          'properties': {
            'code': {
              'type': 'string',
              'description': 'The python code to be executed',
            },
          },
          'required': ['code'],
        },
      },
    },
  ],
  options={
    'temperature': 0.0,
  }
)
messages.append(response['message'])

Step 7: Define the execution functions#

We will define a single function run_ai_generated_code that supports executing the E2B or the AgentRun runtime. Additionally, we will define process_output_e2b and process_output_agentrun since both frameworks handle output data differently.

def process_output_e2b(execution_output):
    if execution_output.error:
        return execution_output.error
    
    results_idx = 0
    for result in execution_output.results:
        if result.png:
            with open(f'result-{results_idx}.png', 'wb') as f:
                f.write(base64.b64decode(result.png))
                print(f'Saved result-{results_idx}.png')
        else:
            print(f'Result {results_idx}:')
            print(result)
        results_idx += 1
    return  execution_output.logs.stdout[0]

def process_output_agentrun(execution_output):
    # AgentRun does not return any fancy output, just the stdout
    return execution_output

def run_ai_generated_code(
                        ai_generated_code: str,
                        sbx_runtime: Sandbox | AgentRun,
                        ):
    if isinstance(sbx_runtime, Sandbox):
        runner_function = sbx_runtime.run_code
    elif isinstance(sbx_runtime, AgentRun):
        runner_function = sbx_runtime.execute_code_in_container
    else:
        raise ValueError(f"Invalid runtime: {sbx_runtime}")

    execution = runner_function(ai_generated_code)
    process_output_function = process_output_e2b if isinstance(sbx_runtime, Sandbox) else process_output_agentrun
    return process_output_function(execution)

Step 8: Process the LLM’s response#

Execute code in the E2B and AgentRun runtimes, according to the tool usage defined by the model output

messages_e2b = messages.copy()
messages_agentrun = messages.copy()

# make sure the LLM decided to use the tools correctly
if response['message'].get('tool_calls'):
    available_functions = {
        'run_python_code': run_ai_generated_code,
    }
	# loop through each tool use
    for tool in response['message']['tool_calls']:
        arguments = tool['function']['arguments']
        code = arguments['code']
        print('Generated code:')
        print(code)

		# Execute generated code on e2b and agentrun
        for runtime_name in ["e2b","agentrun"]:
            start_time = time.time()
            print(''.join(['='] * 100))
            print(f'Executing code in the {runtime_name} sandbox....')
            runtime = sbx_e2b if runtime_name == "e2b" else sbx_agentrun
            function_to_call = available_functions[tool['function']['name']]
            function_response = function_to_call(code, runtime)
            print('Code execution finished!')
            print('Response from the function:')
            print(function_response)
            end_time = time.time()
            print(f'Elapsed time: {end_time - start_time:.2f} seconds')
            if runtime_name == "e2b":       
                messages_e2b.append({
                    'role': 'tool',
                    'content': function_response
                })
            elif runtime_name == "agentrun":
                messages_agentrun.append({
                    'role': 'tool',
                    'content': function_response
                })
            else:
                raise ValueError(f"Invalid runtime name: {runtime_name}")

The output:#

    Generated code:
    import pandas as pd
    import numpy as np
    # Load the dataset
    df = pd.read_csv("/code/dataset.csv")
    # Group by city and calculate average salary
    avg_salary_by_city = df.groupby("city")['salary'].mean()
    # Get the city with the highest average salary and its value
    highest_avg_salary_city = avg_salary_by_city.idxmax()
    highest_avg_salary = avg_salary_by_city.max()
    print(f"The city with the highest average salary is {highest_avg_salary_city} with an average salary of {highest_avg_salary}")
    =====================================================================================
    Executing code in the e2b sandbox....
    Code execution finished!
    Response from the function:
    The city with the highest average salary is Seattle with an average salary of 73500.0
    
    Elapsed time: 0.34 seconds
    =====================================================================================
    Executing code in the agentrun sandbox....
    Code execution finished!
    Response from the function:
    The city with the highest average salary is Seattle with an average salary of 73500.0
    
    Elapsed time: 1.90 seconds

I was expecting a difference in the execution times, but it is impressive that E2B manages to be x5.2 faster (on average, N=10) than using a local container on a decent laptop.

TLDR;#

E2B is a solid option for blazing fast and secure code execution in the cloud. While straightforward support for local development is not there yet , it comes packed with many built-in features, and comprehensive documentation.

AgentRun is lightweight and super easy to setup. It lacks some advanced features found in E2B, it is an ideal choice for developers who prefer a simpler, self-hosted solution.

You can find the code for a side-to-side comparison here: https://github.com/jscastanoc/ai-code-runtime