Running CheXagent Locally: a VLM on a MacBook

Have you heard about Draft Reporting and Vison Language Models (VLMs) but thought it required a GPU cluster? The models are more approachable than ever, if you have a modern MacBook you try it there.

In this guide, we'll use Gradio to build a local web interface for CheXagent, Stanford AIMI's 3-billion parameter vision-language model trained specifically for radiology report generation. The result: upload a chest X-ray, get structured findings in seconds.

CheXagent is a research model. Always consult a qualified radiologist for clinical interpretation.

What You'll Build

Upload a chest X-ray, get findings like this:

Lungs and Airways:
- Complete collapse of the right lung
- Left lung is clear

Pleura:
- Large right pneumothorax

Cardiovascular:
- Mild leftward shift of the mediastinum

Tubes, Catheters, and Support Devices:
- Endotracheal tube in place, terminating 3.5 cm above the carina
- Left internal jugular central venous catheter terminating in the upper SVC
- Enteric tube coursing below the diaphragm, tip not visualized

Musculoskeletal and Chest Wall:
- No acute osseous abnormalities

CheXagent Pneumothorax

Case courtesy of Andrew Ho, Radiopaedia.org. From the case rID: 23278

Requirements

Apple Silicon Mac (M1/M2/M3/M4/M5) with 16GB+ unified memory
macOS 13+ (Ventura or later)

First run downloads ~6GB of model weights from HuggingFace Hub.

Critical: Pin transformers to 4.39.0

The model will silently break with newer transformers versions.

Versions 4.44+ cause CheXagent to return generic, clinically useless findings ("The lungs are unremarkable") instead of detecting actual pathologies. This is a library compatibility issue with how newer transformers handles vision-language models with custom code—the model weights are fine, but the inference pipeline breaks.

Pin to 4.39.0 or you'll spend hours debugging phantom issues.

Setup

mkdir chexagent-local && cd chexagent-local

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Initialize project with Python 3.12
uv init --python 3.12

Install dependencies:

# Core ML
uv add torch torchvision
uv add "transformers==4.39.0"  # CRITICAL: Do not upgrade
uv add accelerate

# Gradio UI
uv add gradio

# Image processing (required by model's custom code)
uv add opencv-python Pillow

# Vision-language model dependencies
uv add timm einops sentencepiece "protobuf>=3.20,<5.0"
uv add pyarrow ftfy regex

# Required by model's custom HuggingFace code (imported at load time)
uv add matplotlib albumentations

The App

Save as app.py:

"""CheXagent Gradio interface for chest X-ray findings generation."""

import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "StanfordAIMI/CheXagent-2-3b-srrg-findings"
MODEL_REVISION = "d18fab171a51e6e6a3f9d313f608b999bcdeb75a"  # Pin for reproducibility
PROMPT = "Structured Radiology Report Generation for Findings Section"


DEVICE = "mps" if torch.backends.mps.is_available() else "cpu"

print(f"Loading model on {DEVICE}...")
print("(First run downloads ~6GB of model weights)")

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID, revision=MODEL_REVISION, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    revision=MODEL_REVISION,
    device_map="auto",
    trust_remote_code=True,
).eval()

print("Model loaded!")


def generate_findings(image_path: str) -> str:
    """Generate radiology findings from a chest X-ray image."""
    if image_path is None:
        return "Please upload an image."

    query = tokenizer.from_list_format([
        {"image": image_path},
        {"text": PROMPT}
    ])

    conversation = [
        {"from": "system", "value": "You are a helpful assistant."},
        {"from": "human", "value": query},
    ]

    input_ids = tokenizer.apply_chat_template(
        conversation, add_generation_prompt=True, return_tensors="pt"
    )

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids.to(DEVICE),
            do_sample=False,
            max_new_tokens=512,
            use_cache=True,
            pad_token_id=tokenizer.eos_token_id,
        )[0]

    return tokenizer.decode(
        output_ids[input_ids.size(1):-1], skip_special_tokens=True
    )


demo = gr.Interface(
    fn=generate_findings,
    inputs=gr.Image(type="filepath", label="Chest X-Ray"),
    outputs=gr.Textbox(label="Findings", lines=10),
    title="CheXagent: Chest X-Ray Findings Generator",
    description="Upload a chest X-ray to generate structured radiology findings.",
)

if __name__ == "__main__":
    demo.launch()

Run It

uv run python app.py

Open http://localhost:7860 in your browser. Upload a chest X-ray and click Submit.

Performance:

First inference after startup: 30-60 seconds (model warmup)
Subsequent inferences: 15-30 seconds on Apple M4

CheXagent Rib Fracture

Case courtesy of Henry Knipe, Radiopaedia.org. From the case rID: 31240

Troubleshooting

"mutex lock failed" on macOS: TensorFlow conflict with Metal. Uninstall it:

uv pip uninstall tensorflow tensorflow-macos

Generic/vague findings: Wrong transformers version. Verify you're on 4.39.0:

uv run python -c "import transformers; print(transformers.__version__)"
# Must print: 4.39.0

If it prints anything else, force reinstall:

uv add "transformers==4.39.0" --reinstall

Out of memory: Close other applications. 16GB Macs should handle the 3B model comfortably, but if you're tight on memory, quit browsers and other heavy apps before running.

Not on a Mac? This guide targets Apple Silicon, but the model runs on NVIDIA GPUs and CPU-only machines too. Change device_map and consider setting torch_dtype explicitly for your hardware. See the HuggingFace model card for details.

What's Next?

You just ran Stanford's vision-language model on commodity hardware. Two years ago, this required a specialized GPU.

In the LinkedIn Comments, let me know what you're interested in next:

Avoid the Dreaded "Wall of Text" - learn how Presto integrates AI into your workflow today.
Fine-tune VLM models - how to use a base model like chexagent-srrg and fine-tune.
Process DICOM studies - converting DICOM to model-ready format has a few tricks.
Compare models - try swapping in other vision-language models to compare output quality.

Resources: