Worker Safety with Cosmos Reason 2

Automating Safety Inspections in ‘Classical’ Warehouses with Zero-Shot Video Reasoning

Computer-Vision
Industrial-AI
NVIDIA-Cosmos
Author

Renan Monteiro Barbosa

This post explores how to leverage NVIDIA Cosmos Reason 2 to automate safety inspections in these challenging environments using zero-shot video reasoning.

In the world of industrial automation, modern “greenfield” factories are often the poster children for AI integration—pristine, well-lit, and perfectly marked. However, the reality for many businesses is the “brownfield” environment: classical warehouses with irregular layouts, faded floor markings, and worn infrastructure. In these settings, traditional computer vision models often struggle, confusing a faded yellow line with a safe walkway or failing to account for the messy visual noise of a working plant.

ImportantDisclosure

Work in progress, first attempt to automate content creation with Agentic AI so steps might not be 100% accurate. Several improvements required.

The Goal: Zero-Shot Safety Compliance

The primary objective is to classify worker behaviors based on strict visual ground truths without the need for extensive model fine-tuning. By leveraging the reasoning capabilities of a Video Language Model (VLM), we can instruct the system to act as an “Expert Inspector” that ignores environmental noise and focuses on specific safety protocols.

Key Challenges Addressed:

  • Environmental Noise: Faded paint, unpainted areas, and complex backgrounds.
  • 2D Projection Limits: Camera angles that make workers appear “on” a path when they are actually walking alongside it.
  • Strict Compliance: Binary safety rules (e.g., “vest on” vs. “vest off”) applied to messy, real-world data.

Prerequisites and Setup

To run this pipeline, you’ll need an NVIDIA GPU with CUDA support (tested on NVIDIA RTX PRO 5000 Blackwell with CUDA 13.0).

1. System Setup (Ubuntu)

Install the necessary system packages:

sudo apt-get update && sudo apt-get install -y curl ffmpeg git git-lfs
git lfs install

2. Environment Configuration

Clone the repository:

git clone https://github.com/nvidia-cosmos/cosmos-reason2.git
cd cosmos-reason2
git lfs pull

You can set up the Python environment using either uv (recommended for speed) or conda.

Install uv if you haven’t already:

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Create the environment with CUDA 13.0 support and install extras:

# This creates a .venv directory inside the repo
uv sync --extra cu130
source .venv/bin/activate

# Install visualization and notebook tools
uv pip install -U fiftyone jupyterlab ipykernel

If you prefer conda, follow these steps to create a compatible environment:

# Create and activate the environment
conda create -n cosmos-reason2 python=3.12 -y
conda activate cosmos-reason2

# Install the package with CUDA 13.0 extras
pip install -e ".[cu130]"

# Install visualization and notebook tools
pip install -U fiftyone jupyterlab ipykernel

3. Jupyter Kernel Registration

To ensure your environment is selectable within JupyterLab, register it as a kernel:

python -m ipykernel install --user --name cosmos-reason2 --display-name "Python (cosmos-reason2)"

4. Authentication

You must authenticate with Hugging Face to download the Cosmos Reason 2 models:

# If using uv
uvx hf auth login

# If using conda/pip
pip install huggingface_hub
huggingface-cli login

5. Running Scripts from Notebooks

When running inference scripts from a notebook (e.g., in the notebooks/ directory), ensure you handle relative paths correctly to avoid “file not found” errors:

```{python}
# Recommended: Change directory to repo root before running scripts
%cd ..
!python scripts/inference_sample.py
```

The “Expert Inspector” Prompt Strategy

The core of this solution lies in Prompt Engineering. Instead of training a classifier, we provide the model with a “persona” and a strict set of rules.

System Instructions (The Persona)

We define a system prompt that includes Critical Negative Constraints to handle the “brownfield” limitations:

```{python}
SYSTEM_INSTRUCTIONS = """
You are an expert Industrial Safety Inspector monitoring a manufacturing facility.
Your goal is to classify the video into EXACTLY ONE of the 8 classes defined below.

CRITICAL NEGATIVE CONSTRAINTS (What to IGNORE):
1. IGNORE SITTING WORKERS:
 - If a person is SITTING at a machine board working, this is NOT an intervention class. Ignore them.
 - If a person is SITTING driving a forklift, the driver is NOT the class. Focus only on the LOAD carried.
2. IGNORE BACKGROUND:
 - The facility is old. Do not report hazards based on faded floor markings or unpainted areas.
3. SINGLE OUTPUT:
 - Even if multiple things happen, choose the MOST PROMINENT behavior.
 - Prioritize UNSAFE behaviors over SAFE behaviors if both are present.
"""
```

User Prompt (The Rules)

We provide a strict classification table in the prompt, mapping visual definitions to specific IDs:

```{python}
USER_PROMPT_CONTENT = """
Analyze the video and output a JSON object. You MUST select the class ID and Label EXACTLY from the table below.

STRICT CLASSIFICATION TABLE (Use these exact IDs and Labels):

| ID | Label | Definition (Ground Truth) | Hazard Status |
| :--- | :--- | :--- | :--- |
| 0 | Safe Walkway Violation | Worker walks OUTSIDE the designated Green Path. | TRUE (Unsafe) |
| 4 | Safe Walkway | Worker walks INSIDE the designated Green Path. | FALSE (Safe) |
| 1 | Unauthorized Intervention | Worker interacts with machine board WITHOUT a green vest. | TRUE (Unsafe) |
| 5 | Authorized Intervention | Worker interacts with machine board WITH a green vest. | FALSE (Safe) |
| 2 | Opened Panel Cover | Machine panel cover is left OPEN after intervention. | TRUE (Unsafe) |
| 6 | Closed Panel Cover | Machine panel cover is CLOSED after intervention. | FALSE (Safe) |
| 3 | Carrying Overload with Forklift | Forklift carries 3 OR MORE blocks. | TRUE (Unsafe) |
| 7 | Safe Carrying | Forklift carries 2 OR FEWER blocks. | FALSE (Safe) |

INSTRUCTIONS:
1. Identify the behavior in the video.
2. Match it to one row in the table above.
3. Output the exact "ID" and "Label" from that row.

OUTPUT FORMAT:
{
 "prediction_class_id": [Integer from Table],
 "prediction_label": "[Exact String from Table]",
 "video_description": "[Concise description of the observed action]",
 "hazard_detection": {
 "is_hazardous": [true/false based on the Hazard Status column],
 "temporal_segment": "[Start Time - End Time] or null"
 }
}
"""
```

Step-by-Step Inference Process

The following steps walkthrough the complete pipeline, from environment verification to batch inference and visualization. These steps are adapted from the official worker_safety.ipynb.

1. Basic Imports and Setup

We start by importing the necessary libraries for path handling and dataset management.

```{python}
import pathlib
import shutil
import os
import torch
import fiftyone as fo
import fiftyone.utils.huggingface as fouh
import json
import warnings
from pathlib import Path
import transformers

warnings.filterwarnings("ignore")
```

2. Dataset Loading and Sample Preparation

We load the dataset from Hugging Face and prepare a reference video for testing.

```{python}
ROOT = pathlib.Path.cwd()
ASSETS = ROOT / "assets"
ASSETS.mkdir(exist_ok=True)

# Load dataset and copy the first video to assets/sample.mp4
dataset = fouh.load_from_hub("pjramg/Safe_Unsafe_Test", persistent=True)

sample = dataset.first()

if sample is None:
    raise RuntimeError("Dataset is empty")
    
dst = ASSETS / "sample.mp4"
shutil.copy2(sample.filepath, dst)
print("Reference video copied to:", dst)
```

3. CUDA / Environment Sanity Check

Before running heavy inference, it’s critical to verify that the GPU is correctly detected.

```{python}
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA device count:", torch.cuda.device_count())
    print("Device name:", torch.cuda.get_device_name(0))
```

4. Model Initialization

We define a helper function to load the Cosmos-Reason2-2B model with the appropriate precision and token limits.

```{python}
def load_model():
    model_name = "nvidia/Cosmos-Reason2-2B"
    model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
        model_name, dtype=torch.float16, device_map="auto", attn_implementation="sdpa"
    )
    processor = transformers.Qwen3VLProcessor.from_pretrained(model_name)

    # Pixel token limits
    PIXELS_PER_TOKEN = 32**2
    min_vision_tokens, max_vision_tokens = 256, 8192
    processor.image_processor.size = processor.video_processor.size = {
        "shortest_edge": min_vision_tokens * PIXELS_PER_TOKEN,
        "longest_edge": max_vision_tokens * PIXELS_PER_TOKEN,
    }
    return model, processor
```

5. Batch Processing and Inference

This is the core loop where we iterate through the FiftyOne dataset, apply our prompt strategy, and store the model’s reasoning back into the database.

```{python}
# Setup Model
model, processor = load_model()
transformers.set_seed(0)

print(f"Processing {len(dataset)} videos...")

for sample in dataset.iter_samples(progress=True):
    video_path = sample.filepath

    conversation = [
        {"role": "system", "content": [{"type": "text", "text": SYSTEM_INSTRUCTIONS}]},
        {
            "role": "user",
            "content": [
                {"type": "video", "video": video_path},
                {"type": "text", "text": USER_PROMPT_CONTENT},
            ],
        },
    ]

    try:
        inputs = processor.apply_chat_template(
            conversation,
            tokenize=True,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt",
            fps=4,
        ).to(model.device)

        # Generate Reasoning
        generated_ids = model.generate(**inputs, max_new_tokens=1024)
        generated_ids_trimmed = [
            out_ids[len(in_ids) :]
            for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids_trimmed,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )[0]

        # Parse and Save Result
        clean_json = output_text.strip().replace("```json", "").replace("```", "")
        json_data = json.loads(clean_json)

        sample["cosmos_analysis"] = json_data
        sample["safety_label"] = fo.Classification(
            label=json_data.get("prediction_label"),
        )
        sample.save()
        
    except Exception as e:
        print(f"Inference/Parsing failed for {video_path}: {e}")
```

6. Visualization

Finally, we launch the FiftyOne App to audit the results and visualize the “Expert Inspector’s” logic.

```{python}
session = fo.launch_app(dataset)
session.wait()
```

Conclusion

This approach demonstrates that you don’t always need a custom-trained model to solve complex industrial safety problems. By using Cosmos Reason 2 and a well-crafted “Expert Inspector” prompt, you can build a safety monitoring system that is robust enough for the messy reality of classical warehouses.

For more details and the full source code, check out the NVIDIA Cosmos Cookbook.[1]

References

1. NVIDIA. (2024). NVIDIA cosmos cookbook: Worker safety in a classical warehouse with cosmos reason 2. https://github.com/nvidia-cosmos/cosmos-cookbook