March 31, 2026

Built-in Auto-Labeling: Every File Comes Pre-Annotated

No separate labeling pipeline. No Scale AI contract. Annotations ship with every file.

The Problem: Collection and Labeling Are Two Separate Pipelines

Collecting data is only half the work. ML teams collect thousands of photos, hours of audio, gigabytes of video — and then they need to label it. Object detection bounding boxes. Transcription. Scene classification. OCR. Every annotation type means a second pipeline with a second vendor: Scale AI, Label Studio, Labelbox, Roboflow, or another round of crowdsourced microtasks.

This two-stage approach has real costs. A second vendor relationship. A second API integration. A second quality review process. A second budget line item. And latency — your data sits in a queue between collection and annotation, sometimes for days. Teams building AI training data pipelines spend as much time managing the labeling stage as they do collecting the raw media.

FirstHandAPI eliminates the second stage entirely for the most common annotation types. Every approved file ships with structured annotation metadata, generated in the same AI scoring pass that determines quality. No separate data labeling contract. No additional API calls. No wait.

How Auto-Labeling Works

When a worker submits a file to your data collection job, the AI scoring ensemble does two things simultaneously. First, it scores the submission for quality on a 1–5 star scale. Second, it generates structured annotation metadata based on the content type. Both happen in the same pipeline pass, using the same AI models that are already analyzing the file.

For images, Claude Vision analyzes the full frame and produces object labels, scene classification, OCR text extraction, color palettes, and dominant object positions. For audio, OpenAI Whisper generates a full transcript with speaker diarization, language detection, and keyword extraction. For video, ffmpeg samples keyframes, Claude Vision annotates each frame, and Whisper transcribes the audio track.

The marginal cost of annotation is minimal — it is additional output tokens on API calls that are already happening for quality scoring. That is why auto-labeling is included at no extra charge. The annotations are attached to the file object and available via API as soon as the file is approved.

Image Annotations

Every approved image includes the following annotation fields, generated by Claude Vision analyzing the full image in a single pass.

{
  "annotations": {
    "objects": [
      { "label": "storefront", "position": "center", "coverage": 0.45 },
      { "label": "signage", "position": "upper-center", "coverage": 0.15 },
      { "label": "awning", "position": "upper-left", "coverage": 0.12 },
      { "label": "pedestrian", "position": "right", "coverage": 0.08 }
    ],
    "scene": {
      "setting": "urban retail street",
      "indoor": false
    },
    "text_extraction": {
      "ocr_text": "SUNNY SIDE CAFE\nOpen Daily 7am-6pm\nFresh Roasted Coffee",
      "language": "en"
    },
    "color_palette": ["#4A90D9", "#F5F5DC", "#2C3E50", "#E74C3C", "#FFFFFF"],
    "dominant_objects": [
      { "label": "storefront", "position": "center", "coverage": 0.45 }
    ],
    "composition": "centered subject with leading lines"
  }
}

objects provides detected objects with spatial position descriptions and approximate frame coverage. scene gives a high-level setting classification and whether the image is indoor or outdoor. text_extraction runs OCR on any visible text — useful for menus, signage, documents, and product packaging. color_palette extracts dominant colors as hex values. composition describes the visual layout of the image.

Audio Annotations

Audio files are processed by OpenAI Whisper for transcription and speaker analysis. The annotation schema includes transcript segments with timestamps, speaker diarization, and content metadata.

{
  "annotations": {
    "speaker_count": 2,
    "language": "en",
    "topics": ["product feedback", "user experience", "checkout flow"],
    "keywords": ["fast checkout", "search", "recommendations", "mobile app"],
    "noise_level": "low",
    "transcript_segments": [
      {
        "speaker": "speaker_0",
        "start": 0.0,
        "end": 8.2,
        "text": "I have been using this app for about three months now and I think the main thing I like is how fast the checkout process is."
      },
      {
        "speaker": "speaker_1",
        "start": 8.5,
        "end": 12.4,
        "text": "What about the search? Have you had any issues finding products?"
      },
      {
        "speaker": "speaker_0",
        "start": 12.8,
        "end": 18.1,
        "text": "Yeah the search could definitely be better. I usually just browse the recommendations instead."
      }
    ]
  }
}

speaker_count and transcript_segments provide diarization — who spoke when, with precise timestamps. language uses ISO 639-1 codes for detected language. topics and keywords summarize the content for filtering and search. noise_level indicates ambient recording conditions. This is particularly useful for feedback collection jobs where you want to analyze sentiment per speaker, or interview recordings with multiple participants.

Video Annotations

Video files get the richest annotations. The pipeline uses ffmpeg to extract keyframes at scene boundaries, Claude Vision to analyze each frame, and Whisper to transcribe the audio track.

{
  "annotations": {
    "scenes": [
      {
        "description": "Close-up of product packaging held in hand, front label visible",
        "timestamp": 0.0
      },
      {
        "description": "Package rotated to show barcode and nutrition label on back",
        "timestamp": 3.5
      },
      {
        "description": "Wide shot placing product on kitchen counter next to competitors",
        "timestamp": 8.2
      }
    ],
    "actions": ["holding", "rotating", "placing", "comparing"],
    "object_tracking": [
      { "label": "product_packaging", "scene_indices": [0, 1, 2] },
      { "label": "barcode", "scene_indices": [1] },
      { "label": "kitchen_counter", "scene_indices": [2] }
    ],
    "keyframe_descriptions": [
      "Product front label in sharp focus, natural lighting",
      "Barcode and nutrition facts clearly readable",
      "Three competing products arranged side by side"
    ],
    "transcript_segments": [
      {
        "start": 0.0,
        "end": 3.2,
        "text": "Here is the front of the box."
      },
      {
        "start": 3.5,
        "end": 7.8,
        "text": "And here you can see the barcode and nutrition label on the back."
      },
      {
        "start": 8.2,
        "end": 14.1,
        "text": "Compared to these two competitors, the packaging is noticeably smaller."
      }
    ]
  }
}

scenes segments the video by visual transitions with natural language descriptions and timestamps. actions lists detected activities across the entire clip. object_tracking maps detected objects to the scene indices where they appear, so you can trace an object across the video. keyframe_descriptions provides a human-readable summary of each sampled frame. transcript_segments covers the audio track with timestamps, ready for subtitle generation or search indexing.

Accessing Annotations

Annotations are part of the file object. Every method that returns files includes the annotation data automatically. No separate endpoint, no additional API call.

REST API

GET /v1/jobs/job_01J5K9.../files?status=approved

{
  "data": [
    {
      "id": "file_01J5KA...",
      "content_type": "image",
      "status": "approved",
      "score": 4,
      "download_url": "https://files.firsthandapi.com/...",
      "annotations": { ... }
    }
  ]
}

TypeScript SDK

import FirstHandAPI from '@firsthandapi/sdk';

const fh = new FirstHandAPI({ apiKey: process.env.FIRSTHAND_API_KEY! });

const { data: files } = await fh.files.list({
  job_id: 'job_01J5K9...',
  status: 'approved',
});

for (const file of files) {
  console.log(file.content_type);            // "image"
  console.log(file.annotations.objects);      // detected objects
  console.log(file.annotations.ocr_text);     // extracted text
  console.log(file.download_url);             // pre-signed URL
}

Python SDK

from firsthandapi import FirstHandAPI

fh = FirstHandAPI(api_key=os.environ["FIRSTHAND_API_KEY"])

files = fh.files.list(job_id="job_01J5K9...", status="approved")

for file in files.data:
    print(file.content_type)                # "image"
    print(file.annotations["objects"])       # detected objects
    print(file.annotations["ocr_text"])      # extracted text

MCP Server

If you are using @firsthandapi/mcp-server with Claude Code or Cursor, the get_job_files tool returns files with annotations included. Your AI agent can read and reason about the annotation metadata directly.

What This Replaces

For the most common annotation use cases, FirstHandAPI auto-labeling eliminates the need for a separate data labeling tool. Here is what you no longer need a dedicated vendor for:

Tool	What it does	Replaced by auto-labeling?
Scale AI	Managed data annotation at scale	For object identification, OCR, transcription — yes
Labelbox	Labeling platform with workforce management	For classification and text extraction — yes
Label Studio	Open-source annotation tool	For basic annotation workflows — yes
Roboflow	Computer vision dataset management	For object identification and scene classification — yes
Supervisely	Computer vision annotation platform	For basic labeling needs — yes
V7	AI-native labeling with auto-annotation	For classification and transcription — yes

Important caveat: for pixel-precise bounding boxes, segmentation masks, polyline annotations (lane detection), or domain-specific labeling schemas (medical imaging, satellite imagery), dedicated labeling tools still have a critical role. FirstHandAPI auto-labeling covers object identification, OCR, scene classification, and transcription — the annotations that can be reliably generated by foundation models. For everything else, the dedicated tools remain the right choice.

Limitations and Accuracy

Auto-labeling is best-effort, generated by foundation models (Claude Vision, OpenAI Whisper) rather than human annotators. You should be aware of the following constraints:

Spatial descriptions, not pixel coordinates. Object positions are described as regions (“upper-left”, “center”) with approximate coverage percentages, not bounding box pixel coordinates. This is sufficient for filtering and search, but not for training object detection models that need precise spatial annotations.
Transcript accuracy depends on audio quality. Whisper performs well on clear speech but degrades with heavy background noise, overlapping speakers, or uncommon accents. The noise_level field helps you filter accordingly.
Annotations are null for policy violations. If a file is flagged for content policy violations during scoring, annotation fields will be null. The file will also be rejected (1-star score), so it will not appear in your approved files.
Confidence varies by content. Object labels for common objects (people, vehicles, buildings, text) are highly reliable. Niche or domain-specific objects may have lower confidence or be described in generic terms.
Video annotations sample keyframes. Not every frame is analyzed. The pipeline uses scene-change detection to select representative keyframes, so brief moments between cuts may not be annotated.

For most AI training data workflows — dataset filtering, content search, basic classification, transcription — these annotations are production-ready. For precision labeling tasks, treat them as a strong starting point that reduces manual annotation effort by 60-80%.

See auto-labeling in action

Post a test job, upload a sample file, and inspect the annotation metadata yourself. Read the quickstart tutorial or check the API reference for the full annotation schema.

Create free account