logo-detector-rfdetr

Two-stage brand-logo detector for 7 brands β€” lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban.

The pipeline started as a zero-shot detector (Stage 1) and was then replaced by a fine-tuned object-detection head (Stage 2). Both stages are shipped in this repository:

Stage Model Supervision Output
Stage 1 OWLv2 image-guided detection Zero-shot, uses 52 hand-labeled crops as exemplars stage1_gallery/embeddings.pt (mean image embedding per brand) + metadata.json
Stage 2 RF-DETR-Small with a frozen DINOv2 backbone Fine-tuned on synthetic copy-paste + real images checkpoint_best_ema.pth β€” RF-DETR-Small best-EMA checkpoint

Stage 2 dramatically outperforms Stage 1 (see table below). Stage 1 artefacts are kept in the repo for reproducibility and for anyone who wants to use the OWLv2 exemplar flow directly.

Side-by-side evaluation β€” 42 real human-labeled images

Per-class Average Precision (%) at IoU ∈ {0.50, 0.75, 0.85, 0.95}. Both stages were evaluated on the same 42-image ground-truth set (brand_detection/data/hf_dataset/), so the numbers are directly comparable.

Class S1 AP@0.50 S2 AP@0.50 S1 AP@0.75 S2 AP@0.75 S1 AP@0.85 S2 AP@0.85 S1 AP@0.95 S2 AP@0.95
lenscrafters 0.12 100.00 0.00 100.00 0.00 51.72 0.00 0.00
lufthansa 0.36 90.14 0.25 72.91 0.13 56.83 0.00 21.53
mediamarkt 0.01 82.58 0.01 82.58 0.01 74.59 0.00 2.48
meta 0.05 88.64 0.01 76.84 0.00 25.74 0.00 0.00
nuance 0.40 100.00 0.20 100.00 0.01 85.15 0.00 24.09
oakley 0.12 91.26 0.00 91.26 0.00 71.29 0.00 12.97
ray_ban 2.15 100.00 0.01 69.41 0.01 67.33 0.00 0.00
mAP 0.46 93.23 0.07 84.71 0.02 61.81 0.00 8.72

S1 = Stage 1 OWLv2 exemplar-gallery (zero-shot). S2 = Stage 2 RF-DETR-Small (fine-tuned).

Training details (Stage 2)

Dataset

The Stage 2 dataset combines the 42 hand-labeled real images with a synthetic copy-paste split built from per-brand crops pasted onto COCO val2017 backgrounds:

  • Real labeled images : 42
  • Synthetic copy-paste images : 1400 (~200 per brand Γ— 7 brands)
  • COCO val2017 backgrounds : 50
  • YOLO split : 1156 train / 286 val (stratified by class, seed 42)

Hyperparameters

  • Architecture : RF-DETR-Small with DINOv2-Small (windowed) backbone
  • Backbone : frozen (lr_encoder = 0)
  • Head learning rate : 1e-4
  • Resolution : 640
  • Effective batch size : 16 (per-device 4 Γ— grad_accum 4)
  • Epochs : 100 with early stopping (patience 20)
  • EMA : enabled, used for best-checkpoint selection
  • Seed : 42

Best EMA checkpoint reached mAP@0.50:0.95 = 0.8981 (β‰ˆ 89.81%) on the synthetic + real val split at epoch 33 β€” this is the checkpoint shipped here (checkpoint_best_ema.pth). The training run was halted shortly after (around epoch 35) due to a GPU deadlock caused by the host machine going to sleep β€” unrelated to the model itself β€” and the best EMA checkpoint was already saved. No re-training was performed because subsequent epochs (32, 33) had already started to plateau and the patience-20 early-stopping criterion was almost certain to fire before epoch 53.

Model parameters : 32.11 M.

Hardware

Trained locally on an NVIDIA RTX 3060 (12 GB) with gradient checkpointing. Each epoch took β‰ˆ 4 min 40 s; training to the best EMA checkpoint was β‰ˆ 2.5 h.

Usage

Stage 2 (recommended) β€” run RF-DETR-Small inference

from huggingface_hub import hf_hub_download
from rfdetr import RFDETRSmall
from PIL import Image

ckpt = hf_hub_download("mettinski/logo-detector-rfdetr", "checkpoint_best_ema.pth")
model = RFDETRSmall(num_classes=7, resolution=640, pretrain_weights=ckpt)

CLASS_NAMES = [
    "lenscrafters", "lufthansa", "mediamarkt", "meta",
    "nuance", "oakley", "ray_ban",
]

img = Image.open("your_image.jpg").convert("RGB")
dets = model.predict(img, threshold=0.5)
for (x0, y0, x1, y1), c, s in zip(dets.xyxy, dets.class_id, dets.confidence):
    print(f"{CLASS_NAMES[int(c)]}: score={float(s):.3f}  box=({x0:.0f},{y0:.0f},{x1:.0f},{y1:.0f})")

The class_id returned by rfdetr is 0-indexed (0 = lenscrafters … 6 = ray_ban). If you compare against the COCO-format ground-truth file shipped with this project, add +1 to convert back to category IDs 1..7.

Stage 1 β€” OWLv2 exemplar gallery

The mean OWLv2 image embedding for each brand (tensor of shape [hidden_dim]) is stored alongside the Stage 2 checkpoint for reproducibility. It is the input you would feed into Owlv2ForObjectDetection.image_guided_detection if you wanted to re-run the zero-shot baseline.

import torch
from huggingface_hub import hf_hub_download

emb_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/embeddings.pt")
meta_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/metadata.json")
embeddings = torch.load(emb_path, map_location="cpu")   # dict[class_name] -> Tensor
import json; metadata = json.loads(open(meta_path).read())
print(metadata["classes"])              # ['lenscrafters', ..., 'ray_ban']
print(embeddings["lufthansa"].shape)    # torch.Size([hidden_dim])
Full stage1/metrics.md

Stage 1 β€” OWLv2 exemplar-gallery β€” detection metrics

Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)

Predictions: 126731 boxes loaded from predictions.json

Per-class Average Precision (%):

class AP@0.50 AP@0.75 AP@0.85 AP@0.95 mean
lenscrafters 0.12 0.00 0.00 0.00 0.03
lufthansa 0.36 0.25 0.13 0.00 0.19
mediamarkt 0.01 0.01 0.01 0.00 0.00
meta 0.05 0.01 0.00 0.00 0.01
nuance 0.40 0.20 0.01 0.00 0.15
oakley 0.12 0.00 0.00 0.00 0.03
ray_ban 2.15 0.01 0.01 0.00 0.54
mAP 0.46 0.07 0.02 0.00 0.14
pycocotools summary (using our custom IoU thresholds):
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.009
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.250
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.319
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.206
Full stage2/metrics.md

Stage 2 β€” RF-DETR-Small (frozen DINOv2 backbone) β€” detection metrics

Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)

Predictions: 6759 boxes loaded from predictions.json

Per-class Average Precision (%):

class AP@0.50 AP@0.75 AP@0.85 AP@0.95 mean
lenscrafters 100.00 100.00 51.72 0.00 62.93
lufthansa 90.14 72.91 56.83 21.53 60.35
mediamarkt 82.58 82.58 74.59 2.48 60.56
meta 88.64 76.84 25.74 0.00 47.81
nuance 100.00 100.00 85.15 24.09 77.31
oakley 91.26 91.26 71.29 12.97 66.70
ray_ban 100.00 69.41 67.33 0.00 59.18
mAP 93.23 84.71 61.81 8.72 62.12
pycocotools summary (using our custom IoU thresholds):
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.621
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.932
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.847
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.547
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.576
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.605
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.680
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.703
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648

Files in this repo

Path Purpose
checkpoint_best_ema.pth RF-DETR-Small best-EMA checkpoint (Stage 2, ~370 MB)
stage1_gallery/embeddings.pt Per-class OWLv2 mean image embedding (Stage 1)
stage1_gallery/metadata.json Stage 1 gallery metadata (classes, exemplar counts)
README.md This model card
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • mAP@0.50 on brand_detection hf_dataset (42 human-labeled images)
    self-reported
    93.230
  • mAP@0.75 on brand_detection hf_dataset (42 human-labeled images)
    self-reported
    84.710
  • mAP@0.85 on brand_detection hf_dataset (42 human-labeled images)
    self-reported
    61.810
  • mAP@0.95 on brand_detection hf_dataset (42 human-labeled images)
    self-reported
    8.720