logo-detector-rfdetr

Two-stage brand-logo detector for 7 brands — lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban.

The pipeline started as a zero-shot detector (Stage 1) and was then replaced by a fine-tuned object-detection head (Stage 2). Both stages are shipped in this repository:

Stage	Model	Supervision	Output
Stage 1	OWLv2 image-guided detection	Zero-shot, uses 52 hand-labeled crops as exemplars	`stage1_gallery/embeddings.pt` (mean image embedding per brand) + `metadata.json`
Stage 2	RF-DETR-Small with a frozen DINOv2 backbone	Fine-tuned on synthetic copy-paste + real images	`checkpoint_best_ema.pth` — RF-DETR-Small best-EMA checkpoint

Stage 2 dramatically outperforms Stage 1 (see table below). Stage 1 artefacts are kept in the repo for reproducibility and for anyone who wants to use the OWLv2 exemplar flow directly.

Side-by-side evaluation — 42 real human-labeled images

Per-class Average Precision (%) at IoU ∈ {0.50, 0.75, 0.85, 0.95}. Both stages were evaluated on the same 42-image ground-truth set (brand_detection/data/hf_dataset/), so the numbers are directly comparable.

Class	S1 AP@0.50	S2 AP@0.50	S1 AP@0.75	S2 AP@0.75	S1 AP@0.85	S2 AP@0.85	S2 AP@0.95
lenscrafters	0.12	100.00	0.00	100.00	0.00	51.72	0.00
lufthansa	0.36	90.14	0.25	72.91	0.13	56.83	21.53
mediamarkt	0.01	82.58	0.01	82.58	0.01	74.59	2.48
meta	0.05	88.64	0.01	76.84	0.00	25.74	0.00
nuance	0.40	100.00	0.20	100.00	0.01	85.15	24.09
oakley	0.12	91.26	0.00	91.26	0.00	71.29	12.97
ray_ban	2.15	100.00	0.01	69.41	0.01	67.33	0.00
mAP	0.46	93.23	0.07	84.71	0.02	61.81	8.72

S1 = Stage 1 OWLv2 exemplar-gallery (zero-shot). S2 = Stage 2 RF-DETR-Small (fine-tuned).

Training details (Stage 2)

Dataset

The Stage 2 dataset combines the 42 hand-labeled real images with a synthetic copy-paste split built from per-brand crops pasted onto COCO val2017 backgrounds:

Real labeled images : 42
Synthetic copy-paste images : 1400 (~200 per brand × 7 brands)
COCO val2017 backgrounds : 50
YOLO split : 1156 train / 286 val (stratified by class, seed 42)

Hyperparameters

Architecture : RF-DETR-Small with DINOv2-Small (windowed) backbone
Backbone : frozen (lr_encoder = 0)
Head learning rate : 1e-4
Resolution : 640
Effective batch size : 16 (per-device 4 × grad_accum 4)
Epochs : 100 with early stopping (patience 20)
EMA : enabled, used for best-checkpoint selection
Seed : 42

Best EMA checkpoint reached mAP@0.50:0.95 = 0.8981 (≈ 89.81%) on the synthetic + real val split at epoch 33 — this is the checkpoint shipped here (checkpoint_best_ema.pth). The training run was halted shortly after (around epoch 35) due to a GPU deadlock caused by the host machine going to sleep — unrelated to the model itself — and the best EMA checkpoint was already saved. No re-training was performed because subsequent epochs (32, 33) had already started to plateau and the patience-20 early-stopping criterion was almost certain to fire before epoch 53.

Model parameters : 32.11 M.

Hardware

Trained locally on an NVIDIA RTX 3060 (12 GB) with gradient checkpointing. Each epoch took ≈ 4 min 40 s; training to the best EMA checkpoint was ≈ 2.5 h.

Usage

Stage 2 (recommended) — run RF-DETR-Small inference

from huggingface_hub import hf_hub_download
from rfdetr import RFDETRSmall
from PIL import Image

ckpt = hf_hub_download("mettinski/logo-detector-rfdetr", "checkpoint_best_ema.pth")
model = RFDETRSmall(num_classes=7, resolution=640, pretrain_weights=ckpt)

CLASS_NAMES = [
    "lenscrafters", "lufthansa", "mediamarkt", "meta",
    "nuance", "oakley", "ray_ban",
]

img = Image.open("your_image.jpg").convert("RGB")
dets = model.predict(img, threshold=0.5)
for (x0, y0, x1, y1), c, s in zip(dets.xyxy, dets.class_id, dets.confidence):
    print(f"{CLASS_NAMES[int(c)]}: score={float(s):.3f}  box=({x0:.0f},{y0:.0f},{x1:.0f},{y1:.0f})")

The class_id returned by rfdetr is 0-indexed (0 = lenscrafters … 6 = ray_ban). If you compare against the COCO-format ground-truth file shipped with this project, add +1 to convert back to category IDs 1..7.

Stage 1 — OWLv2 exemplar gallery

The mean OWLv2 image embedding for each brand (tensor of shape [hidden_dim]) is stored alongside the Stage 2 checkpoint for reproducibility. It is the input you would feed into Owlv2ForObjectDetection.image_guided_detection if you wanted to re-run the zero-shot baseline.

import torch
from huggingface_hub import hf_hub_download

emb_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/embeddings.pt")
meta_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/metadata.json")
embeddings = torch.load(emb_path, map_location="cpu")   # dict[class_name] -> Tensor
import json; metadata = json.loads(open(meta_path).read())
print(metadata["classes"])              # ['lenscrafters', ..., 'ray_ban']
print(embeddings["lufthansa"].shape)    # torch.Size([hidden_dim])

Full stage1/metrics.md

Stage 1 — OWLv2 exemplar-gallery — detection metrics

Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)

Predictions: 126731 boxes loaded from predictions.json

Per-class Average Precision (%):

class	AP@0.50	AP@0.75	AP@0.85	AP@0.95	mean
lenscrafters	0.12	0.00	0.00	0.00	0.03
lufthansa	0.36	0.25	0.13	0.00	0.19
mediamarkt	0.01	0.01	0.01	0.00	0.00
meta	0.05	0.01	0.00	0.00	0.01
nuance	0.40	0.20	0.01	0.00	0.15
oakley	0.12	0.00	0.00	0.00	0.03
ray_ban	2.15	0.01	0.01	0.00	0.54
mAP	0.46	0.07	0.02	0.00	0.14

pycocotools summary (using our custom IoU thresholds):
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.005
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.009
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.250
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.319
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.206

Full stage2/metrics.md

Stage 2 — RF-DETR-Small (frozen DINOv2 backbone) — detection metrics

Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)

Predictions: 6759 boxes loaded from predictions.json

Per-class Average Precision (%):

class	AP@0.50	AP@0.75	AP@0.85	AP@0.95	mean
lenscrafters	100.00	100.00	51.72	0.00	62.93
lufthansa	90.14	72.91	56.83	21.53	60.35
mediamarkt	82.58	82.58	74.59	2.48	60.56
meta	88.64	76.84	25.74	0.00	47.81
nuance	100.00	100.00	85.15	24.09	77.31
oakley	91.26	91.26	71.29	12.97	66.70
ray_ban	100.00	69.41	67.33	0.00	59.18
mAP	93.23	84.71	61.81	8.72	62.12

pycocotools summary (using our custom IoU thresholds):
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.621
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.932
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.847
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.547
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.576
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.605
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.680
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.703
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.667
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648

Files in this repo

Path	Purpose
`checkpoint_best_ema.pth`	RF-DETR-Small best-EMA checkpoint (Stage 2, ~370 MB)
`stage1_gallery/embeddings.pt`	Per-class OWLv2 mean image embedding (Stage 1)
`stage1_gallery/metadata.json`	Stage 1 gallery metadata (classes, exemplar counts)
`README.md`	This model card

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

mAP@0.50 on brand_detection hf_dataset (42 human-labeled images)
self-reported

93.230
mAP@0.75 on brand_detection hf_dataset (42 human-labeled images)
self-reported

84.710
mAP@0.85 on brand_detection hf_dataset (42 human-labeled images)
self-reported

61.810
mAP@0.95 on brand_detection hf_dataset (42 human-labeled images)
self-reported

8.720