logo-detector-rfdetr
Two-stage brand-logo detector for 7 brands β lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban.
The pipeline started as a zero-shot detector (Stage 1) and was then replaced by a fine-tuned object-detection head (Stage 2). Both stages are shipped in this repository:
| Stage | Model | Supervision | Output |
|---|---|---|---|
| Stage 1 | OWLv2 image-guided detection | Zero-shot, uses 52 hand-labeled crops as exemplars | stage1_gallery/embeddings.pt (mean image embedding per brand) + metadata.json |
| Stage 2 | RF-DETR-Small with a frozen DINOv2 backbone | Fine-tuned on synthetic copy-paste + real images | checkpoint_best_ema.pth β RF-DETR-Small best-EMA checkpoint |
Stage 2 dramatically outperforms Stage 1 (see table below). Stage 1 artefacts are kept in the repo for reproducibility and for anyone who wants to use the OWLv2 exemplar flow directly.
Side-by-side evaluation β 42 real human-labeled images
Per-class Average Precision (%) at IoU β {0.50, 0.75, 0.85, 0.95}. Both stages were evaluated on the same 42-image ground-truth set (brand_detection/data/hf_dataset/), so the numbers are directly comparable.
| Class | S1 AP@0.50 | S2 AP@0.50 | S1 AP@0.75 | S2 AP@0.75 | S1 AP@0.85 | S2 AP@0.85 | S1 AP@0.95 | S2 AP@0.95 |
|---|---|---|---|---|---|---|---|---|
| lenscrafters | 0.12 | 100.00 | 0.00 | 100.00 | 0.00 | 51.72 | 0.00 | 0.00 |
| lufthansa | 0.36 | 90.14 | 0.25 | 72.91 | 0.13 | 56.83 | 0.00 | 21.53 |
| mediamarkt | 0.01 | 82.58 | 0.01 | 82.58 | 0.01 | 74.59 | 0.00 | 2.48 |
| meta | 0.05 | 88.64 | 0.01 | 76.84 | 0.00 | 25.74 | 0.00 | 0.00 |
| nuance | 0.40 | 100.00 | 0.20 | 100.00 | 0.01 | 85.15 | 0.00 | 24.09 |
| oakley | 0.12 | 91.26 | 0.00 | 91.26 | 0.00 | 71.29 | 0.00 | 12.97 |
| ray_ban | 2.15 | 100.00 | 0.01 | 69.41 | 0.01 | 67.33 | 0.00 | 0.00 |
| mAP | 0.46 | 93.23 | 0.07 | 84.71 | 0.02 | 61.81 | 0.00 | 8.72 |
S1 = Stage 1 OWLv2 exemplar-gallery (zero-shot). S2 = Stage 2 RF-DETR-Small (fine-tuned).
Training details (Stage 2)
Dataset
The Stage 2 dataset combines the 42 hand-labeled real images with a synthetic copy-paste split built from per-brand crops pasted onto COCO val2017 backgrounds:
- Real labeled images : 42
- Synthetic copy-paste images : 1400 (~200 per brand Γ 7 brands)
- COCO val2017 backgrounds : 50
- YOLO split : 1156 train / 286 val (stratified by class, seed 42)
Hyperparameters
- Architecture : RF-DETR-Small with DINOv2-Small (windowed) backbone
- Backbone : frozen (
lr_encoder = 0) - Head learning rate :
1e-4 - Resolution :
640 - Effective batch size : 16 (per-device 4 Γ grad_accum 4)
- Epochs : 100 with early stopping (patience 20)
- EMA : enabled, used for best-checkpoint selection
- Seed : 42
Best EMA checkpoint reached mAP@0.50:0.95 = 0.8981 (β 89.81%) on the synthetic + real val split at epoch 33 β this is the checkpoint shipped here (checkpoint_best_ema.pth). The training run was halted shortly after (around epoch 35) due to a GPU deadlock caused by the host machine going to sleep β unrelated to the model itself β and the best EMA checkpoint was already saved. No re-training was performed because subsequent epochs (32, 33) had already started to plateau and the patience-20 early-stopping criterion was almost certain to fire before epoch 53.
Model parameters : 32.11 M.
Hardware
Trained locally on an NVIDIA RTX 3060 (12 GB) with gradient checkpointing. Each epoch took β 4 min 40 s; training to the best EMA checkpoint was β 2.5 h.
Usage
Stage 2 (recommended) β run RF-DETR-Small inference
from huggingface_hub import hf_hub_download
from rfdetr import RFDETRSmall
from PIL import Image
ckpt = hf_hub_download("mettinski/logo-detector-rfdetr", "checkpoint_best_ema.pth")
model = RFDETRSmall(num_classes=7, resolution=640, pretrain_weights=ckpt)
CLASS_NAMES = [
"lenscrafters", "lufthansa", "mediamarkt", "meta",
"nuance", "oakley", "ray_ban",
]
img = Image.open("your_image.jpg").convert("RGB")
dets = model.predict(img, threshold=0.5)
for (x0, y0, x1, y1), c, s in zip(dets.xyxy, dets.class_id, dets.confidence):
print(f"{CLASS_NAMES[int(c)]}: score={float(s):.3f} box=({x0:.0f},{y0:.0f},{x1:.0f},{y1:.0f})")
The
class_idreturned byrfdetris 0-indexed (0 = lenscrafters β¦ 6 = ray_ban). If you compare against the COCO-format ground-truth file shipped with this project, add+1to convert back to category IDs1..7.
Stage 1 β OWLv2 exemplar gallery
The mean OWLv2 image embedding for each brand (tensor of shape [hidden_dim]) is stored alongside the Stage 2 checkpoint for reproducibility. It is the input you would feed into Owlv2ForObjectDetection.image_guided_detection if you wanted to re-run the zero-shot baseline.
import torch
from huggingface_hub import hf_hub_download
emb_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/embeddings.pt")
meta_path = hf_hub_download("mettinski/logo-detector-rfdetr", "stage1_gallery/metadata.json")
embeddings = torch.load(emb_path, map_location="cpu") # dict[class_name] -> Tensor
import json; metadata = json.loads(open(meta_path).read())
print(metadata["classes"]) # ['lenscrafters', ..., 'ray_ban']
print(embeddings["lufthansa"].shape) # torch.Size([hidden_dim])
Full stage1/metrics.md
Stage 1 β OWLv2 exemplar-gallery β detection metrics
Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)
Predictions: 126731 boxes loaded from predictions.json
Per-class Average Precision (%):
| class | AP@0.50 | AP@0.75 | AP@0.85 | AP@0.95 | mean |
|---|---|---|---|---|---|
| lenscrafters | 0.12 | 0.00 | 0.00 | 0.00 | 0.03 |
| lufthansa | 0.36 | 0.25 | 0.13 | 0.00 | 0.19 |
| mediamarkt | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 |
| meta | 0.05 | 0.01 | 0.00 | 0.00 | 0.01 |
| nuance | 0.40 | 0.20 | 0.01 | 0.00 | 0.15 |
| oakley | 0.12 | 0.00 | 0.00 | 0.00 | 0.03 |
| ray_ban | 2.15 | 0.01 | 0.01 | 0.00 | 0.54 |
| mAP | 0.46 | 0.07 | 0.02 | 0.00 | 0.14 |
pycocotools summary (using our custom IoU thresholds):
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.005
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.009
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.015
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.250
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.319
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.206
Full stage2/metrics.md
Stage 2 β RF-DETR-Small (frozen DINOv2 backbone) β detection metrics
Evaluated on 42 images / 52 annotations (COCO categories: lenscrafters, lufthansa, mediamarkt, meta, nuance, oakley, ray_ban)
Predictions: 6759 boxes loaded from predictions.json
Per-class Average Precision (%):
| class | AP@0.50 | AP@0.75 | AP@0.85 | AP@0.95 | mean |
|---|---|---|---|---|---|
| lenscrafters | 100.00 | 100.00 | 51.72 | 0.00 | 62.93 |
| lufthansa | 90.14 | 72.91 | 56.83 | 21.53 | 60.35 |
| mediamarkt | 82.58 | 82.58 | 74.59 | 2.48 | 60.56 |
| meta | 88.64 | 76.84 | 25.74 | 0.00 | 47.81 |
| nuance | 100.00 | 100.00 | 85.15 | 24.09 | 77.31 |
| oakley | 91.26 | 91.26 | 71.29 | 12.97 | 66.70 |
| ray_ban | 100.00 | 69.41 | 67.33 | 0.00 | 59.18 |
| mAP | 93.23 | 84.71 | 61.81 | 8.72 | 62.12 |
pycocotools summary (using our custom IoU thresholds):
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.621
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.932
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.847
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.547
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.576
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.605
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.680
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.703
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.750
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.667
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.648
Files in this repo
| Path | Purpose |
|---|---|
checkpoint_best_ema.pth |
RF-DETR-Small best-EMA checkpoint (Stage 2, ~370 MB) |
stage1_gallery/embeddings.pt |
Per-class OWLv2 mean image embedding (Stage 1) |
stage1_gallery/metadata.json |
Stage 1 gallery metadata (classes, exemplar counts) |
README.md |
This model card |
Evaluation results
- mAP@0.50 on brand_detection hf_dataset (42 human-labeled images)self-reported93.230
- mAP@0.75 on brand_detection hf_dataset (42 human-labeled images)self-reported84.710
- mAP@0.85 on brand_detection hf_dataset (42 human-labeled images)self-reported61.810
- mAP@0.95 on brand_detection hf_dataset (42 human-labeled images)self-reported8.720