Zero-Shot Object Detection Service

Habits · PyTorch, TorchVision, Transformers, Grounding DINO, FastAPI

PyTorch TorchVision Transformers Grounding DINO FastAPI

Problem to Solve

This microservice lets Habits app users upload a photo of their meal (plate or tray) and automatically get a list of detected ingredients/foods (labels in English, translated to Spanish on the frontend) and optionally an image with segmentation showing where each ingredient is located.

The model is not trained specifically on those meals: it uses zero-shot text-guided detection (vision–language). Users can "log food" with a photo and have the app suggest ingredients to confirm or edit before saving.

1. How does it work?

The user attaches a photo in the Nutrition section; the frontend calls the Nutri-AI Backend at /detect (ingredient list) and/or /detect/image (segmented JPEG). The backend runs Grounding DINO with text-defined ingredient categories.

Step	Description
1	User chooses "Log meal" and optionally "Attach photo" in the Nutrition section.
2	Frontend sends the image to `/detect` (ingredient list) and/or `/detect/image` (segmented JPEG).
3	Backend loads Grounding DINO (Hugging Face Transformers): text-guided object detection with natural-language categories.
4	Ingredient categories (e.g. rice, egg, chicken, lettuce, bread) are defined in text; model returns bounding boxes with label and score.
5	Thresholds (score, NMS) are applied; list and/or image is returned; frontend translates labels and lets the user confirm or remove before saving.

The whole pipeline is zero-shot: no need to train the model on "my dishes"; it is enough to describe in text what you want to detect.

2. Implementation

The service uses PyTorch and TorchVision for computation, Transformers and Grounding DINO for zero-shot detection, and FastAPI to expose the API.

2.1 API

FastAPI with endpoints /detect (JSON: label, score, box per ingredient) and /detect/image (segmented JPEG). Image via multipart or body; Pillow opens it and passes it to the model.

2.2 Model

Grounding DINO via Transformers (Hugging Face): AutoModelForZeroShotObjectDetection and AutoProcessor. Model is loaded on demand (lazy) on the first request and runs with PyTorch on CPU or GPU.

2.3 Stack & Configuration

PyTorch and TorchVision for tensors and images; Transformers for model and processor; Pillow for I/O; FastAPI + Uvicorn; python-multipart. Parameters like BOX_THRESHOLD, TEXT_THRESHOLD, and model ID are in detection/config.py.

2.4 Deployment

Standalone Python service (Docker); model is downloaded from Hugging Face on first use. Frontend uses VITE_NUTRI_AI_API_URL to call this microservice.

Key Concepts

Grounding DINO

Text-guided object detection model that detects objects described in natural language without class-specific training.

Zero-Shot

No training on "my dishes"; ingredient categories are defined in text and the model generalizes from vision–language pretraining.

PyTorch & Transformers

Model runs on PyTorch; Hugging Face Transformers provides AutoModelForZeroShotObjectDetection and processor.

FastAPI

Exposes /detect and /detect/image; receives images via multipart and returns JSON or JPEG.