Zero-Shot Object Detection Service
Habits · PyTorch, TorchVision, Transformers, Grounding DINO, FastAPI
Problem to Solve
This microservice lets Habits app users upload a photo of their meal (plate or tray) and automatically get a list of detected ingredients/foods (labels in English, translated to Spanish on the frontend) and optionally an image with segmentation showing where each ingredient is located.
The model is not trained specifically on those meals: it uses zero-shot text-guided detection (vision–language). Users can "log food" with a photo and have the app suggest ingredients to confirm or edit before saving.
1. How does it work?
The user attaches a photo in the Nutrition section; the frontend calls the Nutri-AI
Backend at /detect (ingredient list) and/or /detect/image (segmented JPEG). The
backend runs Grounding DINO with text-defined ingredient categories.
| Step | Description |
|---|---|
| 1 | User chooses "Log meal" and optionally "Attach photo" in the Nutrition section. |
| 2 | Frontend sends the image to /detect (ingredient list) and/or /detect/image
(segmented JPEG). |
| 3 | Backend loads Grounding DINO (Hugging Face Transformers): text-guided object detection with natural-language categories. |
| 4 | Ingredient categories (e.g. rice, egg, chicken, lettuce, bread) are defined in text; model returns bounding boxes with label and score. |
| 5 | Thresholds (score, NMS) are applied; list and/or image is returned; frontend translates labels and lets the user confirm or remove before saving. |
The whole pipeline is zero-shot: no need to train the model on "my dishes"; it is enough to describe in text what you want to detect.
2. Implementation
The service uses PyTorch and TorchVision for computation, Transformers and Grounding DINO for zero-shot detection, and FastAPI to expose the API.
2.1 API
FastAPI with endpoints /detect (JSON: label, score, box per ingredient) and
/detect/image (segmented JPEG). Image via multipart or body; Pillow
opens it and passes it to the model.
2.2 Model
Grounding DINO via Transformers (Hugging Face):
AutoModelForZeroShotObjectDetection and AutoProcessor. Model is loaded on demand
(lazy) on the first request and runs with PyTorch on CPU or GPU.
2.3 Stack & Configuration
PyTorch and TorchVision for tensors and images;
Transformers for model and processor; Pillow for I/O;
FastAPI + Uvicorn; python-multipart. Parameters like
BOX_THRESHOLD, TEXT_THRESHOLD, and model ID are in detection/config.py.
2.4 Deployment
Standalone Python service (Docker); model is downloaded from Hugging Face on first use. Frontend uses
VITE_NUTRI_AI_API_URL to call this microservice.
Key Concepts
Grounding DINO
Text-guided object detection model that detects objects described in natural language without class-specific training.
Zero-Shot
No training on "my dishes"; ingredient categories are defined in text and the model generalizes from vision–language pretraining.
PyTorch & Transformers
Model runs on PyTorch; Hugging Face Transformers provides AutoModelForZeroShotObjectDetection
and processor.
FastAPI
Exposes /detect and /detect/image; receives images via multipart and returns JSON
or JPEG.