Problem to Solve
This microservice provides a question-answering assistant inside the Habits app that answers about nutrition and training using your own documents (e.g. PDFs): guides, articles, recipes.
Users type questions in natural language and get answers based on the content of those PDFs, not only the model's built-in knowledge. The service keeps conversation history for follow-up questions in the same session.
1. How does it work?
The dashboard has a chat bar; the frontend calls the RAG Service with
POST /chat. The backend uses LlamaIndex to index PDFs, retrieve relevant chunks, and generate
answers with Groq (Llama 3.1).
| Stage | Description |
|---|---|
| Index | PDFs in data_source read with SimpleDirectoryReader and pypdf; split into
chunks; embeddings with Hugging Face model (e.g. BGE) run locally; index persisted in
storage.
|
| Retrieval | For each question, most relevant chunks are retrieved from the index (embedding similarity). |
| Generation | Retrieved context + chat history sent to LLM (Groq with Llama 3.1); model generates the answer (RAG = Retrieval-Augmented Generation). |
| Response | JSON (response) returned to frontend; history updated; answers grounded in uploaded
documents. |
2. Implementation
LlamaIndex orchestrates the RAG (index, retrieval, chat engine); Groq and Llama 3.1 generate answers; Hugging Face Embeddings build the index locally; FastAPI and Pydantic expose the API.
2.1 API
FastAPI with endpoint POST /chat: body with message and optionally
chat_history (list of role + content). Request/response validated with
Pydantic (ChatRequest, ChatResponse, ChatMessage).
Uvicorn as ASGI server; python-dotenv for env vars.
2.2 RAG with LlamaIndex
LlamaIndex (e.g. llama-index-core, llama-index-readers-file) reads
PDFs, builds the vector index, and exposes a chat engine with memory. The
retriever returns the most relevant chunks; the chat engine combines context
+ history and calls the LLM.
2.3 LLM & Embeddings
Groq with Llama 3.1 (llama-index-llms-groq) for fast answers.
Hugging Face embeddings (e.g. BAAI/bge-small-en-v1.5) via
llama-index-embeddings-huggingface, run locally; no external embedding API.
2.4 Deployment
PDFs read with pypdf and LlamaIndex reader; data_source and
storage configurable via environment. Standalone Python microservice (Docker); requires
GROQ_API_KEY. Frontend uses VITE_RAG_API_URL.
Key Concepts
LlamaIndex
Orchestrates RAG: document loading, vector index, retriever, and chat engine with memory.
RAG
Retrieval-Augmented Generation: retrieve relevant chunks from your PDFs, then generate answers with the LLM using that context.
Groq & Llama 3.1
Fast inference for the chat engine; no OpenAI/Anthropic cost for the end user.
Hugging Face Embeddings
Local embedding model (e.g. BGE) builds the vector index without an external API key.