RAG Service

Habits · LlamaIndex, Groq, Llama 3.1, Hugging Face Embeddings, FastAPI

LlamaIndex Groq Llama 3.1 Hugging Face FastAPI Pydantic

Problem to Solve

This microservice provides a question-answering assistant inside the Habits app that answers about nutrition and training using your own documents (e.g. PDFs): guides, articles, recipes.

Users type questions in natural language and get answers based on the content of those PDFs, not only the model's built-in knowledge. The service keeps conversation history for follow-up questions in the same session.

1. How does it work?

The dashboard has a chat bar; the frontend calls the RAG Service with POST /chat. The backend uses LlamaIndex to index PDFs, retrieve relevant chunks, and generate answers with Groq (Llama 3.1).

Stage	Description
Index	PDFs in `data_source` read with `SimpleDirectoryReader` and pypdf; split into chunks; embeddings with Hugging Face model (e.g. BGE) run locally; index persisted in `storage`.
Retrieval	For each question, most relevant chunks are retrieved from the index (embedding similarity).
Generation	Retrieved context + chat history sent to LLM (Groq with Llama 3.1); model generates the answer (RAG = Retrieval-Augmented Generation).
Response	JSON (`response`) returned to frontend; history updated; answers grounded in uploaded documents.

2. Implementation

LlamaIndex orchestrates the RAG (index, retrieval, chat engine); Groq and Llama 3.1 generate answers; Hugging Face Embeddings build the index locally; FastAPI and Pydantic expose the API.

2.1 API

FastAPI with endpoint POST /chat: body with message and optionally chat_history (list of role + content). Request/response validated with Pydantic (ChatRequest, ChatResponse, ChatMessage). Uvicorn as ASGI server; python-dotenv for env vars.

2.2 RAG with LlamaIndex

LlamaIndex (e.g. llama-index-core, llama-index-readers-file) reads PDFs, builds the vector index, and exposes a chat engine with memory. The retriever returns the most relevant chunks; the chat engine combines context + history and calls the LLM.

2.3 LLM & Embeddings

Groq with Llama 3.1 (llama-index-llms-groq) for fast answers. Hugging Face embeddings (e.g. BAAI/bge-small-en-v1.5) via llama-index-embeddings-huggingface, run locally; no external embedding API.

2.4 Deployment

PDFs read with pypdf and LlamaIndex reader; data_source and storage configurable via environment. Standalone Python microservice (Docker); requires GROQ_API_KEY. Frontend uses VITE_RAG_API_URL.

Key Concepts

LlamaIndex

Orchestrates RAG: document loading, vector index, retriever, and chat engine with memory.

RAG

Retrieval-Augmented Generation: retrieve relevant chunks from your PDFs, then generate answers with the LLM using that context.

Groq & Llama 3.1

Fast inference for the chat engine; no OpenAI/Anthropic cost for the end user.

Hugging Face Embeddings

Local embedding model (e.g. BGE) builds the vector index without an external API key.