Problem to Solve
This project implements a book recommendation system that, given a specific book, suggests 5 similar books based on user rating patterns. The system uses the Book-Crossings dataset, which contains over 1.1 million ratings (scale 1-10) of 270,000 books by 90,000 users.
The objective is to develop a recommendation system using the NearestNeighbors algorithm from scikit-learn, which measures distance to determine the "closeness" between instances. The system works through collaborative filtering, analyzing how users have rated different books to find similarity patterns.
Model used: NearestNeighbors with cosine similarity
1. Dataset Information
Two CSV datasets were imported from the Book-Crossings dataset, containing information about books and user ratings.
Dataset 1: BX-Books.csv
Contains: Book information
- isbn: International Standard Book Number (unique identifier)
- title: Book title
- author: Book author
Dataset 2: BX-Book-Ratings.csv
Contains: 1,149,780 ratings
- user: User identifier
- isbn: Book identifier
- rating: User rating (scale 1-10)
Dataset Examples
Books Dataset (Primeras filas)
| isbn | title | author |
|---|---|---|
| 0195153448 | Classical Mythology | Mark P. O. Morford |
| 0002005018 | Clara Callan | Richard Bruce Wright |
Ratings Dataset (Primeras filas)
| user | isbn | rating |
|---|---|---|
| 276725 | 034545104X | 0 |
| 276726 | 0155061224 | 5 |
2. Data Wrangling - Dataset Union
Both datasets were imported and merged to obtain a complete view of books and their ratings. The books dataset was joined with the ratings dataset using ISBN as the key.
Dataset Import
df_books = pd.read_csv(
books_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['isbn', 'title', 'author'],
usecols=['isbn', 'title', 'author'],
dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})
df_ratings = pd.read_csv(
ratings_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['user', 'isbn', 'rating'],
usecols=['user', 'isbn', 'rating'],
dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})
3. Data Processing
3.1 Missing Value Treatment
Null values were identified and removed from the books dataset to ensure data quality.
df_books.isnull().sum()
df_books.dropna(inplace=True)
3.2 Data Filtering
To improve recommendation quality and reduce noise in the data, two filters were applied:
User Filter
Only users with 200 or more ratings were kept. This ensures recommendations are based on users with sufficient rating history.
user_counts = df_ratings['user'].value_counts()
users_to_keep = user_counts[user_counts >= 200].index
df_filtered_users = df_ratings[df_ratings['user'].isin(users_to_keep)]
Book Filter
Only books with 100 or more ratings were kept. This guarantees that recommended books have sufficient information to calculate reliable similarities.
books_counts = df_ratings['isbn'].value_counts()
books_to_keep = books_counts[books_counts >= 100].index
df_filtered = df_filtered_users[df_filtered_users['isbn'].isin(books_to_keep)]
3.3 Rating Matrix Construction
A pivot matrix was created where rows represent books (indexed by ISBN) and columns represent users. Values are ratings (0 if the user didn't rate the book).
df = df_filtered.pivot_table(
index=['user'],
columns=['isbn'],
values='rating'
).fillna(0).T
The matrix is transposed (`.T`) so that books are rows and users are columns, making it easier to calculate similarities between books.
3.4 Index Replacement
ISBN indices were replaced with book titles to facilitate result interpretation.
df.index = df.join(df_books.set_index('isbn'))['title']
This allows the recommendation function to work directly with book titles instead of ISBN codes.
4. Model Training - NearestNeighbors
A NearestNeighbors model was trained using cosine distance metric, which is ideal for high-dimensional and sparse data like rating matrices. This section explains the mathematical foundations of the algorithm.
4.1 Understanding the Vector Space Model
In this recommendation system, each book is represented as a vector in a high-dimensional space. Each dimension corresponds to a user, and the value in that dimension is the rating that user gave to the book (0 if they didn't rate it).
Book Vector Representation
For example, if we have 3 users and Book A received ratings [5, 0, 8], this means:
- User 1 rated Book A as 5
- User 2 didn't rate Book A (value is 0)
- User 3 rated Book A as 8
In our actual dataset, each book vector has thousands of dimensions (one per user), making it a high-dimensional space where most values are 0 (sparse data).
4.2 Why Cosine Similarity? The Mathematical Insight
After analyzing different distance metrics, cosine similarity was chosen over Euclidean distance.
Cosine Distance vs Euclidean Distance
Euclidean Distance
Problem: Measures absolute differences. If User A rates books as [8, 9, 8] and User B rates the same books as [3, 4, 3], Euclidean distance would show them as very different, even though their preference patterns are identical!
Cosine Similarity ✓
Solution: Measures the angle between vectors, not their magnitude. This captures similarity in preference patterns regardless of rating scale differences between users.
4.3 Deep Dive into the Cosine Similarity Formula
The Complete Formula
cos(θ) = (A · B) / (||A|| × ||B||)
Where:
- A and B are rating vectors of two books
- A · B is the dot product (sum of element-wise products)
- ||A|| and ||B|| are the magnitudes (lengths) of the vectors
- θ is the angle between the vectors
Breaking Down Each Component
1. Dot Product (A · B)
A · B = Σ(aᵢ × bᵢ)
This measures how much the two books' ratings align. Higher values mean more users rated both books similarly. For sparse data (many zeros), this naturally focuses on users who rated both books.
Example:
Book A: [5, 0, 8, 0, 3]
Book B: [4, 0, 9, 0, 2]
A · B = (5×4) + (0×0) + (8×9) + (0×0) + (3×2) = 20 + 0 + 72 + 0 + 6 = 98
Notice how zeros don't contribute - only users who rated both books matter!
2. Vector Magnitude (||A||)
||A|| = √(Σaᵢ²) = √(a₁² + a₂² + ... + aₙ²)
This is the length of the vector, representing the "total rating strength" of the book. It accounts for how many ratings the book received and their values.
Example:
Book A: [5, 0, 8, 0, 3]
||A|| = √(5² + 0² + 8² + 0² + 3²) = √(25 + 0 + 64 + 0 + 9) = √98 ≈ 9.90
3. The Division: Normalization
cos(θ) = (A · B) / (||A|| × ||B||)
Dividing by the product of magnitudes normalizes the similarity. This is crucial because:
- It removes the effect of different numbers of ratings (a book with 1000 ratings vs 10 ratings)
- It removes the effect of different rating scales (a generous user vs a harsh user)
- It focuses purely on the direction of preference, not magnitude
4.4 Geometric Intuition: Why Angle Matters
The cosine similarity literally measures the angle between two vectors in the high-dimensional space. This geometric interpretation reveals why it's perfect for recommendation systems:
Visual Understanding
- θ ≈ 0° (cos ≈ 1): Vectors point in the same direction → Books have identical rating patterns → Highly similar
- θ ≈ 90° (cos ≈ 0): Vectors are perpendicular → No correlation in ratings → Unrelated
- θ ≈ 180° (cos ≈ -1): Vectors point in opposite directions → Users who like one dislike the other → Opposite preferences
Real-World Example
Consider two books:
- Book X: Rated by users as [8, 9, 8, 0, 0]
- Book Y: Rated by users as [3, 4, 3, 0, 0]
Euclidean distance would show them as very different (large numerical differences).
Cosine similarity recognizes they have the same pattern: users 1, 2, and 3 prefer both books similarly, and users 4 and 5 haven't rated either. The angle between these vectors is small, indicating high similarity!
4.5 Implementation
Model Training Code
from sklearn.neighbors import NearestNeighbors
# Initialize model with cosine metric
model = NearestNeighbors(metric='cosine')
# Fit the model to the rating matrix
# Each row in df.values is a book vector
model.fit(df.values)
The model learns the geometric structure of the book vectors in the high-dimensional space. When we query for similar books, it finds those with the smallest angles (highest cosine similarity) to the query book.
5. Recommendation Function
The get_recommends function was implemented to generate book
recommendations.
Function Implementation
def get_recommends(book_title = ""):
# Get the rating vector for the requested book
book = df.loc[book_title]
# Find the 6 nearest neighbors (including the book itself)
distances, indices = model.kneighbors([book.values], n_neighbors=6)
# Prepare the list of recommendations
recommended_books = pd.DataFrame({
'title' : df.iloc[indices[0]].index.values,
'distance': distances[0]
}) \
.sort_values(by='distance', ascending=False) \
.head(5).values
lista = [book_title, recommended_books]
return lista
How the function works:
- Receives a book title as argument
- Obtains the book's rating vector from the matrix
- Searches for the 6 nearest books (including the book itself)
- Sorts results by distance (highest to lowest)
- Selects the 5 most similar books (excluding the original book)
- Returns a list with the queried book title and an array with the 5 recommended books along with their distances
6. Results
The system was successfully tested with different books. Here are some examples of recommendations generated:
Example 1
"Where the Heart Is (Oprah's Book Club (Paperback))"
Example 2
"The Queen of the Damned (Vampire Chronicles (Paperback))"
Results Interpretation
Higher distances indicate greater similarity (cosine distance is interpreted as similarity when close to 1). The system successfully identifies books that share similar rating patterns, meaning users who liked one book tend to like the recommended books as well.
Notice in Example 2 how the system correctly identifies other books from the same series ("Vampire Chronicles") and similar genres, demonstrating that the collaborative filtering approach effectively captures book similarities based on user preferences.
Key Concepts
Collaborative Filtering
Method that predicts a user's preferences based on the preferences of similar users. In this case, it finds books similar to a given book based on how users have rated them.
K-Nearest Neighbors
Algorithm that finds the K most similar items based on distance or similarity metrics. Here, it finds books with similar rating patterns.
Rating Matrix
Representation of user-book interactions that allows calculating similarities and generating recommendations. Each cell contains a user's rating for a book.
Cosine Similarity
Distance metric that measures the angle between two vectors, ideal for sparse and high-dimensional data like rating matrices.