Back to Projects

Book Recommendation System

K-Nearest Neighbors and Collaborative Filtering

KNN Recommendation Systems Collaborative Filtering scikit-learn Pandas

Problem to Solve

This project implements a book recommendation system that, given a specific book, suggests 5 similar books based on user rating patterns. The system uses the Book-Crossings dataset, which contains over 1.1 million ratings (scale 1-10) of 270,000 books by 90,000 users.

The objective is to develop a recommendation system using the NearestNeighbors algorithm from scikit-learn, which measures distance to determine the "closeness" between instances. The system works through collaborative filtering, analyzing how users have rated different books to find similarity patterns.

Model used: NearestNeighbors with cosine similarity

1. Dataset Information

Two CSV datasets were imported from the Book-Crossings dataset, containing information about books and user ratings.

Dataset 1: BX-Books.csv

Contains: Book information

  • isbn: International Standard Book Number (unique identifier)
  • title: Book title
  • author: Book author

Dataset 2: BX-Book-Ratings.csv

Contains: 1,149,780 ratings

  • user: User identifier
  • isbn: Book identifier
  • rating: User rating (scale 1-10)

Dataset Examples

Books Dataset (Primeras filas)

isbn title author
0195153448 Classical Mythology Mark P. O. Morford
0002005018 Clara Callan Richard Bruce Wright

Ratings Dataset (Primeras filas)

user isbn rating
276725 034545104X 0
276726 0155061224 5

2. Data Wrangling - Dataset Union

Both datasets were imported and merged to obtain a complete view of books and their ratings. The books dataset was joined with the ratings dataset using ISBN as the key.

Dataset Import

df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

3. Data Processing

3.1 Missing Value Treatment

Null values were identified and removed from the books dataset to ensure data quality.

df_books.isnull().sum()
df_books.dropna(inplace=True)

3.2 Data Filtering

To improve recommendation quality and reduce noise in the data, two filters were applied:

User Filter

Only users with 200 or more ratings were kept. This ensures recommendations are based on users with sufficient rating history.

user_counts = df_ratings['user'].value_counts()
users_to_keep = user_counts[user_counts >= 200].index
df_filtered_users = df_ratings[df_ratings['user'].isin(users_to_keep)]

Book Filter

Only books with 100 or more ratings were kept. This guarantees that recommended books have sufficient information to calculate reliable similarities.

books_counts = df_ratings['isbn'].value_counts()
books_to_keep = books_counts[books_counts >= 100].index
df_filtered = df_filtered_users[df_filtered_users['isbn'].isin(books_to_keep)]

3.3 Rating Matrix Construction

A pivot matrix was created where rows represent books (indexed by ISBN) and columns represent users. Values are ratings (0 if the user didn't rate the book).

df = df_filtered.pivot_table(
    index=['user'],
    columns=['isbn'],
    values='rating'
).fillna(0).T

The matrix is transposed (`.T`) so that books are rows and users are columns, making it easier to calculate similarities between books.

3.4 Index Replacement

ISBN indices were replaced with book titles to facilitate result interpretation.

df.index = df.join(df_books.set_index('isbn'))['title']

This allows the recommendation function to work directly with book titles instead of ISBN codes.

4. Model Training - NearestNeighbors

A NearestNeighbors model was trained using cosine distance metric, which is ideal for high-dimensional and sparse data like rating matrices. This section explains the mathematical foundations of the algorithm.

4.1 Understanding the Vector Space Model

In this recommendation system, each book is represented as a vector in a high-dimensional space. Each dimension corresponds to a user, and the value in that dimension is the rating that user gave to the book (0 if they didn't rate it).

Book Vector Representation

For example, if we have 3 users and Book A received ratings [5, 0, 8], this means:

  • User 1 rated Book A as 5
  • User 2 didn't rate Book A (value is 0)
  • User 3 rated Book A as 8

In our actual dataset, each book vector has thousands of dimensions (one per user), making it a high-dimensional space where most values are 0 (sparse data).

4.2 Why Cosine Similarity? The Mathematical Insight

After analyzing different distance metrics, cosine similarity was chosen over Euclidean distance.

Cosine Distance vs Euclidean Distance

Euclidean Distance
d = √[(x₁-y₁)² + (x₂-y₂)² + ... + (xₙ-yₙ)²]

Problem: Measures absolute differences. If User A rates books as [8, 9, 8] and User B rates the same books as [3, 4, 3], Euclidean distance would show them as very different, even though their preference patterns are identical!

Cosine Similarity ✓
cos(θ) = (A · B) / (||A|| × ||B||)

Solution: Measures the angle between vectors, not their magnitude. This captures similarity in preference patterns regardless of rating scale differences between users.

4.3 Deep Dive into the Cosine Similarity Formula

The Complete Formula

cos(θ) = (A · B) / (||A|| × ||B||)

Where:

  • A and B are rating vectors of two books
  • A · B is the dot product (sum of element-wise products)
  • ||A|| and ||B|| are the magnitudes (lengths) of the vectors
  • θ is the angle between the vectors

Breaking Down Each Component

1. Dot Product (A · B)

A · B = Σ(aᵢ × bᵢ)

This measures how much the two books' ratings align. Higher values mean more users rated both books similarly. For sparse data (many zeros), this naturally focuses on users who rated both books.

Example:

Book A: [5, 0, 8, 0, 3]

Book B: [4, 0, 9, 0, 2]

A · B = (5×4) + (0×0) + (8×9) + (0×0) + (3×2) = 20 + 0 + 72 + 0 + 6 = 98

Notice how zeros don't contribute - only users who rated both books matter!

2. Vector Magnitude (||A||)

||A|| = √(Σaᵢ²) = √(a₁² + a₂² + ... + aₙ²)

This is the length of the vector, representing the "total rating strength" of the book. It accounts for how many ratings the book received and their values.

Example:

Book A: [5, 0, 8, 0, 3]

||A|| = √(5² + 0² + 8² + 0² + 3²) = √(25 + 0 + 64 + 0 + 9) = √98 ≈ 9.90

3. The Division: Normalization

cos(θ) = (A · B) / (||A|| × ||B||)

Dividing by the product of magnitudes normalizes the similarity. This is crucial because:

  • It removes the effect of different numbers of ratings (a book with 1000 ratings vs 10 ratings)
  • It removes the effect of different rating scales (a generous user vs a harsh user)
  • It focuses purely on the direction of preference, not magnitude

4.4 Geometric Intuition: Why Angle Matters

The cosine similarity literally measures the angle between two vectors in the high-dimensional space. This geometric interpretation reveals why it's perfect for recommendation systems:

Visual Understanding

  • θ ≈ 0° (cos ≈ 1): Vectors point in the same direction → Books have identical rating patterns → Highly similar
  • θ ≈ 90° (cos ≈ 0): Vectors are perpendicular → No correlation in ratings → Unrelated
  • θ ≈ 180° (cos ≈ -1): Vectors point in opposite directions → Users who like one dislike the other → Opposite preferences

Real-World Example

Consider two books:

  • Book X: Rated by users as [8, 9, 8, 0, 0]
  • Book Y: Rated by users as [3, 4, 3, 0, 0]

Euclidean distance would show them as very different (large numerical differences).

Cosine similarity recognizes they have the same pattern: users 1, 2, and 3 prefer both books similarly, and users 4 and 5 haven't rated either. The angle between these vectors is small, indicating high similarity!

4.5 Implementation

Model Training Code

from sklearn.neighbors import NearestNeighbors

# Initialize model with cosine metric
model = NearestNeighbors(metric='cosine')

# Fit the model to the rating matrix
# Each row in df.values is a book vector
model.fit(df.values)

The model learns the geometric structure of the book vectors in the high-dimensional space. When we query for similar books, it finds those with the smallest angles (highest cosine similarity) to the query book.

5. Recommendation Function

The get_recommends function was implemented to generate book recommendations.

Function Implementation

def get_recommends(book_title = ""):
    # Get the rating vector for the requested book
    book = df.loc[book_title]
    
    # Find the 6 nearest neighbors (including the book itself)
    distances, indices = model.kneighbors([book.values], n_neighbors=6)
    
    # Prepare the list of recommendations
    recommended_books = pd.DataFrame({
      'title'   : df.iloc[indices[0]].index.values,
      'distance': distances[0]
    }) \
    .sort_values(by='distance', ascending=False) \
    .head(5).values
    
    lista = [book_title, recommended_books]
    return lista

How the function works:

  1. Receives a book title as argument
  2. Obtains the book's rating vector from the matrix
  3. Searches for the 6 nearest books (including the book itself)
  4. Sorts results by distance (highest to lowest)
  5. Selects the 5 most similar books (excluding the original book)
  6. Returns a list with the queried book title and an array with the 5 recommended books along with their distances

6. Results

The system was successfully tested with different books. Here are some examples of recommendations generated:

Example 1

"Where the Heart Is (Oprah's Book Club (Paperback))"

Rank
Recommended Book
Similarity
1
"I'll Be Seeing You"
0.80
2
"The Weight of Water"
0.77
3
"The Surgeon"
0.77
4
"I Know This Much Is True"
0.77
5
"The Lovely Bones: A Novel"
0.72

Example 2

"The Queen of the Damned (Vampire Chronicles (Paperback))"

Rank
Recommended Book
Similarity
1
"Catch 22"
0.79
2
"The Witching Hour (Lives of the Mayfair Witches)"
0.74
3
"Interview with the Vampire"
0.73
4
"The Tale of the Body Thief (Vampire Chronicles (Paperback))"
0.54
5
"The Vampire Lestat (Vampire Chronicles, Book II)"
0.52

Results Interpretation

Higher distances indicate greater similarity (cosine distance is interpreted as similarity when close to 1). The system successfully identifies books that share similar rating patterns, meaning users who liked one book tend to like the recommended books as well.

Notice in Example 2 how the system correctly identifies other books from the same series ("Vampire Chronicles") and similar genres, demonstrating that the collaborative filtering approach effectively captures book similarities based on user preferences.

Key Concepts

Collaborative Filtering

Method that predicts a user's preferences based on the preferences of similar users. In this case, it finds books similar to a given book based on how users have rated them.

K-Nearest Neighbors

Algorithm that finds the K most similar items based on distance or similarity metrics. Here, it finds books with similar rating patterns.

Rating Matrix

Representation of user-book interactions that allows calculating similarities and generating recommendations. Each cell contains a user's rating for a book.

Cosine Similarity

Distance metric that measures the angle between two vectors, ideal for sparse and high-dimensional data like rating matrices.