Cats vs Dogs Image Classifier

Deep Learning and Computer Vision

TensorFlow Keras CNN Computer Vision Image Classification

Problem to Solve

This project implements a binary image classification system that distinguishes between cats and dogs using deep learning techniques. The challenge was to create a convolutional neural network (CNN) using TensorFlow 2.0 and Keras that correctly classifies images of cats and dogs with at least 63% accuracy.

Image classification is a fundamental problem in computer vision. Unlike traditional machine learning approaches that require manual feature extraction, CNNs can automatically learn hierarchical features from raw pixel data, making them ideal for visual recognition tasks.

Model used: Convolutional Neural Network (CNN) with data augmentation

1. Dataset Information

The dataset consists of images organized into three directories: training, validation, and test sets.

Training Set

Size: 2,001 images

1,000 images of cats
1,000 images of dogs
Used to train the model

Validation Set

Size: 1,001 images

500 images of cats
500 images of dogs
Used to monitor training progress and prevent overfitting

Test Set

Size: 50 images

Unlabeled images
Used for final evaluation
Predictions must maintain order (shuffle=False)

Image Preprocessing Requirements

All images are resized to 150×150 pixels and normalized to values between 0 and 1 (originally 0-255). This standardization is crucial for neural network training as it ensures:

Consistent input dimensions across all images
Numerical stability during training
Faster convergence of the optimization algorithm

2. Image Data Generation

2.1 What is ImageDataGenerator?

ImageDataGenerator is a Keras utility class that provides real-time data augmentation and preprocessing for image datasets. It serves multiple critical functions:

1. Image Loading and Decoding

Automatically reads images from directories and decodes them into tensors (multi-dimensional arrays) that neural networks can process.

2. Data Normalization

Converts pixel values from the range [0, 255] to [0, 1] using the rescale parameter. This normalization is essential because:

Neural networks train more efficiently with normalized inputs
Prevents gradient explosion/vanishing issues
Allows the optimizer to converge faster

3. Batch Generation

Organizes images into batches for efficient training. Instead of loading all images into memory at once, it generates batches on-the-fly, making it memory-efficient for large datasets.

4. Data Augmentation

Applies random transformations to images during training, effectively creating new training examples from existing ones. This helps prevent overfitting and improves model generalization.

Basic ImageDataGenerator Setup

train_image_generator = ImageDataGenerator(
    rescale = 1./255  # Normalize pixel values to [0, 1]
)

The rescale parameter divides each pixel value by 255, converting integers in the range [0, 255] to floats in [0, 1].

2.2 flow_from_directory Method

The flow_from_directory method creates a generator that reads images from a directory structure and applies the specified transformations.

Creating Data Generators

train_data_gen = train_image_generator.flow_from_directory(
    batch_size=batch_size,
    directory=train_dir,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    class_mode='binary'
)

val_data_gen = validation_image_generator.flow_from_directory(
    batch_size=batch_size,
    directory=validation_dir,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    class_mode='binary'
)

test_data_gen = test_image_generator.flow_from_directory(
    batch_size=batch_size,
    directory=test_dir,
    target_size=(IMG_HEIGHT, IMG_WIDTH),
    class_mode='binary',
    shuffle=False  # Critical: maintains prediction order
)

Key Parameters Explained

batch_size: Number of images processed together (128 in this project)
directory: Path to the image directory
target_size: Resizes all images to (150, 150) pixels
class_mode: 'binary' for two-class classification (cats=0, dogs=1)
shuffle: For test data, set to False to maintain prediction order

3. Data Augmentation

With a relatively small training dataset (2,001 images), there's a high risk of overfitting—where the model memorizes training examples rather than learning generalizable features. Data augmentation addresses this by creating variations of existing images.

3.1 Why Data Augmentation?

Data augmentation artificially increases the size and diversity of the training dataset by applying random transformations. This helps the model:

Learn features that are invariant to orientation, position, and scale
Generalize better to new, unseen images
Reduce overfitting by exposing the model to more variations
Improve robustness to real-world image variations

3.2 Augmentation Transformations

Enhanced ImageDataGenerator with Augmentation

train_image_generator = ImageDataGenerator(
    rescale = 1./255,                    # Normalize pixel values to [0, 1]
    rotation_range = 40,                  # Rotate images randomly up to 40°
    width_shift_range = 0.2,              # Shift horizontally by up to 20%
    height_shift_range = 0.2,             # Shift vertically by up to 20%
    shear_range = 0.2,                    # Apply shear transformation up to 20%
    zoom_range = 0.2,                      # Random zoom within 20% range
    horizontal_flip = True,                # Randomly flip images horizontally
    fill_mode = 'nearest'                  # Fill empty pixels using nearest neighbor
)

rotation_range = 40

Randomly rotates images by up to ±40 degrees. This teaches the model that object orientation doesn't change the class (a cat is still a cat whether upright or slightly tilted).

width_shift_range = 0.2

Randomly shifts images horizontally by up to 20% of the image width. Helps the model learn that object position within the frame doesn't affect classification.

height_shift_range = 0.2

Randomly shifts images vertically by up to 20% of the image height. Similar to width shift, teaches position invariance.

shear_range = 0.2

Applies a shear transformation (like tilting a rectangle into a parallelogram). Simulates perspective changes and helps with robustness to viewing angles.

zoom_range = 0.2

Randomly zooms in or out by up to 20%. Teaches the model to recognize objects at different scales, which is crucial for real-world applications.

horizontal_flip = True

Randomly flips images horizontally (mirror effect). This is particularly effective for animals since they're generally symmetric horizontally, effectively doubling the dataset.

fill_mode = 'nearest'

When transformations create empty spaces (e.g., after rotation), this fills them using the nearest pixel values. Other options include 'constant', 'reflect', and 'wrap'.

3.3 Visual Example: Data Augmentation in Action

The following image shows how a single cat image is transformed into 5 different variations using the augmentation parameters:

Data augmentation example showing 5 variations of a cat image

A single training image (left) transformed into 5 variations through random rotations, shifts, zooms, and flips. Each variation is treated as a new training example, effectively increasing the dataset size.

4. CNN Architecture

The model uses a Convolutional Neural Network (CNN) architecture designed to automatically learn hierarchical features from images, progressing from simple edges to complex shapes and patterns.

4.1 Model Architecture

model = keras.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(64, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Conv2D(128, (3, 3), activation='relu'),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dense(512, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(1, activation='sigmoid')
])

4.2 Architecture Summary

Layer	Output Shape	Parameters	Purpose
Conv2D (32 filters)	(148, 148, 32)	896	Detects basic features (edges, lines)
MaxPooling2D	(74, 74, 32)	0	Reduces spatial dimensions
Conv2D (64 filters)	(72, 72, 64)	18,496	Detects more complex patterns
MaxPooling2D	(36, 36, 64)	0	Further reduces dimensions
Conv2D (128 filters)	(34, 34, 128)	73,856	Detects high-level features
MaxPooling2D	(17, 17, 128)	0	Final spatial reduction
Flatten	(36,992)	0	Converts to 1D vector
Dense (512)	(512)	18,940,416	Fully connected classification
Dropout (0.5)	(512)	0	Prevents overfitting
Dense (1)	(1)	513	Binary classification output

Total Parameters: 19,034,177 trainable parameters

4.3 Layer-by-Layer Deep Dive

1. Conv2D(32, (3, 3)) - First Convolutional Layer

Convolution Operation: (I * K)[i,j] = Σ_mΣ_n I[i+m, j+n] × K[m, n]

This layer applies 32 different 3×3 filters (kernels) to the input image. Each filter learns to detect a specific feature:

Mathematical Process: The filter slides across the image, computing dot products between filter weights and local image patches
What it learns: Simple features like edges, corners, and basic textures
Why 32 filters: Allows detection of multiple feature types simultaneously
Why (3, 3): Small kernel size captures fine-grained local patterns
ReLU activation: f(x) = max(0, x) introduces non-linearity, enabling the network to learn complex patterns

Output: 148×148×32 (reduced from 150×150 due to valid padding)

2. MaxPooling2D((2, 2)) - First Pooling Layer

Max Pooling: P[i,j] = max(I[2i:2i+2, 2j:2j+2])

Reduces spatial dimensions by taking the maximum value in each 2×2 region:

Purpose: Reduces computational complexity and parameters
Benefit: Makes the model more translation-invariant
Why Max instead of Average: Preserves the strongest features, which are most informative
Mathematical effect: Halves both width and height: 148×148 → 74×74

Output: 74×74×32 (spatial dimensions halved, depth unchanged)

3. Conv2D(64, (3, 3)) - Second Convolutional Layer

Builds upon the first layer's features to detect more complex patterns:

Why 64 filters: As we go deeper, we need more filters to capture increasing feature complexity
What it learns: Combinations of edges forming shapes, textures, and patterns
Receptive field: Each neuron "sees" a larger area of the original image due to previous pooling

Output: 72×72×64

4. MaxPooling2D((2, 2)) - Second Pooling Layer

Further reduces spatial dimensions while preserving the most important features.

Output: 36×36×64

5. Conv2D(128, (3, 3)) - Third Convolutional Layer

Detects high-level semantic features:

What it learns: Complex patterns like facial features, body parts, or distinctive markings
Why 128 filters: More filters needed to capture the diversity of high-level features
Receptive field: Now covers a significant portion of the original image

Output: 34×34×128

6. MaxPooling2D((2, 2)) - Third Pooling Layer

Final spatial reduction before flattening.

Output: 17×17×128 = 36,992 values

7. Flatten() - Reshaping Layer

Flattening: Converts 3D tensor (17, 17, 128) → 1D vector (36,992)

Transforms the 3D feature maps into a 1D vector to feed into fully connected layers:

Mathematical operation: Reshapes without changing values: 17 × 17 × 128 = 36,992
Purpose: Prepares features for classification

Output: (36,992,) - 1D vector

8. Dense(512) - Fully Connected Layer

Dense Layer: y = ReLU(W × x + b)

Where W is a 36,992 × 512 weight matrix, x is the input vector, and b is a bias vector.

This is the largest layer in terms of parameters (18.9M):

Purpose: Combines all learned features for final classification
Why 512 neurons: Provides enough capacity to learn complex feature combinations
ReLU activation: Introduces non-linearity: f(x) = max(0, x)
Mathematical complexity: Each of 512 neurons connects to all 36,992 inputs: 36,992 × 512 + 512 biases = 18,940,416 parameters

Output: (512,) - Feature vector

9. Dropout(0.5) - Regularization Layer

Dropout: During training, randomly sets 50% of inputs to 0

Prevents overfitting by randomly disabling neurons during training:

How it works: Each neuron has a 50% chance of being set to 0 during each training step
Why 0.5: Common value that balances regularization strength
Effect: Forces the network to learn redundant representations, improving generalization
During inference: All neurons are active, but outputs are scaled appropriately

Output: (512,) - Same shape, but with regularization effect

10. Dense(1, activation='sigmoid') - Output Layer

Sigmoid: σ(x) = 1 / (1 + e^-x)

Produces a single probability value between 0 and 1:

Why sigmoid: Perfect for binary classification, outputs probabilities
Interpretation: Values close to 0 = cat, close to 1 = dog
Decision threshold: Typically 0.5 (values > 0.5 = dog, ≤ 0.5 = cat)
Mathematical properties: Smooth, differentiable, bounded between (0, 1)

Output: (1,) - Single probability value

4.4 Why This Architecture?

Progressive Feature Learning

The architecture follows a hierarchical pattern: early layers detect simple features (edges), middle layers combine them (shapes), and later layers recognize complex patterns (faces, bodies).

Increasing Filter Depth

Filters increase from 32 → 64 → 128 because deeper layers need more capacity to represent complex feature combinations.

Pooling Strategy

MaxPooling after each Conv2D layer reduces spatial dimensions while preserving the most important features, making the model computationally efficient.

Regularization

Dropout prevents the large Dense layer (18.9M parameters) from overfitting to the relatively small training set.

5. Model Compilation

The model is compiled with specific optimizer, loss function, and metrics chosen for binary classification.

Compilation Code

model.compile(
    optimizer='adam',                    # Adjusts weights to minimize loss
    loss='binary_crossentropy',          # For binary classification
    metrics=['accuracy']                 # Evaluation metric
)

Optimizer: Adam

Adam (Adaptive Moment Estimation) is an advanced optimization algorithm that combines the benefits of two other methods:

Adaptive Learning Rates: Automatically adjusts learning rate for each parameter
Momentum: Uses moving averages of gradients for smoother updates
Why Adam: Works well with default hyperparameters, converges faster than SGD, and handles sparse gradients effectively
Mathematical advantage: Maintains per-parameter learning rates that are adapted based on average of recent gradient magnitudes

Loss Function: binary_crossentropy

Binary Cross-Entropy: L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]

Where y is the true label (0 or 1) and ŷ is the predicted probability.

This is the standard loss function for binary classification:

Why binary_crossentropy: Directly measures the difference between predicted probabilities and true labels
Mathematical property: Penalizes confident wrong predictions more heavily
Works with sigmoid: Designed specifically for models with sigmoid output activation
Gradient behavior: Provides strong gradients when predictions are wrong, weak gradients when correct

Metric: Accuracy

Accuracy: (TP + TN) / (TP + TN + FP + FN)

Where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.

Measures the percentage of correctly classified images:

Why accuracy: Simple, interpretable metric for balanced binary classification
Interpretation: 0.70 accuracy means 70% of predictions are correct
Limitation: May not be ideal for imbalanced datasets, but works well here since classes are balanced

6. Model Training

The model is trained using the fit method with specified epochs, batch size, and validation monitoring.

Training Code

history = model.fit(
    x=train_data_gen,
    steps_per_epoch=100,
    epochs=10,
    validation_data=val_data_gen,
    validation_steps=50
)

6.1 Training Parameters Explained

steps_per_epoch = 100

Number of batches processed per epoch. With batch_size=128, this means 100 × 128 = 12,800 images per epoch (more than the 2,001 training images due to data augmentation).

epochs = 10

Number of complete passes through the training dataset. The model sees all training data (with augmentation) 10 times.

validation_data = val_data_gen

Validation set used to monitor training progress and detect overfitting. Not used for training, only evaluation.

validation_steps = 50

Number of validation batches to evaluate per epoch. With batch_size=128, evaluates 50 × 128 = 6,400 validation images per epoch.

6.2 Training Results: Accuracy and Loss

The following graphs show the model's performance during training:

Interpreting the Training Curves

Training Accuracy (Blue Line)

Shows how well the model performs on training data. Ideally, this should increase steadily and converge to a high value.

Validation Accuracy (Orange Line)

Shows how well the model generalizes to unseen validation data. Should follow training accuracy closely. Large gap indicates overfitting.

Training Loss (Blue Line)

Measures the error on training data. Should decrease steadily as the model learns.

Validation Loss (Orange Line)

Measures error on validation data. Should decrease with training loss. If it starts increasing while training loss decreases, the model is overfitting.

Key Observation: The close alignment between training and validation curves indicates good generalization with minimal overfitting, thanks to data augmentation and dropout regularization.

7. Predictions and Results

After training, the model makes predictions on the test set and outputs probabilities for each image.

7.1 Making Predictions

Prediction Code

predictions = model.predict(test_data_gen)
# Returns probabilities between 0 and 1

probabilities = [p[0] for p in predictions]
# Extracts probability values from nested arrays

Understanding the Predictions

The model.predict() method returns an array of probabilities, where each value represents the model's confidence that an image is a dog:

Values close to 0: High confidence the image is a cat
Values close to 1: High confidence the image is a dog
Values around 0.5: Uncertain prediction

For example:

predictions[0] = 0.15 → 15% dog probability → Model predicts Cat (85% confidence)
predictions[1] = 0.87 → 87% dog probability → Model predicts Dog (87% confidence)
predictions[2] = 0.52 → 52% dog probability → Model predicts Dog but with low confidence (52%)

7.2 Visual Results

The model's predictions were visualized by displaying test images with their predicted class and confidence percentage:

Grid of 50 test images with predictions showing confidence percentages

Grid visualization of 50 test images with predictions. Each image shows the predicted class (Cat/Dog) and the confidence percentage. Higher percentages indicate more confident predictions. The model successfully classifies most images correctly, with confidence levels typically above 70% for clear images.

7.3 Results Summary

Model Performance

The CNN achieved above 70% accuracy on the test set, exceeding the 63% requirement. The model demonstrates:

Effective feature learning from raw pixel data
Good generalization to unseen images
Robust predictions with high confidence scores

Key Success Factors

Data Augmentation: Increased effective dataset size and improved generalization
CNN Architecture: Hierarchical feature learning from edges to complex patterns
Regularization: Dropout prevented overfitting despite large parameter count
Proper Preprocessing: Normalization and resizing ensured consistent inputs

Key Concepts

Convolutional Neural Networks

Deep learning architecture designed for image processing, automatically learning hierarchical features through convolutional and pooling layers.

Data Augmentation

Technique to artificially increase dataset size by applying random transformations, improving model generalization and reducing overfitting.

Regularization

Techniques like dropout prevent overfitting by randomly disabling neurons during training, forcing the model to learn robust features.