Back to Projects

SMS Text Classifier

RNN and NLP

TensorFlow Keras LSTM Text Classification

Problem to Solve

This project implements a binary text classification system that distinguishes between spam and legitimate (ham) SMS messages using deep learning techniques. The challenge was to create a Long Short-Term Memory (LSTM) neural network using TensorFlow and Keras that correctly classifies SMS messages with high accuracy.

Spam detection is a critical problem in modern communication systems. Unlike traditional rule-based filters, deep learning models can learn complex patterns in text that distinguish spam from legitimate messages.

1. Dataset

The dataset consists of SMS messages labeled as either "spam" or "ham" (legitimate messages), organized into training and test sets in TSV format.

Label Message Text
ham how are you doing today?
spam sale today! to stop texts call 98912460324
ham i dont want to go. can we try it a different day?

2. Preprocessing

Text data is preprocessed to convert raw messages into numerical representations that neural networks can process.

2.1 Tokenization and Padding

Messages are tokenized into sequences of integers and padded to a fixed length of 100 tokens:

tokenizer = Tokenizer(num_words=5000, oov_token='')
tokenizer.fit_on_texts(train_x)

X_sequences = tokenizer.texts_to_sequences(train_x)
X_padded = pad_sequences(X_sequences, maxlen=100)

2.2 Label Encoding

Text labels are converted to numerical values: "ham" → 0, "spam" → 1.

3. Model Architecture

The model uses an LSTM neural network architecture designed to learn sequential patterns in text.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(len(tokenizer.word_index) + 1, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
Layer Purpose
Embedding Converts word indices to dense 32-dimensional vectors
LSTM Learns sequential patterns and long-range dependencies
Dense (sigmoid) Binary classification output (probability 0-1)

4. Training and Results

The model was compiled with binary cross-entropy loss and RMSprop optimizer, then trained for 10 epochs.

model.compile(
    loss='binary_crossentropy',
    optimizer='rmsprop',
    metrics=['acc']
)

history = model.fit(X_padded, train_y_encoded, epochs=10)

Results

Acc
Test Accuracy
98.6%
Accuracy on test set
Loss
Test Loss
0.087
Binary cross-entropy loss

5. Model Testing

The model was tested with various messages to verify its functionality:

"how are you doing today"

Prediction: ham

"sale today! to stop texts call 98912460324"

Prediction: spam

"you have won £1000 cash! call to claim your prize."

Prediction: spam

Key Concepts

LSTM Networks

Long Short-Term Memory networks are specialized RNNs that can learn long-range dependencies in sequential data.

Word Embeddings

Dense vector representations of words that capture semantic meaning and relationships.

Text Preprocessing

Converting raw text into numerical sequences through tokenization and padding.