Problem to Solve
This project implements a binary text classification system that distinguishes between spam and legitimate (ham) SMS messages using deep learning techniques. The challenge was to create a Long Short-Term Memory (LSTM) neural network using TensorFlow and Keras that correctly classifies SMS messages with high accuracy.
Spam detection is a critical problem in modern communication systems. Unlike traditional rule-based filters, deep learning models can learn complex patterns in text that distinguish spam from legitimate messages.
1. Dataset
The dataset consists of SMS messages labeled as either "spam" or "ham" (legitimate messages), organized into training and test sets in TSV format.
| Label | Message Text |
|---|---|
| ham | how are you doing today? |
| spam | sale today! to stop texts call 98912460324 |
| ham | i dont want to go. can we try it a different day? |
2. Preprocessing
Text data is preprocessed to convert raw messages into numerical representations that neural networks can process.
2.1 Tokenization and Padding
Messages are tokenized into sequences of integers and padded to a fixed length of 100 tokens:
tokenizer = Tokenizer(num_words=5000, oov_token='')
tokenizer.fit_on_texts(train_x)
X_sequences = tokenizer.texts_to_sequences(train_x)
X_padded = pad_sequences(X_sequences, maxlen=100)
2.2 Label Encoding
Text labels are converted to numerical values: "ham" → 0, "spam" → 1.
3. Model Architecture
The model uses an LSTM neural network architecture designed to learn sequential patterns in text.
model = tf.keras.Sequential([
tf.keras.layers.Embedding(len(tokenizer.word_index) + 1, 32),
tf.keras.layers.LSTM(32),
tf.keras.layers.Dense(1, activation='sigmoid')
])
| Layer | Purpose |
|---|---|
| Embedding | Converts word indices to dense 32-dimensional vectors |
| LSTM | Learns sequential patterns and long-range dependencies |
| Dense (sigmoid) | Binary classification output (probability 0-1) |
4. Training and Results
The model was compiled with binary cross-entropy loss and RMSprop optimizer, then trained for 10 epochs.
model.compile(
loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['acc']
)
history = model.fit(X_padded, train_y_encoded, epochs=10)
Results
5. Model Testing
The model was tested with various messages to verify its functionality:
"how are you doing today"
"sale today! to stop texts call 98912460324"
"you have won £1000 cash! call to claim your prize."
Key Concepts
LSTM Networks
Long Short-Term Memory networks are specialized RNNs that can learn long-range dependencies in sequential data.
Word Embeddings
Dense vector representations of words that capture semantic meaning and relationships.
Text Preprocessing
Converting raw text into numerical sequences through tokenization and padding.