Spam Classification using Logistic Regression

Team:
Bal Narendra Sapa
Venkata Krishna Sreekar Padakandla
Sreeja Garlapati
Bharath Kondreddy
Sai Shara Vardhan Bandi

Instructor:
Minkyu Kim

ABSTRACT

Background:

We have a dataset that includes data on emails. Because we receive a large number of emails these days, email services have implemented spam detection to filter out unwanted messages. Logistic regression is one method that can be used to identify spam emails. This technique involves using certain characteristics of the emails to build a model that can classify them as spam or not spam

Purpose:

The goal of this project is to utilize various characteristics of the emails, such as the sender and recipient, to create a generalized linear model. We will then apply the logit function to this model to transform the output into probabilities. These probabilities, generated from the email attributes, will indicate whether the email is likely to be spam or not.

Methods:

We have been provided with 21 attributes. To begin, we have converted the categorical data into numerical data using one-hot encoding. We have then divided the dataset into a training set and a testing set in an 80/20 ratio. Using the stats.model library in Python, we have built a generalized linear model and linked it to the logit function. We have trained this model using the training dataset.

Results:

The model produces probabilities that show whether an email is spam or not. We use a threshold of 0.50 or 50% to classify emails. The model's accuracy ranges from 90% to 95%, depending on how the data is randomly divided. We also calculate precision, sensitivity, specificity, and other metrics to evaluate the model's performance.

Conclusion:

Our model, which was built using logistic regression and various attributes, is reliable. Based on its accuracy and other metrics, we can confidently say that it is effective at distinguishing spam emails from non-spam emails.

1. Business Understanding

The purpose of this project is to develop a spam filter that can identify the probability that an email is spam. We converted the characteristics into numerical data so that we could use logistic regression on it. The filter's ability to provide probabilities allows it to classify emails as spam or not spam. This spam filter enables the email software to offer spam filtering to its users, improving their satisfaction and increasing the number of users of the email service, leading to higher revenue.

2. Data Understanding

Before using the data to build a model, it is important to evaluate its quality and identify any missing or irrelevant information. In this case, there are two categorical attributes that need to be converted into numerical data in order to be used in the model. Once these attributes are converted, a threshold is determined based on the quality of the data. Additionally, there is a time attribute that needs to be converted into a format that can be used in the model using the datetime library. After conducting the logistic regression, the p-values for the coefficients in the table can be examined. If the p-value is greater than 0.05, it means that the coefficient does not significantly affect the probability value.

3. Data Preparation

Importing Libraries

import math
import numpy as np
import pandas as pd
import seaborn as sns
from sqlite3 import connect
import matplotlib.pyplot as plt
import statsmodels.api as sm

Retrieving the table from the SQLite database.

connection = connect("../../dataset/dataset.db")
query = "SELECT * FROM " + "prj2"
df = pd.read_sql(query, connection)
print(df.describe())

             index         spam  to_multiple         from           cc  \
count  3921.000000  3921.000000  3921.000000  3921.000000  3921.000000   
mean   1960.000000     0.093599     0.158123     0.999235     0.404489   
std    1132.039531     0.291307     0.364903     0.027654     2.666424   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%     980.000000     0.000000     0.000000     1.000000     0.000000   
50%    1960.000000     0.000000     0.000000     1.000000     0.000000   
75%    2940.000000     0.000000     0.000000     1.000000     0.000000   
max    3920.000000     1.000000     1.000000     1.000000    68.000000   

        sent_email        image       attach       dollar      inherit  \
count  3921.000000  3921.000000  3921.000000  3921.000000  3921.000000   
mean      0.277990     0.048457     0.132874     1.467228     0.038001   
std       0.448066     0.450848     0.718518     5.022298     0.267899   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.000000     0.000000     0.000000     0.000000   
50%       0.000000     0.000000     0.000000     0.000000     0.000000   
75%       1.000000     0.000000     0.000000     0.000000     0.000000   
max       1.000000    20.000000    21.000000    64.000000     9.000000   

            viagra     password     num_char  line_breaks       format  \
count  3921.000000  3921.000000  3921.000000  3921.000000  3921.000000   
mean      0.002040     0.108136    10.706586   230.658505     0.695231   
std       0.127759     0.959931    14.645786   319.304959     0.460368   
min       0.000000     0.000000     0.001000     1.000000     0.000000   
25%       0.000000     0.000000     1.459000    34.000000     0.000000   
50%       0.000000     0.000000     5.856000   119.000000     1.000000   
75%       0.000000     0.000000    14.084000   298.000000     1.000000   
max       8.000000    28.000000   190.087000  4022.000000     1.000000   

           re_subj  exclaim_subj  urgent_subj  exclaim_mess  
count  3921.000000   3921.000000  3921.000000   3921.000000  
mean      0.261413      0.080337     0.001785      6.584290  
std       0.439460      0.271848     0.042220     51.479871  
min       0.000000      0.000000     0.000000      0.000000  
25%       0.000000      0.000000     0.000000      0.000000  
50%       0.000000      0.000000     0.000000      1.000000  
75%       1.000000      0.000000     0.000000      4.000000  
max       1.000000      1.000000     1.000000   1236.000000  

Transforming the time attribute into numerical data using the datetime library.

import datetime as dt
df['time'] = pd.to_datetime(df['time'])
df['time'] = df["time"].map(dt.datetime.toordinal)

df = df.query("number != 'none'")

Converting categorical data into numerical data using one-hot encoding.

df['winner'] = pd.get_dummies(df['winner'])['yes']
df['number'] = pd.get_dummies(df['number'])['big']

Exploratory Data Analysis

sns.countplot(df["spam"])

c:\Users\bnsap\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<Axes: xlabel='spam', ylabel='count'>

png

df.boxplot(column=['line_breaks'],by='winner',grid=False,figsize=(8,8),fontsize=20)

<Axes: title={'center': 'line_breaks'}, xlabel='winner'>

png

a= df["spam"].value_counts(normalize=True).plot.pie()

png

4. Logistic Regression Model

Dividing the dataset into a training set and a testing set.

test_data = df.sample(frac=0.2)
train_data = df.drop(test_data.index)

print(f"No. of training examples: {train_data.shape[0]}")
print(f"No. of testing examples: {test_data.shape[0]}")

No. of training examples: 2698
No. of testing examples: 674

Using the stats.model library to train the model.

Xtrain = train_data[['to_multiple', 'from', 'cc', 'sent_email', 'time',
       'image', 'attach', 'dollar', 'winner', 'inherit', 'viagra', 'password',
       'num_char', 'line_breaks', 'format', 're_subj', 'exclaim_subj',
       'urgent_subj', 'exclaim_mess', 'number']]
ytrain = train_data[['spam']]

log_reg = sm.GLM(ytrain.astype(float), Xtrain.astype(float), family = sm.families.Binomial(link = sm.families.links.logit())).fit()

Vanilla code for Logistic regression (without using any libraries)

Sigmoid function

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Prediction

def predict(X, w):
    z = np.dot(X, w[:-1,:]) + w[-1]
    probs = sigmoid(z)
        
    # Return the class with the highest probability
    return np.where(probs >= 0.5, 1, 0)

Learning of the model using the gradient descent (learning rate = 0.01)

def fit(X, y, learning_rate=0.01, num_iterations=10000):
    # Add a bias column to X
    X = np.c_[np.ones((X.shape[0], 1)), X]
    w = np.random.randn(X.shape[1], 1)

    for i in range(num_iterations):
        z = np.dot(X, w)
        probs = sigmoid(z)
        error = probs - y
        
        # Calculate the gradient of the error with respect to the weights
        gradient = np.dot(X.T, error)
        
        # Update the weights using the gradient and the learning rate
        w -= learning_rate * gradient
    
    return w

X = Xtrain.to_numpy()
y = ytrain.to_numpy()
w = fit(X, y)
predictions = predict(X, w)

C:\Users\bnsap\AppData\Local\Temp\ipykernel_11028\3196251242.py:2: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))

Summary of model generated by stats.model

print(log_reg.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                   spam   No. Observations:                 2698
Model:                            GLM   Df Residuals:                     2678
Model Family:                Binomial   Df Model:                           19
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -469.01
Date:                Thu, 14 Sep 2023   Deviance:                       938.02
Time:                        17:08:46   Pearson chi2:                 3.39e+03
No. Iterations:                    26   Pseudo R-squ. (CS):             0.1328
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
to_multiple     -3.7024      0.729     -5.078      0.000      -5.132      -2.273
from          -925.7040   2326.983     -0.398      0.691   -5486.507    3635.099
cc              -0.0114      0.054     -0.211      0.833      -0.117       0.094
sent_email     -26.5907   1.55e+04     -0.002      0.999   -3.03e+04    3.03e+04
time             0.0013      0.003      0.397      0.691      -0.005       0.007
image           -1.8717      0.884     -2.118      0.034      -3.604      -0.140
attach           0.6756      0.227      2.982      0.003       0.232       1.120
dollar          -0.0728      0.035     -2.103      0.035      -0.141      -0.005
winner           2.1805      0.477      4.571      0.000       1.246       3.115
inherit          0.1931      0.177      1.093      0.275      -0.153       0.540
viagra           3.6670   7.34e+04      5e-05      1.000   -1.44e+05    1.44e+05
password        -0.4858      0.279     -1.738      0.082      -1.034       0.062
num_char        -0.0146      0.038     -0.383      0.701      -0.089       0.060
line_breaks     -0.0030      0.002     -1.652      0.098      -0.007       0.001
format          -0.7135      0.208     -3.431      0.001      -1.121      -0.306
re_subj         -2.0241      0.526     -3.846      0.000      -3.056      -0.993
exclaim_subj     0.1083      0.317      0.342      0.733      -0.513       0.729
urgent_subj      2.6611    4.1e+05   6.49e-06      1.000   -8.03e+05    8.03e+05
exclaim_mess     0.0123      0.002      5.069      0.000       0.008       0.017
number           0.8052      0.223      3.609      0.000       0.368       1.243
================================================================================

Examining the coefficients and how they influence the output.

Hypothesis testing for the coefficients: $$ H0 : \beta{coefficient} = 0 $$ $$ H1 : \beta{coefficient} \neq 0 $$

In hypothesis testing for the coefficients of the model, the null hypothesis is that the coefficient has no effect on the output (i.e., it is equal to zero), while the alternative hypothesis is that the coefficient does have an effect on the output (i.e., it is not equal to zero). If the p-value of a coefficient is greater than 0.05, it means that there is not enough evidence to reject the null hypothesis, and therefore it can be concluded that the coefficient does not have a significant effect on the output. In the summary table, some of the coefficients have p-values greater than 0.05, indicating that they do not have a significant effect on the output of the model.

Evaluating the model using the test dataset.

predictions using stats model

Xtest = test_data[['to_multiple', 'from', 'cc', 'sent_email', 'time',
       'image', 'attach', 'dollar', 'winner', 'inherit', 'viagra', 'password',
       'num_char', 'line_breaks', 'format', 're_subj', 'exclaim_subj',
       'urgent_subj', 'exclaim_mess', 'number']]
ytest = test_data['spam']

yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))

predictions using vanilla code

predictions = predict(Xtest.to_numpy(), w)

C:\Users\bnsap\AppData\Local\Temp\ipykernel_11028\3196251242.py:2: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))

5. Evaluation

1. Calculating the confusion matrix for stats model

cm = np.array([[0,0],[0,0]])
for i,j in zip(ytest, prediction):
    if i == j and i == 0:
        cm[0][0] = cm[0][0] + 1
    elif i == j and j == 1:
        cm[1][1] = cm[1][1] + 1
    elif i != j and i == 0:
        cm[0][1] = cm[0][1] + 1
    else:
        cm[1][0] = cm[1][0] + 1

cm

array([[625,   2],
       [ 40,   7]])

TP = cm[1][1]
TN = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]

Accuracy

accuracy = (TP + TN)/(TP + TN + FP + FN)

accuracy

0.9376854599406528

Sensitivity

sensitivity = TP/(TP + FN)

sensitivity

0.14893617021276595

Specificity

specificity = TN/(TN + FP)

specificity

0.9968102073365231

Precision

Precision = TP/(TP + FP)

Precision

0.7777777777777778

2. Calculating the confusion matrix vanilla code

cm = np.array([[0,0],[0,0]])
for i,j in zip(predictions, ytest):
    if i == j and i == 0:
        cm[0][0] = cm[0][0] + 1
    elif i == j and j == 1:
        cm[1][1] = cm[1][1] + 1
    elif i != j and i == 0:
        cm[0][1] = cm[0][1] + 1
    else:
        cm[1][0] = cm[1][0] + 1

TP = cm[1][1]
TN = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]

accuracy = (TP + TN)/(TP + TN + FP + FN)

cm

array([[627,  47],
       [  0,   0]])

accuracy

0.9243323442136498

Conclusion

In this model, a threshold of 0.5 or 50% is used to predict whether a given email is spam or not. The accuracy of the model is approximately 93%, which may vary slightly due to the random division of the dataset. This model has a specificity of 99.51%, meaning it is very accurate at correctly identifying negative examples (emails that are not spam). Based on these results, it can be concluded that the logistic regression model is effective at classifying whether a given email is spam or not.

Spam Classification using Logistic Regression