Naive Bayes Classifier Spam Filter Example : 4 Easy Steps

In probability, Bayes is a type of conditional probability. It predicts the event based on an event that has already happened. You can use Naive Bayes as a supervised machine learning method for predicting the event based on the evidence present in your dataset. In this tutorial, you will learn how to classify the email as spam or not using the Naive Bayes Classifier.

Before doing coding demonstration, Let’s know about the Naive Bayes in a brief.

What is the Naive Bayes Classifier Model?

Naive Bayes is based on the popular Bayesian Machine learning algorithm. It is called as Naive as it assumes that all the predictors in the dataset are independent of each other. Naive Bayes Classifier Algorithm is mostly used for binary and multiclass classification. The formulae for the conditional probability is

source: https://en.wikipedia.org/wiki/Conditional_probability

There are three types of Naive Bayes Model

Multinomial

You apply multinomial when the features or variable (Categorical or Continuous) have discrete frequency counts. For example, you want to classify as spam or not, then you will use word counts in the body of the mail.

Bernoulli

It is good to apply when you have a dataset have binary features. And Making prediction from the binary features. For example, a buyer will buy the house or not.

Gaussian

If the dataset features are continuous and normally distributed, then Gaussian is good for making predictions.

The popular use cases of the Naive Bayes Classifiers are the following

Spam Detection
Classification of the customer
Loan Classification
Health Risk Prediction

The assumption for Naive Bayes Classifiers

Before modeling the prediction model, always check the following assumptions

1. All the predictor’s features or variable should be independent of each other.

2. It is based on conditional probability. Therefore historical event matters and should be true for prediction the present events.

Coding Demonstration

Step 1: Import the necessary packages and libraries

import numpy as np
import pandas as pd
import urllib
import sklearn
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Sklearn is machine learning packages. You will import Gaussian, Bernoulli and Multinomial model from the sklearn.naive_bayes.
Import the train test split function from the sklearn.model_selection and for accuracy score import the accuracy_score from the sklearn.metrics.

Step 2: Load the Dataset

In the coding demonstration, I am using Naive Bayes for spam classification, Here I am loading the dataset directly from the UCI Dataset direction using the python urllib packages.

url =" http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
raw_data = urllib.urlopen(url)
dataset = np.loadtxt(raw_data,delimiter=",")
dataset[0]

If you look at the dataset there are 57 attributes predictors and 48 features have attributes with the percentage of word count. We will take these attributes as predictors and the last attribute has binary values 0 (not spam) and 1( spam ) as the target.

x = dataset[:,:48] y = dataset[:,-1]

Step 3: Split the Dataset to train and test function

x_train,x_test,y_train,y_test= train_test_split(x,y,test_size = 0.33, random_state = 17)

Using the sklearn.model_selection , you will split the dataset into train and text with the test size of 0.33. Please note that for the exact output use the same value of random_state that is 17.

Step 4: Model the Naive Bayes Prediction on the dataset.

In this step, we will build all the Naive Bayes model and after comparing you will select the best model.

Bernoulli

BernNB = BernoulliNB(binarize=True)
BernNB.fit(x_train,y_train)
print(BernNB)
y_expect = y_test
y_predict = BernNB.predict(x_test)
accuracy_score(y_expect,y_predict)

Multinomial

MultiNB = MultinomialNB()
MultiNB.fit(x_train,y_train)
print(MultiNM)
y_expect = y_test
y_predict = MultiNB.predict(x_test)
accuracy_score(y_expect,y_predict)

Gaussian

GaussNB = GaussianNB()
GaussNB.fit(x_train,y_train)
print(GaussNB)
y_expect = y_test
y_predict = GaussNB.predict(x_test)
accuracy_score(y_expect,y_predict)

In all of the three, the accuracy score of the Multinomial is more than the others. Then we will select this model. You can improve the score by doing some modification of arguments values. Like in the case of the Bernoulli model, if you will use the binarize = 0.25 then the score will be 0.8966 that is more than the others. Thus you will choose that model with the highest score.

Performance Matrices for Classification :

There are couple of the performance matrices for classification models like confusion matrix , AUC – ROC curve , F-1 score , Precision and recall and accuracy. In the above demonstration, we have used the accuracy matrix. Which one is best is completely depend on the problem statements.

Conclusion

Naive Bayes is the conditional probability based Machine Learning model. You use it as a binary or multiclass classification model. In fact, Choosing the model will depend upon the accuracy score of the all its types Bernoulli, Multinomial and Gaussian score. Higher the score more the accurate predictions. You can also tweak some of the arguments to output the high score.

If you have any suggestion regarding this tutorial, then please message us on Data Science Learner Page.

Naive Bayes Classifier Spam Filter Example : 4 Easy Steps

What is the Naive Bayes Classifier Model?

Multinomial

Bernoulli

Gaussian

The assumption for Naive Bayes Classifiers

Coding Demonstration

Step 1: Import the necessary packages and libraries

Step 2: Load the Dataset

Step 3: Split the Dataset to train and test function

Step 4: Model the Naive Bayes Prediction on the dataset.

Bernoulli

Multinomial

Gaussian

Performance Matrices for Classification :

Conclusion

8 Steps for Aspiring Teachers Starting Their Journey

Empowering Innovation: The Leading Data Science and Artificial Intelligence Tools

The Power of IoT: Data-Driven Decision Making

The Role of Data Science in Global Business Expansion

What is the Naive Bayes Classifier Model?

Multinomial

Bernoulli

Gaussian

The assumption for Naive Bayes Classifiers

Coding Demonstration

Step 1: Import the necessary packages and libraries

Step 2: Load the Dataset

Step 3: Split the Dataset to train and test function

Step 4: Model the Naive Bayes Prediction on the dataset.

Bernoulli

Multinomial

Gaussian

Performance Matrices for Classification :

Conclusion

Join our list

8 Steps for Aspiring Teachers Starting Their Journey

Empowering Innovation: The Leading Data Science and Artificial Intelligence Tools

The Power of IoT: Data-Driven Decision Making

The Role of Data Science in Global Business Expansion