1Learn How to Implement Fake News Detection using Machine Learning

Since fake news can start social battles and permanently sever existing relationships between individuals, it is a big issue that is spreading rapidly across a variety of platforms. On how to classify fake news, there is now a lot of study being done. In this article, we will explain fake news detection using the machine learning process.

2What is Fake News?

Fake news, a form of yellow journalism, refers to news items that may be hoaxes and is typically disseminated through social media and other online media. This is frequently accomplished with political views to advance or impose particular beliefs. Such news stories may make misleading or overstated claims, become viral by algorithms, and trap users in a reality distortion field.

Learn how to apply classification and regression efficiently by understanding the fundamentals through our recent article, KNN Algorithm in Machine Learning

3What is TfidfVectorizer?

TF (Term Frequency): The term frequency word refers to how frequently it appears in a document. A higher percentage means that a term appears more frequently than others when it is one of the search terms, indicating that the document is a good match.

IDF (Inverse Document Frequency): Terms may not be significant if they frequently appear in one document but not many others. IDF is a statistic for assessing a term’s general relevance. From a collection of unprocessed documents, the TfidfVectorizer generates a matrix of TF-IDF features.

4How to Detect Fake News with Python and Machine Learning?

This Python solution for fake news detection deals with both fake and real news. Using Sklearn, we create a TfidfVectorizer for our dataset. After initializing a passive aggressive classifier, the model is then fitted. In the end, the confusion matrix and accuracy score let us know how well our model works.

4.1Requirements

To implement the machine learning process for detecting fake news, we need to perform the following steps.

Importing Datasets and Libraries
Data Preprocessing
Preparation and examination of a news article
Text to vector conversion
Training, assessment, and prediction of models

4.2Step 1: Importing Datasets and Libraries

We can use libraries such as,

Pandas for importing the dataset
Seaborn/Matplotlib for performing data visualization

In Python,

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Now, import the dataset

data is equal to pd.read csv(‘News.csv’,index col=0)

data.head()

Output

4.3Step 2: Data Preprocessing

The code below can be used to determine the dataset’s form.

data.shape

Output

(44919, 5)
Because they won’t be useful in identifying the news, the title, subject, and date columns. We can therefore remove this column.
data = data.drop([“title”, “subject”,”date”], axis = 1)
Next, we must determine if any values are null (we will drop those rows)
data.isnull().sum()

Output

text 0
class 0
Thus no null value is presented
To avoid bias in the model, we must now shuffle the dataset. The index will thereafter be dropped after we have reset it. because we cannot use index columns.
# Shuffling
data = data.sample(frac=1)
data.reset_index(inplace=True)
data.drop([“index”], axis=1, inplace=True)
Let’s now examine the distinct values inside each category using the code below.
sns.countplot(data=data,
x=’class’,
order=data[‘class’].value_counts().index)

Check out our recent article to learn about regularization in machine learning and how to implement it using Python

4.4Step 3: Preparation and Analysis of a News Article

First, we’ll clear the text of any unnecessary spaces, punctuation, and stopwords. The NLTK Library is necessary for that, and some of its modules must be downloaded. So, execute the code below for that.

from tqdm import tqdm
import re
import nltk
nltk.download(‘punkt’)
nltk.download(‘stopwords’)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud
The function name preprocess text can be created once we have all the necessary modules. All of the input data will be preprocessed by this function.
def preprocess_text(text_data):
preprocessed_text = []
for sentence in tqdm(text_data):
sentence = re.sub(r'[^\w\s]’, ”, sentence)
preprocessed_text.
append(‘ ‘.
join(token.
lower()
for token in str(sentence).
split()
if token not in stopwords.words(‘english’)))
return preprocessed_text
Execute the command below to apply the function to all of the news items in the text column.
preprocessed_review = preprocess_text(data[‘text’].values)
data[‘text’] = preprocessed_review
Now, let’s visualize a separate WordCloud for phoney and actual news.
# Real
consolidated = ‘ ‘.
join(
word for word in data[‘text’]
[data[‘class’] == 1]
astype(str))
wordCloud = WordCloud(width=1600,
height=800,
random_state=21,
max_font_size=110,
collocations=False)
plt.figure(figsize=(15, 10))
plt.imshow(wordCloud.generate(consolidated), interpolation=’bilinear’)
plt.axis(‘off’)
plt.show()

Output

# Fake
consolidated = ‘ ‘.join(
word for word in data[‘text’][data[‘class’] == 0].astype(str))
wordCloud = WordCloud(width=1600,
height=800,
random_state=21,
max_font_size=110,
collocations=False)
plt.figure(figsize=(15, 10))
plt.imshow(wordCloud.generate(consolidated), interpolation=’bilinear’)
plt.axis(‘off’)
plt.show()

Output

Let’s now plot the top 20 most frequently used words in a bar graph.
from sklearn.feature_extraction.text import CountVectorizer
def get_top_n_words(corpus, n=None):
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1],
reverse=True)
return words_freq[:n]
common_words = get_top_n_words(data[‘text’], 20)
df1 = pd.DataFrame(common_words, columns=[‘Review’, ‘count’])
df1.groupby(‘Review’).sum()[‘count’].sort_values(ascending=False).plot(
kind=’bar’,
figsize=(10, 6),
xlabel=”Top Words”,
ylabel=”Count”,
title=”Bar Chart of Top Words Frequency”)

Output

4.5Step 4: Text to Vector Conversion

Divide the data into train and test groups before converting it to vectors.
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
x_train, x_test, y_train, y_test = train_test_split(data[‘text’],
data[‘class’],
test_size=0.25)
Using TfidfVectorizer, we can now turn the training data into vectors.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorization = TfidfVectorizer()
x_train = vectorization.fit_transform(x_train)
x_test = vectorization.transform(x_test)

4.6Step 5: Training, Assessment, and Prediction of Models

The dataset is now prepared for model training.
We will use logistic regression for training and accuracy scores to measure prediction accuracy.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train, y_train)
# testing the model
print(accuracy_score(y_train, model.predict(x_train)))
print(accuracy_score(y_test, model.predict(x_test)))

Output:

0.993766511324171
0.9893143365983972
Now, let’s use a decision tree classifier to train.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(x_train, y_train)
# testing the model
print(accuracy_score(y_train, model.predict(x_train)))
print(accuracy_score(y_test, model.predict(x_test)))

Output:

0.9999703167205913
0.9951914514692787
The following code can be used to implement the confusion matrix for Decision Tree Classifier.
from sklearn import metrics
cm = metrics.confusion_matrix(y_test, model.predict(x_test))
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=[False, True])
cm_display.plot()
plt.show()

Output:

5Conclusion

Detecting fake news Machine Learning and Python has been explained with executed codes. We have used a TfidfVectorizer to fit our model to a dataset. We hope you enjoyed this article. For more informative blogs that are easy to execute, continue to come back to SLA blogs. Gain satisfying hands-on experience and an IBM Certification by enrolling in our Machine Learning Course in Chennai

Fake News Detection Using Machine Learning

Table of Contents

1Learn How to Implement Fake News Detection using Machine Learning

2What is Fake News?

3What is TfidfVectorizer?

4How to Detect Fake News with Python and Machine Learning?

4.1Requirements

4.2Step 1: Importing Datasets and Libraries

4.3Step 2: Data Preprocessing

4.4Step 3: Preparation and Analysis of a News Article

4.5Step 4: Text to Vector Conversion

4.6Step 5: Training, Assessment, and Prediction of Models

5Conclusion