Fake News Detection Using Machine Learning

1Learn How to Implement Fake News Detection using Machine Learning

Since fake news can start social battles and permanently sever existing relationships between individuals, it is a big issue that is spreading rapidly across a variety of platforms. On how to classify fake news, there is now a lot of study being done. In this article, we will explain fake news detection using the machine learning process.

2What is Fake News?

Fake news, a form of yellow journalism, refers to news items that may be hoaxes and is typically disseminated through social media and other online media. This is frequently accomplished with political views to advance or impose particular beliefs. Such news stories may make misleading or overstated claims, become viral by algorithms, and trap users in a reality distortion field.

Learn how to apply classification and regression efficiently by understanding the fundamentals through our recent article, KNN Algorithm in Machine Learning

3What is TfidfVectorizer? 

TF (Term Frequency): The term frequency word refers to how frequently it appears in a document. A higher percentage means that a term appears more frequently than others when it is one of the search terms, indicating that the document is a good match.

IDF (Inverse Document Frequency): Terms may not be significant if they frequently appear in one document but not many others. IDF is a statistic for assessing a term’s general relevance. From a collection of unprocessed documents, the TfidfVectorizer generates a matrix of TF-IDF features.

4How to Detect Fake News with Python and Machine Learning?

This Python solution for fake news detection deals with both fake and real news. Using Sklearn, we create a TfidfVectorizer for our dataset. After initializing a passive aggressive classifier, the model is then fitted. In the end, the confusion matrix and accuracy score let us know how well our model works.

4.1Requirements

To implement the machine learning process for detecting fake news, we need to perform the following steps.

  • Importing Datasets and Libraries
  • Data Preprocessing
  • Preparation and examination of a news article
  • Text to vector conversion
  • Training, assessment, and prediction of models

4.2Step 1: Importing Datasets and Libraries

We can use libraries such as,

  • Pandas for importing the dataset
  • Seaborn/Matplotlib for performing data visualization

In Python,

  1. import pandas as pd
  2. import seaborn as sns
  3. import matplotlib.pyplot as plt

Now, import the dataset


  1. data is equal to pd.read csv(‘News.csv’,index col=0)

data.head()

Output

Importing Datasets And Libraries - Output

4.3Step 2: Data Preprocessing

The code below can be used to determine the dataset’s form.

data.shape

Output

  1. (44919, 5)
  2. Because they won’t be useful in identifying the news, the title, subject, and date columns. We can therefore remove this column.
  3. data = data.drop([“title”, “subject”,”date”], axis = 1)
  4. Next, we must determine if any values are null (we will drop those rows)
  5. data.isnull().sum()

Output

  1. text 0
  2. class 0
  3. Thus no null value is presented
  4. To avoid bias in the model, we must now shuffle the dataset. The index will thereafter be dropped after we have reset it. because we cannot use index columns.
  5. # Shuffling
  6. data = data.sample(frac=1)
  7. data.reset_index(inplace=True)
  8. data.drop([“index”], axis=1, inplace=True)
  9. Let’s now examine the distinct values inside each category using the code below.
  10. sns.countplot(data=data,
  11. x=’class’,
  12. order=data[‘class’].value_counts().index)
Data Preprocessing Output

Check out our recent article to learn about regularization in machine learning and how to implement it using Python

4.4Step 3: Preparation and Analysis of a News Article

First, we’ll clear the text of any unnecessary spaces, punctuation, and stopwords. The NLTK Library is necessary for that, and some of its modules must be downloaded. So, execute the code below for that.

  1. from tqdm import tqdm
  2. import re
  3. import nltk
  4. nltk.download(‘punkt’)
  5. nltk.download(‘stopwords’)
  6. from nltk.corpus import stopwords
  7. from nltk.tokenize import word_tokenize
  8. from nltk.stem.porter import PorterStemmer
  9. from wordcloud import WordCloud
  10. The function name preprocess text can be created once we have all the necessary modules. All of the input data will be preprocessed by this function.
  11. def preprocess_text(text_data):
  12. preprocessed_text = []
  13. for sentence in tqdm(text_data):
  14. sentence = re.sub(r'[^\w\s]’, ”, sentence)
  15. preprocessed_text.
  16. append(‘ ‘.
  17. join(token.
  18. lower()
  19. for token in str(sentence).
  20. split()
  21. if token not in stopwords.words(‘english’)))
  22. return preprocessed_text
  23. Execute the command below to apply the function to all of the news items in the text column.
  24. preprocessed_review = preprocess_text(data[‘text’].values)
  25. data[‘text’] = preprocessed_review
  26. Now, let’s visualize a separate WordCloud for phoney and actual news.
  27. # Real
  28. consolidated = ‘ ‘.
  29. join(
  30. word for word in data[‘text’]
  31. [data[‘class’] == 1]
  32. astype(str))
  33. wordCloud = WordCloud(width=1600,
  34. height=800,
  35. random_state=21,
  36. max_font_size=110,
  37. collocations=False)
  38. plt.figure(figsize=(15, 10))
  39. plt.imshow(wordCloud.generate(consolidated), interpolation=’bilinear’)
  40. plt.axis(‘off’)
  41. plt.show()

Output

Preparation And Analysis Of A News Article - Output

  1. # Fake
  2. consolidated = ‘ ‘.join(
  3. word for word in data[‘text’][data[‘class’] == 0].astype(str))
  4. wordCloud = WordCloud(width=1600,
  5. height=800,
  6. random_state=21,
  7. max_font_size=110,
  8. collocations=False)
  9. plt.figure(figsize=(15, 10))
  10. plt.imshow(wordCloud.generate(consolidated), interpolation=’bilinear’)
  11. plt.axis(‘off’)
  12. plt.show()

Output

Preparation And Analysis Of A News Article1 - Output
  1. Let’s now plot the top 20 most frequently used words in a bar graph.
  2. from sklearn.feature_extraction.text import CountVectorizer
  3. def get_top_n_words(corpus, n=None):
  4. vec = CountVectorizer().fit(corpus)
  5. bag_of_words = vec.transform(corpus)
  6. sum_words = bag_of_words.sum(axis=0)
  7. words_freq = [(word, sum_words[0, idx])
  8. for word, idx in vec.vocabulary_.items()]
  9. words_freq = sorted(words_freq, key=lambda x: x[1],
  10. reverse=True)
  11. return words_freq[:n]
  12. common_words = get_top_n_words(data[‘text’], 20)
  13. df1 = pd.DataFrame(common_words, columns=[‘Review’, ‘count’])
  14. df1.groupby(‘Review’).sum()[‘count’].sort_values(ascending=False).plot(
  15. kind=’bar’,
  16. figsize=(10, 6),
  17. xlabel=”Top Words”,
  18. ylabel=”Count”,
  19. title=”Bar Chart of Top Words Frequency”)

Output

Preparation And Analysis Of A News Article2 - Output

4.5Step 4: Text to Vector Conversion

  1. Divide the data into train and test groups before converting it to vectors.
  2. from sklearn.model_selection import train_test_split
  3. from sklearn.metrics import accuracy_score
  4. from sklearn.linear_model import LogisticRegression
  5. x_train, x_test, y_train, y_test = train_test_split(data[‘text’], 
  6. data[‘class’], 
  7. test_size=0.25)
  8. Using TfidfVectorizer, we can now turn the training data into vectors.
  9. from sklearn.feature_extraction.text import TfidfVectorizer
  10. vectorization = TfidfVectorizer()
  11. x_train = vectorization.fit_transform(x_train)
  12. x_test = vectorization.transform(x_test)

4.6Step 5: Training, Assessment, and Prediction of Models

  1. The dataset is now prepared for model training.
  2. We will use logistic regression for training and accuracy scores to measure prediction accuracy.
  3. from sklearn.linear_model import LogisticRegression
  4. model = LogisticRegression()
  5. model.fit(x_train, y_train)
  6. # testing the model
  7. print(accuracy_score(y_train, model.predict(x_train)))
  8. print(accuracy_score(y_test, model.predict(x_test)))

Output:

  1. 0.993766511324171
  2. 0.9893143365983972
  3. Now, let’s use a decision tree classifier to train.
  4. from sklearn.tree import DecisionTreeClassifier
  5. model = DecisionTreeClassifier()
  6. model.fit(x_train, y_train)
  7. # testing the model
  8. print(accuracy_score(y_train, model.predict(x_train)))
  9. print(accuracy_score(y_test, model.predict(x_test)))

Output:

  1. 0.9999703167205913
  2. 0.9951914514692787
  3. The following code can be used to implement the confusion matrix for Decision Tree Classifier.
  4. from sklearn import metrics
  5. cm = metrics.confusion_matrix(y_test, model.predict(x_test))
  6. cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
  7. display_labels=[False, True])
  8. cm_display.plot()
  9. plt.show()

Output:

Training, Assessment, And Prediction Of Models - Output

5Conclusion

Detecting fake news Machine Learning and Python has been explained with executed codes.  We have used a TfidfVectorizer to fit our model to a dataset. We hope you enjoyed this article.  For more informative blogs that are easy to execute, continue to come back to SLA blogs. Gain satisfying hands-on experience and an IBM Certification by enrolling in our Machine Learning Course in Chennai