Fake News Detection

Fake news simply means - to incorporate information that leads people to the wrong path. Nowadays fake news is spreading a lot and people share this information without verifying it.

Fake news detection is a severe yet challenging problem. The rapid rise of social networking platforms has not only yielded an enormous increase in information but has also stimulated the spread of fake news. Thus, the effect of fake news has been increasing, sometimes extending to the offline world and intimidating public safety. Given a large amount of web content, automatic fake news detection is a practical solution useful to all online content providers, to reduce the human time and effort to detect and prevent the spread of fake news.

Detecting Fake News using Python

First, we will import libraries and the dataset. The dataset we’ll use for this project- we’ll call it news.csv which has a shape of 7796×4 (7796 rows and 4 columns).

Required packages:

pip install pandas

pip install numpy

pip install scikit-learn

You can download the dataset here.

Analyzing the data:

Next step is to analyze the data using functions like head(), describe(), count(), etc

After cleaning the data

Data Cleaning:

Removing columns which have no significance in the prediction process like ID or S.No .

                data=data.drop(axis=1,labels="Unnamed: 0")

                data=data.dropna()

Data Processing like extraction of features from the text

CountVectorizer tokenizes (breaking down a sentence or paragraph or any text into words) the text along with performing very basic preprocessing like removing the punctuation marks, converting all the words to lowercase, etc. It is also used to transform a given text into a vector based on the frequency (count) of each word that occurs in the entire text.

Splitting the dataset:

To train any machine learning model irrespective of what type of dataset is being used we have to split the dataset into training data and testing data.

from sklearn.model_selection import train_test_split as tts

x_train,x_test,y_train,y_test=tts(x_mat,y,test_size=0.2,random_state=10)

Here I have used the ‘train_test_split’ to split the data in 80:20 ratio i.e. 80% of the data will be used for training the model while 20% will be used for testing the model that is built out of it.

Selection of Algorithm

Decision Tree Algorithm:

A Decision Tree is a white box type of machine learning algorithm. It shares internal decision-making logic, which is not available in the black box type of algorithms such as Neural Network. Its training time is faster compared to the neural network algorithm. The time complexity of decision trees is a function of the number of records and number of attributes in the given data. The decision tree is a distribution-free or non-parametric method, which does not depend upon probability distribution assumptions. Decision trees can handle high-dimensional data with good accuracy.

Confusion Matrix

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes.

The accuracy of the prediction can be calculated by adding the diagonals of the matrix and dividing it by the sum of all the elements of the matrix.

Get Project Source Code Here

Search This Blog

ML Internship at DLithe