Fake News Detection
Fake News Detection
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 21417 non-null object
1 text 21417 non-null object
2 subject 21417 non-null object
3 date 21417 non-null object
4 label 21417 non-null int64
dtypes: int64(1), object(4)
memory usage: 836.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 44898 entries, 0 to 23480
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 44898 non-null object
1 text 44898 non-null object
2 subject 44898 non-null object
3 date 44898 non-null object
4 label 44898 non-null int64
dtypes: int64(1), object(4)
memory usage: 2.1+ MB
None
title \
0 As U.S. budget fight looms, Republicans flip t...
1 U.S. military to accept transgender recruits o...
2 Senior U.S. Republican senator: 'Let Mr. Muell...
3 FBI Russia probe helped by Australian diplomat...
4 Trump wants Postal Service to charge 'much mor...
text subject \
0 WASHINGTON (Reuters) - The head of a conservat... politicsNews
1 WASHINGTON (Reuters) - Transgender people will... politicsNews
2 WASHINGTON (Reuters) - The special counsel inv... politicsNews
3 WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews
4 SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews
date label
0 December 31, 2017 1
1 December 29, 2017 1
2 December 31, 2017 1
3 December 30, 2017 1
4 December 29, 2017 1
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
def preprocess(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize the text
words = word_tokenize(text)
# Remove stopwords and apply stemming
words = [stemmer.stem(word) for word in words if word not in stop_words]
return ' '.join(words)
#The proprocessing is applied to the text column and the output is returned and added as a new column which is cleaned_text.
data['cleaned_text'] = data['text'].apply(preprocess)
print (data.head())
title \
0 As U.S. budget fight looms, Republicans flip t...
1 U.S. military to accept transgender recruits o...
2 Senior U.S. Republican senator: 'Let Mr. Muell...
3 FBI Russia probe helped by Australian diplomat...
4 Trump wants Postal Service to charge 'much mor...
text subject \
0 WASHINGTON (Reuters) - The head of a conservat... politicsNews
1 WASHINGTON (Reuters) - Transgender people will... politicsNews
2 WASHINGTON (Reuters) - The special counsel inv... politicsNews
3 WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews
4 SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews
date label \
0 December 31, 2017 1
1 December 29, 2017 1
2 December 31, 2017 1
3 December 30, 2017 1
4 December 29, 2017 1
cleaned_text
0 washington reuter head conserv republican fact...
1 washington reuter transgend peopl allow first ...
2 washington reuter special counsel investig lin...
3 washington reuter trump campaign advis georg p...
4 seattlewashington reuter presid donald trump c...
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(data['cleaned_text'])
# The labels
y = data['label']
print(X)
print (y)
In [21]: """
Model Selection
Common algorithms for text classification:
Logistic Regression
Naive Bayes
Support Vector Machines (SVM)
Random Forest"""
#Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print (X_train)
print (X_test)
print (y_train)
print (y_test)
model = MultinomialNB()
model.fit(X_train, y_train)
Out[33]: ▾ MultinomialNB i ?
MultinomialNB()
y_pred = model.predict(X_test)
Accuracy: 0.9276169265033407
[[4344 306]
[ 344 3986]]
precision recall f1-score support
In [37]: #Deployment
joblib.dump(model, 'fake_news_detector.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
Out[37]: ['tfidf_vectorizer.pkl']
def predict_news(text):
cleaned_text = preprocess(text)
vectorized_text = vectorizer.transform([cleaned_text])
return model.predict(vectorized_text)
In [ ]: