LIBRARIES :
➢ tkinter – library that provides GUI in python
➢ tkk – contains all the widgets like buttons, label, message box, etc
➢ messagebox – used to display or type any text
➢ PIL – Python Imaging Library – used for adding images (ImageTk also does the same
thing)
➢ NumPy – Numeric Python – it includes mathematical operations , and is used to create
arrays and also used for storage
➢ pandas – Python Data Analysis/Panel Data – used to analyse the data, i.e., statistical
data calculations like wrong data, right data, NULL value, etc
➢ Itertools – library which contains basic data structures (tuples, sets, dictionaries, etc)
➢ sklearn.model_selection – used to separate the data into TRAIN sets and TEST sets
➢ train_test_split – we train the TRAIN sets to fit the news into the model
➢ sklearn.feature_extraction.text imports TfIdfVectoriser – converts the raw data into
TF-IDF matrix
➢ sklearn.metrics – used for measurements like score, loss, gain, etc
➢ Passive Aggressive Classifier Algorithm – an algorithm in which the system classifies
the dataset into TRAIN SETS and TEST SETS (FND runs on PAC, and PAC algorithm
is based on TF-IDF)
a. Passive – if the prediction is correct, then do not disturb the model
b. Aggressive – if the prediction is incorrect, then make some changes in the model
c. Classifier – classifies whether the part of the algorithm is Passive or Aggressive
• A machine learning algorithm is fed with some news. Some % of the news are
trained to predict whether they are true or fake. The remaining % are predicted
if they are true or fake by the trained values of the news.
• Example, if 1000 news are fed in the algorithm, 800 news are trained to be
fake or true. The remaining 200 news are predicted/analysed whether they are
true or fake based on the 800 trained news.
➢ TF-IDF – Term Frequency – Inverse Document Frequency
➢ TF – number of times a specific term is repeated in the algorithm
➢ Document Frequency – number of documents containing a specific term
➢ Inverse Document Frequency – number of times the term is repeated in all the
documents (indicates how important the term is)
➢ accuracy_score – used to calculate the accuracy of the model’s predictions
➢ Confusion Matrix – calculates the performance of the ML algorithm based on the truth
values of the model.
• It has 2 rows and 2 columns – TP, TN, FP, FN
1. TP – news which is actually true and predicted correctly
2. TN – news which is actually true and predicted incorrectly
3. FP – news which is actually false and predicted correctly
4. FN – news which is actually false and predicted incorrectly
CODE :
1. Defining a function:
def accuracy( ):
• Function is named as ACCURACY
2. Reading and Loading the Data:
df=pd.read_csv('news.csv')
• This line reads the data from a CSV file named 'news.csv' and stores it in a pandas
DataFrame called df. The DataFrame will contain the data from the CSV file,
allowing us to manipulate and analyse it.
➢ DataFrame – data structure provided by pandas library – used for data analysis (like
pandas)
➢ .csv file – comma separated value files – allows data to be saved in the tabular form
(different from excel sheets)
3. Exploring the Data:
df.shape
df.head( )
• The shape attribute of a DataFrame is used for implementing rows and columns in
the DataFrame. The shape attribute is not assigned to any variable, it is used in
common to all variables.
• The head( ) method displays the shape of the DataFrame and it will not be visible
on the GUI because it is not using the print( ) function.
4. Preparing the Data for Machine Learning:
labels = df.label
labels.head( )
• The 'label' column of the DataFrame is extracted and assigned to a variable called
labels.
5. Splitting the Data into Training and Testing Sets:
x_train, x_test, y_train, y_test = train_test_split (df['text'], labels, test_size=0.2,
random_state=7)
• The train_test_split function from scikit-learn is used to split the dataset into
training and testing sets.
• The 'text' column of the DataFrame is selected as the feature (x_test and y_test)
and the 'label' column is used as the target (x_train and y_train).
• The test_size parameter specifies the percentage of data to be used for testing (in
this case, 20%), and random_state ensures reproducibility of the same train-test
split when the code is run multiple times with the same random_state value.
6. TF-IDF Vectorization:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)
• TF-IDF vectorization to convert the text data (news) into numerical vectors.
TfidfVectorizer is used from scikit-learn for this purpose. The parameter
stop_words='english' removes common English stop words, and max_df=0.7 sets
the maximum document frequency for the words to be included in the vocabulary
(words appearing in more than 70% of the documents will be ignored).
• The fit_transform( ) method is applied to the training data (x_train) to learn the
vocabulary and transform the text into TF-IDF vectors.
• The transform( ) method is applied to the testing data (x_test) to transform it using
the learned vocabulary.
7. Training the Classifier:
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train, y_train)
• The max_iter parameter sets the maximum number of iterations for the model to
converge. The classifier is trained using the TF-IDF vectors of the training data
(tfidf_train) and their corresponding labels (y_train).
8. Making Predictions and Calculating Accuracy:
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)
• The trained classifier is used to predict the labels for the test data (x_test). The
predictions are stored in the y_pred variable.
• The accuracy_score( ) function from scikit-learn is used to compare the predicted
labels (y_pred) with the actual labels (y_test) and calculate the accuracy of the
model. The accuracy value is stored in the score variable.
9. Displaying the Results on the GUI:
print(f'Accuracy: {round(score*100,2)}%')
CNF = confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL'])
print(CNF)
accu = str(round(score*100, 2)) + "%"
e1.insert(10, accu)
real = str(CNF[0, 0])
e2.insert(10, real)
fake = str(CNF[1, 1])
e3.insert(10, fake)
• The code prints the accuracy of the model and the confusion matrix to the console.
The accuracy is also inserted into an Entry widget e1 on the GUI.
• The confusion matrix is calculated using the confusion_matrix( ) function from
scikit-learn, and its elements (True Positives and True Negatives) are inserted into
e2 and e3 Entry widgets, respectively.
• The code assumes that there are 3 Entry widgets (e1, e2, and e3) which are already
defined on the GUI, where the accuracy and confusion matrix values will be
displayed. If these Entry widgets are not defined earlier in the code, this part will
result in an error.
• e1/e2/e3 – name of the entry (it can be named anything)
• insert( ) – used to insert the entry into e1/e2/e3
• 10 – index/position in which the entry should be placed
GUI :