0% found this document useful (0 votes)

37 views9 pages

Ai - Phase 3

The document discusses loading and preprocessing a SMS spam dataset for building a smarter AI classifier. It provides an overview of the SMS spam dataset, which contains labeled SMS messages collected from various online sources. It describes acquiring the data, loading it into the analysis environment, preprocessing steps like handling missing values and outliers, and exploring the data to identify patterns and ensure quality. Functions are defined for visualizing the data distribution, correlations, and scatter plots to help with the preprocessing and exploration phase of the project.

Uploaded by

Manikandan N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views9 pages

Ai - Phase 3

Uploaded by

Manikandan N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Building a Smarter AI-Powered Spam

Classifier

Phase 3 Submission Document

Project Title : Development Part 1

Topic: section begin building your project by loading and

preprocessing the dataset
INTRODUCTION:

In the realm of data-driven projects, success often hinges on the quality and readiness of the
dataset under examination. Loading and preprocessing this data is a foundational step, setting
the stage for robust analysis, modeling, and decision-making. In this section, we will delve into
the critical processes of acquiring, loading, and preparing the dataset for our project.

 Dataset Overview: We will begin by providing a brief overview of the dataset under
investigation. This includes its source, the context in which it was collected, and the
primary objective of its utilization within the project.

 Data Acquisition: This section will discuss the methods employed to obtain the dataset.
It may include data collection procedures, sources, and any ethical considerations
associated with data gathering.

 Data Loading: Loading the dataset into our analysis environment is a pivotal task. We
will discuss the tools and techniques used for importing the data, whether it be from a
database, CSV file, API, or other sources.

 Data Preprocessing: Raw data seldom arrives in the perfect format for analysis. This
subsection will cover data preprocessing steps such as handling missing values, dealing
with outliers, and converting data types to ensure it is ready for analytical tasks.

 Data Exploration: While primarily an exploratory process, this phase is crucial in

identifying initial patterns and trends within the data, which may inform subsequent
project directions.

 Data Quality Assurance: Quality control is integral to ensuring the integrity of the
dataset. We will discuss measures taken to validate and clean the data, maintaining its
accuracy and reliability.
DATASET :
Context:

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS
Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged
acording being ham (legitimate) or spam.

Content:
The files contain one message per line. Each line is composed by two columns: v1 contains the label
(ham or spam) and v2 contains the raw text.

This corpus has been collected from free or free for research sources at the Internet:

 A collection of 425 SMS spam messages was manually extracted from the Grumbletext
Web site. This is a UK forum in which cell phone users make public claims about SMS spam
messages, most of them without reporting the very spam message received. The
identification of the text of spam messages in the claims is a very hard and time-consuming
task, and it involved carefully scanning hundreds of web pages.

A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC),
which is a dataset of about 10,000 legitimate messages collected for research at the
Department of Computer Science at the National University of Singapore. The messages
largely originate from Singaporeans and mostly from students attending the University.
These messages were collected from volunteers who were made aware that their
contributions were going to be made publicly available.

 A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available.

 Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham
messages and 322 spam messages and it is public available.

 This is an automatically-generated kernel with starter code demonstrating how to read in the
data and begin exploring. Click the blue "Edit Notebook" or "Fork Notebook" button at the
top of this kernel to begin editing.

Acknowledgements:
The original dataset can be found here. The creators would like to note that in case you
find the dataset useful, please make a reference to previous paper and the web
page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.
We offer a comprehensive study of this corpus in the following paper. This work presents a number
of statistics, studies and baseline results for several machine learning methods.

Exploratory Analysis:
To begin this exploratory analysis, first use matplotlib to import libraries and define functions for
plotting the data. Depending on the data, not all plots will be made. (Hey, I'm just a kerneling bot, not
a Kaggle Competitions Grandmaster!)
ln[1]:

from mpl_toolkits.mplot3d import Axes3D

from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

There is 1 csv file in the current version of the dataset:

ln[2]:

for dirname, _, filenames in os.walk('/kaggle/input'):

for filename in filenames:
print(os.path.join(dirname, filename))

The next hidden code cells define functions for plotting data. Click on the "Code" button in the publis
hed kernel to reveal the hidden code.

ln[3]:

# Distribution graphs (histogram/bar graph) of column data

def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
nunique = df.nunique()
df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For
displaying purposes, pick columns that have between 1 and 50 unique values
nRow, nCol = df.shape
columnNames = list(df)
nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80,
facecolor = 'w', edgecolor = 'k')
for i in range(min(nCol, nGraphShown)):
plt.subplot(nGraphRow, nGraphPerRow, i + 1)
columnDf = df.iloc[:, i]
if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
valueCounts = columnDf.value_counts()
valueCounts.plot.bar()
else:
columnDf.hist()
plt.ylabel('counts')
plt.xticks(rotation = 90)
plt.title(f'{columnNames[i]} (column {i})')
plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
plt.show()

ln[4]:

# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
filename = df.dataframeName
df = df.dropna('columns') # drop columns with NaN
df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where th
ere are more than 1 unique values
if df.shape[1] < 2:
print(f'No correlation plots shown: The number of non-NaN or constant col
umns ({df.shape[1]}) is less than 2')
return
corr = df.corr()
plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w',
edgecolor='k')
corrMat = plt.matshow(corr, fignum = 1)
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.gca().xaxis.tick_bottom()
plt.colorbar(corrMat)
plt.title(f'Correlation Matrix for {filename}', fontsize=15)
plt.show()

ln[5]:

# Scatter and density plots

def plotScatterMatrix(df, plotSize, textSize):
df = df.select_dtypes(include =[np.number]) # keep only numerical columns
# Remove rows and columns that would lead to df being singular
df = df.dropna('columns')
df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where th
ere are more than 1 unique values
columnNames = list(df)
if len(columnNames) > 10: # reduce the number of columns for matrix inversion
of kernel density plots
columnNames = columnNames[:10]
df = df[columnNames]
ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize],
diagonal='kde')
corrs = df.corr().values
for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords
='axes fraction', ha='center', va='center', size=textSize)
plt.suptitle('Scatter and Density Plot')
plt.show()
Now you're ready to read in the data and use the plotting functions to visualize the data.

ln[6]:

nRowsRead = 1000 # specify 'None' if want to read whole file

# spam.csv has 5572 rows in reality, but we are only loading/previewing the first
1000 rows
df1 = pd.read_csv('/kaggle/input/spam.csv', delimiter=',', nrows = nRowsRead)
df1.dataframeName = 'spam.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid

continuation byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError Traceback (most recent call last)

<ipython-input-6-556be88e201a> in <module>
1 nRowsRead = 1000 # specify 'None' if want to read whole file
2 # spam.csv has 5572 rows in reality, but we are only loading/previewing t
he first 1000 rows
----> 3 df1 = pd.read_csv('/kaggle/input/spam.csv', delimiter=',', nrows = nRowsR
ead)
4 df1.dataframeName = 'spam.csv'
5 nRow, nCol = df1.shape

/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_
or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, ma
ngle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitial
space, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbos
e, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_pars
er, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal,
lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, d
ialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map
, float_precision)
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name

/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_
buffer, kwds)
461
462 try:
--> 463 data = parser.read(nrows)
464 finally:
465 parser.close()

/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)

1152 def read(self, nrows=None):
1153 nrows = _validate_integer("nrows", nrows)
-> 1154 ret = self._engine.read(nrows)
1155
1156 # May alter columns / col_dict

/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)

2046 def read(self, nrows=None):
2047 try:
-> 2048 data = self._reader.read(nrows)
2049 except StopIteration:
2050 if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data(
)

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid

continuation byte

Let's take a quick look at what the data looks like:

ln[7]:

df1.head(5)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-e55bb665ba13> in <module>
----> 1 df1.head(5)

NameError: name 'df1' is not defined

Distribution graphs (histogram/bar graph) of sampled columns:

ln[8]:

plotPerColumnDistribution(df1, 10, 5)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-a0a199b2d778> in <module>
----> 1 plotPerColumnDistribution(df1, 10, 5)

NameError: name 'df1' is not defined

Conclusion:
This concludes your starter analysis! To go forward from here, click the blue "Edit Notebook" button
at the top of the kernel. This will create a copy of the code and environment for you to edit. Delete,
modify, and add code as you please. Happy Kaggling!

Ad3301 Data Exploration and Visualization
100% (3)
Ad3301 Data Exploration and Visualization
30 pages
Aids Lab
No ratings yet
Aids Lab
45 pages
Arduino Based Water Level Sensor With Notification
80% (10)
Arduino Based Water Level Sensor With Notification
44 pages
Data Science
No ratings yet
Data Science
42 pages
ccs346 Eda Lab Manual
No ratings yet
ccs346 Eda Lab Manual
41 pages
Black and White Blank Note Document
No ratings yet
Black and White Blank Note Document
57 pages
Gokul
No ratings yet
Gokul
10 pages
De&v Record
No ratings yet
De&v Record
36 pages
AD3411
No ratings yet
AD3411
28 pages
Dev Lab Manual Org
No ratings yet
Dev Lab Manual Org
28 pages
DEV Manual - ESEC
No ratings yet
DEV Manual - ESEC
27 pages
Eda Lab Manual
No ratings yet
Eda Lab Manual
34 pages
I.C.T Class Six Chap-1
100% (2)
I.C.T Class Six Chap-1
10 pages
Ijresm V6 I9 3 2
No ratings yet
Ijresm V6 I9 3 2
5 pages
Data Visualization: Types of Data Visualization: Charts and Graphs Line Charts
No ratings yet
Data Visualization: Types of Data Visualization: Charts and Graphs Line Charts
15 pages
Lab 3 ML
No ratings yet
Lab 3 ML
19 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Bda Prac 1 - Merged
No ratings yet
Bda Prac 1 - Merged
28 pages
Hints and Answers
No ratings yet
Hints and Answers
13 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Edap Lab
No ratings yet
Edap Lab
47 pages
Python CA2
No ratings yet
Python CA2
11 pages
Khadeeja - DS - PRACTICAL 4
No ratings yet
Khadeeja - DS - PRACTICAL 4
24 pages
A110 MohammedRayyan Dep8
No ratings yet
A110 MohammedRayyan Dep8
5 pages
Data Understanding and Preparation
No ratings yet
Data Understanding and Preparation
48 pages
ML Manual
No ratings yet
ML Manual
21 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
Assignment 01
No ratings yet
Assignment 01
3 pages
DWDM Pavan Final
No ratings yet
DWDM Pavan Final
10 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Manual
No ratings yet
Manual
48 pages
UNIT-4 Important Q-A
No ratings yet
UNIT-4 Important Q-A
28 pages
Eda 4 5
No ratings yet
Eda 4 5
7 pages
Dev
No ratings yet
Dev
33 pages
Data Science Report
No ratings yet
Data Science Report
33 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Exp 1
No ratings yet
Exp 1
5 pages
DEV Manual
No ratings yet
DEV Manual
23 pages
DEV Lab Manual-1
No ratings yet
DEV Lab Manual-1
27 pages
Kavin
No ratings yet
Kavin
13 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Data Science Sample
No ratings yet
Data Science Sample
5 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
III Unit
No ratings yet
III Unit
4 pages
Python Lab PRG
No ratings yet
Python Lab PRG
20 pages
Module 7 - Advanced Python Tools Assignment DS
No ratings yet
Module 7 - Advanced Python Tools Assignment DS
3 pages
Case Study 8051 PDF
No ratings yet
Case Study 8051 PDF
4 pages
Sms Gap Analysis Checklist and Implementation Plan
No ratings yet
Sms Gap Analysis Checklist and Implementation Plan
11 pages
Wap and Sms Gateway
No ratings yet
Wap and Sms Gateway
116 pages
TOPFLYtech TLP2-SFB Asset GPS Tracker User Manual 20201216
No ratings yet
TOPFLYtech TLP2-SFB Asset GPS Tracker User Manual 20201216
15 pages
Coupa - Supplier-Guide As at 03JAN2019 PDF
No ratings yet
Coupa - Supplier-Guide As at 03JAN2019 PDF
170 pages
Silent SMS
100% (1)
Silent SMS
6 pages
OpenVox SMS API Demo
No ratings yet
OpenVox SMS API Demo
7 pages
Marketing Strategies of Airtel PROJECT
No ratings yet
Marketing Strategies of Airtel PROJECT
112 pages
KerOS Firmware & Common Packet Forwarder 20210527
No ratings yet
KerOS Firmware & Common Packet Forwarder 20210527
103 pages
Hack Tata Docomo For Free G..
No ratings yet
Hack Tata Docomo For Free G..
3 pages
Design and Fabrication of Smart Home With Internet of Things Enabled Automation System
No ratings yet
Design and Fabrication of Smart Home With Internet of Things Enabled Automation System
16 pages
Voting Sytem Idaya Project
No ratings yet
Voting Sytem Idaya Project
21 pages
Bài Tập Tiếng Anh 8 -Mai Lan Hương - Hà Thanh Uyên Unit 10
No ratings yet
Bài Tập Tiếng Anh 8 -Mai Lan Hương - Hà Thanh Uyên Unit 10
15 pages
Bangladesh Public Service Commission
No ratings yet
Bangladesh Public Service Commission
2 pages
Web School ERP v5 - The Powerful School Management Software
No ratings yet
Web School ERP v5 - The Powerful School Management Software
10 pages
2009 Cheat Sheet (PDF Library)
No ratings yet
2009 Cheat Sheet (PDF Library)
6 pages
COM Configurator Manager Guide V1 - 0 - 0
No ratings yet
COM Configurator Manager Guide V1 - 0 - 0
18 pages
SMS Modem User Manual China Skyline
No ratings yet
SMS Modem User Manual China Skyline
21 pages
Product Note RD3500
No ratings yet
Product Note RD3500
3 pages
Amaysim Activity Reward
No ratings yet
Amaysim Activity Reward
3 pages
2017 OpenCPU
No ratings yet
2017 OpenCPU
6 pages
Module Data Cyber
No ratings yet
Module Data Cyber
31 pages
COMP327 Mobile Computing: Lecture Set 10 - M-Commerce
No ratings yet
COMP327 Mobile Computing: Lecture Set 10 - M-Commerce
35 pages
Unit 10: Communication A. Phonetics
No ratings yet
Unit 10: Communication A. Phonetics
14 pages
Mca1to6 New
No ratings yet
Mca1to6 New
28 pages
Software Requirements Specification: Splitpay
No ratings yet
Software Requirements Specification: Splitpay
13 pages
GSM Equipment and Network Error Codes: Home Freeware Hardware Software Submit Resources Contact Us Sponsor
No ratings yet
GSM Equipment and Network Error Codes: Home Freeware Hardware Software Submit Resources Contact Us Sponsor
6 pages
TAN132 23 Unit10 S1
No ratings yet
TAN132 23 Unit10 S1
3 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Coding In C Decoded: Decoded, #1
From Everand
Coding In C Decoded: Decoded, #1
D Brown
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Ai - Phase 3

Uploaded by

Ai - Phase 3

Uploaded by

Building a Smarter AI-Powered Spam

Phase 3 Submission Document

Project Title : Development Part 1

Topic: section begin building your project by loading and

 Data Exploration: While primarily an exploratory process, this phase is crucial in

from mpl_toolkits.mplot3d import Axes3D

There is 1 csv file in the current version of the dataset:

for dirname, _, filenames in os.walk('/kaggle/input'):

# Distribution graphs (histogram/bar graph) of column data

# Scatter and density plots

nRowsRead = 1000 # specify 'None' if want to read whole file

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid

During handling of the above exception, another exception occurred:

UnicodeDecodeError Traceback (most recent call last)

/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)

/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 135-136: invalid

Let's take a quick look at what the data looks like:

NameError: name 'df1' is not defined

Distribution graphs (histogram/bar graph) of sampled columns:

NameError: name 'df1' is not defined

You might also like