Ai - Phase 3
Ai - Phase 3
Classifier
In the realm of data-driven projects, success often hinges on the quality and readiness of the
dataset under examination. Loading and preprocessing this data is a foundational step, setting
the stage for robust analysis, modeling, and decision-making. In this section, we will delve into
the critical processes of acquiring, loading, and preparing the dataset for our project.
Dataset Overview: We will begin by providing a brief overview of the dataset under
investigation. This includes its source, the context in which it was collected, and the
primary objective of its utilization within the project.
Data Acquisition: This section will discuss the methods employed to obtain the dataset.
It may include data collection procedures, sources, and any ethical considerations
associated with data gathering.
Data Loading: Loading the dataset into our analysis environment is a pivotal task. We
will discuss the tools and techniques used for importing the data, whether it be from a
database, CSV file, API, or other sources.
Data Preprocessing: Raw data seldom arrives in the perfect format for analysis. This
subsection will cover data preprocessing steps such as handling missing values, dealing
with outliers, and converting data types to ensure it is ready for analytical tasks.
Data Quality Assurance: Quality control is integral to ensuring the integrity of the
dataset. We will discuss measures taken to validate and clean the data, maintaining its
accuracy and reliability.
DATASET :
Context:
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS
Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged
acording being ham (legitimate) or spam.
Content:
The files contain one message per line. Each line is composed by two columns: v1 contains the label
(ham or spam) and v2 contains the raw text.
This corpus has been collected from free or free for research sources at the Internet:
A collection of 425 SMS spam messages was manually extracted from the Grumbletext
Web site. This is a UK forum in which cell phone users make public claims about SMS spam
messages, most of them without reporting the very spam message received. The
identification of the text of spam messages in the claims is a very hard and time-consuming
task, and it involved carefully scanning hundreds of web pages.
A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC),
which is a dataset of about 10,000 legitimate messages collected for research at the
Department of Computer Science at the National University of Singapore. The messages
largely originate from Singaporeans and mostly from students attending the University.
These messages were collected from volunteers who were made aware that their
contributions were going to be made publicly available.
A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available.
Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham
messages and 322 spam messages and it is public available.
This is an automatically-generated kernel with starter code demonstrating how to read in the
data and begin exploring. Click the blue "Edit Notebook" or "Fork Notebook" button at the
top of this kernel to begin editing.
Acknowledgements:
The original dataset can be found here. The creators would like to note that in case you
find the dataset useful, please make a reference to previous paper and the web
page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.
We offer a comprehensive study of this corpus in the following paper. This work presents a number
of statistics, studies and baseline results for several machine learning methods.
Exploratory Analysis:
To begin this exploratory analysis, first use matplotlib to import libraries and define functions for
plotting the data. Depending on the data, not all plots will be made. (Hey, I'm just a kerneling bot, not
a Kaggle Competitions Grandmaster!)
ln[1]:
ln[2]:
The next hidden code cells define functions for plotting data. Click on the "Code" button in the publis
hed kernel to reveal the hidden code.
ln[3]:
ln[4]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
filename = df.dataframeName
df = df.dropna('columns') # drop columns with NaN
df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where th
ere are more than 1 unique values
if df.shape[1] < 2:
print(f'No correlation plots shown: The number of non-NaN or constant col
umns ({df.shape[1]}) is less than 2')
return
corr = df.corr()
plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w',
edgecolor='k')
corrMat = plt.matshow(corr, fignum = 1)
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.gca().xaxis.tick_bottom()
plt.colorbar(corrMat)
plt.title(f'Correlation Matrix for {filename}', fontsize=15)
plt.show()
ln[5]:
ln[6]:
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()
/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_
or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, ma
ngle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitial
space, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbos
e, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_pars
er, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal,
lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, d
ialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map
, float_precision)
683 )
684
--> 685 return _read(filepath_or_buffer, kwds)
686
687 parser_f.__name__ = name
/opt/conda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_
buffer, kwds)
461
462 try:
--> 463 data = parser.read(nrows)
464 finally:
465 parser.close()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data(
)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()
ln[7]:
df1.head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-e55bb665ba13> in <module>
----> 1 df1.head(5)
ln[8]:
plotPerColumnDistribution(df1, 10, 5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-a0a199b2d778> in <module>
----> 1 plotPerColumnDistribution(df1, 10, 5)
Conclusion:
This concludes your starter analysis! To go forward from here, click the blue "Edit Notebook" button
at the top of the kernel. This will create a copy of the code and environment for you to edit. Delete,
modify, and add code as you please. Happy Kaggling!