22mca341 - Data Science
22mca341 - Data Science
• The essential idea behind these three topics is that we use data
in order to come up with the best model possible.
Math
• Essentially, we will use math in order to
formalize relationships between variables.
• Data mining is the part of data science where we try to find relationships
between variables.
Essential steps to perform data science
1. Asking an interesting question
2. Obtaining the data
3. Exploring the data
4. Modeling the data
5. Communicating and Visualizing the results
5 STEPS OF DATA SCIENCE
SIMPLE PICTOGRAPH
Types of Data
• Structured versus Unstructured data
(Observations and Characteristics)
• Examples:
– Most data that exists in text form, including server logs and
Facebook posts, is unstructured
– Scientific observations, as recorded by careful scientists,
are kept in a very neat and organized (structured) format
Structured Vs. Unstructured Data:
• Structured data is generally thought of as being much easier to work with and
analyze.
• Most statistical and machine learning models were built with structured data
in mind and cannot work on the loose interpretation of unstructured data.
• The natural row and column structure is easy to digest for human and
machine eyes.
• Most estimates place unstructured data as 80-90% of the world's data.
• This data exists in many forms and for the most part, goes unnoticed by
humans as a potential source of data.
• Tweets, e-mails, literature, and server logs are generally unstructured forms
of data.
• So, with most of our data existing in this free-form format, we must turn to
pre-analysis techniques, called pre-processing, in order to apply structure to
at least a part of the data for further analysis.
Quantitative versus qualitative data
1. Classical
2. Exploratory (EDA)
3. Bayesian
DA approaches in detail
1. For classical analysis, the sequence is
Problem => Data => Model => Analysis =>
Conclusions
2. For EDA, the sequence is
Problem => Data => Analysis => Model =>
Conclusions
3. For Bayesian, the sequence is
Problem => Data => Model => Prior Distribution
=> Analysis =>Conclusions
EDA GOALS
• The primary goal of EDA is to maximize the
analyst's insight into a data set and into the
underlying structure of a data set, while providing
all of the specific items that an analyst would
want to extract from a data set, such as:
1. a good-fitting, parsimonious model
2. a list of outliers
3. a sense of robustness of conclusions
4. estimates for parameters
5. uncertainties for those estimates
6. a ranked list of important factors
7. conclusions as to whether individual factors
are statistically significant
8. optimal settings.
Data Preprocessing
• Word / Phrase count
• Existence of certain special char’s
• Relative length of text
• Picking out topics
Tweet
Word/Phrase Counts
• You were born with wings
You Were Born With wings
Word count 1 1 1 1 1
Word/Phrase Counts
PYTHON
• Approach 1 − Using split() function. Split
function breaks the string into a list iterable
with space as a delimiter. ...
• Approach 2 − Using regex module. Here
findall() function is used to count the number
of words in the sentence available in a regex
module. ...
• Approach 3 − Using sum()+ strip()+ split()
function.
PYTHON – WORD COUNT
Special Characters - presence
Special Characters - presence
PYTHON – SPECIAL CHAR’S
# Python program to Count Alphabets Digits and Special Characters in a
String
string = input("Please Enter your Own String : ")
alphabets = digits = special = 0
for i in range(len(string)):
if(string[i].isalpha()):
alphabets = alphabets + 1
elif(string[i].isdigit()):
digits = digits + 1
else:
special = special + 1
print(string[i])
Result:
Classify ordinal or nominal
• The origin of the beans in your cup of coffee
• The place someone receives after completing
a foot race
• The metal used to make the medal that they
receive after placing in the said race.
• The telephone number of a client
• How many cups of coffee you drink in a day?
Interval level data
• The basic difference between the ordinal and
interval is, just that – difference.
• Data at the interval allows meaningful
subtraction of data points.
Interval example
• Temperature is a great example (which come
in our mind immediately)
Result:
Measures of variation
• In data science, it is important to mention
– How the data is “spread out” ? Is given by
variation.
• Mode: 5
SD
• X1,x2,x3,x4…. Xn
• Xm = Mean = x1+x2…..xn / n
• X | x-xm | x-xm 2
• Variance = Sigma(x-xm)2 / n
• SD = sqrt (variance)
IQR
• Inter-Quartile Range:
• The interquartile range (IQR) measures the
spread of the middle half of your data.
• It is the range for the middle 50% of
the sample.
IQR
• Inter-Quartile Range:
• (1,2, 3, 6,7, ) 8,9, (11,14, 15, 18,20) (12 values )
• Median = 8 +9 / 2 = 8.5
• Q1 = 3 Q3 = 15
• IQR = Q3 – Q1 = 15 -3 = 12
CV
• COMMON VARIATION
• CV = SD/MEAN
• SD = SQRT((SIGMA(X-XM)2)/N))
Python Prebuilt Modules
• pandas
• sci-kit learn
• seaborn
• numpy/scipy
• requests (to mine data from the Web)
• BeautifulSoup (for the Web-HTML
parsing)
Python Prebuilt Modules
• Pandas: library that is used for data analysis, Pandas is the most widely used tool
for data munging, Pandas in general is used for financial time series
data/economics data.
• sci-kit learn: useful library for machine learning in Python, a lot of efficient tools
for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
• Seaborn: data visualization library based on matplotlib underneath to plot graphs,
It provides a high-level interface for drawing attractive and informative statistical
graphics.
• Numpy/scipy: NumPy and SciPy are Python libraries used for used mathematical
and numerical analysis, NumPy contains array data and basic operations such as
sorting, indexing, etc whereas, SciPy consists of all the numerical code.
• Requests: allow you to send HTTP/1.1 requests using Python. With it, you can add
content like headers, form data, multipart files, and parameters via
simple Python libraries. It also allows you to access the response data of Python in
the same way.
• Beautifulsoup: for pulling data out of HTML and XML files. It works with your
favorite parser to provide idiomatic ways of navigating, searching, and modifying
the parse tree. It commonly saves programmers hours or days of work.
Basic questions for data exploration
• When looking at a new dataset, whether it is
familiar to you or not, it is important to use the
following questions as guidelines for your
preliminary analysis:
1. Is the data organized or not?
2. What does each row represent?
3. What does each column represent?
4. Are there any missing data points?
5. Do we need to perform any transformations on
the columns?
Case Study: Dataset 1 – Yelp
• The first dataset we will look at is a public
dataset made available by the restaurant
review site, Yelp. All personally identifiable
information has been removed.
• Let's read in the data first, as shown here
1. import pandas as pd
2. yelp_raw_data = pd.read_csv("yelp.csv")
3. yelp_raw_data.head()
Explanation of steps
• Import the pandas package and nickname it as
pd.
• Read in the .csv from the web;
• Look at the head of the data (just first few
records..)
Table of values
Business id Data Rewview id Stars Text
Yelp Dataset
• Is the data organized or not?
• What does each row represent?
• What does each column represent?
• Are there any missing data points?
• Do we need to perform any transformations
on the columns?
Case Study:
• Dataset 2 – Titanic Data
• Apply the following operations:
– Filtering operations
– Handling of ordinal, nominal/categorical variables
and applying mathematical, statistical functions
– Changing the data type of a column
– Handling of missing values
Summary
Whenever you are faced with a new dataset, the first three
questions you should ask about it are the following:
• Is the data organized or unorganized?
– For example, does our data exist in a nice, clean row/column
structure?
• Is each column quantitative or qualitative?
– For example, are the values numbers, strings, or do they represent
quantities?
• At what level of data is each column?
– For example, are the values at the nominal, ordinal, interval, or ratio
level?