0% found this document useful (0 votes)
22 views109 pages

22mca341 - Data Science

Uploaded by

akashyadav4846
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views109 pages

22mca341 - Data Science

Uploaded by

akashyadav4846
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

22MCA341 - DATA SCIENCE

Introduction to Data Science & Types of Data


Data Science-Overview, Terminologies used, Steps and Life Cycle,
Applications. Structured versus Unstructured Data, Quantitative
versus Qualitative Data, Basics of Data Exploration and Data Pre-
Processing – Examples, Levels of Data with Mathematical
Operations, Other Measures on All Levels of Data. Python
Programming for Data Science – Prebuilt Python Modules.
INTRODUCTION
A. 19TH CENTURY – INDUSTRY AGE
B. 20TH CENTURY – INFORMATION AGE
C. 21ST CENTURY – DATA AGE
BASIC TERMINOLOGY
“Data", we refer to a collection of information in
either an organized or unorganized format:
FORMAT -1
• Organized data: This refers to data that is
sorted into a row/column structure, where
every row represents a single observation, and
the columns represent the characteristics of
that observation.
FORMAT -2
• Unorganized data: This is the type of data that
is in the free form, usually text or raw
audio/signals that must be parsed further to
become organized.
Epitomize
• Whenever you open Excel (or any other
spreadsheet program), you are looking at a
blank row/column structure waiting for
organized data. These programs don't do well
with unorganized data.
Epitomize
• For the most part, we will deal with organized
data as it is the easiest to glean insight from,
but we will not shy away from looking at raw
text and methods of processing unorganized
forms of data.
What is Data Science?

Data science is the art and science


of acquiring knowledge through
data.
• Data science is about using data in order to
gain new insights.
Data Science
• Data science is all about how we take data,
use it to acquire knowledge, and then use
that knowledge to do the following:
– Make decisions
– Predict the future
– Understand the past/present
– Create new industries/products
• Main Objective: To understand the methods of data science,
including how to process data, gather insights, and use those
insights to make informed decisions and predictions.

Why data science?


– Data is collected in various forms and from different
sources, and often comes in very unorganized.
– Data can be missing, incomplete, or just flat out wrong.
Often, we have data on very different scales and that
makes it tough to compare it.
Eg: Pricing used cars

• One of the main goals of data science is to make explicit


practices and procedures to discover and apply these
relationships in the data.
Data Science Venn diagram

Data Science is the intersection of the


three key areas. In order to gain
knowledge from data, we must be able
to utilize computer programming to
access the data, understand the
mathematics behind the models we
derive, and above all, understand our
analyses' place in the domain we are in.

Hacking skills: To conceptualize and program complicated algorithms using computer


languages.
Math & Statistics Knowledge: To theorize and evaluate algorithms and tweak the
existing procedures to fit specific situations.
Substantive Expertise (domain expertise): To apply concepts and results in a
meaningful and effective way.
Data Science Venn diagram

Those with hacking skills can conceptualize and program complicated


algorithms using computer languages.

Having a Math & Statistics Knowledge base allows you to theorize


and evaluate algorithms and tweak the existing procedures to fit
specific situations.

Having Substantive Expertise (domain expertise) allows you to apply


concepts and results in a meaningful and effective way.
The Data Science Venn Diagram
• Math/statistics: This is the use of equations
and formulas to perform analysis
• Computer programming: This is the ability to
use code to create outcomes on the computer
• Domain knowledge: This refers to
understanding the problem domain
(medicine, finance, social science, and so on)
Data Model
• A data model refers to an organized and
formal relationship between elements of
data, usually meant to simulate a real-world
phenomenon.

• The essential idea behind these three topics is that we use data
in order to come up with the best model possible.
Math
• Essentially, we will use math in order to
formalize relationships between variables.

• There are many types of data models, including probabilistic and


statistical models.

• Both of these are subsets of a larger paradigm, called Machine


Learning (ML).
Computer Programming
• Python is an extremely simple language to
read and write, even if you've never coded
before.

• It is one of the most common languages, both


in production and in the academic setting
(one of the fastest growing).
WHY PYTHON…..?
• The language's online community is vast and
friendly. This means that a quick Google
search should yield multiple results of people
who have faced and solved similar (if not
exactly the same) situations.

• Python has prebuilt data science modules


that both the novice and the veteran data
scientist can utilize.
Python Practices
A QUICK REVIEW
(2)
Basic Logical Operators
• For these operators, keep the boolean data
type in mind.
• Every operator will evaluate to
either True or False.
Logical Operators
(2)
Example – Parsing a single tweet
tweet = "RT @j_o_n_dnger: $TWTR now top holding for
Andor, unseating $AAPL"

words_in_tweet = tweet.split(' ') # list of words in tweet

for word in words_in_tweet: # for each word in list


if "$" in word: # if word has a "cashtag"
print "THIS TWEET IS ABOUT", word
# alert user
The words_in_tweet variable tokenizes
• ['RT',
• '@robdv:',
• '$TWTR',
• 'now',
• 'top',
• 'holding',
• 'for',
• 'Andor,',
• 'unseating',
• '$AAPL']
Output
Some more terminologies
• Machine Learning: This refers to giving computers the ability to learn
from data without explicit "rules" being given by a programmer.

• Machine learning combines the power of computers with intelligent


learning algorithms in order to automate the discovery of relationships in
data and creation of powerful data models.

• Types of data models:

• Probabilistic model: This refers to using probability to find a


relationship between elements that includes a degree of randomness.

• Statistical model: This refers to taking advantage of statistical theorems


to formalize relationships between data elements in a (usually) simple
mathematical formula.
Some more terminologies

• Exploratory data analysis (EDA) refers to preparing data in order to


standardize results and gain quick insights.

• EDA is concerned with data visualization and preparation. This is where


we turn unorganized data into organized data and also clean up
missing/incorrect data points.
• During EDA, we will create many types of plots and use these plots to
identify key features and relationships to exploit in our data models.

• Data mining is the process of finding relationships between elements of


data.

• Data mining is the part of data science where we try to find relationships
between variables.
Essential steps to perform data science
1. Asking an interesting question
2. Obtaining the data
3. Exploring the data
4. Modeling the data
5. Communicating and Visualizing the results
5 STEPS OF DATA SCIENCE
SIMPLE PICTOGRAPH
Types of Data
• Structured versus Unstructured data
(Observations and Characteristics)

• Quantitative versus Qualitative data


(Does not follow any standard organization
hierarchy)
• Structured (organized) data: This is the data that can be
thought of as observations and characteristics. It is usually
organized using a table method (rows and columns).

• Unstructured (unorganized) data: This data exists as a free


entity and does not follow any standard organization
hierarchy.

• Examples:
– Most data that exists in text form, including server logs and
Facebook posts, is unstructured
– Scientific observations, as recorded by careful scientists,
are kept in a very neat and organized (structured) format
Structured Vs. Unstructured Data:
• Structured data is generally thought of as being much easier to work with and
analyze.
• Most statistical and machine learning models were built with structured data
in mind and cannot work on the loose interpretation of unstructured data.
• The natural row and column structure is easy to digest for human and
machine eyes.
• Most estimates place unstructured data as 80-90% of the world's data.
• This data exists in many forms and for the most part, goes unnoticed by
humans as a potential source of data.

• Tweets, e-mails, literature, and server logs are generally unstructured forms
of data.
• So, with most of our data existing in this free-form format, we must turn to
pre-analysis techniques, called pre-processing, in order to apply structure to
at least a part of the data for further analysis.
Quantitative versus qualitative data

• Quantitative data: This data can be described using


numbers, and basic mathematical procedures,
including addition, are possible on the set.

• Qualitative data: This data cannot be described


using numbers and basic mathematics. This data is
generally thought of as being described using
"natural" categories and language.
Qualitative versus Quantitative Data
Observe and answer
• Data: COFFEE SHOP
– NAME OF COFFEE SHOP -Qualitative
– REVENUE (IN THOUSANDS OF RUPEES) -
Quantitative
– ZIP CODE – Qualitative
– AVERAGE MONTHLY CUSTOMERS – Quantitative
– COUNTRY OF COFFEE ORIGIN - Qualitative
Basics of data exploration
Exploratory Data Analysis (EDA) is an
Approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to

1. maximize insight into a data set


2. uncover underlying structure
3. extract important variables
4. detect outliers and anomalies
5. test underlying assumptions
6. develop parsimonious models and
7. determine optimal factor settings
Statistical graphics
• EDA is a philosophy as to how we dissect a
data set; what we look for; how we look; and
how we interpret.

• It is true that EDA heavily uses the collection


of techniques that we call "statistical
graphics", but it is not identical to statistical
graphics.
EDA Techniques
• The particular graphical techniques employed
in EDA are often quite simple, consisting of
various techniques of:

1. Plotting the raw data (such as data traces,


histograms, bihistograms, probability plots, lag
plots, block plots, and Youden plots.
Data Traces
Histograms, Bihistograms
Probability plots
Lag plots
Block Plots
Youden Plots
EDA Techniques
2. Plotting simple statistics such as mean plots,
standard deviation plots, box plots, and main
effects plots of the raw data.

3. Positioning such plots so as to maximize our


natural pattern-recognition abilities, such as
using multiple plots per page.
Popular data analysis Approaches
• There are Three approaches :

1. Classical
2. Exploratory (EDA)
3. Bayesian
DA approaches in detail
1. For classical analysis, the sequence is
Problem => Data => Model => Analysis =>
Conclusions
2. For EDA, the sequence is
Problem => Data => Analysis => Model =>
Conclusions
3. For Bayesian, the sequence is
Problem => Data => Model => Prior Distribution
=> Analysis =>Conclusions
EDA GOALS
• The primary goal of EDA is to maximize the
analyst's insight into a data set and into the
underlying structure of a data set, while providing
all of the specific items that an analyst would
want to extract from a data set, such as:
1. a good-fitting, parsimonious model
2. a list of outliers
3. a sense of robustness of conclusions
4. estimates for parameters
5. uncertainties for those estimates
6. a ranked list of important factors
7. conclusions as to whether individual factors
are statistically significant
8. optimal settings.
Data Preprocessing
• Word / Phrase count
• Existence of certain special char’s
• Relative length of text
• Picking out topics
Tweet
Word/Phrase Counts
• You were born with wings
You Were Born With wings

Word count 1 1 1 1 1
Word/Phrase Counts
PYTHON
• Approach 1 − Using split() function. Split
function breaks the string into a list iterable
with space as a delimiter. ...
• Approach 2 − Using regex module. Here
findall() function is used to count the number
of words in the sentence available in a regex
module. ...
• Approach 3 − Using sum()+ strip()+ split()
function.
PYTHON – WORD COUNT
Special Characters - presence
Special Characters - presence
PYTHON – SPECIAL CHAR’S
# Python program to Count Alphabets Digits and Special Characters in a
String
string = input("Please Enter your Own String : ")
alphabets = digits = special = 0

for i in range(len(string)):
if(string[i].isalpha()):
alphabets = alphabets + 1
elif(string[i].isdigit()):
digits = digits + 1
else:
special = special + 1
print(string[i])

print("\nTotal Number of Alphabets in this String : ", alphabets)


print("Total Number of Digits in this String : ", digits)
print("Total Number of Special Characters in this String : ", special)
FOUR LEVELS OF DATA
• NOMINAL
• ORDINAL
• INTERVAL
• RATIO
Nominal
• Nominal data is a group of non-
parametric variables.
– Purely name or category
– Gender, nationality, species
• A part of speech is also considered on the nominal level
of data.
• We cannot do any arithmetic on nominal data
Nominal Data
Examples:
– A type of animal is on the nominal level of
data. We may also say that if it is a
chimpanzee, then it belongs to the mammalian
class as well.
– A part of speech is also considered on the
nominal level of data. The word she is a
pronoun, and it is also a noun.
– Of course, being qualitative, we cannot
perform any quantitative mathematical
operations, such as addition or division.
Math operations allowed on nominal

• Basic equality and membership functions


– Being a tech entrepreneur is same as being
tech industry, but not vice-versa.
– A figure described as a square falls under
the description of being a rectangle, but not
vice-versa.
Measure of center
• Balance point of the data.
• Common examples mean, median and mode
• In order to find the center of nominal data,
we generally refer MODE (the most common
element)
• Examples: common continent surveyed for an
experiment; It makes the choice as center…
Ordinal
• Ordinal data is a group of non-parametric
ordered variables.
• Provides a rank order
• Means for observation before the other;
– Example: rate your satisfaction on a scale from 1
to 5 or 1 to 10
• Math operations: Ordering, comparison
Measures of center in Ordinal
• Median is usually an appropriate way of
defining the center of the data.

• Mean is not possible in ordinal data.


Example
• Imagine you have conducted a survey among
employees asking
– “How happy are you to be working here?”
– On a scale from 1-5.
– Results:
– 5,4,3,4,5,3,2,5,3,1,4, 5,4,3,4,5,3,2,5,3,1,4,
5,4,3,4,5,3,2,5,3,1,4, 5,4,3,4,5,3,2,5,3,1,4
Python Implementation
import numpy
R = [5,4,3,4,5,3,2,5,3,1,4, 5,4,3,4,5,3,2,5,3,1,4,
5,4,3,4,5,3,2,5,3,1,4, 5,4,3,4,5,3,2,5,3,1,4]
Sort_R = sorted(R)
‘’’ sorted R values are stored in Sort_R.
‘’’
print (Sort_R)
print (numpy.median(Sort_R))

Result:
Classify ordinal or nominal
• The origin of the beans in your cup of coffee
• The place someone receives after completing
a foot race
• The metal used to make the medal that they
receive after placing in the said race.
• The telephone number of a client
• How many cups of coffee you drink in a day?
Interval level data
• The basic difference between the ordinal and
interval is, just that – difference.
• Data at the interval allows meaningful
subtraction of data points.
Interval example
• Temperature is a great example (which come
in our mind immediately)

• If it is 100 deg Fahrenheit and 80 deg


Fahrenheit in city-1, city-2, then city-2 is 20
deg warmer than city-1.
Math operations
Ordering and Comparison
• Addition
• Subtraction
Measures of Center
• Median, mean and Mode
• Suppose look at the temperature of the fridge
containing pharmaceutical company’s new
COVID-19 vaccine.
• 31,32,32,31,35,38,39,31,32,31,35,38,
38,39,31,32,31,35,38,
38,39,31,32,31,35,38,38,39,31,32,31,35,38
Python Illustration to find mean and
median
import numpy
Temps=[31,32,32,31,35,38,39,31,32,31,35,38,
38,39,31,32,31,35,38,
38,39,31,32,31,35,38,38,39,31,32,31,35,38]
print(numpy.mean(Temps))
Print(numpy.median(Temps))

Result:
Measures of variation
• In data science, it is important to mention
– How the data is “spread out” ? Is given by
variation.

• It is a number that attempts to describe how


spread out the data is.
Key understanding
• Along with a measure of center, a measure of
variation can almost describe an entire
dataset with only TWO numbers.
STANDARD DEVIATION (SD)
• Most common measure of variation of data at
the interval level and beyond.

• STANDARD DEVIATION: It is an “Average


distance a data point is at from the mean”.
SD PROCEDURE
• STEP-1: Find the mean of the data.

• STEP-2: For each number in the dataset, subtract it


from the mean and then square it.

• STEP-3: Find the average of each square difference.

• STEP-4: Take the square root of the number obtained in


step three. This is the standard deviation.
SD in Python Implementation
Import numpy
Temps=[31,32,32,31,35,38,39,31,32,31,35,38,
38,39,31,32,31,35,38,
38,39,31,32,31,35,38,38,39,31,32,31,35,38]
Mean = numpy.mean(Temps)
Sq_diffs=[]
For t in Temps:
Diff=t-mean
Sq_diff=Diff**2
Sq_diffs.append(Sq_diff)
Avg_sq_diff= numpy.mean(Sq_diffs)
SD = numpy.sqrt(Avg_sq_diff)
Print(SD)
Ratio
• Ratio Data is defined as quantitative data,
having the same properties as interval data,
with an equal and definitive ratio between
each data and absolute “zero” being treated
as a point of origin.

• In other words, there can be no negative


numerical value in ratio data.
Ratio Examples
• Consider the variable age.
• Age is frequently collected as ratio data, but
can also be collected as ordinal data.
• This happens on surveys when they ask,
“What age group do you fall in?” There, you
wouldn't have data on your respondent's
individual ages – you'd only know how many
were between 18-24, 25-34, etc.
More examples

• A ratio variable, has all the properties of an


interval variable, and also has a clear
definition of 0.0. ...
• Examples of ratio variables include: dose
amount, reaction rate, flow rate,
concentration, pulse, weight, length,
temperature in Kelvin (0.0 Kelvin really does
mean “no heat”), survival time.
Measures of center
• Arithmetic mean still holds meaning at this
level.
• It also does a geometric mean. (In statistics,
the geometric mean is calculated by raising
the product of a series of numbers to the
inverse of the total length of the series.)
Mean
• 24,56,67,84,90,12
• Total = 333
• Mean = total /6 = 333/6 = 55.5
Median
• 3,4,9,1,7,10,5,6,8,12
• Step-1: Sort the values
• 1,3,4,5, 6,7, 8,9,10,12

• 6+7/2 = 13/2 = 6.5


Range
• 3,4,9,1,7,10,5,6,8,12
• Range = Max – Min = 12 – 1 = 11
Mode
• 3,4,9,1,7,10,5,6,8,12, 5, 6,7,,5,8,3,2,5

• Mode: 5
SD
• X1,x2,x3,x4…. Xn
• Xm = Mean = x1+x2…..xn / n
• X | x-xm | x-xm 2

• Variance = Sigma(x-xm)2 / n
• SD = sqrt (variance)
IQR
• Inter-Quartile Range:
• The interquartile range (IQR) measures the
spread of the middle half of your data.
• It is the range for the middle 50% of
the sample.
IQR
• Inter-Quartile Range:
• (1,2, 3, 6,7, ) 8,9, (11,14, 15, 18,20) (12 values )
• Median = 8 +9 / 2 = 8.5
• Q1 = 3 Q3 = 15
• IQR = Q3 – Q1 = 15 -3 = 12
CV
• COMMON VARIATION

• CV = SD/MEAN

• SD = SQRT((SIGMA(X-XM)2)/N))
Python Prebuilt Modules
• pandas
• sci-kit learn
• seaborn
• numpy/scipy
• requests (to mine data from the Web)
• BeautifulSoup (for the Web-HTML
parsing)
Python Prebuilt Modules
• Pandas: library that is used for data analysis, Pandas is the most widely used tool
for data munging, Pandas in general is used for financial time series
data/economics data.

• sci-kit learn: useful library for machine learning in Python, a lot of efficient tools
for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
• Seaborn: data visualization library based on matplotlib underneath to plot graphs,
It provides a high-level interface for drawing attractive and informative statistical
graphics.
• Numpy/scipy: NumPy and SciPy are Python libraries used for used mathematical
and numerical analysis, NumPy contains array data and basic operations such as
sorting, indexing, etc whereas, SciPy consists of all the numerical code.
• Requests: allow you to send HTTP/1.1 requests using Python. With it, you can add
content like headers, form data, multipart files, and parameters via
simple Python libraries. It also allows you to access the response data of Python in
the same way.
• Beautifulsoup: for pulling data out of HTML and XML files. It works with your
favorite parser to provide idiomatic ways of navigating, searching, and modifying
the parse tree. It commonly saves programmers hours or days of work.
Basic questions for data exploration
• When looking at a new dataset, whether it is
familiar to you or not, it is important to use the
following questions as guidelines for your
preliminary analysis:
1. Is the data organized or not?
2. What does each row represent?
3. What does each column represent?
4. Are there any missing data points?
5. Do we need to perform any transformations on
the columns?
Case Study: Dataset 1 – Yelp
• The first dataset we will look at is a public
dataset made available by the restaurant
review site, Yelp. All personally identifiable
information has been removed.
• Let's read in the data first, as shown here
1. import pandas as pd
2. yelp_raw_data = pd.read_csv("yelp.csv")
3. yelp_raw_data.head()
Explanation of steps
• Import the pandas package and nickname it as
pd.
• Read in the .csv from the web;
• Look at the head of the data (just first few
records..)
Table of values
Business id Data Rewview id Stars Text
Yelp Dataset
• Is the data organized or not?
• What does each row represent?
• What does each column represent?
• Are there any missing data points?
• Do we need to perform any transformations
on the columns?
Case Study:
• Dataset 2 – Titanic Data
• Apply the following operations:
– Filtering operations
– Handling of ordinal, nominal/categorical variables
and applying mathematical, statistical functions
– Changing the data type of a column
– Handling of missing values
Summary
Whenever you are faced with a new dataset, the first three
questions you should ask about it are the following:
• Is the data organized or unorganized?
– For example, does our data exist in a nice, clean row/column
structure?
• Is each column quantitative or qualitative?
– For example, are the values numbers, strings, or do they represent
quantities?
• At what level of data is each column?
– For example, are the values at the nominal, ordinal, interval, or ratio
level?

You might also like