0% found this document useful (0 votes)

22 views109 pages

22mca341 - Data Science

Uploaded by

akashyadav4846

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views109 pages

22mca341 - Data Science

Uploaded by

akashyadav4846

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 109

22MCA341 - DATA SCIENCE

Introduction to Data Science & Types of Data

Data Science-Overview, Terminologies used, Steps and Life Cycle,
Applications. Structured versus Unstructured Data, Quantitative
versus Qualitative Data, Basics of Data Exploration and Data Pre-
Processing – Examples, Levels of Data with Mathematical
Operations, Other Measures on All Levels of Data. Python
Programming for Data Science – Prebuilt Python Modules.
INTRODUCTION
A. 19TH CENTURY – INDUSTRY AGE
B. 20TH CENTURY – INFORMATION AGE
C. 21ST CENTURY – DATA AGE
BASIC TERMINOLOGY
“Data", we refer to a collection of information in
either an organized or unorganized format:
FORMAT -1
• Organized data: This refers to data that is
sorted into a row/column structure, where
every row represents a single observation, and
the columns represent the characteristics of
that observation.
FORMAT -2
• Unorganized data: This is the type of data that
is in the free form, usually text or raw
audio/signals that must be parsed further to
become organized.
Epitomize
• Whenever you open Excel (or any other
spreadsheet program), you are looking at a
blank row/column structure waiting for
organized data. These programs don't do well
with unorganized data.
Epitomize
• For the most part, we will deal with organized
data as it is the easiest to glean insight from,
but we will not shy away from looking at raw
text and methods of processing unorganized
forms of data.
What is Data Science?

Data science is the art and science

of acquiring knowledge through
data.
• Data science is about using data in order to
gain new insights.
Data Science
• Data science is all about how we take data,
use it to acquire knowledge, and then use
that knowledge to do the following:
– Make decisions
– Predict the future
– Understand the past/present
– Create new industries/products
• Main Objective: To understand the methods of data science,
including how to process data, gather insights, and use those
insights to make informed decisions and predictions.

Why data science?

– Data is collected in various forms and from different
sources, and often comes in very unorganized.
– Data can be missing, incomplete, or just flat out wrong.
Often, we have data on very different scales and that
makes it tough to compare it.
Eg: Pricing used cars

• One of the main goals of data science is to make explicit

practices and procedures to discover and apply these
relationships in the data.
Data Science Venn diagram

Data Science is the intersection of the

three key areas. In order to gain
knowledge from data, we must be able
to utilize computer programming to
access the data, understand the
mathematics behind the models we
derive, and above all, understand our
analyses' place in the domain we are in.

Hacking skills: To conceptualize and program complicated algorithms using computer

languages.
Math & Statistics Knowledge: To theorize and evaluate algorithms and tweak the
existing procedures to fit specific situations.
Substantive Expertise (domain expertise): To apply concepts and results in a
meaningful and effective way.
Data Science Venn diagram

Those with hacking skills can conceptualize and program complicated

algorithms using computer languages.

Having a Math & Statistics Knowledge base allows you to theorize

and evaluate algorithms and tweak the existing procedures to fit
specific situations.

Having Substantive Expertise (domain expertise) allows you to apply

concepts and results in a meaningful and effective way.
The Data Science Venn Diagram
• Math/statistics: This is the use of equations
and formulas to perform analysis
• Computer programming: This is the ability to
use code to create outcomes on the computer
• Domain knowledge: This refers to
understanding the problem domain
(medicine, finance, social science, and so on)
Data Model
• A data model refers to an organized and
formal relationship between elements of
data, usually meant to simulate a real-world
phenomenon.

• The essential idea behind these three topics is that we use data
in order to come up with the best model possible.
Math
• Essentially, we will use math in order to
formalize relationships between variables.

• There are many types of data models, including probabilistic and

statistical models.

• Both of these are subsets of a larger paradigm, called Machine

Learning (ML).
Computer Programming
• Python is an extremely simple language to
read and write, even if you've never coded
before.

• It is one of the most common languages, both

in production and in the academic setting
(one of the fastest growing).
WHY PYTHON…..?
• The language's online community is vast and
friendly. This means that a quick Google
search should yield multiple results of people
who have faced and solved similar (if not
exactly the same) situations.

• Python has prebuilt data science modules

that both the novice and the veteran data
scientist can utilize.
Python Practices
A QUICK REVIEW
(2)
Basic Logical Operators
• For these operators, keep the boolean data
type in mind.
• Every operator will evaluate to
either True or False.
Logical Operators
(2)
Example – Parsing a single tweet
tweet = "RT @j_o_n_dnger: $TWTR now top holding for
Andor, unseating $AAPL"

words_in_tweet = tweet.split(' ') # list of words in tweet

for word in words_in_tweet: # for each word in list

if "$" in word: # if word has a "cashtag"
print "THIS TWEET IS ABOUT", word
# alert user
The words_in_tweet variable tokenizes
• ['RT',
• '@robdv:',
• '$TWTR',
• 'now',
• 'top',
• 'holding',
• 'for',
• 'Andor,',
• 'unseating',
• '$AAPL']
Output
Some more terminologies
• Machine Learning: This refers to giving computers the ability to learn
from data without explicit "rules" being given by a programmer.

• Machine learning combines the power of computers with intelligent

learning algorithms in order to automate the discovery of relationships in
data and creation of powerful data models.

• Types of data models:

• Probabilistic model: This refers to using probability to find a

relationship between elements that includes a degree of randomness.

• Statistical model: This refers to taking advantage of statistical theorems

to formalize relationships between data elements in a (usually) simple
mathematical formula.
Some more terminologies

• Exploratory data analysis (EDA) refers to preparing data in order to

standardize results and gain quick insights.

• EDA is concerned with data visualization and preparation. This is where

we turn unorganized data into organized data and also clean up
missing/incorrect data points.
• During EDA, we will create many types of plots and use these plots to
identify key features and relationships to exploit in our data models.

• Data mining is the process of finding relationships between elements of

data.

• Data mining is the part of data science where we try to find relationships
between variables.
Essential steps to perform data science
1. Asking an interesting question
2. Obtaining the data
3. Exploring the data
4. Modeling the data
5. Communicating and Visualizing the results
5 STEPS OF DATA SCIENCE
SIMPLE PICTOGRAPH
Types of Data
• Structured versus Unstructured data
(Observations and Characteristics)

• Quantitative versus Qualitative data

(Does not follow any standard organization
hierarchy)
• Structured (organized) data: This is the data that can be
thought of as observations and characteristics. It is usually
organized using a table method (rows and columns).

• Unstructured (unorganized) data: This data exists as a free

entity and does not follow any standard organization
hierarchy.

• Examples:
– Most data that exists in text form, including server logs and
Facebook posts, is unstructured
– Scientific observations, as recorded by careful scientists,
are kept in a very neat and organized (structured) format
Structured Vs. Unstructured Data:
• Structured data is generally thought of as being much easier to work with and
analyze.
• Most statistical and machine learning models were built with structured data
in mind and cannot work on the loose interpretation of unstructured data.
• The natural row and column structure is easy to digest for human and
machine eyes.
• Most estimates place unstructured data as 80-90% of the world's data.
• This data exists in many forms and for the most part, goes unnoticed by
humans as a potential source of data.

• Tweets, e-mails, literature, and server logs are generally unstructured forms
of data.
• So, with most of our data existing in this free-form format, we must turn to
pre-analysis techniques, called pre-processing, in order to apply structure to
at least a part of the data for further analysis.
Quantitative versus qualitative data

• Quantitative data: This data can be described using

numbers, and basic mathematical procedures,
including addition, are possible on the set.

• Qualitative data: This data cannot be described

using numbers and basic mathematics. This data is
generally thought of as being described using
"natural" categories and language.
Qualitative versus Quantitative Data
Observe and answer
• Data: COFFEE SHOP
– NAME OF COFFEE SHOP -Qualitative
– REVENUE (IN THOUSANDS OF RUPEES) -
Quantitative
– ZIP CODE – Qualitative
– AVERAGE MONTHLY CUSTOMERS – Quantitative
– COUNTRY OF COFFEE ORIGIN - Qualitative
Basics of data exploration
Exploratory Data Analysis (EDA) is an
Approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to

1. maximize insight into a data set

2. uncover underlying structure
3. extract important variables
4. detect outliers and anomalies
5. test underlying assumptions
6. develop parsimonious models and
7. determine optimal factor settings
Statistical graphics
• EDA is a philosophy as to how we dissect a
data set; what we look for; how we look; and
how we interpret.

• It is true that EDA heavily uses the collection

of techniques that we call "statistical
graphics", but it is not identical to statistical
graphics.
EDA Techniques
• The particular graphical techniques employed
in EDA are often quite simple, consisting of
various techniques of:

1. Plotting the raw data (such as data traces,

histograms, bihistograms, probability plots, lag
plots, block plots, and Youden plots.
Data Traces
Histograms, Bihistograms
Probability plots
Lag plots
Block Plots
Youden Plots
EDA Techniques
2. Plotting simple statistics such as mean plots,
standard deviation plots, box plots, and main
effects plots of the raw data.

3. Positioning such plots so as to maximize our

natural pattern-recognition abilities, such as
using multiple plots per page.
Popular data analysis Approaches
• There are Three approaches :

1. Classical
2. Exploratory (EDA)
3. Bayesian
DA approaches in detail
1. For classical analysis, the sequence is
Problem => Data => Model => Analysis =>
Conclusions
2. For EDA, the sequence is
Problem => Data => Analysis => Model =>
Conclusions
3. For Bayesian, the sequence is
Problem => Data => Model => Prior Distribution
=> Analysis =>Conclusions
EDA GOALS
• The primary goal of EDA is to maximize the
analyst's insight into a data set and into the
underlying structure of a data set, while providing
all of the specific items that an analyst would
want to extract from a data set, such as:
1. a good-fitting, parsimonious model
2. a list of outliers
3. a sense of robustness of conclusions
4. estimates for parameters
5. uncertainties for those estimates
6. a ranked list of important factors
7. conclusions as to whether individual factors
are statistically significant
8. optimal settings.
Data Preprocessing
• Word / Phrase count
• Existence of certain special char’s
• Relative length of text
• Picking out topics
Tweet
Word/Phrase Counts
• You were born with wings
You Were Born With wings

Word count 1 1 1 1 1
Word/Phrase Counts
PYTHON
• Approach 1 − Using split() function. Split
function breaks the string into a list iterable
with space as a delimiter. ...
• Approach 2 − Using regex module. Here
findall() function is used to count the number
of words in the sentence available in a regex
module. ...
• Approach 3 − Using sum()+ strip()+ split()
function.
PYTHON – WORD COUNT
Special Characters - presence
Special Characters - presence
PYTHON – SPECIAL CHAR’S
# Python program to Count Alphabets Digits and Special Characters in a
String
string = input("Please Enter your Own String : ")
alphabets = digits = special = 0

for i in range(len(string)):
if(string[i].isalpha()):
alphabets = alphabets + 1
elif(string[i].isdigit()):
digits = digits + 1
else:
special = special + 1
print(string[i])

print("\nTotal Number of Alphabets in this String : ", alphabets)

print("Total Number of Digits in this String : ", digits)
print("Total Number of Special Characters in this String : ", special)
FOUR LEVELS OF DATA
• NOMINAL
• ORDINAL
• INTERVAL
• RATIO
Nominal
• Nominal data is a group of non-
parametric variables.
– Purely name or category
– Gender, nationality, species
• A part of speech is also considered on the nominal level
of data.
• We cannot do any arithmetic on nominal data
Nominal Data
Examples:
– A type of animal is on the nominal level of
data. We may also say that if it is a
chimpanzee, then it belongs to the mammalian
class as well.
– A part of speech is also considered on the
nominal level of data. The word she is a
pronoun, and it is also a noun.
– Of course, being qualitative, we cannot
perform any quantitative mathematical
operations, such as addition or division.
Math operations allowed on nominal

• Basic equality and membership functions

– Being a tech entrepreneur is same as being
tech industry, but not vice-versa.
– A figure described as a square falls under
the description of being a rectangle, but not
vice-versa.
Measure of center
• Balance point of the data.
• Common examples mean, median and mode
• In order to find the center of nominal data,
we generally refer MODE (the most common
element)
• Examples: common continent surveyed for an
experiment; It makes the choice as center…
Ordinal
• Ordinal data is a group of non-parametric
ordered variables.
• Provides a rank order
• Means for observation before the other;
– Example: rate your satisfaction on a scale from 1
to 5 or 1 to 10
• Math operations: Ordering, comparison
Measures of center in Ordinal
• Median is usually an appropriate way of
defining the center of the data.

• Mean is not possible in ordinal data.

Example
• Imagine you have conducted a survey among
employees asking
– “How happy are you to be working here?”
– On a scale from 1-5.
– Results:
– 5,4,3,4,5,3,2,5,3,1,4, 5,4,3,4,5,3,2,5,3,1,4,
5,4,3,4,5,3,2,5,3,1,4, 5,4,3,4,5,3,2,5,3,1,4
Python Implementation
import numpy
R = [5,4,3,4,5,3,2,5,3,1,4, 5,4,3,4,5,3,2,5,3,1,4,
5,4,3,4,5,3,2,5,3,1,4, 5,4,3,4,5,3,2,5,3,1,4]
Sort_R = sorted(R)
‘’’ sorted R values are stored in Sort_R.
‘’’
print (Sort_R)
print (numpy.median(Sort_R))

Result:
Classify ordinal or nominal
• The origin of the beans in your cup of coffee
• The place someone receives after completing
a foot race
• The metal used to make the medal that they
receive after placing in the said race.
• The telephone number of a client
• How many cups of coffee you drink in a day?
Interval level data
• The basic difference between the ordinal and
interval is, just that – difference.
• Data at the interval allows meaningful
subtraction of data points.
Interval example
• Temperature is a great example (which come
in our mind immediately)

• If it is 100 deg Fahrenheit and 80 deg

Fahrenheit in city-1, city-2, then city-2 is 20
deg warmer than city-1.
Math operations
Ordering and Comparison
• Addition
• Subtraction
Measures of Center
• Median, mean and Mode
• Suppose look at the temperature of the fridge
containing pharmaceutical company’s new
COVID-19 vaccine.
• 31,32,32,31,35,38,39,31,32,31,35,38,
38,39,31,32,31,35,38,
38,39,31,32,31,35,38,38,39,31,32,31,35,38
Python Illustration to find mean and
median
import numpy
Temps=[31,32,32,31,35,38,39,31,32,31,35,38,
38,39,31,32,31,35,38,
38,39,31,32,31,35,38,38,39,31,32,31,35,38]
print(numpy.mean(Temps))
Print(numpy.median(Temps))

Result:
Measures of variation
• In data science, it is important to mention
– How the data is “spread out” ? Is given by
variation.

• It is a number that attempts to describe how

spread out the data is.
Key understanding
• Along with a measure of center, a measure of
variation can almost describe an entire
dataset with only TWO numbers.
STANDARD DEVIATION (SD)
• Most common measure of variation of data at
the interval level and beyond.

• STANDARD DEVIATION: It is an “Average

distance a data point is at from the mean”.
SD PROCEDURE
• STEP-1: Find the mean of the data.

• STEP-2: For each number in the dataset, subtract it

from the mean and then square it.

• STEP-3: Find the average of each square difference.

• STEP-4: Take the square root of the number obtained in

step three. This is the standard deviation.
SD in Python Implementation
Import numpy
Temps=[31,32,32,31,35,38,39,31,32,31,35,38,
38,39,31,32,31,35,38,
38,39,31,32,31,35,38,38,39,31,32,31,35,38]
Mean = numpy.mean(Temps)
Sq_diffs=[]
For t in Temps:
Diff=t-mean
Sq_diff=Diff**2
Sq_diffs.append(Sq_diff)
Avg_sq_diff= numpy.mean(Sq_diffs)
SD = numpy.sqrt(Avg_sq_diff)
Print(SD)
Ratio
• Ratio Data is defined as quantitative data,
having the same properties as interval data,
with an equal and definitive ratio between
each data and absolute “zero” being treated
as a point of origin.

• In other words, there can be no negative

numerical value in ratio data.
Ratio Examples
• Consider the variable age.
• Age is frequently collected as ratio data, but
can also be collected as ordinal data.
• This happens on surveys when they ask,
“What age group do you fall in?” There, you
wouldn't have data on your respondent's
individual ages – you'd only know how many
were between 18-24, 25-34, etc.
More examples

• A ratio variable, has all the properties of an

interval variable, and also has a clear
definition of 0.0. ...
• Examples of ratio variables include: dose
amount, reaction rate, flow rate,
concentration, pulse, weight, length,
temperature in Kelvin (0.0 Kelvin really does
mean “no heat”), survival time.
Measures of center
• Arithmetic mean still holds meaning at this
level.
• It also does a geometric mean. (In statistics,
the geometric mean is calculated by raising
the product of a series of numbers to the
inverse of the total length of the series.)
Mean
• 24,56,67,84,90,12
• Total = 333
• Mean = total /6 = 333/6 = 55.5
Median
• 3,4,9,1,7,10,5,6,8,12
• Step-1: Sort the values
• 1,3,4,5, 6,7, 8,9,10,12

• 6+7/2 = 13/2 = 6.5

Range
• 3,4,9,1,7,10,5,6,8,12
• Range = Max – Min = 12 – 1 = 11
Mode
• 3,4,9,1,7,10,5,6,8,12, 5, 6,7,,5,8,3,2,5

• Mode: 5
SD
• X1,x2,x3,x4…. Xn
• Xm = Mean = x1+x2…..xn / n
• X | x-xm | x-xm 2

• Variance = Sigma(x-xm)2 / n
• SD = sqrt (variance)
IQR
• Inter-Quartile Range:
• The interquartile range (IQR) measures the
spread of the middle half of your data.
• It is the range for the middle 50% of
the sample.
IQR
• Inter-Quartile Range:
• (1,2, 3, 6,7, ) 8,9, (11,14, 15, 18,20) (12 values )
• Median = 8 +9 / 2 = 8.5
• Q1 = 3 Q3 = 15
• IQR = Q3 – Q1 = 15 -3 = 12
CV
• COMMON VARIATION

• CV = SD/MEAN

• SD = SQRT((SIGMA(X-XM)2)/N))
Python Prebuilt Modules
• pandas
• sci-kit learn
• seaborn
• numpy/scipy
• requests (to mine data from the Web)
• BeautifulSoup (for the Web-HTML
parsing)
Python Prebuilt Modules
• Pandas: library that is used for data analysis, Pandas is the most widely used tool
for data munging, Pandas in general is used for financial time series
data/economics data.

• sci-kit learn: useful library for machine learning in Python, a lot of efficient tools
for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
• Seaborn: data visualization library based on matplotlib underneath to plot graphs,
It provides a high-level interface for drawing attractive and informative statistical
graphics.
• Numpy/scipy: NumPy and SciPy are Python libraries used for used mathematical
and numerical analysis, NumPy contains array data and basic operations such as
sorting, indexing, etc whereas, SciPy consists of all the numerical code.
• Requests: allow you to send HTTP/1.1 requests using Python. With it, you can add
content like headers, form data, multipart files, and parameters via
simple Python libraries. It also allows you to access the response data of Python in
the same way.
• Beautifulsoup: for pulling data out of HTML and XML files. It works with your
favorite parser to provide idiomatic ways of navigating, searching, and modifying
the parse tree. It commonly saves programmers hours or days of work.
Basic questions for data exploration
• When looking at a new dataset, whether it is
familiar to you or not, it is important to use the
following questions as guidelines for your
preliminary analysis:
1. Is the data organized or not?
2. What does each row represent?
3. What does each column represent?
4. Are there any missing data points?
5. Do we need to perform any transformations on
the columns?
Case Study: Dataset 1 – Yelp
• The first dataset we will look at is a public
dataset made available by the restaurant
review site, Yelp. All personally identifiable
information has been removed.
• Let's read in the data first, as shown here
1. import pandas as pd
2. yelp_raw_data = pd.read_csv("yelp.csv")
3. yelp_raw_data.head()
Explanation of steps
• Import the pandas package and nickname it as
pd.
• Read in the .csv from the web;
• Look at the head of the data (just first few
records..)
Table of values
Business id Data Rewview id Stars Text
Yelp Dataset
• Is the data organized or not?
• What does each row represent?
• What does each column represent?
• Are there any missing data points?
• Do we need to perform any transformations
on the columns?
Case Study:
• Dataset 2 – Titanic Data
• Apply the following operations:
– Filtering operations
– Handling of ordinal, nominal/categorical variables
and applying mathematical, statistical functions
– Changing the data type of a column
– Handling of missing values
Summary
Whenever you are faced with a new dataset, the first three
questions you should ask about it are the following:
• Is the data organized or unorganized?
– For example, does our data exist in a nice, clean row/column
structure?
• Is each column quantitative or qualitative?
– For example, are the values numbers, strings, or do they represent
quantities?
• At what level of data is each column?
– For example, are the values at the nominal, ordinal, interval, or ratio
level?

Definitions of CEC2017 Benchmark Suite Final Version Updated
100% (1)
Definitions of CEC2017 Benchmark Suite Final Version Updated
34 pages
Chapter 2 - Non Audio
100% (1)
Chapter 2 - Non Audio
23 pages
Operation Manual Rheonik Mass Flowmeter: RHE 07, 08, 11 RHM .. NT, Etx, HT
No ratings yet
Operation Manual Rheonik Mass Flowmeter: RHE 07, 08, 11 RHM .. NT, Etx, HT
50 pages
Module-5 Part-2: Exception and Interrupt Handling
No ratings yet
Module-5 Part-2: Exception and Interrupt Handling
23 pages
Payroll System Thesis Documentation
100% (1)
Payroll System Thesis Documentation
8 pages
V Belt Troubleshooting Guide
100% (1)
V Belt Troubleshooting Guide
5 pages
Metalka Majur Catalogue 2016 Latest PDF
No ratings yet
Metalka Majur Catalogue 2016 Latest PDF
49 pages
Republic of The Philippines Laguna State Polytechnic University Province of Laguna
No ratings yet
Republic of The Philippines Laguna State Polytechnic University Province of Laguna
15 pages
PotBS Manual
No ratings yet
PotBS Manual
14 pages
Research Paper-Business Analytics
No ratings yet
Research Paper-Business Analytics
13 pages
Data Validation and Verification
100% (1)
Data Validation and Verification
18 pages
Geometrika From Kasos
No ratings yet
Geometrika From Kasos
7 pages
Semester: 4: A Project On Online MCQ Exam
No ratings yet
Semester: 4: A Project On Online MCQ Exam
9 pages
Css s24 Model Answer Paper of Summer 2024 Exam Css
No ratings yet
Css s24 Model Answer Paper of Summer 2024 Exam Css
31 pages
For Cinema, Television and Photography: Light & Shadow
No ratings yet
For Cinema, Television and Photography: Light & Shadow
7 pages
The School Principal of PEGAFI, Dr. Francisca Uy
No ratings yet
The School Principal of PEGAFI, Dr. Francisca Uy
3 pages
SHR 1040
No ratings yet
SHR 1040
23 pages
PD80-01 SpecSheet
No ratings yet
PD80-01 SpecSheet
4 pages
РЛС ICE RADAR FICE-100 Мануал1
No ratings yet
РЛС ICE RADAR FICE-100 Мануал1
20 pages
Multiplexer and Demultiplexer
No ratings yet
Multiplexer and Demultiplexer
11 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
Detailed Drawing of Footing and Column
No ratings yet
Detailed Drawing of Footing and Column
1 page
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
TNCT Q4 Module3a
No ratings yet
TNCT Q4 Module3a
4 pages
6220010
No ratings yet
6220010
37 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Module 1
No ratings yet
Module 1
35 pages
03-07-2024-Data Science - Orentation Programme
No ratings yet
03-07-2024-Data Science - Orentation Programme
53 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Week 8 IT Era LITE WITH HIGHLIGHT
No ratings yet
Week 8 IT Era LITE WITH HIGHLIGHT
36 pages
File
No ratings yet
File
27 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
Effectiveness and Safety of Virtual Reality Rehabilitation After
No ratings yet
Effectiveness and Safety of Virtual Reality Rehabilitation After
15 pages
Final Test Vocabulary
No ratings yet
Final Test Vocabulary
13 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Activity 3. Mind Map. Data Science Methodology
No ratings yet
Activity 3. Mind Map. Data Science Methodology
4 pages
T800 ULTRA Artical
No ratings yet
T800 ULTRA Artical
3 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
FDS - Lecture Notes - III AIML, CSM
No ratings yet
FDS - Lecture Notes - III AIML, CSM
101 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
DS Skills
No ratings yet
DS Skills
4 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Structures Study Notes
No ratings yet
Data Structures Study Notes
34 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
DAT100 - Int - Data - Ana - Lec2 - Intro II
No ratings yet
DAT100 - Int - Data - Ana - Lec2 - Intro II
39 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
PDF 1
No ratings yet
PDF 1
3 pages
DS231 Week 2
No ratings yet
DS231 Week 2
33 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Unit 1
No ratings yet
Unit 1
76 pages
CHAPTER 6 Frequency Analysis
No ratings yet
CHAPTER 6 Frequency Analysis
38 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
DS231 Module 2
No ratings yet
DS231 Module 2
33 pages
Data Science-New (Unit-I)
No ratings yet
Data Science-New (Unit-I)
18 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Unit 1 PPT 1
100% (1)
Unit 1 PPT 1
27 pages
06.11 Week 5, Class1 - Introduction To Data Analytics
No ratings yet
06.11 Week 5, Class1 - Introduction To Data Analytics
13 pages
Andrews M. Doing Data Science in R. An Introduction... 2021
No ratings yet
Andrews M. Doing Data Science in R. An Introduction... 2021
486 pages
Asset-V1 e-SHE+EX101+Q1+type@asset+block@Chapter2 Session 1 PDF
No ratings yet
Asset-V1 e-SHE+EX101+Q1+type@asset+block@Chapter2 Session 1 PDF
6 pages
FDS CH1
No ratings yet
FDS CH1
4 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
DA106 Week1 Material
No ratings yet
DA106 Week1 Material
10 pages
Datas Unit1
No ratings yet
Datas Unit1
20 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
FDS Notes PDF
No ratings yet
FDS Notes PDF
140 pages
Unit-1 - Introduction To Data Science
No ratings yet
Unit-1 - Introduction To Data Science
17 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
Data Science Terminology
No ratings yet
Data Science Terminology
10 pages
09c Alteon 500-201 Alteon Outbound SSL Detailed Configuration v1
No ratings yet
09c Alteon 500-201 Alteon Outbound SSL Detailed Configuration v1
44 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
52 pages
Anshumoocs
No ratings yet
Anshumoocs
20 pages
Data Science Unit 01
No ratings yet
Data Science Unit 01
19 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet

22mca341 - Data Science

Uploaded by

22mca341 - Data Science

Uploaded by

22MCA341 - DATA SCIENCE

Introduction to Data Science & Types of Data

Data science is the art and science

Why data science?

• One of the main goals of data science is to make explicit

Data Science is the intersection of the

Hacking skills: To conceptualize and program complicated algorithms using computer

Those with hacking skills can conceptualize and program complicated

Having a Math & Statistics Knowledge base allows you to theorize

Having Substantive Expertise (domain expertise) allows you to apply

• There are many types of data models, including probabilistic and

• Both of these are subsets of a larger paradigm, called Machine

• It is one of the most common languages, both

• Python has prebuilt data science modules

words_in_tweet = tweet.split(' ') # list of words in tweet

for word in words_in_tweet: # for each word in list

• Machine learning combines the power of computers with intelligent

• Types of data models:

• Probabilistic model: This refers to using probability to find a

• Statistical model: This refers to taking advantage of statistical theorems

• Exploratory data analysis (EDA) refers to preparing data in order to

• EDA is concerned with data visualization and preparation. This is where

• Data mining is the process of finding relationships between elements of

• Quantitative versus Qualitative data

• Unstructured (unorganized) data: This data exists as a free

• Quantitative data: This data can be described using

• Qualitative data: This data cannot be described

1. maximize insight into a data set

• It is true that EDA heavily uses the collection

1. Plotting the raw data (such as data traces,

3. Positioning such plots so as to maximize our

print("\nTotal Number of Alphabets in this String : ", alphabets)

• Basic equality and membership functions

• Mean is not possible in ordinal data.

• If it is 100 deg Fahrenheit and 80 deg

• It is a number that attempts to describe how

• STANDARD DEVIATION: It is an “Average

• STEP-2: For each number in the dataset, subtract it

• STEP-3: Find the average of each square difference.

• STEP-4: Take the square root of the number obtained in

• In other words, there can be no negative

• A ratio variable, has all the properties of an

• 6+7/2 = 13/2 = 6.5

You might also like