0% found this document useful (0 votes)

18 views28 pages

TYCS DS Unit1

Data Science Notes

Uploaded by

Srushti Dewalekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views28 pages

TYCS DS Unit1

Data Science Notes

Uploaded by

Srushti Dewalekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

DATA SCIENCE

TYCS SEMESTER VI
UNIT 1
By Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Introduction
•Statistical Learning - field of statistics that largely
involve computational considerations
•Machine Learning - build computer systems that
automatically improve with experience
•Artificial Intelligence - the science and engineering of
making intelligent machines that are experts in
linguistics, philosophy, psychology, neuroscience,
mathematics, computer science and so on
•Data Mining – science of knowledge discovery using
database systems, ML and statistics
•Data Science - big umbrella that brings everything
together with a potential to show insight from data
and build intelligent systems inside it
Asst Prof. Bindy Wilson
Data Science
⚫ An interdisciplinary field that uses technology
from comp sc., database, statistics & machine
learning
⚫ Involves collection, preparation, analysis,
visualization, management and preservation
of data
⚫ Extracts meaningful information from data
sources to be used for business purposes
⚫ Use that knowledge to
⚫ • Make decisions
⚫ • Predict the future

Asst Prof. Bindy Wilson

Types of Data
⚫ Data is a collection of facts in a format
that can be processed by a computer.
⚫ Two data types
⚫ Quantitative data(Numerical
Variables): Data can be described using
numbers, and basic mathematical
procedures. Eg : the salary of employees
⚫ Qualitative data(Categorical
Variables) : This data cannot be described
using numbers and basic mathematics. Eg :
gender or country name
Asst Prof. Bindy Wilson
Categorical or Qualitative data
⚫ classified into Nominal, Binary, and Ordinal
⚫ Nominal - These are variables without any
regard for ordering. For example, candidate
names in polling data from a survey.
⚫ Ordinal - can have two or more categories
with an added condition how the categories
are ordered. For example, a customer rating
for a movie, variable rating has a relative
importance on a scale of 1 to 5
⚫ Binary - variables with exactly two
categories such as gender, possible
outcomes of a single coin toss, etc
Asst Prof. Bindy Wilson
Numerical or Quantitative data
⚫ measurable and represented as numbers, not
words or text
⚫ divided into continuous and discrete
⚫ Discrete variables have a logical end to them,
eg. days in a month. Continuous variables
don’t have a logical end to them, subdivided
into Interval and Ratio
⚫ Interval - measured along a continuous
range. 0o C has certain degree of temperature
⚫ Ratio - include distance, mass, and height. A
value 0 for a ratio variable means a none or
no measure.
Asst Prof. Bindy Wilson
Traditional vs Big Data
⚫ Traditional data – structured & stored in
databases in table format, contains numeric or
text values, usually managed in a single system
⚫ Big data – distributed across network of
computers and is bigger in 5 V’s
⚫ Volume – enormous volume
⚫ Variety – many sources & types, photos, videos,
audio, PDF, data from sensors, monitoring
devices..
⚫ Velocity – massive & continuous data flow
⚫ Veracity – uncertain, imprecise, abnormal data
⚫ Validity – if accurate for intended use
Asst Prof. Bindy Wilson
Different types of data sources
1) Structured - always the easiest to understand,
represent, store, query, and process
⚫ data will have rows and columns stored in a
tabular manner
⚫ data coming from CSV and Excel files
2) Semi-Structured - is the web data that
consists of XML, HTML etc
⚫ data generated from Twitter and Facebook
⚫ Stored in NoSQL Databases like MongoDB and
Cassandra
3) Unstructured - data like images, videos, web
logs, and click stream, and also data from
newspapers and books which are non-digitized
data.
Asst Prof. Bindy Wilson
The Five Steps of Data Science
⚫ 1. Asking an interesting question
⚫ 2. Obtaining the data - finding the right data
set
⚫ 3. Exploring the data - understanding the data
⚫ 4. Modeling the data - involves the use of
statistical and machine learning models
⚫ 5. Communicating and visualizing the results -
conclude your results in a digestible format

Asst Prof. Bindy Wilson

Data Collection
⚫ Primary and Secondary data
⚫ Primary data is data originated for the first time by
the researcher through direct efforts and experience
⚫ Also known as the first hand or raw data
⚫ The data collected surveys, observations, physical
testing, mailed questionnaires, interviews
⚫ Secondary data is second-hand information which is
already collected and recorded by any other person
⚫ Readily from various sources like censuses,
government publications, internal records of the
organisation, reports, books, journal articles, websites
Asst Prof. Bindy Wilson
Various types of data collection methods
⚫ 1)Companies and Proprietary Data Sources
⚫ 2)Government Data Sources
⚫ 3)Academic Data Sets
⚫ 4)Sweat Equity
⚫ 5)Scraping

⚫ A) Casual & Scientific

⚫ B) Simple & Systematic
⚫ C) Subjective & Objective
⚫ D) Factual & Inferential
⚫ E) Direct & Indirect
⚫ F) Behavioral & Non-behavioral
Asst Prof. Bindy Wilson
Web scraping
⚫ used for extracting data from websites
⚫ Web scraping a web page involves fetching it and
extracting from it
⚫ The content of a page may be parsed, searched,
reformatted, its data copied into a spreadsheet
⚫ Human Copy-Paste
⚫ Text pattern matching - using regular expression
matching facilities of programming languages
⚫ API Interface
⚫ DOM Parsing
Asst Prof. Bindy Wilson
Data wrangling or Data cleaning
⚫ Initial Data Analysis (IDA)
⚫ Removing inconsistencies from the data,
like missing values, and follow a standard
format
⚫ Correcting Factor Variables
⚫ Dealing with NAs - impute, a process of
filling the missing values
⚫ Dealing with Dates and Times

Asst Prof. Bindy Wilson

Handling missing data
⚫ Heuristic-based imputation - make a reasonable
guess
⚫ Mean value imputation- Using the mean value of
a variable
⚫ Random value imputation - select a random
value from the column
⚫ Imputation by nearest neighbor - identify the
record which matches most closely on all fields,
and use this nearest neighbor to infer the values
⚫ Imputation by interpolation - use a method like
linear regression to predict the value

Asst Prof. Bindy Wilson

Exploratory Data Analysis (EDA)
⚫ Fundamental step after data collection and
pre-processing
⚫ Most EDA techniques are graphical in nature
⚫ Objectives of EDA :
⚫ 1) Maximize insight into the database
⚫ 2) Visualize relationships between exposure
and outcome variables
⚫ 3) Detect outliers and anomalies
⚫ 4) Extract and create relevant variables
⚫ 5) Finding a suitable model
Asst Prof. Bindy Wilson
⚫ EDA methods - Graphical or non-graphical
⚫ Non-graphical - Summary statistics include,
frequency, mean, median, mode, range,
interquartile range, maximum and minimum
values
⚫ Graphical - Data visualization, multiple types
of charts, graphs

Asst Prof. Bindy Wilson

Summary Statistics
⚫ Mean - we sum values and divide by the
number of observations
⚫ Median - middle value in a sorted data set
⚫ Variance - measure of the spread for the
given set of numbers
⚫ Interquartile range (IQR) - data situated
between the 1st and the 3rd quartiles
⚫ Skewness - measures asymmetry about the
mean
Asst Prof. Bindy Wilson
Summary Statistics (contd)
⚫ Kurtosis - measure of peakedness and tailedness
of the probability distribution of a random
variable
⚫ Covariance and correlation - measure the
degree of the relationship between two random
variables
⚫ If two variables have a correlation close to -1, it
means that as one variable increases, the other
decreases, and if two variables have a correlation
close to +1, it means that those variables move
together in the same direction

Asst Prof. Bindy Wilson

Data visualization
⚫ the process of creating and studying the visual
representation of data to bring some meaningful
insights
⚫ deals with visualizing the information in a given data
⚫ Benefits
⚫ • Identifying red spots in data
⚫ • Tracking and identifying relations among different
attributes
⚫ • Seeing the trend
⚫ • Summarizing complicated long spreadsheets and
databases into visual art
⚫ • Easy to use and very impactful way to store and
present information
Asst Prof. Bindy Wilson
Data visualization
⚫ Four types of presentation in data visualization
Comparison, Relationship, Distribution, and
Composition
⚫ Comparison is used to see the differences
between multiple items at given point in time Eg –
Line chart
⚫ Relationship helps in finding correlation between
two or more variables Eg - Scatter and bubble
⚫ Distribution charts like column and line
histograms show the spread of data. Skewness
toward left or right could be easily spotted.
⚫ Composition refers to a stacked chart with
multiple components like a pie chart or stacked
column chart
Asst Prof. Bindy Wilson
Boxplots
⚫ Boxplots are a compact way of representing
the five-number summary namely median, first
and third quartiles (25th and 75th percentile)
and min and max.
⚫ The upper side of the vertical rectangular box
represents the third quartile and the lower, the
first quartile. The difference between the two
points is known as the interquartile range,
which consist of 50% of the data.
⚫ A line dividing the rectangle represents the
median.
⚫ It also contains a line extending on both sides
(known as whiskers) of the rectangle
⚫ The points plotted, which are shown as
extensions of the lines, are called outliers.

Asst Prof. Bindy Wilson

Line Chart
⚫ A line chart is a basic visualization chart type in which
information is displayed in a series of data points
connected by line segments. Line charts are used for
showing trends

Asst Prof. Bindy Wilson

Scatterplots
⚫ A scatterplot is a graph that helps identify if there is
a relationship between two variables.
⚫ Scatterplots use Cartesian coordinates to show two
variables on an x- and y-axis. Higher dimensional
scatterplots are also possible
⚫ If we add dimensions of color or shape or size, so we
can present more than two variables

Asst Prof. Bindy Wilson

Correlation Plots
⚫ The best way to show how much one indicator relates
to another is by computing the correlation.
⚫ The combination of color, size, and position
encapsulates a numeric value into a visual
representation in a Correlation plot
⚫ Direction of the ellipse represents a positive or negative
correlation & size represents the value

Asst Prof. Bindy Wilson

⚫ Stacked column charts are an elegant way of showing
the composition of various categories that make up a
particular variable
⚫ A histogram is one of the most basic and easy to
understand graphical representations of numerical data.
⚫ It consists of rectangular boxes. The width of each
rectangle has a certain range and the height signifies the
number of data points within that range.

Asst Prof. Bindy Wilson

⚫ A Pie Chart is a type of graph that uses pie slices to
show relative sizes of data.
⚫ Heatmaps are visualization of data where values are
represented as different shades of colors, darker the
shade, higher is the value.
⚫ Dendograms are visual representations specifically
useful in clustering analysis. They are tree diagrams
frequently used to illustrate the formation of clusters
⚫ The y-axis in dendograms measures the closeness (or
similarity) of clusters.

Asst Prof. Bindy Wilson

High level Programming language
⚫ A high-level language (HLL) is a programming language that
enables a programmer to write programs that are independent
of a particular type of computer.
⚫ They are closer to human languages and further from machine
languages.
⚫ Assembly languages are considered low-level because they are
very close to machine languages.
⚫ Advantages of high-level language
⚫ High-level languages are programmer-friendly. They are easy to
write, debug and maintain.
⚫ It provides a higher level of abstraction from machine languages.
⚫ It is a machine-independent language.
⚫ Easy to learn.
⚫ Less error-prone, easy to find and debug errors.
⚫ High-level programming results in better programming
productivity.

Asst Prof. Bindy Wilson

Integrated Development
Environment (IDE )
⚫ An IDE enables programmers to consolidate the different
aspects of writing a computer program.
⚫ It is a software for building applications that combines common
developer tools into a single graphical user interface (GUI)
⚫ Development tools include text editors, code libraries,
compilers and test platforms
⚫ An IDE typically offers
⚫ a text editor,
⚫ automated code validation,
⚫ syntax highlighting,
⚫ auto completion,
⚫ contextual suggestions,
⚫ easy access to help, and
⚫ debugging tools

Asst Prof. Bindy Wilson

Final UNIT II-DESCRIPTIVE ANALYTICS
No ratings yet
Final UNIT II-DESCRIPTIVE ANALYTICS
128 pages
DS Unit 1
No ratings yet
DS Unit 1
99 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
22UCS303 DS-Unit III-N
No ratings yet
22UCS303 DS-Unit III-N
85 pages
02 Data
No ratings yet
02 Data
66 pages
02 Data
No ratings yet
02 Data
65 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Module 1
No ratings yet
Module 1
64 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Science 5
100% (4)
Data Science 5
216 pages
Unit-1 Theory
No ratings yet
Unit-1 Theory
26 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
Introduction To Predictive Analytics
No ratings yet
Introduction To Predictive Analytics
92 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Important Questions
No ratings yet
Important Questions
26 pages
Unit 3
No ratings yet
Unit 3
30 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Lect 3
No ratings yet
Lect 3
51 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
02 Data
No ratings yet
02 Data
24 pages
DV - Unit 1
No ratings yet
DV - Unit 1
40 pages
02 Data
No ratings yet
02 Data
65 pages
Cheatsheet Data
No ratings yet
Cheatsheet Data
3 pages
02 Data
No ratings yet
02 Data
64 pages
02 Data
No ratings yet
02 Data
62 pages
Data Science
No ratings yet
Data Science
12 pages
Ms Data Science S, 24 (WEEK# 1) Unlock
No ratings yet
Ms Data Science S, 24 (WEEK# 1) Unlock
31 pages
Ms Data Science S, 24 (WEEK# 1)
No ratings yet
Ms Data Science S, 24 (WEEK# 1)
30 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
9 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Mining
No ratings yet
Data Mining
34 pages
Data Visualization and Story Telling Notes
No ratings yet
Data Visualization and Story Telling Notes
31 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Statistics
No ratings yet
Statistics
2 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Review of Related Literature About Beauty Salon
50% (2)
Review of Related Literature About Beauty Salon
7 pages
CSEC Spanish June 2015 P1 (Teacher - S Script)
57% (7)
CSEC Spanish June 2015 P1 (Teacher - S Script)
6 pages
MGT 1103
No ratings yet
MGT 1103
4 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
JANREX Immersion
No ratings yet
JANREX Immersion
19 pages
Syllabus: Cambridge International AS & A Level Sociology 9699
No ratings yet
Syllabus: Cambridge International AS & A Level Sociology 9699
35 pages
Thesis New
No ratings yet
Thesis New
72 pages
Pharmaceutical Technology
No ratings yet
Pharmaceutical Technology
72 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
Super Sure MCQ On OB & HRM (ManagementCommerce) Part - 22
No ratings yet
Super Sure MCQ On OB & HRM (ManagementCommerce) Part - 22
145 pages
Consumers Buying Behaviour Towards Branded Tea'S
No ratings yet
Consumers Buying Behaviour Towards Branded Tea'S
21 pages
SunilPatil MCIArb .CV
No ratings yet
SunilPatil MCIArb .CV
5 pages
Q3 Statistics and Probability 11 Module 3
No ratings yet
Q3 Statistics and Probability 11 Module 3
21 pages
Chapter 6
No ratings yet
Chapter 6
16 pages
Lecture Note (CPS) Construction Planning & M
No ratings yet
Lecture Note (CPS) Construction Planning & M
195 pages
Scientific Process Skills - Quizizz
No ratings yet
Scientific Process Skills - Quizizz
5 pages
Accessory To War
No ratings yet
Accessory To War
482 pages
Factors Influencing The Service Lifespan of Buildings An Improved Hedonic Model
No ratings yet
Factors Influencing The Service Lifespan of Buildings An Improved Hedonic Model
9 pages
Roads D PR Template
No ratings yet
Roads D PR Template
17 pages
Estimation of Parameters Part 2 PDF
No ratings yet
Estimation of Parameters Part 2 PDF
46 pages
Integrated Undergraduate Research Handbook Version 2020 by DR JB
No ratings yet
Integrated Undergraduate Research Handbook Version 2020 by DR JB
52 pages
Compilation of Module Answers
No ratings yet
Compilation of Module Answers
46 pages
Introduction To The Chicago School: Deductive Qualitative Analysis and Grounded Theory
No ratings yet
Introduction To The Chicago School: Deductive Qualitative Analysis and Grounded Theory
5 pages
Or 1
No ratings yet
Or 1
26 pages
Gage R&R
No ratings yet
Gage R&R
24 pages
Untitled
No ratings yet
Untitled
13 pages
Quality Attributes Chewable Tablets
No ratings yet
Quality Attributes Chewable Tablets
16 pages
CBR Statdas
No ratings yet
CBR Statdas
20 pages
Abib
No ratings yet
Abib
3 pages
Nonlinear Interpolation
No ratings yet
Nonlinear Interpolation
10 pages
The Crucible - Writing Project
No ratings yet
The Crucible - Writing Project
2 pages
Analisis Jurnal Pico Gea
No ratings yet
Analisis Jurnal Pico Gea
3 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
4/5 (2)
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet

TYCS DS Unit1

Uploaded by

TYCS DS Unit1

Uploaded by

DATA SCIENCE

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

⚫ A) Casual & Scientific

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

You might also like