0% found this document useful (0 votes)
18 views28 pages

TYCS DS Unit1

Data Science Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views28 pages

TYCS DS Unit1

Data Science Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

DATA SCIENCE

TYCS SEMESTER VI
UNIT 1
By Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson


Introduction
•Statistical Learning - field of statistics that largely
involve computational considerations
•Machine Learning - build computer systems that
automatically improve with experience
•Artificial Intelligence - the science and engineering of
making intelligent machines that are experts in
linguistics, philosophy, psychology, neuroscience,
mathematics, computer science and so on
•Data Mining – science of knowledge discovery using
database systems, ML and statistics
•Data Science - big umbrella that brings everything
together with a potential to show insight from data
and build intelligent systems inside it
Asst Prof. Bindy Wilson
Data Science
⚫ An interdisciplinary field that uses technology
from comp sc., database, statistics & machine
learning
⚫ Involves collection, preparation, analysis,
visualization, management and preservation
of data
⚫ Extracts meaningful information from data
sources to be used for business purposes
⚫ Use that knowledge to
⚫ • Make decisions
⚫ • Predict the future

Asst Prof. Bindy Wilson


Types of Data
⚫ Data is a collection of facts in a format
that can be processed by a computer.
⚫ Two data types
⚫ Quantitative data(Numerical
Variables): Data can be described using
numbers, and basic mathematical
procedures. Eg : the salary of employees
⚫ Qualitative data(Categorical
Variables) : This data cannot be described
using numbers and basic mathematics. Eg :
gender or country name
Asst Prof. Bindy Wilson
Categorical or Qualitative data
⚫ classified into Nominal, Binary, and Ordinal
⚫ Nominal - These are variables without any
regard for ordering. For example, candidate
names in polling data from a survey.
⚫ Ordinal - can have two or more categories
with an added condition how the categories
are ordered. For example, a customer rating
for a movie, variable rating has a relative
importance on a scale of 1 to 5
⚫ Binary - variables with exactly two
categories such as gender, possible
outcomes of a single coin toss, etc
Asst Prof. Bindy Wilson
Numerical or Quantitative data
⚫ measurable and represented as numbers, not
words or text
⚫ divided into continuous and discrete
⚫ Discrete variables have a logical end to them,
eg. days in a month. Continuous variables
don’t have a logical end to them, subdivided
into Interval and Ratio
⚫ Interval - measured along a continuous
range. 0o C has certain degree of temperature
⚫ Ratio - include distance, mass, and height. A
value 0 for a ratio variable means a none or
no measure.
Asst Prof. Bindy Wilson
Traditional vs Big Data
⚫ Traditional data – structured & stored in
databases in table format, contains numeric or
text values, usually managed in a single system
⚫ Big data – distributed across network of
computers and is bigger in 5 V’s
⚫ Volume – enormous volume
⚫ Variety – many sources & types, photos, videos,
audio, PDF, data from sensors, monitoring
devices..
⚫ Velocity – massive & continuous data flow
⚫ Veracity – uncertain, imprecise, abnormal data
⚫ Validity – if accurate for intended use
Asst Prof. Bindy Wilson
Different types of data sources
1) Structured - always the easiest to understand,
represent, store, query, and process
⚫ data will have rows and columns stored in a
tabular manner
⚫ data coming from CSV and Excel files
2) Semi-Structured - is the web data that
consists of XML, HTML etc
⚫ data generated from Twitter and Facebook
⚫ Stored in NoSQL Databases like MongoDB and
Cassandra
3) Unstructured - data like images, videos, web
logs, and click stream, and also data from
newspapers and books which are non-digitized
data.
Asst Prof. Bindy Wilson
The Five Steps of Data Science
⚫ 1. Asking an interesting question
⚫ 2. Obtaining the data - finding the right data
set
⚫ 3. Exploring the data - understanding the data
⚫ 4. Modeling the data - involves the use of
statistical and machine learning models
⚫ 5. Communicating and visualizing the results -
conclude your results in a digestible format

Asst Prof. Bindy Wilson


Data Collection
⚫ Primary and Secondary data
⚫ Primary data is data originated for the first time by
the researcher through direct efforts and experience
⚫ Also known as the first hand or raw data
⚫ The data collected surveys, observations, physical
testing, mailed questionnaires, interviews
⚫ Secondary data is second-hand information which is
already collected and recorded by any other person
⚫ Readily from various sources like censuses,
government publications, internal records of the
organisation, reports, books, journal articles, websites
Asst Prof. Bindy Wilson
Various types of data collection methods
⚫ 1)Companies and Proprietary Data Sources
⚫ 2)Government Data Sources
⚫ 3)Academic Data Sets
⚫ 4)Sweat Equity
⚫ 5)Scraping

⚫ A) Casual & Scientific


⚫ B) Simple & Systematic
⚫ C) Subjective & Objective
⚫ D) Factual & Inferential
⚫ E) Direct & Indirect
⚫ F) Behavioral & Non-behavioral
Asst Prof. Bindy Wilson
Web scraping
⚫ used for extracting data from websites
⚫ Web scraping a web page involves fetching it and
extracting from it
⚫ The content of a page may be parsed, searched,
reformatted, its data copied into a spreadsheet
⚫ Human Copy-Paste
⚫ Text pattern matching - using regular expression
matching facilities of programming languages
⚫ API Interface
⚫ DOM Parsing
Asst Prof. Bindy Wilson
Data wrangling or Data cleaning
⚫ Initial Data Analysis (IDA)
⚫ Removing inconsistencies from the data,
like missing values, and follow a standard
format
⚫ Correcting Factor Variables
⚫ Dealing with NAs - impute, a process of
filling the missing values
⚫ Dealing with Dates and Times

Asst Prof. Bindy Wilson


Handling missing data
⚫ Heuristic-based imputation - make a reasonable
guess
⚫ Mean value imputation- Using the mean value of
a variable
⚫ Random value imputation - select a random
value from the column
⚫ Imputation by nearest neighbor - identify the
record which matches most closely on all fields,
and use this nearest neighbor to infer the values
⚫ Imputation by interpolation - use a method like
linear regression to predict the value

Asst Prof. Bindy Wilson


Exploratory Data Analysis (EDA)
⚫ Fundamental step after data collection and
pre-processing
⚫ Most EDA techniques are graphical in nature
⚫ Objectives of EDA :
⚫ 1) Maximize insight into the database
⚫ 2) Visualize relationships between exposure
and outcome variables
⚫ 3) Detect outliers and anomalies
⚫ 4) Extract and create relevant variables
⚫ 5) Finding a suitable model
Asst Prof. Bindy Wilson
⚫ EDA methods - Graphical or non-graphical
⚫ Non-graphical - Summary statistics include,
frequency, mean, median, mode, range,
interquartile range, maximum and minimum
values
⚫ Graphical - Data visualization, multiple types
of charts, graphs

Asst Prof. Bindy Wilson


Summary Statistics
⚫ Mean - we sum values and divide by the
number of observations
⚫ Median - middle value in a sorted data set
⚫ Variance - measure of the spread for the
given set of numbers
⚫ Interquartile range (IQR) - data situated
between the 1st and the 3rd quartiles
⚫ Skewness - measures asymmetry about the
mean
Asst Prof. Bindy Wilson
Summary Statistics (contd)
⚫ Kurtosis - measure of peakedness and tailedness
of the probability distribution of a random
variable
⚫ Covariance and correlation - measure the
degree of the relationship between two random
variables
⚫ If two variables have a correlation close to -1, it
means that as one variable increases, the other
decreases, and if two variables have a correlation
close to +1, it means that those variables move
together in the same direction

Asst Prof. Bindy Wilson


Data visualization
⚫ the process of creating and studying the visual
representation of data to bring some meaningful
insights
⚫ deals with visualizing the information in a given data
⚫ Benefits
⚫ • Identifying red spots in data
⚫ • Tracking and identifying relations among different
attributes
⚫ • Seeing the trend
⚫ • Summarizing complicated long spreadsheets and
databases into visual art
⚫ • Easy to use and very impactful way to store and
present information
Asst Prof. Bindy Wilson
Data visualization
⚫ Four types of presentation in data visualization
Comparison, Relationship, Distribution, and
Composition
⚫ Comparison is used to see the differences
between multiple items at given point in time Eg –
Line chart
⚫ Relationship helps in finding correlation between
two or more variables Eg - Scatter and bubble
⚫ Distribution charts like column and line
histograms show the spread of data. Skewness
toward left or right could be easily spotted.
⚫ Composition refers to a stacked chart with
multiple components like a pie chart or stacked
column chart
Asst Prof. Bindy Wilson
Boxplots
⚫ Boxplots are a compact way of representing
the five-number summary namely median, first
and third quartiles (25th and 75th percentile)
and min and max.
⚫ The upper side of the vertical rectangular box
represents the third quartile and the lower, the
first quartile. The difference between the two
points is known as the interquartile range,
which consist of 50% of the data.
⚫ A line dividing the rectangle represents the
median.
⚫ It also contains a line extending on both sides
(known as whiskers) of the rectangle
⚫ The points plotted, which are shown as
extensions of the lines, are called outliers.

Asst Prof. Bindy Wilson


Line Chart
⚫ A line chart is a basic visualization chart type in which
information is displayed in a series of data points
connected by line segments. Line charts are used for
showing trends

Asst Prof. Bindy Wilson


Scatterplots
⚫ A scatterplot is a graph that helps identify if there is
a relationship between two variables.
⚫ Scatterplots use Cartesian coordinates to show two
variables on an x- and y-axis. Higher dimensional
scatterplots are also possible
⚫ If we add dimensions of color or shape or size, so we
can present more than two variables

Asst Prof. Bindy Wilson


Correlation Plots
⚫ The best way to show how much one indicator relates
to another is by computing the correlation.
⚫ The combination of color, size, and position
encapsulates a numeric value into a visual
representation in a Correlation plot
⚫ Direction of the ellipse represents a positive or negative
correlation & size represents the value

Asst Prof. Bindy Wilson


⚫ Stacked column charts are an elegant way of showing
the composition of various categories that make up a
particular variable
⚫ A histogram is one of the most basic and easy to
understand graphical representations of numerical data.
⚫ It consists of rectangular boxes. The width of each
rectangle has a certain range and the height signifies the
number of data points within that range.

Asst Prof. Bindy Wilson


⚫ A Pie Chart is a type of graph that uses pie slices to
show relative sizes of data.
⚫ Heatmaps are visualization of data where values are
represented as different shades of colors, darker the
shade, higher is the value.
⚫ Dendograms are visual representations specifically
useful in clustering analysis. They are tree diagrams
frequently used to illustrate the formation of clusters
⚫ The y-axis in dendograms measures the closeness (or
similarity) of clusters.

Asst Prof. Bindy Wilson


High level Programming language
⚫ A high-level language (HLL) is a programming language that
enables a programmer to write programs that are independent
of a particular type of computer.
⚫ They are closer to human languages and further from machine
languages.
⚫ Assembly languages are considered low-level because they are
very close to machine languages.
⚫ Advantages of high-level language
⚫ High-level languages are programmer-friendly. They are easy to
write, debug and maintain.
⚫ It provides a higher level of abstraction from machine languages.
⚫ It is a machine-independent language.
⚫ Easy to learn.
⚫ Less error-prone, easy to find and debug errors.
⚫ High-level programming results in better programming
productivity.

Asst Prof. Bindy Wilson


Integrated Development
Environment (IDE )
⚫ An IDE enables programmers to consolidate the different
aspects of writing a computer program.
⚫ It is a software for building applications that combines common
developer tools into a single graphical user interface (GUI)
⚫ Development tools include text editors, code libraries,
compilers and test platforms
⚫ An IDE typically offers
⚫ a text editor,
⚫ automated code validation,
⚫ syntax highlighting,
⚫ auto completion,
⚫ contextual suggestions,
⚫ easy access to help, and
⚫ debugging tools

Asst Prof. Bindy Wilson

You might also like