Data Analytics With Python Lecture 1
Data Analytics With Python Lecture 1
Learning objectives
1. Define data and its importance
2. Define data analytics and its types
3. Explain why analytics is important in today’s business environment
4. Explain how statistics, analytics and data science are interrelated
5. Why python?
6. Explain the four different levels of Data:
– Nominal
– Ordinal
– Interval and
– Ratio
Define Data and its importance
• Variable, Measurement and Data
• What is generating so much data?
• How data add value to the business?
• Why data is important?
1.1 Variable, Measurement and Data
• Variables – is a characteristic of any entity being studied that is capable of taking on different
values
• Measurements – is when a standard process is used to assign numbers to particular attributes
or characteristic of a variable
• Data – data are recorded measurements
1.2 What is generating so much data?
• Data can be generated by
– Humans,
– Machines or
– Humans-machines combines
• It can be generated anywhere where any information is generated and stored in structured or
unstructured formats
1.4 Why Data is important?
• Data helps in make better decisions
• Data helps in solve problems by finding the reason for underperformance
• Data helps one to evaluate the performance.
• Data helps one improve processes
• Data helps one understand consumers and the market 10
2. Define data analytic and its types
• Define data analytics
• Why analytics is important?
• Data analysis
• Data analytics vs. Data analysis
• Types of Data analytics
2.1. Define data analytics
• Analytics is defined as “the scientific process of transforming data into insights for making better
decisions”
• Analytics, is the use of data, information technology, statistical analysis, quantitative methods, and
mathematical or computer-based models to help managers gain improved insight about their
business operations and make better, fact-based decisions – James Evans
• Analysis = Analytics?
2.2 Why analytics is important?
• Opportunity abounds for the use of analytics and big data such as:
1. Determining credit risk
2. Developing new medicines
3. Finding more efficient ways to deliver products and services
4. Preventing fraud
5. Uncovering cyber threats
6. Retaining the most valuable customers
2.3 Data analysis
• Data analysis is the process of examining, transforming, and arranging raw data in a specific way to
generate useful information from it
• Data analysis allows for the evaluation of data through analytical and logical reasoning to lead to
some sort of outcome or conclusion in some context
• Data analysis is a multi-faceted process that involves a number of steps, approaches, and diverse
techniques
Analysis / = Analytics
Data Analysis = Data analytics /
Business Analysis = Business analytics
2.5 Classification of Data analytics
Based on the phase of workflow and the kind of analysis required, there are four major types of data
analytics.
• Descriptive analytics • Diagnostic analytics • Predictive analytics • Prescriptive analytics
Descriptive Analytics
• Descriptive Analytics, is the conventional form of Business Intelligence and data analysis
• It seeks to provide a depiction or “summary view” of facts and figures in an understandable format
• This either inform or prepare data for further analysis
• Descriptive analysis or statistics can summarize raw data and convert it into a form that can be
easily understood by humans
• They can describe in detail about an event that has occurred in the past
Example
A common example of Descriptive Analytics is company reports that simply provide a historic review
like: • Data Queries • Reports • Descriptive Statistics • Data Visualization • Data dashboard
Diagnostic analytics
• Diagnostic Analytics is a form of advanced analytics which examines data or content to answer the
question “Why did it happen?”
• Diagnostic analytical tools aid an analyst to dig deeper into an issue so that they can arrive at the
source of a problem
• In a structured business environment, tools for both descriptive and diagnostic analytics go
parallel
Example
• It uses techniques such as: 1. Data Discovery 2. Data Mining 3. Correlations
Predictive analytics
• Predictive analytics helps to forecast trends based on the current events
• Predicting the probability of an event happening in future or estimating the accurate time it will
happen can all be determined with the help of predictive analytical models
• Many different but co-dependent variables are analysed to predict a trend in this type of analysis
Example • Set of techniques that use model constructed from past data to predict the future or
ascertain impact of one variable on another: 1. Linear regression 2. Time series analysis and
forecasting 3. Data mining
Prescriptive analytics
• Set of techniques to indicate the best course of action
• It tells what decision to make to optimize the outcome
• The goal of prescriptive analytics is to enable: 1. Quality improvements 2. Service enhancements
3. Cost reductions and 4. Increasing productivity
Prescriptive analytics: Example • Optimization Model • Simulation • Decision Analysis
3. Explain why analytics is important
• Demand for Data Analytics • Element of data Analytics
GET THE DTYPE OF EACH COLUMN PANDAS TYPES VERSUS PYTHON TYPES
With iloc, we can pass in the -1 to get the last row—something we couldn’t do with loc.
Subsetting Columns
• The Python slicing syntax uses a colon, :
• If we have just a colon, the attribute refers to everything.
• So, if we just want to get the first column using the loc or iloc syntax, we can write something like
df.loc[:, [columns]] to subset the column(s).
# subset columns with loc # note the position of the colon # it is used to select all rows
Grouped Means
# For each year in our data, what was the average life expectancy?
# To answer this question, # we need to split our data into parts by year;
# then we get the 'lifeExp' column and calculate the mean
Visual Representation of the Data
• Histogram -- vertical bar chart of frequencies
• Frequency Polygon -- line graph of frequencies
• Ogive -- line graph of cumulative frequencies
• Pie Chart -- proportional representation for categories of
a whole
• Stem and Leaf Plot
• Pareto Chart
• Scatter Plot
Principles of Excellent Graphs
• The graph should not distort the data
• The graph should not contain unnecessary
adornments (sometimes referred to as chart junk)
• The scale on the vertical axis should begin at
zero
• All axes should be properly labelled
• The graph should contain a title
• The simplest possible graph should
be used for a given set of data
Lecture 4: Central Tendency and Dispersion
Lecture objectives • Central tendency • Measures of Dispersion
Median
• Middle value in an ordered array of numbers
• Applicable for ordinal, interval, and ratio data
• Not applicable for nominal data
• Unaffected by extremely large and extremely small values
Median: Computational Procedure
• First Procedure – Arrange the observations in an ordered array – If there is an odd number of
terms, the median is the middle term of the ordered array – If there is an even number of terms, the
median is the average of the middle two terms
• Second Procedure – The median’s position in an ordered
array is given by (n+1)/2.
Median: Example with an Odd Number of Terms
Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• There are 17 terms in the ordered array.
• Position of median = (n+1)/2 = (17+1)/2 = 9
• The median is the 9th term, 15.
• If the 22 is replaced by 100, the median is 15.
• If the 3 is replaced by -103, the median is 15.
Median: Example with an Even Number of Terms
Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
• There are 16 terms in the ordered array
• Position of median = (n+1)/2 = (16+1)/2 = 8.5
• The median is between the 8th and 9th terms, 14.5
• If the 21 is replaced by 100, the median is 14.5
• If the 3 is replaced by -88, the median is 14.5
Mode
• The most frequently occurring value in a data set
• Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio)
• Bimodal -- Data sets that have two modes
• Multimodal -- Data sets that contain more than two modes
Percentiles
• Measures of central tendency that divide a group of data into 100 parts
• Example: 90th percentile indicates that at most 90% of the data lie below it, and at least 10% of
the data lie above it
• The median and the 50th percentile have the same value
• Applicable for ordinal, interval, and ratio data • Not applicable for nominal data
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array
𝑝
• Calculate the pth percentile location: 𝑖 = (𝑛)
100
• Determine the percentile’s location and its value.
• If i is a whole number, the percentile is the average of the values at the i and (i+1) positions
• If i is not a whole number, the percentile is at the (i+1) position in the ordered array 24
Percentiles: Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
30
• Location of 30th percentile: 𝑖 = (8) = 2.4
100
• The location index, i, is not a whole number; i+1 = 2.4+1=3.4; the whole number portion is 3; the
30th percentile is at the 3rd location of the array; the 30th percentile is 13.
Dispersion
• Measures of variability describe the spread or the dispersion of a set of data
• Reliability of measure of central tendency
• To compare dispersion of various samples
Interquartile Range
• Range of values between the first and
third quartiles
• Range of the “middle half”
• Less influenced by extremes
Interquartile Range Q=Q3-Q1
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control – construction of quality control charts – process capability studies
• Comparing populations – household incomes in two cities – employee absenteeism at two plants
Lecture 5: Central Tendency and Dispersion- II
Coefficient of Variation
• Ratio of the standard deviation to the mean,
expressed as a percentage
• Measurement of relative dispersion
𝜎
𝐶. 𝑉 = (100)
𝜇
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal shape
– Platykurtic: flat and spread out
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness
Skewness..
The skewness of a distribution is measured by comparing the relative positions of the mean, median
and mode.
• Distribution is symmetrical
• Mean = Median = Mode
• Distribution skewed right
• Median lies between mode and mean, and mode is
less than mean
• Distribution skewed left
• Median lies between mode and mean, and mode is
greater than mean
Box and Whisker Plot
• Five specific values are used:
– Median, Q2
– First quartile, Q1
– Third quartile, Q3
– Minimum value in the data set
– Maximum value in the data set