0% found this document useful (0 votes)
8 views

Data Analytics With Python Lecture 1

Uploaded by

Sukant Tekade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Analytics With Python Lecture 1

Uploaded by

Sukant Tekade
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

WEEK 01

Lecture 1: Introduction to data analytics


Objective of the course
• The principal focus of this course is to introduce conceptual understanding using simple and
practical examples rather than repetitive and point click mentality
• This course should make you comfortable using analytics in your career and your life
• You will know how to work with real data, and might have learned many different
methodologies but choosing the right methodology is important
• The danger in using quantitative method does not generally lie in the inability to perform the
calculation • The real threat is lack of fundamental understanding of:
– Why to use a particular technique of procedure
– How to use it correctly and,
– How to correctly interpret the res

Learning objectives
1. Define data and its importance
2. Define data analytics and its types
3. Explain why analytics is important in today’s business environment
4. Explain how statistics, analytics and data science are interrelated
5. Why python?
6. Explain the four different levels of Data:
– Nominal
– Ordinal
– Interval and
– Ratio
Define Data and its importance
• Variable, Measurement and Data
• What is generating so much data?
• How data add value to the business?
• Why data is important?
1.1 Variable, Measurement and Data
• Variables – is a characteristic of any entity being studied that is capable of taking on different
values
• Measurements – is when a standard process is used to assign numbers to particular attributes
or characteristic of a variable
• Data – data are recorded measurements
1.2 What is generating so much data?
• Data can be generated by
– Humans,
– Machines or
– Humans-machines combines
• It can be generated anywhere where any information is generated and stored in structured or
unstructured formats
1.4 Why Data is important?
• Data helps in make better decisions
• Data helps in solve problems by finding the reason for underperformance
• Data helps one to evaluate the performance.
• Data helps one improve processes
• Data helps one understand consumers and the market 10
2. Define data analytic and its types
• Define data analytics
• Why analytics is important?
• Data analysis
• Data analytics vs. Data analysis
• Types of Data analytics
2.1. Define data analytics
• Analytics is defined as “the scientific process of transforming data into insights for making better
decisions”
• Analytics, is the use of data, information technology, statistical analysis, quantitative methods, and
mathematical or computer-based models to help managers gain improved insight about their
business operations and make better, fact-based decisions – James Evans
• Analysis = Analytics?
2.2 Why analytics is important?
• Opportunity abounds for the use of analytics and big data such as:
1. Determining credit risk
2. Developing new medicines
3. Finding more efficient ways to deliver products and services
4. Preventing fraud
5. Uncovering cyber threats
6. Retaining the most valuable customers
2.3 Data analysis
• Data analysis is the process of examining, transforming, and arranging raw data in a specific way to
generate useful information from it
• Data analysis allows for the evaluation of data through analytical and logical reasoning to lead to
some sort of outcome or conclusion in some context
• Data analysis is a multi-faceted process that involves a number of steps, approaches, and diverse
techniques

Analysis / = Analytics
Data Analysis = Data analytics /
Business Analysis = Business analytics
2.5 Classification of Data analytics
Based on the phase of workflow and the kind of analysis required, there are four major types of data
analytics.
• Descriptive analytics • Diagnostic analytics • Predictive analytics • Prescriptive analytics
Descriptive Analytics
• Descriptive Analytics, is the conventional form of Business Intelligence and data analysis
• It seeks to provide a depiction or “summary view” of facts and figures in an understandable format
• This either inform or prepare data for further analysis
• Descriptive analysis or statistics can summarize raw data and convert it into a form that can be
easily understood by humans
• They can describe in detail about an event that has occurred in the past
Example
A common example of Descriptive Analytics is company reports that simply provide a historic review
like: • Data Queries • Reports • Descriptive Statistics • Data Visualization • Data dashboard
Diagnostic analytics
• Diagnostic Analytics is a form of advanced analytics which examines data or content to answer the
question “Why did it happen?”
• Diagnostic analytical tools aid an analyst to dig deeper into an issue so that they can arrive at the
source of a problem
• In a structured business environment, tools for both descriptive and diagnostic analytics go
parallel
Example
• It uses techniques such as: 1. Data Discovery 2. Data Mining 3. Correlations
Predictive analytics
• Predictive analytics helps to forecast trends based on the current events
• Predicting the probability of an event happening in future or estimating the accurate time it will
happen can all be determined with the help of predictive analytical models
• Many different but co-dependent variables are analysed to predict a trend in this type of analysis

Example • Set of techniques that use model constructed from past data to predict the future or
ascertain impact of one variable on another: 1. Linear regression 2. Time series analysis and
forecasting 3. Data mining

Prescriptive analytics
• Set of techniques to indicate the best course of action
• It tells what decision to make to optimize the outcome
• The goal of prescriptive analytics is to enable: 1. Quality improvements 2. Service enhancements
3. Cost reductions and 4. Increasing productivity
Prescriptive analytics: Example • Optimization Model • Simulation • Decision Analysis
3. Explain why analytics is important
• Demand for Data Analytics • Element of data Analytics

4. Data analyst and Data scientist


• The requisite skill set
• Difference between Data analyst and Data
Scientist
6.Explain the four different levels of Data
• Types of Variables
• Levels of Data Measurement
• Compare the four different levels of Data: Nominal ,Ordinal, Interval and Ratio
• Usage Potential of Various Levels of Data
• Data Level, Operations, and Statistical Methods
6.2 Levels of Data Measurement
• Nominal — Lowest level of
measurement
• Ordinal
• Interval
• Ratio — Highest level of
measurement
6.3.1 Nominal
• A nominal scale classifies data into
distinct categories in which no
ranking is implied
• Example: Gender, Marital Status

6.3.2 Ordinal scale


• An ordinal scale classifies data into distinct categories in which ranking is implied
• Example: – Product satisfaction  Satisfied, Neutral, Unsatisfied – Faculty rank  Professor,
Associate Professor, Assistant Professor – Student Grades  A, B, C, D, F
6.3.3. Interval scale
• An interval scale is an ordered scale in which the difference between measurements is a
meaningful quantity but the measurements do not have a true zero point.
• Example – Temperature in Fahrenheit and Celsius – Y
6.3.4 Ratio scale
• A ratio scale is an ordered scale in which the difference between the measurements is a
meaningful quantity and the measurements have a true zero point.
• Example – Weight – Age – Salary

Lecture 2: Python – Fundamentals


Learning objectives: 1. Installing Python 2. Fundamentals of Python 3. Data Visualisation
Python Installation
Installation Process
Step 1: Type https://www.anaconda.com at the address bar of web browser.
Step 2: Click on download button
Step 3: Download python 3.7 version for windows OS
Step 4: Double click on file to run the application
Step 5: Follow the instructions until completion of installation process
Python Installation Process Installation Process
– Step 1: Type https://www.anaconda.com at the address bar of web browser.

About Jupyter Notebook


• Command mode allow to edit notebook
as whole
• To close edit mode (Press Escape key)
• Execution (Three ways) o Ctrl +Enter
(Output field can not be modified) o Shift
+Enter (Output field is modified) o Run
button on Jupyter interface
• Comment line is written preceding with
# symbol.

About Jupyter Notebook --Important


shortcut keys
• A -> To create cell above
• Y -> For code cell
• B -> To create cell below
• D + D -> For deleting cell
• M -> For markdown cell
Fundamentals of Python
• Loading a simple delimited data file
• Counting how many rows and columns were loaded
• Determining which type of data was loaded • Looking at different parts of the data by subsetting
rows and columns

GET THE NUMBER OF ROWS AND COLUMNS

GET COLUMN NAMES

GET THE DTYPE OF EACH COLUMN PANDAS TYPES VERSUS PYTHON TYPES

Looking At Columns, Rows, and Cells


• get the country column and save it to its own
variable

#Show the first 5 Observations


#Show the last 5 observations # Looking at country, continent, and year

Lecture 3: Python – Fundamentals – II

With iloc, we can pass in the -1 to get the last row—something we couldn’t do with loc.
Subsetting Columns
• The Python slicing syntax uses a colon, :
• If we have just a colon, the attribute refers to everything.
• So, if we just want to get the first column using the loc or iloc syntax, we can write something like
df.loc[:, [columns]] to subset the column(s).

# subset columns with loc # note the position of the colon # it is used to select all rows
Grouped Means
# For each year in our data, what was the average life expectancy?
# To answer this question, # we need to split our data into parts by year;
# then we get the 'lifeExp' column and calculate the mean
Visual Representation of the Data
• Histogram -- vertical bar chart of frequencies
• Frequency Polygon -- line graph of frequencies
• Ogive -- line graph of cumulative frequencies
• Pie Chart -- proportional representation for categories of
a whole
• Stem and Leaf Plot
• Pareto Chart
• Scatter Plot
Principles of Excellent Graphs
• The graph should not distort the data
• The graph should not contain unnecessary
adornments (sometimes referred to as chart junk)
• The scale on the vertical axis should begin at
zero
• All axes should be properly labelled
• The graph should contain a title
• The simplest possible graph should
be used for a given set of data
Lecture 4: Central Tendency and Dispersion
Lecture objectives • Central tendency • Measures of Dispersion

Measures of Central Tendency


• Measures of central tendency yield information about “particular places or locations in a group of
numbers.”
• A single number to describe the characteristics of a set of data
Summary statistics
• Central tendency or measures of location
– Arithmetic mean – Weighted mean – Median – Percentile
• Dispersion
– Skewness – Kurtosis – Range – Interquartile range – Variance – Standard score – Coefficient of
variation
Arithmetic Mean
• Commonly called ‘the mean’
• It is the average of a group of numbers
• Applicable for interval and ratio data
• Not applicable for nominal or ordinal data
• Affected by each value in the data set, including extreme values
• Computed by summing all values in the data set and dividing the sum by the number of values in
the data set
Weighted Average
• Sometimes we wish to average numbers, but we want to assign more importance, or weight, to
some of the numbers.
• The average you need is the weighted average.

Median
• Middle value in an ordered array of numbers
• Applicable for ordinal, interval, and ratio data
• Not applicable for nominal data
• Unaffected by extremely large and extremely small values
Median: Computational Procedure
• First Procedure – Arrange the observations in an ordered array – If there is an odd number of
terms, the median is the middle term of the ordered array – If there is an even number of terms, the
median is the average of the middle two terms
• Second Procedure – The median’s position in an ordered
array is given by (n+1)/2.
Median: Example with an Odd Number of Terms
Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• There are 17 terms in the ordered array.
• Position of median = (n+1)/2 = (17+1)/2 = 9
• The median is the 9th term, 15.
• If the 22 is replaced by 100, the median is 15.
• If the 3 is replaced by -103, the median is 15.
Median: Example with an Even Number of Terms
Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
• There are 16 terms in the ordered array
• Position of median = (n+1)/2 = (16+1)/2 = 8.5
• The median is between the 8th and 9th terms, 14.5
• If the 21 is replaced by 100, the median is 14.5
• If the 3 is replaced by -88, the median is 14.5
Mode
• The most frequently occurring value in a data set
• Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio)
• Bimodal -- Data sets that have two modes
• Multimodal -- Data sets that contain more than two modes

Percentiles
• Measures of central tendency that divide a group of data into 100 parts
• Example: 90th percentile indicates that at most 90% of the data lie below it, and at least 10% of
the data lie above it
• The median and the 50th percentile have the same value
• Applicable for ordinal, interval, and ratio data • Not applicable for nominal data
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array
𝑝
• Calculate the pth percentile location: 𝑖 = (𝑛)
100
• Determine the percentile’s location and its value.
• If i is a whole number, the percentile is the average of the values at the i and (i+1) positions
• If i is not a whole number, the percentile is at the (i+1) position in the ordered array 24
Percentiles: Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
30
• Location of 30th percentile: 𝑖 = (8) = 2.4
100
• The location index, i, is not a whole number; i+1 = 2.4+1=3.4; the whole number portion is 3; the
30th percentile is at the 3rd location of the array; the 30th percentile is 13.
Dispersion
• Measures of variability describe the spread or the dispersion of a set of data
• Reliability of measure of central tendency
• To compare dispersion of various samples

Measures of Variability or dispersion


Common Measures of Variability
• Range
• Inter-quartile range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Z scores
• Coefficient of Variation
Range – ungrouped data
• The difference between the largest and the smallest values in a set of
data
• Simple to compute
• Ignores all data points except the two extremes
• Example: Range = Largest – Smallest = 48 - 35 = 13
Quartiles
• Measures of central tendency that divide a group of data into four
subgroups
• Q1: 25% of the data set is below the first quartile
• Q2: 50% of the data set is below the second quartile
• Q3: 75% of the data set is below the third quartile
• Q1 is equal to the 25th percentile
• Q2 is located at 50th percentile and equals the median
• Q3 is equal to the 75th percentile
• Quartile values are not necessarily members of the data set.

Interquartile Range
• Range of values between the first and
third quartiles
• Range of the “middle half”
• Less influenced by extremes
Interquartile Range Q=Q3-Q1
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control – construction of quality control charts – process capability studies
• Comparing populations – household incomes in two cities – employee absenteeism at two plants
Lecture 5: Central Tendency and Dispersion- II

Coefficient of Variation
• Ratio of the standard deviation to the mean,
expressed as a percentage
• Measurement of relative dispersion

𝜎
𝐶. 𝑉 = (100)
𝜇
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal shape
– Platykurtic: flat and spread out
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness

Skewness..
The skewness of a distribution is measured by comparing the relative positions of the mean, median
and mode.
• Distribution is symmetrical
• Mean = Median = Mode
• Distribution skewed right
• Median lies between mode and mean, and mode is
less than mean
• Distribution skewed left
• Median lies between mode and mean, and mode is
greater than mean
Box and Whisker Plot
• Five specific values are used:
– Median, Q2
– First quartile, Q1
– Third quartile, Q3
– Minimum value in the data set
– Maximum value in the data set

You might also like