S09 Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

9/21/20

Seminar 9:
Descriptive Analysis

Descriptive Statistics

• Numbers that summarize, re-


describe, highlight any
meaningful ways to extract data.
• Ways to describe data:
– Measures of central tendency
• Identifying the central position of
data
– Measures of spread Spread
• Identify the variability within the
data

Central

1
9/21/20

Measures of Central Tendency


• Mean
– Average value
– For discrete (whole numbers) and continuous data (decimals).
– Mean is susceptible to outliers and skewness.
– Not for categorical data.
• Median
– Middle value.
– Divides the distribution into half. Half of the data points are less than
the median and the other half of them are more than the median.
– Less susceptible to outliers and skewness.
– Not applicable for categorical data.
• Mode
– Most frequently occurring value.
– Suitable for discrete, continuous, and categorical data.
3
– In some data, mode may not be a good representation of centrality.

Measures of Spread / Dispersion


• If the mean comes with large spread value, mean may not be
representative.
• Less variation/risk is preferred.

Range
– Difference between the highest and lowest value in data.
Quartiles
– Divide data into quarters, four equal parts (Q1, Q2, and Q3) with Q2
sitting at the median (2nd quartile is the median)
Variance
– Measures the width of its spread from center.
– Average squared difference between a variable’s value and the mean.
– Denotes the variability.
Standard Deviation
– Square root of variance. 4

2
9/21/20

Online Retail Example


• Look at the following data file:
OnlineRetail.csv

Online Retail Example


• Read in the data using • Check the datatypes of all
pandas variables using dtypes attribute:
read_csv() function:

• Understand the dimension


or size of the given data InvoiceNo and etc have been
using shape attribute:
classified as object,
in particular String object.

3
9/21/20

Online Retail Example


• Utilize describe()
from DataFrame to Ø As observed from the output, only numerical
statistics are generated. Non-numerical columns are
generate descriptive not involved.
Ø It shows the count, mean, standard deviation, Q1,
statistics that Q2, Q3, and max.

summarize the central Ø Apart from generating numbers, the more


important task is to make sense out of these
numbers.
tendency, dispersion: ØFor example, looking at the Quantity column's mean
and standard deviation, what does it tell you?

ØAnd why is the number of mean and standard


deviation for UnitPrice so close to each other? What
could be the reason for this?

Skewness

Im age source: https://upload.wikim edia.org/wikipedia/com m ons/c/cc/Relationship_between_m ean_and_m edian_under_different_skewness.png


8

4
9/21/20

Online Retail Example


Ø Individual statistics numbers can be generated through the call of:
• Other basic statistics can § mean()
be generated like Pearson § var()
correlation: § percentile()
§ median()
§ mode()
§ std()
§ count()
§ sum()
§ min()
§ max()
§ abs()
§ cov() # covariance matrix
• There appears to be weak § kurt() # kurtosis value
negative correlations § skew() # skewness index, positive for right skew and negative for left
observed between pairs of skew.

variables.
9

Data Indexing
• Indexing refers to the
position of a subset
of data within an
iterable structure.
• Iterable means loop-
able, you can make a
for-loop to go from
one element to next
element.

5
9/21/20

String Revisit

11

String Indexing Revisit


• String associates each character with an
index number.
• Index number starts from 0, increments by 1
starting from the left.
• Use square bracket to embed index number.
• To refer to a particular character, refer to it
using the format:
– string_var_name[index]
12

6
9/21/20

String Indexing Common Mistake

• It is a common mistake to think that the


first character of a String has index
number 1. That is wrong!
• The first character of a String has index
number 0, as shown below:

13

String Reverse Indexing Revisit

• Default indexing starts from the left.


• Reverse indexing starts from the right,
using negative notation.
• To refer to a particular character, refer to it
using the format:
– string_var_name[-index]

14

7
9/21/20

String Slicing Revisit


• Slicing extracts a subset of string sequence.
• Syntax as follows:
– string_var_name[start: stop]
– start: starting index of extraction
– stop: stopping index of extraction, excluding last
position

15

String Slicing Common Mistake

• It is a common mistake to include the


stopping index as the last position of
extraction. That is wrong!
– In the example below, index 5 (which corresponds to
the character m) is excluded from the extraction.

16

8
9/21/20

String Slicing Revisit


• The default index for start is 0.
• The default index for stop if not specified is
assumed to be till the end of string.

• Slicing also works for step change, syntax as


follows:
– string_var_name[start: stop: step]
17

Now, we look back to DataFrame


• Connect the codes to a data file
• Then we will look at how to do data
retrieval from the dataset.

import pandas as pd
df = pd.read_csv("OnlineRetail.csv")

9
9/21/20

Try out the following pandas


functions:
Function name Description
df Display the content of dataframe
df.head() See the first 5 records
df.tail() See the last 5 records
df.loc[0] See the first row of data
df.loc[1:3] See the second to forth row of data
df.loc[0, “InvoiceNo”] See data for row 0 in a particular column
(InvoiceNo column in this example)
df.loc[0, ["InvoiceNo","Description"]] See data for row 0 in 2 columns
(InvoiceNo & Description columns)
df[‘colName’] See only one column of information

DataFrame: useful row operations


• Row is accessed via the use of index, index 0 is
the first row of data. Get a specific row or rows
(slicing) using loc[index]:
Getting first row Getting second to forth rows

Ø df.drop(number) # drop certain row

20

10
9/21/20

DataFrame Slicing Common Mistake

• When using loc[start:stop], it is a common


mistake to exclude the stopping index as the last
position of extraction. That is wrong!
– In the example below, index 3 (which corresponds to
the 4th row) is included in the extraction.

21

Getting data from column(s)


• Getting one column

• Getting multiple columns

How can you get data from multiple rows and multiple columns?

11
9/21/20

You have learnt...


1. To run descriptive analysis using pandas library.
2. Selecting data
3. Slicing data

23

12

You might also like