Big Data
Lecture Notes
Week 5 (2024.03.31)
By Eunhui Kim
[email protected]
Statistical Data Types
Data Types
Categorical Numerical
(qualitative) (quantitative)
Nominal Ordinal Interval Ratio
Statistical Data Types
Categorical Data
• Categorical measurement expressed not in terms of numbers, but rather by means of a natural
language description
• Categorical data can take on numerical values (such as “1” indicating male and “2” indicating
female), but those numbers don’t have mathematical meaning
• Example: person’s gender, hometown, postcode, phone number, or favorite movie
• Two types of categorical data : nominal and ordinal data
Statistical Data Types
Categorical Data : Nominal Data
• Nominal Variables have values that are ‘labels’ representing some category
• The values do not have any quantitative meanings and any relative ranking or order.
• Example:
‒ Gender (male, female)
‒ Nationality (British, American, Spanish,...)
‒ Genre/Style (Rock, Hip-Hop, Jazz, Classical,...)
Statistical Data Types
Categorical Data : Nominal Data
• Mathematical features:
Statistical Data Types
Categorical Data : Nominal Data
• Descriptive Statistics
‒ Frequencies : count how many you have in each category
‒ Proportions : determine how often something happens by dividing the frequency by the total number of events
‒ Percentages : transform the proportions to percentages by multiplying by 100
‒ Central point : you can determine the most common item by finding the mode
• Example: bag of read, blue and green marbles
‒ Frequencies : 10 red, 15 blue, 5 green
‒ Proportions : total = 30, red proportion is 10/30, blue proportion is 15/30 and green proportion is 5/30
‒ Percentages : percentage of red marbles is 100*10/30, blue marbles is 100*15/30 and green is 100*5/30
‒ Central point : the mode, the most common, marble in the bag is the blue marble
Statistical Data Types
Categorical Data : Nominal Data
• Visualization
Statistical Data Types
Categorical Data : Nominal Data
• Dummy Variables
‒ In regression analysis, a dummy variable is one that takes a binary value (0 or 1) to indicate the absence or
presence of some categorical effect that may be expected to shift the outcome
Statistical Data Types
Categorical Data : Ordinal Data
• Ordinal data is a type of categorical data in which the values follow a natural order
• The values do not have any quantitative meaning but have relative ranking or order
• There is no consistency in the relative distances between adjacent categories
‒ The difference in finishing between 1st and 2nd is not necessarily (and probably not) the same as the difference
between 2nd and 3rd
• Example:
‒ Opinion (agree, mostly agree, neutral, mostly disagree, disagree)
‒ Grade (A+, A, B+, … or 1st, 2nd, 3rd, …)
‒ Time of day (morning, noon, night)
‒ Ratings in restaurants
Statistical Data Types
Categorical Data : Ordinal Data
• Mathematical features:
Statistical Data Types
Categorical Data : Ordinal Data
• Descriptive Statistics
‒ Frequencies : count how many you have in each category
‒ Proportions : determine how often something happens by dividing the frequency by the total number of events
‒ Percentages : transform the proportions to percentages by multiplying by 100
‒ Central point : since there is an order to the data you can rank them and compute the median (or mode, but not
the mean) to find the central value.
‒ Summary statistics : as the data are ordered, you can use percentiles and the inter-quartile range to
summarize your data
Statistical Data Types
Categorical Data : Ordinal Data
• Visualization
Statistical Data Types
Categorical Data : Ordinal Data
• Dummy Variables
‒ In regression analysis, a dummy variable is one that takes a binary value (0 or 1) to indicate the absence or
presence of some categorical effect that may be expected to shift the outcome
Age 20’s 30’s 40’s 50’s
21 1 0 0 0
45 0 0 1 0
25 1 0 0 0
56 0 0 0 1
55 0 0 0 1
31 0 1 0 0
58 0 0 0 1
Statistical Data Types
Numerical Data
• Numerical measurement expressed not by means of a natural language description, but rather
in terms of number
• It has a mathematical meaning
• Example: age, height, weight, number of students in a classroom
• Two types of categorical data : interval and ratio data
Statistical Data Types
Numerical Data : Interval Data
• Interval data is measured numerical data that has equal distances between adjacent values
• There is order and the difference between two values is meaningful but not their ratio
• Example:
‒ temperature (Farenheit)
‒ temperature (Celcius)
‒ pH
‒ Dates (1066, 1492, 1776, ...)
• It does not have an inherently defined zero value
‒ If the temperature of a particular city is 0° C then it does not mean that temperature does not exist
Statistical Data Types
Numerical Data : Interval Data
• Mathematical features
Statistical Data Types
Numerical Data : Interval Data
• Descriptive Statistics
‒ Central Point : Mean (not-skewed), Median (skewed), or (sometimes) Mode
‒ Range : Minimum and maximum
‒ Spread : percentiles, inter-quartile range and standard deviation
Statistical Data Types
Numerical Data : Interval Data
• Visualization
Statistical Data Types
Numerical Data : Ratio Data
• Ratio data is measured numerical data that has equal distances between adjacent values and a
meaningful zero
• Example:
‒ temperature (Kelvin) : 0 K = absolute zero
‒ Age
‒ Weight
‒ Distance (measured with a ruler or other such measuring device)
‒ Time interval (measured with a stop-watch)
Statistical Data Types
Numerical Data : Ratio Data
• Mathematical features
Statistical Data Types
Numerical Data : Ratio Data
• Descriptive Statistics
‒ Central Point : Mean (not-skewed), Median (skewed), or (sometimes) Mode
‒ Range : Minimum and maximum
‒ Spread : percentiles, inter-quartile range and standard deviation
Statistical Data Types
Numerical Data : Ratio Data
• Visualization
Statistical Data Types
Summary
Statistical Data Types
Practice
Nominal Nominal Nominal Ordinal Interval Ratio
Last
Emp_ID City Department Designation Salary
Accessed
2023.10.03
2453 Mumbai Marketing Vice President 125000
16:30
2023.10.01
2589 Thane Finance General Manager 80000
20:00
2023.10.03
3048 Surat HR Junior Manager 50000
18:05
2023.10.03
2985 Chennai Operations Asst. Manager 30000
21:00
Pandas
Combining data from multiple DataFrames
• concat() : when you want to stack DataFrames along a specific axis
• merge() : combine DataFrames based on specific columns
• join() : combine DataFrames based on their index labels
Pandas
concat()
• Concatenating DataFrames along a specified axis (either rows or columns)
• Combining multiple data structures into a single data structure, either by stacking them on top of
each other (along rows) or side by side (along columns)
• Pass multiple objects as a list:
df1 df2
Pandas
concat()
• ignore_index: If set to True, the resulting DataFrame will have a new index that ignores the
original index values of the concatenated objects
Pandas
concat()
• axis: Specifies whether the concatenation should be performed along rows (axis=0) or columns
(axis=1)
Pandas
concat()
• join: how the concatenation handles columns with different names
‒ 'outer' (default): Union of all columns, resulting in NaN for missing values
‒ 'inner': Intersection of columns, only including columns that exist in all DataFrames
Pandas
merge()
• Combining DataFrames by aligning rows based on columns (known as keys)
• Resulting DataFrame will contain data from the input DataFrames that match the specified keys
df1 df2
Pandas
merge()
• on
‒ Columns on which the DataFrames should be joined
‒ You can specify one or more column names.
‒ If the column names are the same in both DataFrames, you can simply provide the column name as a string
‒ If the column names differ, you can provide a list of column names for the left DataFrame (left_on) and the right
DataFrame (right_on) separately.
df1 df2
Pandas
merge()
• on
‒ Columns on which the DataFrames should be joined
‒ You can specify one or more column names.
‒ If the column names are the same in both DataFrames, you can simply provide the column name as a string
‒ If the column names differ, you can provide a list of column names for the left DataFrame (left_on) and the right
DataFrame (right_on) separately.
df1 df2
Pandas
merge()
• how: type of join to perform and can take one of the following values:
‒ inner: inner join, which returns only the rows that have matching keys in both DataFrames (default)
‒ outer: full outer join, which returns all rows from both DataFrames, filling in missing values with NaN
‒ left: left join, which returns all rows from the left DataFrame and the matching rows from the right DataFrame.
Non-matching rows in the left DataFrame will have NaN values
‒ right: right join, which is the opposite of a left join. It returns all rows from the right DataFrame and the matching
rows from the left DataFrame. Non-matching rows in the right DataFrame will have NaN values
Inner join left join right join outer join
(default)
Pandas
merge()
• how: inner
df1 df2
Pandas
merge()
• how: outer
df1 df2
Pandas
merge()
• how: left
df1 df2
Pandas
merge()
• how: right
df1 df2
Pandas
join()
• Combining DataFrames into a single DataFrame by aligning them on their index labels
df1 df2
Pandas
join()
• on: Specifies the column name(s) or index level(s) on which the join should be performed. If on
is not specified, the join is performed based on the indices of the DataFrames.
• how: Determines the type of join to perform and can take one of the following values:
‒ inner, left, right, outer
Inner join left join right join outer join
(default)
Pandas
join()
• lsuffix and rsuffix:
‒ These parameters are used when the DataFrames have columns with the same name
‒ If there are overlapping column names, you can specify suffixes to append to the columns from the calling
DataFrame (left) and the other DataFrame (right) to make them unique
df1 df2
Python for Data Visualization
• Python 2D plotting library
‒ line plots, scatter plots, barcharts, histograms, pie charts etc.
• Producing publication quality figures in a variety of hardcopy formats
• A set of functionalities similar to those of MATLAB
• Relatively low-level; some effort needed to create advanced visualization
Python for Data Visualization
• Python visualization library based on Matplotlib
• Provides high level interface for drawing attractive statistical graphics
• Similar (in style) to the popular ggplot2 library in R
Seaborn
load_dataset()
• Load an example dataset from the online repository (requires internet)
Seaborn
load_dataset()
• Tips dataset
Seaborn
Bar Plot : Average total bill per Day
Seaborn
Bar Plot : Average total bill per Day
Seaborn
Bar Plot
Seaborn
Bar Plot
Seaborn
Bar Plot
Seaborn
Bar Plot
Seaborn
Bar Plot
Seaborn
Bar Plot
Seaborn
Count Plot
Seaborn
Line Plot
Seaborn
Line Plot
Seaborn
Line Plot
Seaborn
Box Plot
Seaborn
Box Plot
Seaborn
Box Plot
Seaborn
Histogram
Seaborn
Kernel Density Estimation
Histogram
Seaborn
Histogram
Seaborn
Histogram
Seaborn
Violin Plot
Seaborn
Violin Plot
Seaborn
Violin Plot
Seaborn
Violin Plot
Seaborn
Swarm Plot
Seaborn
Swarm Plot
Seaborn
Joint Plot
Seaborn
Linear Model Plot
Seaborn
Relation Plot
Seaborn
Relation Plot
Seaborn
Relation Plot
Seaborn
Categorical Plot
Seaborn
Categorical Plot
Seaborn
Categorical Plot
Seaborn
Assignment 5
Submission due : April 11th, 23:55
What to submit : Notebook file (.ipynb)
• Colab : [File]-[Download]-[Download .ipynb]
• Kaggle : [File]-[Download Notebook]
IMPORTANT
• Using the seaborn library
• The design of the graph such as color or width does not need to be the same
• The type of graph must be the same
• You don’t need to clean the dataset
Assignment 5
Problem 1: Loading the 'titanic' dataset from the online repository provided by the
seaborn library
• Requires internet
Assignment 5
Problem 2: Visualize the number of survivor by gender
Assignment 5
Problem 3: Visualize the number of survivor by passenger class
Assignment 5
Problem 4: Visualize the number of people per passenger class by embarked port
Assignment 5
Problem 5: Visualize survival rate by gender and passenger class
Assignment 5
Problem 6: Visualize age distribution by embarked port and gender.
Assignment 5
Problem 7: Visualize the survival by gender and passenger class