0% found this document useful (0 votes)

6 views88 pages

Bigdata Week5 Lecture Note

The document covers statistical data types, categorizing them into categorical (nominal and ordinal) and numerical (interval and ratio) data, explaining their characteristics and examples. It also discusses data manipulation using the Pandas library, including methods for combining DataFrames such as concat(), merge(), and join(). Additionally, it introduces Python libraries for data visualization, particularly Seaborn, and outlines an assignment involving the visualization of the Titanic dataset.

Uploaded by

tuanntt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views88 pages

Bigdata Week5 Lecture Note

Uploaded by

tuanntt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

Big Data

Lecture Notes
Week 5 (2024.03.31)

By Eunhui Kim
[email protected]
Statistical Data Types

Data Types

Categorical Numerical
(qualitative) (quantitative)

Nominal Ordinal Interval Ratio

Statistical Data Types
 Categorical Data
• Categorical measurement expressed not in terms of numbers, but rather by means of a natural
language description
• Categorical data can take on numerical values (such as “1” indicating male and “2” indicating
female), but those numbers don’t have mathematical meaning

• Example: person’s gender, hometown, postcode, phone number, or favorite movie

• Two types of categorical data : nominal and ordinal data

Statistical Data Types
 Categorical Data : Nominal Data
• Nominal Variables have values that are ‘labels’ representing some category
• The values do not have any quantitative meanings and any relative ranking or order.
• Example:
‒ Gender (male, female)
‒ Nationality (British, American, Spanish,...)
‒ Genre/Style (Rock, Hip-Hop, Jazz, Classical,...)
Statistical Data Types
 Categorical Data : Nominal Data
• Mathematical features:
Statistical Data Types
 Categorical Data : Nominal Data
• Descriptive Statistics
‒ Frequencies : count how many you have in each category
‒ Proportions : determine how often something happens by dividing the frequency by the total number of events
‒ Percentages : transform the proportions to percentages by multiplying by 100
‒ Central point : you can determine the most common item by finding the mode

• Example: bag of read, blue and green marbles

‒ Frequencies : 10 red, 15 blue, 5 green
‒ Proportions : total = 30, red proportion is 10/30, blue proportion is 15/30 and green proportion is 5/30
‒ Percentages : percentage of red marbles is 100*10/30, blue marbles is 100*15/30 and green is 100*5/30
‒ Central point : the mode, the most common, marble in the bag is the blue marble
Statistical Data Types
 Categorical Data : Nominal Data
• Visualization
Statistical Data Types
 Categorical Data : Nominal Data
• Dummy Variables
‒ In regression analysis, a dummy variable is one that takes a binary value (0 or 1) to indicate the absence or
presence of some categorical effect that may be expected to shift the outcome
Statistical Data Types
 Categorical Data : Ordinal Data
• Ordinal data is a type of categorical data in which the values follow a natural order
• The values do not have any quantitative meaning but have relative ranking or order
• There is no consistency in the relative distances between adjacent categories
‒ The difference in finishing between 1st and 2nd is not necessarily (and probably not) the same as the difference
between 2nd and 3rd
• Example:
‒ Opinion (agree, mostly agree, neutral, mostly disagree, disagree)
‒ Grade (A+, A, B+, … or 1st, 2nd, 3rd, …)
‒ Time of day (morning, noon, night)
‒ Ratings in restaurants
Statistical Data Types
 Categorical Data : Ordinal Data
• Mathematical features:
Statistical Data Types
 Categorical Data : Ordinal Data
• Descriptive Statistics
‒ Frequencies : count how many you have in each category
‒ Proportions : determine how often something happens by dividing the frequency by the total number of events
‒ Percentages : transform the proportions to percentages by multiplying by 100
‒ Central point : since there is an order to the data you can rank them and compute the median (or mode, but not
the mean) to find the central value.
‒ Summary statistics : as the data are ordered, you can use percentiles and the inter-quartile range to
summarize your data
Statistical Data Types
 Categorical Data : Ordinal Data
• Visualization
Statistical Data Types
 Categorical Data : Ordinal Data
• Dummy Variables
‒ In regression analysis, a dummy variable is one that takes a binary value (0 or 1) to indicate the absence or
presence of some categorical effect that may be expected to shift the outcome

Age 20’s 30’s 40’s 50’s

21 1 0 0 0
45 0 0 1 0
25 1 0 0 0
56 0 0 0 1
55 0 0 0 1
31 0 1 0 0
58 0 0 0 1
Statistical Data Types
 Numerical Data
• Numerical measurement expressed not by means of a natural language description, but rather
in terms of number
• It has a mathematical meaning

• Example: age, height, weight, number of students in a classroom

• Two types of categorical data : interval and ratio data

Statistical Data Types
 Numerical Data : Interval Data
• Interval data is measured numerical data that has equal distances between adjacent values
• There is order and the difference between two values is meaningful but not their ratio
• Example:
‒ temperature (Farenheit)
‒ temperature (Celcius)
‒ pH
‒ Dates (1066, 1492, 1776, ...)

• It does not have an inherently defined zero value

‒ If the temperature of a particular city is 0° C then it does not mean that temperature does not exist
Statistical Data Types
 Numerical Data : Interval Data
• Mathematical features
Statistical Data Types
 Numerical Data : Interval Data
• Descriptive Statistics
‒ Central Point : Mean (not-skewed), Median (skewed), or (sometimes) Mode
‒ Range : Minimum and maximum
‒ Spread : percentiles, inter-quartile range and standard deviation
Statistical Data Types
 Numerical Data : Interval Data
• Visualization
Statistical Data Types
 Numerical Data : Ratio Data
• Ratio data is measured numerical data that has equal distances between adjacent values and a
meaningful zero
• Example:
‒ temperature (Kelvin) : 0 K = absolute zero
‒ Age
‒ Weight
‒ Distance (measured with a ruler or other such measuring device)
‒ Time interval (measured with a stop-watch)
Statistical Data Types
 Numerical Data : Ratio Data
• Mathematical features
Statistical Data Types
 Numerical Data : Ratio Data
• Descriptive Statistics
‒ Central Point : Mean (not-skewed), Median (skewed), or (sometimes) Mode
‒ Range : Minimum and maximum
‒ Spread : percentiles, inter-quartile range and standard deviation
Statistical Data Types
 Numerical Data : Ratio Data
• Visualization
Statistical Data Types
 Summary
Statistical Data Types
 Practice
Nominal Nominal Nominal Ordinal Interval Ratio
Last
Emp_ID City Department Designation Salary
Accessed
2023.10.03
2453 Mumbai Marketing Vice President 125000
16:30
2023.10.01
2589 Thane Finance General Manager 80000
20:00
2023.10.03
3048 Surat HR Junior Manager 50000
18:05
2023.10.03
2985 Chennai Operations Asst. Manager 30000
21:00
Pandas
 Combining data from multiple DataFrames
• concat() : when you want to stack DataFrames along a specific axis
• merge() : combine DataFrames based on specific columns
• join() : combine DataFrames based on their index labels
Pandas
 concat()
• Concatenating DataFrames along a specified axis (either rows or columns)
• Combining multiple data structures into a single data structure, either by stacking them on top of
each other (along rows) or side by side (along columns)
• Pass multiple objects as a list:

df1 df2
Pandas
 concat()
• ignore_index: If set to True, the resulting DataFrame will have a new index that ignores the
original index values of the concatenated objects
Pandas
 concat()
• axis: Specifies whether the concatenation should be performed along rows (axis=0) or columns
(axis=1)
Pandas
 concat()
• join: how the concatenation handles columns with different names
‒ 'outer' (default): Union of all columns, resulting in NaN for missing values
‒ 'inner': Intersection of columns, only including columns that exist in all DataFrames
Pandas
 merge()
• Combining DataFrames by aligning rows based on columns (known as keys)
• Resulting DataFrame will contain data from the input DataFrames that match the specified keys

df1 df2
Pandas
 merge()
• on
‒ Columns on which the DataFrames should be joined
‒ You can specify one or more column names.
‒ If the column names are the same in both DataFrames, you can simply provide the column name as a string
‒ If the column names differ, you can provide a list of column names for the left DataFrame (left_on) and the right
DataFrame (right_on) separately.

df1 df2
Pandas
 merge()
• how: type of join to perform and can take one of the following values:
‒ inner: inner join, which returns only the rows that have matching keys in both DataFrames (default)
‒ outer: full outer join, which returns all rows from both DataFrames, filling in missing values with NaN
‒ left: left join, which returns all rows from the left DataFrame and the matching rows from the right DataFrame.
Non-matching rows in the left DataFrame will have NaN values
‒ right: right join, which is the opposite of a left join. It returns all rows from the right DataFrame and the matching
rows from the left DataFrame. Non-matching rows in the right DataFrame will have NaN values

Inner join left join right join outer join

(default)
Pandas
 merge()
• how: inner

df1 df2
Pandas
 merge()
• how: outer

df1 df2
Pandas
 merge()
• how: left

df1 df2
Pandas
 merge()
• how: right

df1 df2
Pandas
 join()
• Combining DataFrames into a single DataFrame by aligning them on their index labels

df1 df2
Pandas
 join()
• on: Specifies the column name(s) or index level(s) on which the join should be performed. If on
is not specified, the join is performed based on the indices of the DataFrames.
• how: Determines the type of join to perform and can take one of the following values:
‒ inner, left, right, outer

Inner join left join right join outer join

(default)
Pandas
 join()
• lsuffix and rsuffix:
‒ These parameters are used when the DataFrames have columns with the same name
‒ If there are overlapping column names, you can specify suffixes to append to the columns from the calling
DataFrame (left) and the other DataFrame (right) to make them unique

df1 df2
Python for Data Visualization

• Python 2D plotting library
‒ line plots, scatter plots, barcharts, histograms, pie charts etc.
• Producing publication quality figures in a variety of hardcopy formats
• A set of functionalities similar to those of MATLAB

• Relatively low-level; some effort needed to create advanced visualization

Python for Data Visualization

• Python visualization library based on Matplotlib
• Provides high level interface for drawing attractive statistical graphics
• Similar (in style) to the popular ggplot2 library in R
Seaborn
 load_dataset()
• Load an example dataset from the online repository (requires internet)
Seaborn
 load_dataset()
• Tips dataset
Seaborn
 Bar Plot : Average total bill per Day
Seaborn
 Bar Plot : Average total bill per Day
Seaborn
 Bar Plot
Seaborn
 Bar Plot
Seaborn
 Bar Plot
Seaborn
 Bar Plot
Seaborn
 Bar Plot
Seaborn
 Bar Plot
Seaborn
 Count Plot
Seaborn
 Line Plot
Seaborn
 Line Plot
Seaborn
 Line Plot
Seaborn
 Box Plot
Seaborn
 Box Plot
Seaborn
 Box Plot
Seaborn
 Histogram
Seaborn
Kernel Density Estimation
 Histogram
Seaborn
 Histogram
Seaborn
 Histogram
Seaborn
 Violin Plot
Seaborn
 Violin Plot
Seaborn
 Violin Plot
Seaborn
 Violin Plot
Seaborn
 Swarm Plot
Seaborn
 Swarm Plot
Seaborn
 Joint Plot
Seaborn
 Linear Model Plot
Seaborn
 Relation Plot
Seaborn
 Relation Plot
Seaborn
 Relation Plot
Seaborn
 Categorical Plot
Seaborn
 Categorical Plot
Seaborn
 Categorical Plot
Seaborn
Assignment 5
 Submission due : April 11th, 23:55
 What to submit : Notebook file (.ipynb)
• Colab : [File]-[Download]-[Download .ipynb]
• Kaggle : [File]-[Download Notebook]

 IMPORTANT
• Using the seaborn library
• The design of the graph such as color or width does not need to be the same
• The type of graph must be the same
• You don’t need to clean the dataset
Assignment 5
 Problem 1: Loading the 'titanic' dataset from the online repository provided by the
seaborn library
• Requires internet
Assignment 5
 Problem 2: Visualize the number of survivor by gender
Assignment 5
 Problem 3: Visualize the number of survivor by passenger class
Assignment 5
 Problem 4: Visualize the number of people per passenger class by embarked port
Assignment 5
 Problem 5: Visualize survival rate by gender and passenger class
Assignment 5
 Problem 6: Visualize age distribution by embarked port and gender.
Assignment 5
 Problem 7: Visualize the survival by gender and passenger class

Module1 Understanding Data1
No ratings yet
Module1 Understanding Data1
56 pages
Know - Your - Data and Rescaling
No ratings yet
Know - Your - Data and Rescaling
72 pages
Fds Presentation II YEAR
No ratings yet
Fds Presentation II YEAR
21 pages
PRW Questions
No ratings yet
PRW Questions
31 pages
CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets
No ratings yet
CENG313 Introduction To Data Science: Lecture 3-4: Data Types and Datasets
69 pages
Day 5 Statistics (1 of 3) - Basics
No ratings yet
Day 5 Statistics (1 of 3) - Basics
19 pages
WINSEM2024-25 MCSE615L TH VL2024250502897 2025-01-07 Reference-Material-I
No ratings yet
WINSEM2024-25 MCSE615L TH VL2024250502897 2025-01-07 Reference-Material-I
50 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
Know - Your - Data and Rescaling-1
No ratings yet
Know - Your - Data and Rescaling-1
78 pages
Introduction To Satistics .Doc1
No ratings yet
Introduction To Satistics .Doc1
7 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data
No ratings yet
Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data
53 pages
Unit 3
No ratings yet
Unit 3
30 pages
Types of Data, Stat.
0% (1)
Types of Data, Stat.
20 pages
Data Types For Analyst
No ratings yet
Data Types For Analyst
8 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
UE20CS203-Unit1-Class5-Types of Data - Experiments
No ratings yet
UE20CS203-Unit1-Class5-Types of Data - Experiments
51 pages
Dis Vishnu
No ratings yet
Dis Vishnu
48 pages
Crisp DM - Crisp MLQ
No ratings yet
Crisp DM - Crisp MLQ
9 pages
CRISP MLQ for Data Professionals
No ratings yet
CRISP MLQ for Data Professionals
12 pages
ML 2
No ratings yet
ML 2
4 pages
Unit 1 Computational Statistics
No ratings yet
Unit 1 Computational Statistics
4 pages
EDS Unit 2 ?
No ratings yet
EDS Unit 2 ?
13 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
20 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Business 1
No ratings yet
Business 1
1 page
Data Management
No ratings yet
Data Management
36 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
Topic 1 Introduction To Statistics
No ratings yet
Topic 1 Introduction To Statistics
35 pages
Fds Unit II Notes
No ratings yet
Fds Unit II Notes
37 pages
DSA Unit 2 Answers
No ratings yet
DSA Unit 2 Answers
22 pages
Data Science (Unit 02) Notes
No ratings yet
Data Science (Unit 02) Notes
7 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Statistics Overview
No ratings yet
Statistics Overview
13 pages
Chapter 1
No ratings yet
Chapter 1
62 pages
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
Types of Data and Data Quality: KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
25 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13
No ratings yet
Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13
20 pages
Central Tendencies
No ratings yet
Central Tendencies
5 pages
Statistics: An Overview: Unit 1
No ratings yet
Statistics: An Overview: Unit 1
10 pages
Statistics 1
No ratings yet
Statistics 1
16 pages
UNIT-I - Data Categorization-by-Dr - SKY
No ratings yet
UNIT-I - Data Categorization-by-Dr - SKY
22 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
Unit 3 Iml
No ratings yet
Unit 3 Iml
98 pages
Week1 Statistics Detailed
No ratings yet
Week1 Statistics Detailed
3 pages
Week - 5 Data Types in Statistics
No ratings yet
Week - 5 Data Types in Statistics
4 pages
Notes of Week-1 and Week-2
No ratings yet
Notes of Week-1 and Week-2
30 pages
Data Science - Unit 2
No ratings yet
Data Science - Unit 2
57 pages
CH 2
No ratings yet
CH 2
35 pages
Unit 2 Descriptive Analytics
No ratings yet
Unit 2 Descriptive Analytics
87 pages
Unit 2 1
No ratings yet
Unit 2 1
48 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
Fundamentals of Data Science and Analytics On Descriptive Analysis
No ratings yet
Fundamentals of Data Science and Analytics On Descriptive Analysis
53 pages
Data Types
No ratings yet
Data Types
5 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
27 pages
Computer ch-1 Graphics and Animation
No ratings yet
Computer ch-1 Graphics and Animation
4 pages
Avigilon h6xp Dome Camera Datasheet en
No ratings yet
Avigilon h6xp Dome Camera Datasheet en
13 pages
Example Euler Method
No ratings yet
Example Euler Method
11 pages
Certified Ai Associate
No ratings yet
Certified Ai Associate
10 pages
Sample Letter of Agreement (LOA) School & Partner Agency Purpose of Agreement
No ratings yet
Sample Letter of Agreement (LOA) School & Partner Agency Purpose of Agreement
6 pages
1 Fixed-Point Digital Signal Processor
No ratings yet
1 Fixed-Point Digital Signal Processor
159 pages
Tambola App: Scan & Play Instantly
No ratings yet
Tambola App: Scan & Play Instantly
35 pages
Problem Statement: EX - NO:7 E - Ticketing
No ratings yet
Problem Statement: EX - NO:7 E - Ticketing
31 pages
Robotics and AI Innovations
No ratings yet
Robotics and AI Innovations
3 pages
Very Very Important
No ratings yet
Very Very Important
6 pages
HP Quicktest Professional
No ratings yet
HP Quicktest Professional
6 pages
Excel For Data Analysis Curriculum
No ratings yet
Excel For Data Analysis Curriculum
5 pages
Social Media's Impact on Ghanaian Teens
No ratings yet
Social Media's Impact on Ghanaian Teens
34 pages
Instruction Execution and Data Path
No ratings yet
Instruction Execution and Data Path
12 pages
Programming and Scientific Computing in Python For Aerospace Engineers - J Hoekstra (TU Delft)
100% (1)
Programming and Scientific Computing in Python For Aerospace Engineers - J Hoekstra (TU Delft)
139 pages
Samsung LN40D610M4FXZA Fast Track Guide (SM)
No ratings yet
Samsung LN40D610M4FXZA Fast Track Guide (SM)
4 pages
Modular Simulation Software Development For Liquid Propellant Rocket Engines Based On MATLAB Simulink
No ratings yet
Modular Simulation Software Development For Liquid Propellant Rocket Engines Based On MATLAB Simulink
7 pages
RC StudioManual en
100% (1)
RC StudioManual en
446 pages
Course No.: CS-566 Course Title: Web Technologies Total Marks: 12 Date of Exams: Degree: BSCS Semester: 5 Section: A, B 1 2 3 4 5 6 7 8 9 10
No ratings yet
Course No.: CS-566 Course Title: Web Technologies Total Marks: 12 Date of Exams: Degree: BSCS Semester: 5 Section: A, B 1 2 3 4 5 6 7 8 9 10
9 pages
What Is Mechanical Integrity and What Are The Requirements of An MI Program - Life Cycle Engineering
No ratings yet
What Is Mechanical Integrity and What Are The Requirements of An MI Program - Life Cycle Engineering
5 pages
XLR8 - Quick Reference Guide v2018.3
No ratings yet
XLR8 - Quick Reference Guide v2018.3
17 pages
Bro - Aanalyst 200 400 PDF
No ratings yet
Bro - Aanalyst 200 400 PDF
12 pages
Beginner's Guide to Random Forests
No ratings yet
Beginner's Guide to Random Forests
73 pages
CALL: Transforming Language Learning
No ratings yet
CALL: Transforming Language Learning
13 pages
Fire Alarm Systems: Detectors & Panels
No ratings yet
Fire Alarm Systems: Detectors & Panels
1 page
CIMS Overview for Engineering Students
No ratings yet
CIMS Overview for Engineering Students
15 pages
ADONIS 3.9 - User's Manual
No ratings yet
ADONIS 3.9 - User's Manual
870 pages
Verilog TestBench Examples Guide
No ratings yet
Verilog TestBench Examples Guide
7 pages
Basic Web Dynpro ABAP: Step by Step Guide
No ratings yet
Basic Web Dynpro ABAP: Step by Step Guide
51 pages
Press Release - SAR
No ratings yet
Press Release - SAR
2 pages

Bigdata Week5 Lecture Note

Uploaded by

Bigdata Week5 Lecture Note

Uploaded by

Big Data

Nominal Ordinal Interval Ratio

• Example: person’s gender, hometown, postcode, phone number, or favorite movie

• Two types of categorical data : nominal and ordinal data

• Example: bag of read, blue and green marbles

Age 20’s 30’s 40’s 50’s

• Example: age, height, weight, number of students in a classroom

• Two types of categorical data : interval and ratio data

• It does not have an inherently defined zero value

Inner join left join right join outer join

Inner join left join right join outer join

• Relatively low-level; some effort needed to create advanced visualization

You might also like