0% found this document useful (0 votes)

17 views11 pages

Exploratory Data Analysis For Machine Learning

The document summarizes an exploratory data analysis of a dataset containing information on courses from the Udemy website between 2011-2020. It outlines plans to clean the data, explore key variables like number of subscribers over time, and conduct a hypothesis test to analyze if the average number of subscribers per course has significantly changed between 2011-2015 versus 2016-2020. Preliminary findings did not identify any outliers in course prices.

Uploaded by

alejuy153

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views11 pages

Exploratory Data Analysis For Machine Learning

Uploaded by

alejuy153

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

2022

Rafael Rey
IBM Machine Learning
1/13/2022
Exploratory Data Analysis
Contents
Summary of the dataset........................................................................................................2
Preliminary plan for data exploration....................................................................................2
Data cleaning and feature engineering..................................................................................3
Key Findings and Insights.......................................................................................................4
Other interesting hypotheses that can be formulated with this data set................................4
Conducting a formal significance test for one of the hypotheses and discuss the results........4
Suggestions for next steps in analyzing this data....................................................................9
Summary.............................................................................................................................10

1
Page
1. Summary of the dataset
The dataset refers to information available about courses on the Udemy website between the
years 2011 and 2020. The data was obtained from the Kaggle.com website. The data set
includes: the title of the course, the rating of the course, the price of the course, the number of
subscribers, the number of reviews, etc.

2. Preliminary plan for data exploration

A Data Overview and then a Data Cleaning and Feature Engineering will be performed for
both Categorical Data and Numerical Data. Once the previous steps have been successfully
completed, a hypothesis test will be carried out and questions will be left open to deepen the
relationship between the different variables.

The first thing we will examine is the number of columns and we will only have the necessary
ones
The dataset has 13608 rows and 20 columns:
'id' 'title' 'url' 'is_wishlisted' 'created' 'published_time'
'num_published_practice_tests''is_paid' 'num_subscribers' 'avg_rating'
'avg_rating_recent' 'rating'
'num_reviews' 'num_published_lectures' 'discount_price__amount'
'discount_price__currency' 'discount_price__price_string'
'price_detail__amount' 'price_detail__currency'
'price_detail__price_string'

Data columns (total 20 columns):

# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 13608 non-null int64
1 title 13608 non-null object
2 url 13608 non-null object
3 is_paid 13608 non-null bool
4 num_subscribers 13608 non-null int64
5 avg_rating 13608 non-null float64
6 avg_rating_recent 13608 non-null float64
7 rating 13608 non-null float64
8 num_reviews 13608 non-null int64
9 is_wishlisted 13608 non-null bool
10 num_published_lectures 13608 non-null int64
11 num_published_practice_tests 13608 non-null int64
12 created 13608 non-null object
13 published_time 13608 non-null object
14 discount_price__amount 12205 non-null float64
15 discount_price__currency 12205 non-null object
16 discount_price__price_string 12205 non-null object
17 price_detail__amount 13111 non-null float64
18 price_detail__currency 13111 non-null object
19 price_detail__price_string 13111 non-null object
dtypes: bool(2), float64(5), int64(5), object(8)
memory usage: 1.9+ MB

After carrying out the data cleaning and feature engineering we will only be left with the
following columns: num_subscribers, avg_rating, num_reviews, published_time, and
2

discount_price__amount . For the hypothesis test that will be carried out for this report, only
Page

the num_subscribers and published_time columns are necessary, but the other variables are
important for other hypotheses raised in the point (other interesting hypotheses that can be
formulated with this data set).

The description of the variables in the columns of interest is as follows:

Variable identifier Variable name Variable description

num_subscribers Número de suscriptores Number of subscribers for
each course on Udemy
avg_rating Average Rating Average rating of each
course (rating from 1 to 5
stars)
num_reviews Number of reviews Number of reviews for each
course
published_time Published time Date the course was
uploaded to the Udemy
platform and ready to
subscribe
discount_price__amount Discount Price amount Total price of each course
including discounts if they
apply.

The price of the variable discount_price__amount will be changed to 'price'.

3. Data cleaning and feature engineering

First step: count all the rows and columns. The count gave 13,608 rows and 20 columns.

Step Two: Print All Column Names

Third step: check if all the values of the column 'id' (identifier of each Udemy course) are
unique (that there are no duplicate, tripled etc. 'id' values).

Fourth step: verify the data types, verify if there are missing values and verify if there are
Boolean values to change them to the numerical values 0 and 1. It will be verified that the
number of reviews is less than or equal to the number of subscribers of each course. In case
the number of reviews of a certain course is greater than the number of subscribers of this
course, the corresponding row will be eliminated. Check that if a course appears as free its
price is 0, verify that if a course is paid its price is not 0. In case of eliminating rows for finding
the aforementioned contradictions, we will have to reindex those rows.

Step Five: Make a backup of the original dataset in case it is necessary to have this dataset in
the future. Eliminate the columns that are not necessary as mentioned in the previous point,
leaving only the columns of interest.

Sixth step: verify that there are no outliers in the column of discount_price__amount and if
there are, correct them.
3
Page
4. Key Findings and Insights
No outliers were found in the price of the courses.

From the distribution of the number of subscribers it can be deduced that very few courses
concentrate a large number of subscribers.

5. Other interesting hypotheses that can be formulated with this

data set
With this set of data, several hypothesis tests could be carried out in order to analyze the
validity of the following statements:

The average number of Udemy subscribers per course has changed significantly if we take into
account the period 2011-2015 vs the period 2016-2020

The average price of a Udemy course has changed significantly in 2011-2015 vs. 2016-2020

Average revenue per course posted on Udemy has changed significantly in 2011-2015 vs.
2016-2020

The average rating of courses published on Udemy has changed significantly in the period
2011-2015 vs. the period 2016-2020.

6. Significance test for one of the hypotheses

The first hypothesis will be analyzed, that is: the number of average subscribers per Udemy
course has changed significantly or not taking into account the period 2011-2015 vs the period
2016-2020

In this case, the null hypothesis will be that there is no significant change in the number of
average subscribers between the two periods.

The alternative hypothesis is that the change is statistically significant in the number of
subscribers between those two periods.

We will establish a confidence level of 95% so that the a or type I error will be 5%.

Since the difference could be positive or negative (that is, x̅1 that represents the average
number of Udemy subscribers per course between 2011 and 2015 may be greater or less than
x̅2 that refers to the 2016-2020 period). Therefore, the behavior of the absolute value of the
subtraction x̅1- x̅2 will be analyzed.
4
Page
Page 5
Page 6
Since the value x̅1- x̅2 = 1257 is in the interval 836.54 and 1677, that is to say, 836.54 ≤ x̅1- x̅2 ≤
1677,then the null hypothesis cannot be rejected. Therefore, it cannot be ruled out that there
is no significant change in the average number of subscribers between the two periods.

7
Page
If we consider an Alpha of 1%, we have the following calculations shown below.

8
Page
The conclusion is the same as with an a of 5%, that is, the null hypothesis cannot be rejected.

Therefore, with both a levels of 5% and 1%, for practical purposes it can be said that there is
no statistically significant difference between the number of average subscribers per course on
Udemy comparing the periods 2011-2015 and 2016-2020.

7. Suggestions for next steps in analyzing this data

As mentioned in point 5, analyzes of the other hypotheses raised could be carried out.
9

When performing a bivariate analysis, the following correlations were found:

Page
num_subscribers avg_rating num_reviews price

num_subscribers 1 0.082050 0.784190 0.118411

avg_rating 0.082050 1 0.068626 0.119035

num_reviews 0.784190 0.068626 1 0.093628
price 0.118411 0.119035 0.093628 1

8. Summary

 No missing data found.

 No outliers found.

 With a type 1 error of 5% and 1%, no statistically significant differences were found in
the number of average subscribers per course when comparing the periods 2011-2015
vs. 2016-2020. This is because while total subscribers to Udemy increased significantly
between the first and second periods, so did the number of courses posted on the site.

 From the bivariate analysis it appears that there is a strong relationship between the
number of reviews and the number of subscribers, as expected. However, the other
correlations observed in the table were weak.

10
Page

Six Weeks Summer Training Report PDF
100% (2)
Six Weeks Summer Training Report PDF
26 pages
Finance & Accounting Courses - Udemy (13K+ Course)
No ratings yet
Finance & Accounting Courses - Udemy (13K+ Course)
29 pages
Amazon Data Analyst Interview Questions -1
No ratings yet
Amazon Data Analyst Interview Questions -1
22 pages
Spss & Surfer Tutorial
100% (1)
Spss & Surfer Tutorial
27 pages
9_Paint Manufacture_Ch-9
No ratings yet
9_Paint Manufacture_Ch-9
6 pages
Engine Pressure Sensor Test
100% (1)
Engine Pressure Sensor Test
11 pages
B.SC - Physics Thiruvallur University
No ratings yet
B.SC - Physics Thiruvallur University
69 pages
Unit 4-8 Triangles and Coordinate Proof
No ratings yet
Unit 4-8 Triangles and Coordinate Proof
25 pages
Analyze Ab Test Results Notebook
No ratings yet
Analyze Ab Test Results Notebook
13 pages
DMML
No ratings yet
DMML
65 pages
Analyze Ab Test Results Notebook
No ratings yet
Analyze Ab Test Results Notebook
15 pages
Felsenstein J. - Theoretical Evolutionary Genetics (2003)
No ratings yet
Felsenstein J. - Theoretical Evolutionary Genetics (2003)
383 pages
Unit 3
No ratings yet
Unit 3
55 pages
Data Analysis Portfolio Sample
No ratings yet
Data Analysis Portfolio Sample
25 pages
Predictive Modeling
No ratings yet
Predictive Modeling
42 pages
PostgreSQL Data Base Design Part 2
No ratings yet
PostgreSQL Data Base Design Part 2
40 pages
Pandas EDA Project 1707541945
No ratings yet
Pandas EDA Project 1707541945
19 pages
Data Analysis Portfolio
No ratings yet
Data Analysis Portfolio
21 pages
2022 Onshape Assignment Modelling Parts
No ratings yet
2022 Onshape Assignment Modelling Parts
17 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Distributed Systems Principles and Paradigms 2nd Edition Andrew S. Tanenbaum all chapter instant download
No ratings yet
Distributed Systems Principles and Paradigms 2nd Edition Andrew S. Tanenbaum all chapter instant download
51 pages
Analyzing Data With Power Bi And Power Pivot For Excel Ferrari download
No ratings yet
Analyzing Data With Power Bi And Power Pivot For Excel Ferrari download
77 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
227(eBook PDF) Electric Circuits 10th Edition by Nilsson & Riedel instant download
100% (2)
227(eBook PDF) Electric Circuits 10th Edition by Nilsson & Riedel instant download
45 pages
Vertopal.com AML Project LearnerNotebook LowCode
No ratings yet
Vertopal.com AML Project LearnerNotebook LowCode
74 pages
KHUSHI_00816659424_Group 1
No ratings yet
KHUSHI_00816659424_Group 1
47 pages
EDUTECH STARTUPANALYSIS
No ratings yet
EDUTECH STARTUPANALYSIS
11 pages
Transmission Loss Strategy
No ratings yet
Transmission Loss Strategy
42 pages
Data Analysis
No ratings yet
Data Analysis
9 pages
Activity #1 - Advanced SQL
No ratings yet
Activity #1 - Advanced SQL
12 pages
SQL Notes
No ratings yet
SQL Notes
76 pages
Data Science
100% (1)
Data Science
13 pages
Description and Research Questions
No ratings yet
Description and Research Questions
1 page
Veeruu PPT Major
No ratings yet
Veeruu PPT Major
9 pages
Preprocessing ch.1
No ratings yet
Preprocessing ch.1
24 pages
Biology Year 11 (23 Copies)
No ratings yet
Biology Year 11 (23 Copies)
22 pages
A - B Testing
No ratings yet
A - B Testing
31 pages
مشروع DBMS
No ratings yet
مشروع DBMS
8 pages
Welcome To The Purdue OWL
No ratings yet
Welcome To The Purdue OWL
8 pages
Analyze A/B Test Results: #We Are Setting The Seed To Assure You Get The Same Answers On Quizzes As We Set Up
No ratings yet
Analyze A/B Test Results: #We Are Setting The Seed To Assure You Get The Same Answers On Quizzes As We Set Up
12 pages
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
No ratings yet
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
18 pages
Review Paper On Concept Drifting Data Stream Mining
No ratings yet
Review Paper On Concept Drifting Data Stream Mining
4 pages
Statistic (Question 8)
No ratings yet
Statistic (Question 8)
2 pages
2020-3-NS-SPI (Q)
No ratings yet
2020-3-NS-SPI (Q)
3 pages
Deep Analysis On Business Courses Offered by Udemy
No ratings yet
Deep Analysis On Business Courses Offered by Udemy
7 pages
Walmart Data Analyst Interview Experience
No ratings yet
Walmart Data Analyst Interview Experience
10 pages
DATA STRUCTURE DESIGN LABORATORY SET A
No ratings yet
DATA STRUCTURE DESIGN LABORATORY SET A
2 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Frequency Hopping Spread Spectrum PHY of The 802.11 Wireless LAN Standard
No ratings yet
Frequency Hopping Spread Spectrum PHY of The 802.11 Wireless LAN Standard
12 pages
Bosch Dishwasher Service Training
81% (16)
Bosch Dishwasher Service Training
119 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
KPMG - Task 1
No ratings yet
KPMG - Task 1
22 pages
Tech-DA Assessment 3 Lessons 8-9 v1 0124
No ratings yet
Tech-DA Assessment 3 Lessons 8-9 v1 0124
3 pages
Lead Score Case Study
No ratings yet
Lead Score Case Study
9 pages
Lead Score Case Study (1)
No ratings yet
Lead Score Case Study (1)
9 pages
Tonehammer Emotional Piano Readme
No ratings yet
Tonehammer Emotional Piano Readme
14 pages
Ayush_report
No ratings yet
Ayush_report
17 pages
LEAD SCORING CASE STUDY-converted-compressed
No ratings yet
LEAD SCORING CASE STUDY-converted-compressed
13 pages
1le5533 3ab63 4ab3
No ratings yet
1le5533 3ab63 4ab3
2 pages
Physics Class XII Chapter 11 Dual Nature of Radiation and Matter Practice Paper 11 2024 Answers
No ratings yet
Physics Class XII Chapter 11 Dual Nature of Radiation and Matter Practice Paper 11 2024 Answers
10 pages
Project-Predictive Modeling-Rajendra M Bhat
100% (3)
Project-Predictive Modeling-Rajendra M Bhat
14 pages
Project 3- Online Course Platform Analysis
No ratings yet
Project 3- Online Course Platform Analysis
3 pages
Iit Jee
No ratings yet
Iit Jee
313 pages
Dynamics (Music) - Wikipedia
No ratings yet
Dynamics (Music) - Wikipedia
6 pages
Tugas Simulasi Optimasi
No ratings yet
Tugas Simulasi Optimasi
3 pages
Recent Progress of Fillers in Mixed Matrix Membranes For CO2 Separtion A Review PDF
No ratings yet
Recent Progress of Fillers in Mixed Matrix Membranes For CO2 Separtion A Review PDF
20 pages
Refraction Physics SAT
No ratings yet
Refraction Physics SAT
3 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Udacity Dandsyllabus
No ratings yet
Udacity Dandsyllabus
7 pages
Diane L. Fairclough - Design and Analysis of Quali
No ratings yet
Diane L. Fairclough - Design and Analysis of Quali
426 pages
Urban Clap - Anu
No ratings yet
Urban Clap - Anu
10 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Data Analyst 101
No ratings yet
Data Analyst 101
9 pages
A Cost Effective and Reliable Battery Management System For Electric Vehicle Applications
No ratings yet
A Cost Effective and Reliable Battery Management System For Electric Vehicle Applications
6 pages
Tailing Lug
No ratings yet
Tailing Lug
3 pages
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
From Everand
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
Avishek Nag
No ratings yet
TouchCode Class 8
From Everand
TouchCode Class 8
Team Orange
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Python Programming Using Google Colab
From Everand
Python Programming Using Google Colab
AM Govind Kumar
No ratings yet
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
AI-900: Microsoft Azure AI Fundamentals Preparation
From Everand
AI-900: Microsoft Azure AI Fundamentals Preparation
Georgio Daccache
No ratings yet
The C++ Workshop: Learn to write clean, maintainable code in C++ and advance your career in software engineering
From Everand
The C++ Workshop: Learn to write clean, maintainable code in C++ and advance your career in software engineering
Dale Green
No ratings yet
Workshop Master Revealed
From Everand
Workshop Master Revealed
Anil Soni
No ratings yet
TouchCode Class 7
From Everand
TouchCode Class 7
Team Orange
No ratings yet
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Manufacturing: Engineering, Management and Marketing
From Everand
Manufacturing: Engineering, Management and Marketing
S.O.T Ogaji
No ratings yet
Practice Questions for UiPath Certified RPA Associate Case Based
From Everand
Practice Questions for UiPath Certified RPA Associate Case Based
Exam OG
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Exploratory Data Analysis For Machine Learning

Uploaded by

Exploratory Data Analysis For Machine Learning

Uploaded by

2022

2. Preliminary plan for data exploration

Data columns (total 20 columns):

The description of the variables in the columns of interest is as follows:

Variable identifier Variable name Variable description

The price of the variable discount_price__amount will be changed to 'price'.

3. Data cleaning and feature engineering

Step Two: Print All Column Names

5. Other interesting hypotheses that can be formulated with this

6. Significance test for one of the hypotheses

7. Suggestions for next steps in analyzing this data

When performing a bivariate analysis, the following correlations were found:

num_subscribers 1 0.082050 0.784190 0.118411

avg_rating 0.082050 1 0.068626 0.119035

 No missing data found.

You might also like