Exploratory Data Analysis For Machine Learning
Exploratory Data Analysis For Machine Learning
Rafael Rey
IBM Machine Learning
1/13/2022
Exploratory Data Analysis
Contents
Summary of the dataset........................................................................................................2
Preliminary plan for data exploration....................................................................................2
Data cleaning and feature engineering..................................................................................3
Key Findings and Insights.......................................................................................................4
Other interesting hypotheses that can be formulated with this data set................................4
Conducting a formal significance test for one of the hypotheses and discuss the results........4
Suggestions for next steps in analyzing this data....................................................................9
Summary.............................................................................................................................10
1
Page
1. Summary of the dataset
The dataset refers to information available about courses on the Udemy website between the
years 2011 and 2020. The data was obtained from the Kaggle.com website. The data set
includes: the title of the course, the rating of the course, the price of the course, the number of
subscribers, the number of reviews, etc.
The first thing we will examine is the number of columns and we will only have the necessary
ones
The dataset has 13608 rows and 20 columns:
'id' 'title' 'url' 'is_wishlisted' 'created' 'published_time'
'num_published_practice_tests''is_paid' 'num_subscribers' 'avg_rating'
'avg_rating_recent' 'rating'
'num_reviews' 'num_published_lectures' 'discount_price__amount'
'discount_price__currency' 'discount_price__price_string'
'price_detail__amount' 'price_detail__currency'
'price_detail__price_string'
After carrying out the data cleaning and feature engineering we will only be left with the
following columns: num_subscribers, avg_rating, num_reviews, published_time, and
2
discount_price__amount . For the hypothesis test that will be carried out for this report, only
Page
the num_subscribers and published_time columns are necessary, but the other variables are
important for other hypotheses raised in the point (other interesting hypotheses that can be
formulated with this data set).
Third step: check if all the values of the column 'id' (identifier of each Udemy course) are
unique (that there are no duplicate, tripled etc. 'id' values).
Fourth step: verify the data types, verify if there are missing values and verify if there are
Boolean values to change them to the numerical values 0 and 1. It will be verified that the
number of reviews is less than or equal to the number of subscribers of each course. In case
the number of reviews of a certain course is greater than the number of subscribers of this
course, the corresponding row will be eliminated. Check that if a course appears as free its
price is 0, verify that if a course is paid its price is not 0. In case of eliminating rows for finding
the aforementioned contradictions, we will have to reindex those rows.
Step Five: Make a backup of the original dataset in case it is necessary to have this dataset in
the future. Eliminate the columns that are not necessary as mentioned in the previous point,
leaving only the columns of interest.
Sixth step: verify that there are no outliers in the column of discount_price__amount and if
there are, correct them.
3
Page
4. Key Findings and Insights
No outliers were found in the price of the courses.
From the distribution of the number of subscribers it can be deduced that very few courses
concentrate a large number of subscribers.
The average number of Udemy subscribers per course has changed significantly if we take into
account the period 2011-2015 vs the period 2016-2020
The average price of a Udemy course has changed significantly in 2011-2015 vs. 2016-2020
Average revenue per course posted on Udemy has changed significantly in 2011-2015 vs.
2016-2020
The average rating of courses published on Udemy has changed significantly in the period
2011-2015 vs. the period 2016-2020.
In this case, the null hypothesis will be that there is no significant change in the number of
average subscribers between the two periods.
The alternative hypothesis is that the change is statistically significant in the number of
subscribers between those two periods.
We will establish a confidence level of 95% so that the a or type I error will be 5%.
Since the difference could be positive or negative (that is, x̅1 that represents the average
number of Udemy subscribers per course between 2011 and 2015 may be greater or less than
x̅2 that refers to the 2016-2020 period). Therefore, the behavior of the absolute value of the
subtraction x̅1- x̅2 will be analyzed.
4
Page
Page 5
Page 6
Since the value x̅1- x̅2 = 1257 is in the interval 836.54 and 1677, that is to say, 836.54 ≤ x̅1- x̅2 ≤
1677,then the null hypothesis cannot be rejected. Therefore, it cannot be ruled out that there
is no significant change in the average number of subscribers between the two periods.
7
Page
If we consider an Alpha of 1%, we have the following calculations shown below.
8
Page
The conclusion is the same as with an a of 5%, that is, the null hypothesis cannot be rejected.
Therefore, with both a levels of 5% and 1%, for practical purposes it can be said that there is
no statistically significant difference between the number of average subscribers per course on
Udemy comparing the periods 2011-2015 and 2016-2020.
8. Summary
No outliers found.
With a type 1 error of 5% and 1%, no statistically significant differences were found in
the number of average subscribers per course when comparing the periods 2011-2015
vs. 2016-2020. This is because while total subscribers to Udemy increased significantly
between the first and second periods, so did the number of courses posted on the site.
From the bivariate analysis it appears that there is a strong relationship between the
number of reviews and the number of subscribers, as expected. However, the other
correlations observed in the table were weak.
10
Page