0% found this document useful (0 votes)
13 views

Exploratory Data Analysis For Machine Learning

The document summarizes an exploratory data analysis of a dataset containing information on courses from the Udemy website between 2011-2020. It outlines plans to clean the data, explore key variables like number of subscribers over time, and conduct a hypothesis test to analyze if the average number of subscribers per course has significantly changed between 2011-2015 versus 2016-2020. Preliminary findings did not identify any outliers in course prices.

Uploaded by

alejuy153
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Exploratory Data Analysis For Machine Learning

The document summarizes an exploratory data analysis of a dataset containing information on courses from the Udemy website between 2011-2020. It outlines plans to clean the data, explore key variables like number of subscribers over time, and conduct a hypothesis test to analyze if the average number of subscribers per course has significantly changed between 2011-2015 versus 2016-2020. Preliminary findings did not identify any outliers in course prices.

Uploaded by

alejuy153
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

2022

Rafael Rey
IBM Machine Learning
1/13/2022
Exploratory Data Analysis
Contents
Summary of the dataset........................................................................................................2
Preliminary plan for data exploration....................................................................................2
Data cleaning and feature engineering..................................................................................3
Key Findings and Insights.......................................................................................................4
Other interesting hypotheses that can be formulated with this data set................................4
Conducting a formal significance test for one of the hypotheses and discuss the results........4
Suggestions for next steps in analyzing this data....................................................................9
Summary.............................................................................................................................10

1
Page
1. Summary of the dataset
The dataset refers to information available about courses on the Udemy website between the
years 2011 and 2020. The data was obtained from the Kaggle.com website. The data set
includes: the title of the course, the rating of the course, the price of the course, the number of
subscribers, the number of reviews, etc.

2. Preliminary plan for data exploration


A Data Overview and then a Data Cleaning and Feature Engineering will be performed for
both Categorical Data and Numerical Data. Once the previous steps have been successfully
completed, a hypothesis test will be carried out and questions will be left open to deepen the
relationship between the different variables.

The first thing we will examine is the number of columns and we will only have the necessary
ones
The dataset has 13608 rows and 20 columns:
'id' 'title' 'url' 'is_wishlisted' 'created' 'published_time'
'num_published_practice_tests''is_paid' 'num_subscribers' 'avg_rating'
'avg_rating_recent' 'rating'
'num_reviews' 'num_published_lectures' 'discount_price__amount'
'discount_price__currency' 'discount_price__price_string'
'price_detail__amount' 'price_detail__currency'
'price_detail__price_string'

Data columns (total 20 columns):


# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 13608 non-null int64
1 title 13608 non-null object
2 url 13608 non-null object
3 is_paid 13608 non-null bool
4 num_subscribers 13608 non-null int64
5 avg_rating 13608 non-null float64
6 avg_rating_recent 13608 non-null float64
7 rating 13608 non-null float64
8 num_reviews 13608 non-null int64
9 is_wishlisted 13608 non-null bool
10 num_published_lectures 13608 non-null int64
11 num_published_practice_tests 13608 non-null int64
12 created 13608 non-null object
13 published_time 13608 non-null object
14 discount_price__amount 12205 non-null float64
15 discount_price__currency 12205 non-null object
16 discount_price__price_string 12205 non-null object
17 price_detail__amount 13111 non-null float64
18 price_detail__currency 13111 non-null object
19 price_detail__price_string 13111 non-null object
dtypes: bool(2), float64(5), int64(5), object(8)
memory usage: 1.9+ MB

After carrying out the data cleaning and feature engineering we will only be left with the
following columns: num_subscribers, avg_rating, num_reviews, published_time, and
2

discount_price__amount . For the hypothesis test that will be carried out for this report, only
Page

the num_subscribers and published_time columns are necessary, but the other variables are
important for other hypotheses raised in the point (other interesting hypotheses that can be
formulated with this data set).

The description of the variables in the columns of interest is as follows:

Variable identifier Variable name Variable description


num_subscribers Número de suscriptores Number of subscribers for
each course on Udemy
avg_rating Average Rating Average rating of each
course (rating from 1 to 5
stars)
num_reviews Number of reviews Number of reviews for each
course
published_time Published time Date the course was
uploaded to the Udemy
platform and ready to
subscribe
discount_price__amount Discount Price amount Total price of each course
including discounts if they
apply.

The price of the variable discount_price__amount will be changed to 'price'.

3. Data cleaning and feature engineering


First step: count all the rows and columns. The count gave 13,608 rows and 20 columns.

Step Two: Print All Column Names

Third step: check if all the values of the column 'id' (identifier of each Udemy course) are
unique (that there are no duplicate, tripled etc. 'id' values).

Fourth step: verify the data types, verify if there are missing values and verify if there are
Boolean values to change them to the numerical values 0 and 1. It will be verified that the
number of reviews is less than or equal to the number of subscribers of each course. In case
the number of reviews of a certain course is greater than the number of subscribers of this
course, the corresponding row will be eliminated. Check that if a course appears as free its
price is 0, verify that if a course is paid its price is not 0. In case of eliminating rows for finding
the aforementioned contradictions, we will have to reindex those rows.

Step Five: Make a backup of the original dataset in case it is necessary to have this dataset in
the future. Eliminate the columns that are not necessary as mentioned in the previous point,
leaving only the columns of interest.

Sixth step: verify that there are no outliers in the column of discount_price__amount and if
there are, correct them.
3
Page
4. Key Findings and Insights
No outliers were found in the price of the courses.

From the distribution of the number of subscribers it can be deduced that very few courses
concentrate a large number of subscribers.

5. Other interesting hypotheses that can be formulated with this


data set
With this set of data, several hypothesis tests could be carried out in order to analyze the
validity of the following statements:

The average number of Udemy subscribers per course has changed significantly if we take into
account the period 2011-2015 vs the period 2016-2020

The average price of a Udemy course has changed significantly in 2011-2015 vs. 2016-2020

Average revenue per course posted on Udemy has changed significantly in 2011-2015 vs.
2016-2020

The average rating of courses published on Udemy has changed significantly in the period
2011-2015 vs. the period 2016-2020.

6. Significance test for one of the hypotheses


The first hypothesis will be analyzed, that is: the number of average subscribers per Udemy
course has changed significantly or not taking into account the period 2011-2015 vs the period
2016-2020

In this case, the null hypothesis will be that there is no significant change in the number of
average subscribers between the two periods.

The alternative hypothesis is that the change is statistically significant in the number of
subscribers between those two periods.

We will establish a confidence level of 95% so that the a or type I error will be 5%.

Since the difference could be positive or negative (that is, x̅1 that represents the average
number of Udemy subscribers per course between 2011 and 2015 may be greater or less than
x̅2 that refers to the 2016-2020 period). Therefore, the behavior of the absolute value of the
subtraction x̅1- x̅2 will be analyzed.
4
Page
Page 5
Page 6
Since the value x̅1- x̅2 = 1257 is in the interval 836.54 and 1677, that is to say, 836.54 ≤ x̅1- x̅2 ≤
1677,then the null hypothesis cannot be rejected. Therefore, it cannot be ruled out that there
is no significant change in the average number of subscribers between the two periods.

7
Page
If we consider an Alpha of 1%, we have the following calculations shown below.

8
Page
The conclusion is the same as with an a of 5%, that is, the null hypothesis cannot be rejected.

Therefore, with both a levels of 5% and 1%, for practical purposes it can be said that there is
no statistically significant difference between the number of average subscribers per course on
Udemy comparing the periods 2011-2015 and 2016-2020.

7. Suggestions for next steps in analyzing this data


As mentioned in point 5, analyzes of the other hypotheses raised could be carried out.
9

When performing a bivariate analysis, the following correlations were found:


Page
num_subscribers avg_rating num_reviews price

num_subscribers 1 0.082050 0.784190 0.118411

avg_rating 0.082050 1 0.068626 0.119035


num_reviews 0.784190 0.068626 1 0.093628
price 0.118411 0.119035 0.093628 1

8. Summary

 No missing data found.

 No outliers found.

 With a type 1 error of 5% and 1%, no statistically significant differences were found in
the number of average subscribers per course when comparing the periods 2011-2015
vs. 2016-2020. This is because while total subscribers to Udemy increased significantly
between the first and second periods, so did the number of courses posted on the site.

 From the bivariate analysis it appears that there is a strong relationship between the
number of reviews and the number of subscribers, as expected. However, the other
correlations observed in the table were weak.

10
Page

You might also like