Output - Exploration of Netflix Dataset in R
Output - Exploration of Netflix Dataset in R
Overv ew
Th s dataset cons sts of tv shows and mov es ava lable on Netfl x as of 2019. The dataset wh ch can be found
here: https://www.kaggle.com/sh vamb/netfl x-shows (https://www.kaggle.com/sh vamb/netfl x-shows) s
collected from Fl xable wh ch s a th rd-party Netfl x search eng ne.
Th s sect on created by 3 parts; data read ng, data clean ng and data v sual sat on
Data Read ng
Lets read the data and rename t as “netds” to get more useful and easy cod ng n funct ons. In the below we
have to wr te na.str ng=c("“,”NA") because, some values of our data are empty or tak ng place as NA. If we do
not spec fy them at the beg n ng n read ng funct on, we can not reach the m ss ng values n future steps.
The argument ‘str ngsAsFactors’ s an argument to the ‘data. frame()’ funct on n R. It s a log cal that nd cates
whether str ngs n a data frame should be treated as factor var ables or as just pla n str ngs.
In the dataset there are 6234 observat ons of 12 follow ng var ables descr b ng the tv shows and mov es:
library(plotly)
##
## Attaching package: 'plotly'
fig_table1
VARIABLES DESCRIPTION
listed_in Genere
Data Clean ng
As a f rst step of the clean ng part, we can remove unnecessary var ables and parts of the data such as
show_ d var able. Also descr pt on var able w ll not be used for the analys s or v sual sat on but t can be useful
for the further analys s or nterpretat on.
library(lubridate)
##
## Attaching package: 'lubridate'
M ss ng values can be problem for the next steps. Therefore, we have to check them before the analyse and
then we can f ll the m ss ng values of some var ables f t s necessary.
## Variable Missing.Values
## 1 type 0
## 2 title 0
## 3 director 1969
## 4 cast 570
## 5 country 476
## 6 date_added 11
## 7 release_year 0
## 8 rating 10
## 9 duration 0
## 10 listed_in 0
## 11 description 0
We can clearly see that m ss ng values take place n d rector, cast, country, data_added and rat ng var ables.
S nce rat ng s the categor cal var able w th 14 levels we can f ll n (approx mate) the m ss ng values for rat ng
w th a mode.
## Variable Missing.Values
## 1 type 0
## 2 title 0
## 3 director 1969
## 4 cast 570
## 5 country 476
## 6 date_added 11
## 7 release_year 0
## 8 rating 0
## 9 duration 0
## 10 listed_in 0
## 11 description 0
Now, we are go ng to drop the m ss ng values, at po nt where t w ll be necessary. We also drop dupl cated
rows n the dataset based on the “t tle”, “country”, “type”," release_year" var ables.
library(dplyr)
##
## Attaching package: 'dplyr'
Data clean ng process s done. Now we can start to v sual sat on.
In the f rst graphy, ggplot2 l brary s used and data v sual sed w th bas c bar graph. In the code part, some
arguments of funct ons w ll be descr bed.
library(tibble)
library(dplyr)
library(ggplot2)
# Here we created a new table by the name of "amount_by_type" and applied some filter by usin
g dplyr library. Primarly, group_by() function is used to select variable and then used summa
rise() function with n() to count number of TV Shows and Movies.
# In ggplot2 library, the code is created by two parts. First one is ggplot(), here we have t
o specify our arguments such as data, x and y axis and fill type. then continue with + and ty
pe of the graph will be added by using geom_graphytype.
ggplotly(figure00, dynamicTicks = T)
4000
3500
Amount of Netflix Content
3000
2500
2000
1500
1000
500
0
Movie TV Show
Netflix Content by Type
As we see from above there are more than 2 t mes more Mov es than TV Shows on Netfl x.
# 1: split the countries (ex: "United States, India, South Korea, China" form to 'United Stat
es' 'India' 'South Korea' 'China') in the country column by using strsplit() function and the
n assign this operation to "k" for future use.
# 2: Created a new dataframe by using data.frame() function. First column should be type = se
cond one country=. Created type column by using rep() function. The function replicates the v
alues in netds$type depends on the length of each element of k. we used sapply()) function. N
ow k is our new data in sapply(). it means that calculate the lenght of each element of the k
list so that we create type column. In the country column, we used just unlist() function. It
simply converts the list to vector with all the atomic components are being preserved.
# 4: we created new grouped data frame by the name of amount_by_country. NA.omit() function d
eletes the NA values on the country column/variable. Then we groupped countries and types by
using group_by() function (in the "dplyr" library). After that used summarise() function to
summarise the counted number of observations on the new "count" column by using n() functio
n.
# reshape() fonction will be used to create a reshaped grouped data. amount_by_country is use
d as data in the function. In this function, we will describe id variable, names of the valu
e, time variable, and direction. Direction is character string, partially matched to either
"wide" to reshape to wide format, or "long" to reshape to long format. Then we applied arran
ge() funtion to the reshaped grouped data. The dplyr function arrange() can be used to reorde
r (or sort) rows by one or more variables. In this part we sort count.movie column as descend
ing.
# To check to arguments and detailed descriptions of functions please use to help menu or goo
gle.com
# After the arrange funtion, top_n() function is used to list the specified number of rows.
u <- reshape(data=data.frame(amount_by_country),idvar="country",
v.names = "count",
timevar = "type",
direction="wide") %>% arrange(desc(count.Movie)) %>%
top_n(10)
# 6: names of the second and third columns are changed by using names() function as seen belo
w.
# 7: In the arrange() function we sorted our count.movie columns as descending but, now, we w
ant to change this sort depends on the total values of "number of Movies" and "number of TV S
hows". To sort a data frame in R, use the order() function. By default, sorting is ASCENDING.
Therefore, we have to specify as descending. + is used to specify total operation.
# 8: Now we can create our graph by using ggplot2 library. First argument of the ggplot funct
ion is our data.frame, then we specified our variables in the aes() function. coloured the gr
aphy depends on the countries. Then typeof the graph is writed as geom_point and dot size spe
cified as 5. After that we named x and y axis. Title of the graph is wroted by using ggtitle
() function.
library(ggplot2)
ggplotly(figure000, dynamicTicks = T)
700 Australia
Canada
600
France
India
500
Number of TV Shows
Japan
400 Mexico
South Korea
300 Spain
Taiwan
200
United Kingdom
United States
100
0
0 500 1000 1500 2000
Number of Movies
We see that the Un ted States s a clear leader n the amount of content on Netfl x.
# 0: To see number contents by time we have to create a new data.frame. This process is a lit
tle tiring. Maybe there is a short way but I couldn't find it. Lets start!
# 1: Title column take place in our dataframe as character therefore I have to convert it to
tbl_df format to apply the function below. If this column remains in character format and I
want to implement the function, R returns an error: " Error in UseMethod("group_by_") : no a
pplicable method for 'group_by_' applied to an object of class "character"" Therefore, fi
rst I assign it title column to f then convert the format as tibble and then assign it again
to title column.
f <- netds$title
f <-tibble(f)
netds$title <- f
# 2: new_date variable created by selecting just years. In this way, we can analyze and visua
lise the data more easy
library(lubridate)
# 2: df_by_date crated as a new grouped data frame. Titles are grouped depending the new_date
(year) and then na.omit function applied to date column to remove NA values. Finally, number
of added contents in a day calculated by using summarise() and n() functions.
library(ggplot2)
Type<- df_by_date$`netds$type`
ggplotly(g1, dynamicTicks = T)
1400 TV Show
1200
Number of Content
1000
800
600
400
200
From above we see that start ng from the year 2016 the total amount of content was grow ng exponent ally. We
also not ce how fast the amount of mov es on Netfl x overcame the amount of TV Shows. The reason for the
decl ne n 2020 s that the data we have s end ng beg n ng of the 2020. At the beg n ng of 2020, the number of
ngred ents produced s small. Before to say someth ng about 2020 we have to see year-end data.
# Here plotly library used to visualise data. To see the graph in chunk output or console you
have to assign it to somewhere such as "fig"
library(plotly)
# In the first part of visualisation, again, we have to specify our data labels, values, x a
d y axis and type of graph.
figure2
0.0321%
2.
29 0.112%
%
0.594%
2.
39
8.15%
2.7
1.52%
%
2.9
1%
5%
4.59% 3.5%
# data preparation
data2 <-netds$title %>%
group_by(netds$rating, netds$type)%>%
summarise(content_num = n())
## [1] 0
# visualisation
library(plotly)
figure3
1500
Count 1000
500
0
G NC NR PG PG R TV T TV TV T TV TV UR
-17 -13 -14 V-G -M -PG V-Y -Y7 -Y7
A -FV
rating
# data preparation
library(crayon)
##
## Attaching package: 'crayon'
# before apply to strsplit function, we have to make sure that type of the variable is charac
ter.
netds$listed_in<- as.character(netds$listed_in)
## Selecting by count
# visualisation
figure4
1500
Count
1000
500
0
Action & Adventure
British TV Shows
Comedies
Crime TV Shows
Documentaries
Docuseries
Dramas
Horror Movies
Independent Movies
International Movies
International TV Shows
Kids' TV
Romantic Movies
Romantic TV Shows
Stand-Up Comedy
Thrillers
TV Comedies
TV Dramas
Genre
# data preparation
## Selecting by count
library(tibble)
# visualisation as table
fig_table2
D rector count
Jan Suter 21
f le:///C:/Users/y g terol/Desktop/r_analys s/netfl x_analys s/anal z.html 15/16
19.04.2020 Explorat on of Netfl x Dataset n R
Jan Suter 21
Raúl Campos 19
Jay Karas 14
Marcus Raboy 14
Jay Chapman 12
Mart n Scorsese 9
Steven Sp elberg 9
Dav d Dhawan 8
Johnn e To 8
Lance Bangs 8
Shannon Hartman 8
Umesh Mehra 8
Cathy Garc a-Mol na 7
D bakar Banerjee 7
Hakan Algül 7
Noah Baumbach 7
Quent n Tarant no 7
Robert Rodr guez 7
Ryan Pol to 7