Netflix Users Analysis Using Python
📝 Project Description:
Leveraging the power of Python and cutting-edge data analysis libraries, we delved into a fascinating
dataset on Netflix users to uncover valuable insights. Explored key attributes such as Age, Gender,
Subscription Plan, Monthly Revenue, Last Date of Activity, Join Date, and Device to gain a
comprehensive understanding of user behavior and preferences. Employed advanced data visualization
techniques to present findings in an insightful and visually appealing manner. Conducted in-depth
analysis to identify trends, patterns, and correlations within the dataset, providing actionable insights for
Netflix and related stakeholders.
Import Library
In [1]: import pandas as pd
In [2]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
C:\Users\Syed Arif\anaconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A
NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected v
ersion 1.25.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Uploading Csv fle
In [3]: df = pd.read_csv(r"C:\Users\Syed Arif\Desktop\Netflix User Base\Netflix Userbase.csv"
Data Preprocessing
.head()
head is used show to the By default = 5 rows in the dataset
In [4]: df.head()
Out[4]:
Last
User Subscription Monthly Join Plan
Payment Country Age Gender Device
ID Type Revenue Date Duration
Date
15- United
0 1 Basic 10 10-06-23 28 Male Smartphone 1 Month
01-22 States
05-
1 2 Premium 15 22-06-23 Canada 35 Female Tablet 1 Month
09-21
28- United
2 3 Standard 12 27-06-23 42 Male Smart TV 1 Month
02-23 Kingdom
10-
3 4 Standard 12 26-06-23 Australia 51 Female Laptop 1 Month
07-22
01-
4 5 Basic 10 28-06-23 Germany 33 Male Smartphone 1 Month
05-23
.tail()
tail is used to show last rows
In [5]: df.tail()
Out[5]:
Last
User Subscription Monthly Join Plan
Payment Country Age Gender Device
ID Type Revenue Date Duration
Date
25- Smart
2495 2496 Premium 14 12-07-23 Spain 28 Female 1 Month
07-22 TV
04- Smart
2496 2497 Basic 15 14-07-23 Spain 33 Female 1 Month
08-22 TV
09- United
2497 2498 Standard 12 15-07-23 38 Male Laptop 1 Month
08-22 States
12-
2498 2499 Standard 13 12-07-23 Canada 48 Female Tablet 1 Month
08-22
13- United Smart
2499 2500 Basic 15 12-07-23 35 Female 1 Month
08-22 States TV
.shape
It show the total no of rows & Column in the dataset
In [6]: df.shape
Out[6]: (2500, 10)
.Columns
It show the no of each Column
In [7]: df.columns
Out[7]: Index(['User ID', 'Subscription Type', 'Monthly Revenue', 'Join Date',
'Last Payment Date', 'Country', 'Age', 'Gender', 'Device',
'Plan Duration'],
dtype='object')
.dtypes
This Attribute show the data type of each column
In [8]: df.dtypes
Out[8]: User ID int64
Subscription Type object
Monthly Revenue int64
Join Date object
Last Payment Date object
Country object
Age int64
Gender object
Device object
Plan Duration object
dtype: object
.unique()
In a column, It show the unique value of specific column.
In [9]: df["Country"].unique()
Out[9]: array(['United States', 'Canada', 'United Kingdom', 'Australia',
'Germany', 'France', 'Brazil', 'Mexico', 'Spain', 'Italy'],
dtype=object)
.nuique()
It will show the total no of unque value from whole data frame
In [10]: df.nunique()
Out[10]: User ID 2500
Subscription Type 3
Monthly Revenue 6
Join Date 300
Last Payment Date 26
Country 10
Age 26
Gender 2
Device 4
Plan Duration 1
dtype: int64
.describe()
It show the Count, mean , median etc
In [11]: df.describe()
Out[11]:
User ID Monthly Revenue Age
count 2500.00000 2500.000000 2500.000000
mean 1250.50000 12.508400 38.795600
std 721.83216 1.686851 7.171778
min 1.00000 10.000000 26.000000
25% 625.75000 11.000000 32.000000
50% 1250.50000 12.000000 39.000000
75% 1875.25000 14.000000 45.000000
max 2500.00000 15.000000 51.000000
.value_counts
It Shows all the unique values with their count
In [12]: df["Country"].value_counts()
Out[12]: United States 451
Spain 451
Canada 317
United Kingdom 183
Australia 183
Germany 183
France 183
Brazil 183
Mexico 183
Italy 183
Name: Country, dtype: int64
.isnull()
It shows the how many null values
In [13]: df.isnull()
Out[13]:
Last
User Subscription Monthly Join Plan
Payment Country Age Gender Device
ID Type Revenue Date Duration
Date
0 False False False False False False False False False False
1 False False False False False False False False False False
2 False False False False False False False False False False
3 False False False False False False False False False False
4 False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ...
2495 False False False False False False False False False False
2496 False False False False False False False False False False
2497 False False False False False False False False False False
2498 False False False False False False False False False False
2499 False False False False False False False False False False
2500 rows × 10 columns
In [14]: sns.heatmap(df.isnull())
Out[14]: <AxesSubplot:>
In [15]: df["Join Date"] = pd.to_datetime(df["Join Date"])
df["Last Payment Date"] = pd.to_datetime(df["Last Payment Date"])
In [16]: import pandas as pd
# Assuming 'df' is your DataFrame with a 'Join Date' column
df['Join Date'] = pd.to_datetime(df['Join Date'])
# Extract month names
df['Join Month'] = df['Join Date'].dt.month_name()
# Display the DataFrame with the added 'Join Month' column
print(df)
User ID Subscription Type Monthly Revenue Join Date Last Payment Date \
0 1 Basic 10 2022-01-15 2023-10-06
1 2 Premium 15 2021-05-09 2023-06-22
2 3 Standard 12 2023-02-28 2023-06-27
3 4 Standard 12 2022-10-07 2023-06-26
4 5 Basic 10 2023-01-05 2023-06-28
... ... ... ... ... ...
2495 2496 Premium 14 2022-07-25 2023-12-07
2496 2497 Basic 15 2022-04-08 2023-07-14
2497 2498 Standard 12 2022-09-08 2023-07-15
2498 2499 Standard 13 2022-12-08 2023-12-07
2499 2500 Basic 15 2022-08-13 2023-12-07
Country Age Gender Device Plan Duration Join Month
0 United States 28 Male Smartphone 1 Month January
1 Canada 35 Female Tablet 1 Month May
2 United Kingdom 42 Male Smart TV 1 Month February
3 Australia 51 Female Laptop 1 Month October
4 Germany 33 Male Smartphone 1 Month January
... ... ... ... ... ... ...
2495 Spain 28 Female Smart TV 1 Month July
2496 Spain 33 Female Smart TV 1 Month April
2497 United States 38 Male Laptop 1 Month September
2498 Canada 48 Female Tablet 1 Month December
2499 United States 35 Female Smart TV 1 Month August
[2500 rows x 11 columns]
Why we Use (get_continent) in Python:
This library can help you find the continent of a given country
In [17]: # Deriving some useful features using lambda function
def get_continent(country):
"""returns the continent of the given country"""
if country in {"United States", "Canada", "Mexico"}:
return "North America"
if country in {"France", "Germany", "United Kingdom", "Italy", "Spain"}:
return "Europe"
if country == "Brazil":
return "South America"
if country == "Australia":
return "Australia"
return "Africa / Asia"
def get_age_class(age):
"""returns the age class of a given age"""
return "Kid" if age < 11 \
else "Teen" if age < 20 \
else "Young" if age < 40 \
else "Senior" if age < 70 \
else "Elderly"
df["Country"] = df["Country"].apply(lambda x: get_continent(x))
df["Age"] = df["Age"].apply(lambda x : get_age_class(x))
In [18]: df
Out[18]:
Last
User Subscription Monthly Join Plan
Payment Country Age Gender Device
ID Type Revenue Date Duration
Date
2022- 2023-10- North
0 1 Basic 10 Young Male Smartphone 1 Month Ja
01-15 06 America
2021- 2023-06- North
1 2 Premium 15 Young Female Tablet 1 Month
05-09 22 America
2023- 2023-06-
2 3 Standard 12 Europe Senior Male Smart TV 1 Month Fe
02-28 27
2022- 2023-06-
3 4 Standard 12 Australia Senior Female Laptop 1 Month O
10-07 26
2023- 2023-06-
4 5 Basic 10 Europe Young Male Smartphone 1 Month Ja
01-05 28
... ... ... ... ... ... ... ... ... ... ...
2022- 2023-12-
2495 2496 Premium 14 Europe Young Female Smart TV 1 Month
07-25 07
2022- 2023-07-
2496 2497 Basic 15 Europe Young Female Smart TV 1 Month
04-08 14
2022- 2023-07- North
2497 2498 Standard 12 Young Male Laptop 1 Month Sept
09-08 15 America
2022- 2023-12- North
2498 2499 Standard 13 Senior Female Tablet 1 Month Dec
12-08 07 America
2022- 2023-12- North
2499 2500 Basic 15 Young Female Smart TV 1 Month A
08-13 07 America
2500 rows × 11 columns
In [19]: plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='Gender')
plt.xlabel('Gender')
plt.ylabel('Distribuation')
plt.title('Distribuation of Gender')
plt.show()
In [20]: plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='Gender', hue ="Country")
plt.xlabel('Gender')
plt.ylabel('Country')
plt.title('Gender Vise Country Subscribers')
plt.show()
In [21]: plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='Country', hue ="Device")
plt.xlabel('Country')
plt.ylabel('Device')
plt.title('Country Vise Device Users')
plt.show()
In [22]: # Group the data by Feedback and calculate the count of each category
Device = df.groupby('Device').size()
# Create a pie chart
plt.figure(figsize=(8, 8))
plt.pie(Device, labels=Device.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Device')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
In [23]: plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='Subscription Type')
plt.xlabel('Subscription Type')
plt.ylabel('Counts')
plt.title('Distribuation of Subscription Type')
plt.show()
In [26]: df
Out[26]:
Last
User Subscription Monthly Join Plan
Payment Country Age Gender Device
ID Type Revenue Date Duration
Date
2022- 2023-10- North
0 1 Basic 10 Young Male Smartphone 1 Month Ja
01-15 06 America
2021- 2023-06- North
1 2 Premium 15 Young Female Tablet 1 Month
05-09 22 America
2023- 2023-06-
2 3 Standard 12 Europe Senior Male Smart TV 1 Month Fe
02-28 27
2022- 2023-06-
3 4 Standard 12 Australia Senior Female Laptop 1 Month O
10-07 26
2023- 2023-06-
4 5 Basic 10 Europe Young Male Smartphone 1 Month Ja
01-05 28
... ... ... ... ... ... ... ... ... ... ...
2022- 2023-12-
2495 2496 Premium 14 Europe Young Female Smart TV 1 Month
07-25 07
2022- 2023-07-
2496 2497 Basic 15 Europe Young Female Smart TV 1 Month
04-08 14
2022- 2023-07- North
2497 2498 Standard 12 Young Male Laptop 1 Month Sept
09-08 15 America
2022- 2023-12- North
2498 2499 Standard 13 Senior Female Tablet 1 Month Dec
12-08 07 America
2022- 2023-12- North
2499 2500 Basic 15 Young Female Smart TV 1 Month A
08-13 07 America
2500 rows × 11 columns
In [27]: Joining_Months_Counts = df['Join Month'].value_counts()
Joining_Months_Counts.plot(kind='bar')
plt.xlabel('Join Month')
plt.ylabel('Count')
plt.title('Joining Counts By Months')
Out[27]: Text(0.5, 1.0, 'Joining Counts By Months')