0% found this document useful (0 votes)
50 views23 pages

Project Fahim Slmbi

The document presents a project by Fahim Ahmed Khan on exploratory data analysis of the Netflix dataset, focusing on movies and guest stars, submitted for an MBA in Business Analytics at Symbiosis School. It outlines the significance of Netflix as a leading streaming service, details the methodology for data analysis, and discusses the company's history, content strategy, and challenges. The project emphasizes the importance of data-driven insights in enhancing user experience and content performance on the platform.

Uploaded by

23029141977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views23 pages

Project Fahim Slmbi

The document presents a project by Fahim Ahmed Khan on exploratory data analysis of the Netflix dataset, focusing on movies and guest stars, submitted for an MBA in Business Analytics at Symbiosis School. It outlines the significance of Netflix as a leading streaming service, details the methodology for data analysis, and discusses the company's history, content strategy, and challenges. The project emphasizes the importance of data-driven insights in enhancing user experience and content performance on the platform.

Uploaded by

23029141977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

A Project on “EXPLORATORY DATA ANALYSIS ON NETFLIX DATASET

(INVESTIGATING NETFLIX MOVIES AND GUEST STARS)”

Submitted by :Fahim Ahmed Khan


PRN no 23029141977.

Under the Guidance of “Dr Sandeep Gaikwad”

Submitted to
Symbiosis School for Online and Digital Learning (SSODL)Pune
Submitted in partial fulfilment of the requirement for the award
of

Master of Business Administration


Batch July 2023

Student Declaration
I, Fahim Ahmed Khan, hereby declare that the report entitled “Exploratory
data analysis on Netflix dataset (Investigating Netflix movies and guest
stars)” at “Symbiosis school of online and digital learning ” in partial
fulfilment of the requirement of the award of the “MBA in (Business analytics)
” is my original work.

The findings in this project are based on data collected by me and I


have not copied from any other student or any other source. This
report has not been submitted by me elsewhere.

Signature-Fahim Ahmed Khan


Name of the student program -MBA in Business Analytics
Name PRN 23029141977

ACKNOWLEDGEMENTS

I take this opportunity to express my gratitude to everyone who


supported me for the project. I am thankful for their aspiring
guidance, invaluably constructive criticism and friendly advice
during the project work. I am sincerely grateful to them for sharing
their truthful and illuminating views on a number of issues related
to the project.

I express my warm thanks to my guide Dr. (Sandeep Gaikwad) for


their constant and timely support and guidance during my project.

Signature Fahim
Name of student -Fahim ahmed khan
Program Name MBA in business analytics
PRN no 2302914977
INDEX

 INTRODUCTION
 COMPANY PROFILE
 LITERATURE REVIEW
 THEROTICAL REVIEW
 OBJECTIVE OF THE STUDY
 RESEARCH METHODOLOGY
 ANALYSIS
 FINDINGS
 CONCLUSION
 REFERENCES
INTRODUCTION

Netflix is an American subscription video on-demand over-the-top streaming service. The


service primarily distributes original and acquired films and television shows from various
genres, and it is available internationally in multiple languages.

Launched on January 16, 2007, nearly a decade after Netflix, Inc. began its pioneering
DVD-by-mail movie rental service, Netflix is the most-subscribed video on demand
streaming media service, with 238.39 million paid memberships in more than 190 countries.
By 2022, "Netflix Original" productions accounted for half of its library in the United States
and the namesake company had ventured into other categories, such as video game
publishing of mobile games via its flagship service.

Netflix competes in two different markets:

 DVD
 Online television and premium video industry.

Here's an overview of Netflix and its key features:

1. Streaming Service: Netflix primarily operates as an online streaming service, allowing


users to watch a wide variety of content over the internet. It is available in over 190 countries,
making it accessible to a global audience.

2. Extensive Content Library: Netflix offers a vast and diverse library of content, including
movies, TV series, documentaries, and stand-up comedy specials. It features content from
various genres, languages, and countries, catering to a broad spectrum of tastes and
preferences.

3. Original Content: Netflix is known for its investment in original programming. The
company produces and releases exclusive content, including critically acclaimed series like
"Stranger Things," "The Crown," "Narcos," and original movies like "Bird Box" and "The
Irishman."

4. Personalized Recommendations: Netflix uses advanced algorithms and machine learning


to provide personalized content recommendations. The platform analyses a user's viewing
history and preferences to suggest titles that they are likely to enjoy.

5. Multi-Device Accessibility: Netflix is accessible on a wide range of devices, including


smartphones, tablets, smart TVs, gaming consoles, and web browsers. Subscribers can start
watching on one device and continue on another without losing their place.

6. Offline Viewing: Netflix allows users to download content for offline viewing, which is
especially useful for those with limited or no internet access while traveling or in remote
locations.
7. Ad-Free Experience: Unlike traditional television, Netflix offers an ad-free viewing
experience. Subscribers can watch content without interruptions from advertisements.

8. Multiple Subscription Tiers: Netflix offers different subscription tiers with varying
pricing and features, including options for single, dual, or multiple device access and
different video quality levels.

9. Global Reach: Netflix has a global presence, producing content and providing services in
multiple languages and regions. It has expanded its content library to cater to a diverse
international audience.

10. Awards and Recognition: Netflix has received numerous industry awards and
nominations, including Academy Awards (Oscars), Emmy Awards, and Golden Globe
Awards for its original content.

Netflix has had a profound impact on the entertainment industry, contributing to the shift
away from traditional cable and broadcast television toward on-demand streaming. It has a
strong focus on innovation, content creation, and user experience, making it one of the most
influential players in the media and entertainment sector. With its global reach and an ever-
expanding library of content, Netflix continues to be a major player in the digital
entertainment landscape.

Introduction to Exploratory Data Analysis (EDA) for Netflix

Exploratory Data Analysis, commonly abbreviated as EDA, is a fundamental step in data


analysis that involves the initial exploration, visualization, and understanding of data to
derive meaningful insights and patterns. In the context of Netflix, EDA refers to the process
of analysing data related to the streaming platform, user behaviour, and content performance.

Here's an introduction to EDA for Netflix data:

1. Data Sources: Netflix gathers vast amounts of data on user interactions, content
consumption, and other platform-related activities. This data includes user demographics,
viewing histories, content metadata, and more. EDA begins with the collection and
integration of these diverse data sources.

2. Data Preprocessing: Before performing any analysis, data collected from various sources
need to be pre-processed. This involves handling missing data, dealing with outliers, and
ensuring data quality. In the context of Netflix, this may include addressing incomplete user
profiles or content information.

3. Data Visualization: Visualization is a critical aspect of EDA. Analysts create visual


representations of data using charts, graphs, and plots to understand data distributions and
patterns. For Netflix, this might involve creating charts to visualize trends in user
engagement, viewing habits, and content popularity.

4. Summary Statistics: EDA often involves calculating and interpreting summary statistics,
such as mean, median, standard deviation, and percentiles. For Netflix, this might mean
calculating average viewing times, identifying the most-watched genres, or analysing trends
over time.
5. Pattern Discovery: EDA aims to uncover patterns, trends, and anomalies in the data. In
the context of Netflix, this might include identifying peak usage times, understanding how
user demographics influence content preferences, and recognizing seasonal viewing patterns.

6. Hypothesis Testing: Hypotheses can be formulated during EDA and tested using
statistical methods. For instance, analysts might investigate whether the release of certain
content correlates with changes in user behaviour or if specific genres attract more viewers.

7. Machine Learning and Predictive Analytics: EDA can serve as the foundation for more
advanced analytics, including machine learning models. For Netflix, predictive models can be
developed to forecast user preferences, improve content recommendations, or optimize
content acquisition strategies.

8. Data Storytelling: The insights gained from EDA need to be effectively communicated to
relevant stakeholders. Data storytelling through reports, presentations, and data visualization
tools helps decision-makers within Netflix leverage the findings to make informed business
decisions.

Exploratory Data Analysis for Netflix is essential in understanding user behaviour and
content performance. By employing data-driven insights, Netflix can refine its content
acquisition strategies, enhance user recommendations, and ultimately provide a more tailored
and satisfying streaming experience to its subscribers. EDA plays a crucial role in the data-
driven decision-making process for a platform that has a global audience and offers a diverse
range of content.
COMPANY PROFILE

Netflix was founded by Marc Randolph and Reed Hastings on August 29, 1997, in Scott’s
Valley, California. Hastings, a computer scientist and mathematician, was a co-founder of
Pure Software, which was acquired by Rational Software that year for $750 million, the then
biggest acquisition in Silicon Valley history. Randolph had worked as a marketing director for
Pure Software after Pure Atria acquired a company where Randolph worked. He was
previously a co-founder of Micro Warehouse, a computer mail-order company as well as vice
president of marketing for Borland. Hastings and Randolph came up with the idea for Netflix
while carpooling between their homes in Santa Cruz, California, and Pure Atria's
headquarters in Sunnyvale
In January 2007, the company launched a streaming media service, introducing video on
demand via the Internet. However, at that time it only had 1,000 films available for
streaming, compared to 70,000 available on DVD. The company had for some time
considered offering movies online, but it was only in the mid-2000s that data speeds and
bandwidth costs had improved sufficiently to allow customers to download movies from the
net. The original idea was a "Netflix box" that could download movies overnight, and be
ready to watch the next day. By 2005, Netflix had acquired movie rights and designed the box
and service. But after witnessing how popular streaming services such as YouTube were
despite the lack of high-definition content, the concept of using a hardware device was
scrapped and replaced with a streaming concept.
In January 2011, Netflix announced agreements with several manufacturers to include
branded Netflix buttons on the remote controls of devices compatible with the service, such
as Blu-ray players. By May 2011, Netflix had become the largest source of Internet streaming
traffic in North America, accounting for 30% of traffic during peak hours.

Key Points:

1. Revolutionary Streaming Service: Netflix initially started as a DVD rental service but
made a historic pivot by launching its online streaming platform in 2007. This move
disrupted traditional television and ushered in the era of on-demand streaming.
2. Global Presence: As of my last knowledge update in September 2021, Netflix was
available in over 190 countries. It has been rapidly expanding its reach to provide
content to a truly global audience while customizing its library for various regions.
3. Extensive Content Library:
o Movies: Netflix offers a vast collection of movies, ranging from classic films
to the latest releases.
o TV Series: The platform includes an extensive catalogue of TV series, from
popular network shows to original series that have garnered critical acclaim.
4. Personalized User Experience: Netflix employs sophisticated algorithms and machine
learning to provide personalized content recommendations to its subscribers. These
recommendations are based on a user's viewing history and preferences, enhancing the
overall viewing experience.
5. Multi-Device Accessibility: Netflix is accessible on a wide range of devices, including
smartphones, tablets, smart TVs, gaming consoles, and web browsers. The platform
offers a seamless experience, allowing users to start watching on one device and
continue on another without losing their place.
6. Offline Viewing: Netflix introduced the option to download content for offline viewing.
This feature is especially useful for users who want to watch content without an internet
connection, such as during travel or in areas with limited connectivity.
7. Ad-Free Environment: Netflix is known for its ad-free streaming experience, in
contrast to traditional television. This allows viewers to enjoy content without any
interruptions from advertisements.
8. Subscription Tiers: Netflix offers several subscription tiers, each with different pricing
and features. These tiers cater to various user needs and may include options for single
or multiple device access, as well as different video quality levels (e.g., Standard
Definition, High Definition, and 4K Ultra HD).
9. Awards and Recognition: Netflix's original content has received widespread critical
acclaim and numerous industry awards. This includes accolades from prestigious
institutions like the Academy Awards (Oscars), Emmy Awards, and Golden Globe
Awards.
10. Innovation and Industry Impact: Netflix's innovative approach to content creation,
technology, and user experience has made it a trailblazer in the streaming industry. It
has not only inspired the development of numerous other streaming platforms but also
contributed to the broader shift away from traditional cable and broadcast television.

Financial Performance:
- Revenue: As of the last available data (up to September 2021), Netflix's revenue had been
steadily increasing over the years. In 2020, the company reported annual revenue of
approximately $25 billion.
- Profitability: Netflix had been reinvesting heavily in content production and international
expansion, which sometimes resulted in lower profits. However, the company's profitability
had been improving, and it consistently reported positive net income.
- Debt: Netflix has incurred significant long-term debt to fund its content production and
global expansion. In 2020, the long-term debt stood at around $16 billion. This debt has been
considered a strategic move to secure a dominant position in the industry.

Content Production:
- Netflix invests heavily in original content creation to distinguish itself from competitors.
The company had been increasing its content budget, allocating billions of dollars each year
to produce and license content. This includes scripted and unscripted series, films,
documentaries, and more.
- The company had entered into exclusive production and distribution deals with well-known
creators and showrunners, which helped in creating unique and highly popular content.

Subscriber Growth:
- Netflix's subscriber base had been growing steadily. In Q2 2021, the company reported over
209 million paid subscribers globally.
- International expansion was a significant driver of this growth. A substantial portion of
Netflix's subscribers come from outside the United States.

Technology and User Experience:


- Netflix utilizes a range of technologies to optimize user experience, including adaptive
streaming that adjusts video quality based on internet connection speed, as well as advanced
compression algorithms to minimize data usage.
- The recommendation algorithm uses machine learning to analyze viewing history and
provide personalized content suggestions to subscribers.

Challenges:
- Netflix faces challenges related to content acquisition costs, especially as it competes with
other streaming platforms for exclusive rights to popular shows and movies.
- The competitive landscape is continuously evolving, with new entrants in the streaming
industry and traditional media companies launching their streaming services.
- Regional regulations and content restrictions can pose challenges as Netflix operates
globally and must adhere to various content guidelines and censorship rules in different
countries.

Leadership:
- Reed Hastings has been a key figure in the company's leadership, serving as Co-Founder,
Chairman, and Co-CEO. Ted Sarandos, as the Co-CEO, is responsible for content acquisition
and creation, making him a central figure in Netflix's original content strategy.
Literature Review

Sr. Paper Name Year of Author Insights


No Publication
1. Analytics Vidhya 2023 Swapnil Our Analysis revealed that
Vishwakarma Netflix had added more movies
than TV shows, in the month of
July Netflix adds the most
content and data analysis
journey showcased the power of
data in unravelling the mysteries
of Netflix’s content landscape,
providing valuable insights for
viewers and content creators.

2. Electronic Code 2023 Vijay Kumar Many interesting inferences


Book (ECB) Sahu based on their research are:
The most content type on Netflix
is movies.
The most popular director on
Netflix, with the most titles
observed for Jan Suter.
The most popular actor on
Netflix movie, based on the no.
of titles is Anupam Kher.

3. Electronic Code 2023 Moses The Netflix dataset analysis


Book (ECB) Benjamin focuses on identifying trends in
consumer habits and preferences
regarding the streaming services.
It examines the viewing history
of subscribers, the types of
content they watch, the time
spent watching, and the
geographic location of the
viewers. The is collected from a
variety of sources and the data is
cleaned and analysed.
THEROTICAL REVIEW

Let's delve into a more detailed theoretical review of how Exploratory Data Analysis (EDA)
can be applied to Netflix:

1. User Behaviour Analysis: Netflix can collect and analyse a vast amount of data related to
user behaviour. EDA could be used to identify patterns, such as the most popular genres, how
long users spend on the platform, and at what times of day they are most active.
By segmenting users into different groups based on their viewing habits, Netflix can tailor its
recommendations and content suggestions more effectively.
2. Content Performance Analysis: EDA can help assess the performance of individual
movies and TV shows. Metrics like viewer ratings, watch duration, and drop-off points in a
series can be analysed.
By understanding what content resonates with the audience, Netflix can make informed
decisions about renewing, promoting, or creating new content.
3. Geographic Analysis: Geographic data can be explored to identify regional content
preferences. EDA can reveal differences in popular genres and viewing habits across
countries.
Netflix can use this information to localize its content libraries and improve content
recommendations for viewers in different regions.
4. User Retention and Churn Analysis: EDA can be applied to assess user retention and
churn. By examining user engagement patterns and subscription duration, Netflix can identify
factors that contribute to user retention.
Insights from EDA can inform strategies to reduce churn, such as personalized retention
campaigns or content recommendations.
5. Recommendation System Improvement: Netflix's recommendation system is powered
by data analysis. EDA can be used to assess the performance of the recommendation
algorithm by studying the accuracy of recommendations, click-through rates, and user
feedback.
Netflix can use EDA findings to fine-tune its recommendation system, improving the quality
of personalized content suggestions.
6. A/B Testing: EDA is essential in designing and analysing A/B tests. When Netflix makes
changes to its platform, such as altering the user interface, it can employ EDA to evaluate the
impact of these changes on user behaviour.
Netflix can perform A/B tests to assess different features, layouts, or promotional strategies,
and EDA helps in interpreting the results.
7. Content Licensing Decisions: EDA can aid in making informed content licensing
decisions. By analysing user engagement with licensed content, Netflix can determine which
types of content have the greatest impact on viewer retention and acquisition.
This data can be used during negotiations with content providers.

8. Content Creation and Investment: EDA can provide insights into the success of different
content genres and types. For original content production, Netflix can analyse the viewing
data of similar shows or movies to inform budget allocation and creative decisions.
It can also help in identifying emerging content trends and opportunities.
9. Security and Anomaly Detection: Beyond content and user behaviour, EDA can be used
for security. Netflix can analyse login and access patterns to detect unusual activities,
potentially indicating unauthorized account access or security breaches.
EDA can help in identifying and mitigating security threats in real-time.
10. Cohort Analysis: Netflix can perform cohort analysis to group users based on certain
criteria, such as the date they joined or their subscription plan. EDA can help Netflix
understand how different user cohorts behave over time, allowing them to make more
targeted decisions.
11. Content Popularity Trends: EDA can uncover trends in content popularity. By analysing
which genres or themes are trending, Netflix can adjust its content creation and licensing
strategies to meet current demands.
12. Personalization and Niche Content: EDA can identify niche content preferences within
user segments. By understanding these niches, Netflix can cater to the long tail of user
interests and recommend specialized content to niche audiences.
13. Predictive Modelling: EDA can serve as a foundation for predictive modelling. For
instance, it can be used to build predictive models for user churn, helping Netflix anticipate
and reduce subscriber attrition.
14. Viewer Demographics: EDA can help Netflix identify the demographics of its viewers.
Analysing data related to age, gender, location, and viewing habits can provide valuable
insights for content targeting and advertising.
15. Content Discovery: EDA can be used to improve content discovery. By analysing how
users navigate the platform and discover new content, Netflix can optimize its user interface,
search functionality, and content recommendations.
16. Viewer Sentiment Analysis: By analysing social media and user reviews, Netflix can
perform sentiment analysis. This can help the company understand how viewers feel about its
content and make adjustments as necessary.
17. Real-time Analytics: EDA can also be applied to real-time data streams. Netflix can use
it to monitor live events, user interactions, and platform performance in real time, enabling
quick responses and improvements.
18. Multi-device Analysis: As viewers access Netflix on various devices, EDA can help
analyse user behaviour across different platforms (smartphones, smart TVs, tablets, etc.),
providing insights into user preferences on each.
19. Quality of Experience (QOE): Netflix can use EDA to assess the quality of user
experience, including video streaming quality and buffering issues. By identifying pain
points, the company can improve QOE for its subscribers.

20. Content Cost-Benefit Analysis: EDA can help Netflix evaluate the cost-effectiveness of
content. By analysing production costs, viewership numbers, and user feedback, the company
can determine which content provides the best return on investment.

In summary, EDA is a powerful tool for Netflix to gain a deeper understanding pf its users,
content, and the streaming industry. By applying EDA techniques, Netflix can make data-
driven decisions that enhance the user experience, content library, and overall business
performance.
OBJECTIVE OF THE STUDY

The objective of Exploratory Data Analysis (EDA) at Netflix is to gain a deeper


understanding of its data and leverage insights to inform decision-making, enhance user
experiences, optimize content delivery, and drive business strategies.

Here are the primary objectives of EDA at Netflix:

1. To Understand User Behaviour: Analyse how users interact with the platform, what
content they engage with, and when they use Netflix. This understanding is crucial for
personalizing recommendations, content curation, and improving the user experience.

2. To Assess Content Performance: Evaluate the success of individual movies, series, and
documentaries. Identify which content resonates with viewers, has higher viewer ratings, and
drives user retention.

3. To Churn Prediction and Reduction: Identify factors that contribute to user churn and
develop strategies to reduce subscription cancellations. EDA helps in recognizing patterns
and trends that may indicate when users are likely to leave the service.

4. To Enhanced User Experience: Improve the user interface and content discovery features
based on how users navigate the platform. EDA guides user interface design and content
promotion strategies for an intuitive user experience.

In essence, EDA at Netflix is a multidimensional process that provides a data-driven


foundation for a wide range of business decisions, from content creation and distribution to
user retention and security. It helps Netflix adapt to a rapidly changing and highly
competitive industry by ensuring that data insights guide its strategies and offerings.
RESEARCH METHODOLOGY

Research Methodology is the specific procedure or techniques used to identify, select


processes and analyse information about a topic. The methodology section allows the reader
to critically evaluate a study’s overall validity and reliability. The methodology section
answers two main questions:

o How was the data collected?


o How it was analysed?

We collected a comprehensive dataset from Netflix, including information about the content
library, user ratings, and genre classifications. The dataset covers a span of several years and
ensures a robust representation of Netflix's offerings. It is a crucial first step in Exploratory
Data Analysis (EDA) for Netflix or any other data analysis project. Netflix collects a vast
amount of data related to user behaviour, content, and platform performance.

1. Review Stage: We first wanted to get an overview of the dataset that we were dealing
with. First, we loaded up tidy verse for a simple data analysis purpose. We got the
dataset from Kaggle, and we are going to utilize data that the Kaggle website provides
to understand the trend of movies and TV shows released on the platform.
2. Import Libraries:

 import pandas
 import matplotlib.pyplot

3. Loading the Dataset: Using Pandas Library, we’ll load the CSV file.
DATA PROCESSING

To prepare the data for analysis, we performed data cleaning, addressing missing values, and
handling outliers. We also transformed the data into a format suitable for analysis, ensuring
the integrity and accuracy of our findings.

Here are some key aspects of data collection in EDA for Netflix:

1. User Interactions: Netflix records data on how users interact with the platform. This
includes information on when users log in, what they search for, what they watch,
how long they watch, and when they pause or stop content.
2. User Demographics: Data collection includes information about users'
demographics, such as age, gender, location, and language preferences. This
information can help Netflix understand its user base better.
3. Content Metadata: Netflix collects data related to its extensive content library. This
includes details about movies, TV shows, documentaries, release dates, genres, cast
and crew information, ratings, and viewer reviews.
4. Viewing History: Users' viewing history, which includes data on previously watched
content and the frequency of viewing, is recorded to provide personalized
recommendations.
5. Geographic Data: Geographic data helps Netflix understand where its users are
located. This information can be used to analyse regional content preferences and
adapt the content library to local tastes.
6. Engagement Metrics: Netflix monitors various engagement metrics, such as the time
spent on the platform, the number of episodes or movies watched in a single session,
and the frequency of logins.
7. Content Performance Data: Data on how well specific content is performing,
including metrics like the number of views, viewer ratings, and viewer reviews, are
collected. This information helps in content recommendation and decision-making for
future content production.
8. Quality of Service Metrics: Data on the quality of the streaming service, including
load times, buffering, and video quality, is collected to ensure a smooth user
experience.
9. Surveys and Feedback: In addition to passive data collection, Netflix may also
conduct user surveys and collect feedback to gain qualitative insights into user
preferences and suggestions for improvements.
10. Legal and Privacy Considerations: Netflix adheres to strict legal and privacy
regulations when collecting and handling user data. This includes obtaining user
consent, anonymizing data when necessary, and ensuring the security of user
information.

ANALYSIS

The EDA process can be summed up in three steps, which are:

1. Understanding the data


2. Cleaning the data
3. Analysis of the relationship between variables

Let us understand the process of Exploratory Data Analysis (EDA) step-by-step:

Understanding the data

Import necessary libraries

The first step is to import the required libraries. In this code block, the Pandas library is used
to read and manipulate data and the Pandas-profiling library is used for EDA. The datasets
module from the scikit-learn library is used to load the Iris dataset.

import pandas as pd
import pandas_profiling
from sklearn import datasets

Loading the dataset

Next, we have to load the dataset. Here, we will be using the multivariate dataset named the
Iris dataset.
iris = datasets.load_iris()

Converting to Pandas Data Frame

The scikit-learn dataset is loaded as a Bunch object, similar to a dictionary. To use this
dataset with Pandas, we must convert it to a Pandas Data Frame.

iris_data = pd.DataFrame(iris.data, columns=iris.feature_names)


iris_data['target'] = iris['target']

Checking data attributes

It is always a good approach to check the attributes of the data, like its shape or the number of
rows and columns in the dataset. To check the shape of the data, run the following code:

iris_data.shape

To check the column names of the DataFrame, we use the columns attribute.

iris_data.columns

If your dataset is big, you can view the first few records of the DataFrame by running the
below code:

iris_data.head()

Cleaning the data

Once you are done with scanning the attributes of the data, you can make any necessary
modifications in the dataset, like changing the name of the columns or the raws. Remember
not to change the dataset’s variables, which can significantly impact the final result.

Check for null values

To clean the data, first, you must check for any null values in the variables. If any of the
variables in a dataset have null values, it can affect the analysis results. If your dataset has
missing data, handle them through approaches like imputation, deletion of observations or
variables, or using models that can handle missing data.

Dropping the redundant data and removing outliers

Next, if you find any redundant data in your dataset that does not add value to the output, you
can also remove them from the table. All the columns and rows are important in the iris
dataset we have taken. So, we would not be dropping the data. In this step, we must also find
any outliers in the data.

Analysis of the relationship between variables


The final step in the process of EDA is to analyse the relationship between variables. It
involves the following:

 Correlation analysis: The analyst computes the correlation matrix between variables
to identify which variables are strongly correlated with each other.
 Visualization: The data analyst creates visualizations to explore the relationship
between variables. This includes scatter plots, heatmaps, etc.
 Hypothesis testing: The analyst performs statistical tests to test hypotheses about the
relationship between variables.

Run the following code to generate a report that includes various relationship analyses
between variables.

1. Loading the dataset using the panda’s library:

2. Renaming the unnamed default column:


3. Summary:

4. Scaling the data:

5. Assigning customized colour into the range:


6. Adding size attribute:
7. Creating a col named has_guest or not:
8. Guest Stars:

You might also like