0% found this document useful (0 votes)
3 views10 pages

Assignment! DS EN22CS301186

The document presents two detailed case studies on Exploratory Data Analysis, focusing on air quality data and movie ratings, highlighting data description, techniques used, results, and perspectives. Additionally, it discusses the challenges and scope of managing data science projects, emphasizing issues like data quality, integration, and model interpretability, while outlining the importance of cross-functional collaboration and iterative workflows. The analysis concludes with the necessity for continuous monitoring and adaptation in data science project management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Assignment! DS EN22CS301186

The document presents two detailed case studies on Exploratory Data Analysis, focusing on air quality data and movie ratings, highlighting data description, techniques used, results, and perspectives. Additionally, it discusses the challenges and scope of managing data science projects, emphasizing issues like data quality, integration, and model interpretability, while outlining the importance of cross-functional collaboration and iterative workflows. The analysis concludes with the necessity for continuous monitoring and adaptation in data science project management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Anushtha Rathore

EN22CS301186

Assignment 02

Name: Anushtha Rathore


Class: 5D-CSE
Enrollment No.: EN22CS301186
Data Science
Submitted to: Pramod S. Nair

1
Anushtha Rathore
EN22CS301186

INDEX

S.no Questions Pg no.

01 Do two detailed case studies on Exploratory Data Analysis 3-7


 Case Study 1: Analyzing Air Quality Data
 Description of Data
 Techniques
 Results
 Perspectives
 Diagram
 Case Study 2: Exploring Movie Ratings and Box Office
Performance
 Description of Data
 Techniques
 Results
 Perspectives
 Diagram

Reference

02 What are the important challenges and scope of Data Science project 8-11
management.e inspiring Industry Projects on Data Science in detail
with atleast 3 examples
 Important Difficulties in Project Management for Data Science:
 Scope of Data Science Project Management

Reference

Question 1) Do two detailed case studies on Exploratory Data Analysis

2
Anushtha Rathore
EN22CS301186

The analysis of air quality data gathered from multiple monitoring stations located around a
metropolis is the main goal of this case study. The objective is to comprehend how different
elements affect air pollution levels, spot long-term patterns, and investigate connections
between meteorological and air quality indicator data.

Description of Data
The hourly air quality measurements in the dataset include:
 DateTime: The measurement's timestamp.
 PM2.5: The concentration of particulate matter (µg/m³).
 PM10: The concentration of particulate matter (µg/m³).
 NO2: The concentration of nitrogen dioxide (µg/m3).
 The ambient temperature in degrees Celsius.
 Humidity: Percentage of relative humidity.
 Wind speed (km/h) is the wind speed.
 Location: Tracking the position of the monitoring station.

Techniques
1. Data Purification
 Interpolated missing data, particularly for continuous measurements.
 DateTime was converted to the proper datetime objects for time series analysis.

2. Characteristic Statistics
 Summary statistics (mean, median, max, min) for each air quality metric were
computed.
 Computed monthly averages to assess seasonal patterns.

3. Information Visualization:
 Time Series Analysis: Line plots were made to show how PM2.5 levels have
changed over time.
 Relationships Heatmap: To investigate relationships between various air
contaminants and climatic conditions, a heatmap was created.

4. Trend Evaluation:
 To discover longer-term patterns and level out daily volatility, rolling averages
were utilized.
 Examined changes in the quality of the air at various times of the year and in
different places.

3
Anushtha Rathore
EN22CS301186

Results
 Trends in Air Quality: During the winter, there was a noticeable seasonal variation in
PM2.5 levels, with higher concentrations owing to stable meteorological conditions
and increased heating.
 Findings pertaining to the Correlation: There was a strong positive correlation (0.85)
between PM2.5 and NO2, indicating a major contribution of vehicle emissions to
particulate matter pollution.
 Weather Impact: Lower PM2.5 levels were linked to higher humidity levels, suggesting
that moisture in the air may aid in particulate settling.

Perspectives
According to the analysis,
 Reducing vehicle emissions through policy might greatly enhance air quality.
 Public awareness initiatives ought to concentrate on tactics for reducing pollution
throughout the winter.
 Public awareness initiatives ought to concentrate on tactics for reducing pollution
throughout the winter.
 Increased surveillance in key months can help reduce the health hazards linked to air
pollution.

Diagram 1: PM2.5 Levels Over Time

 Type: Line Plot

4
Anushtha Rathore
EN22CS301186

 Description: This line plot shows the trend of PM2.5 levels over time, with the x-axis
representing the date and the y-axis representing the PM2.5 concentration (µg/m³).
You can use a rolling average to smooth out daily fluctuations.

Case Study 2: Exploring Movie Ratings and Box Office Performance

In this case study, we examine a dataset that includes movie-related data, such as ratings,
genres, and box office results. The goal is to identify trends in the box office performance of
films and determine the elements that lead to high attendance and profits.

Description of Data
The dataset includes details on five thousand films, including:
 MovieID: A special number assigned to every film.
 Title: Film title.
 Genre: The film's genre, such as comedy or action.
 User rating on average, out of ten.
 BoxOffice: The total amount made at the box office (millions).
 ReleaseYear: The year of the film's premiere.
 Runtime: The length of the film, expressed in minutes.
 Director: The film's director.

Approach
1. Data Cleaning:
 Deleted unnecessary columns and duplicate entries.
 Handled missing values by imputing the mean or median for box office receipts
and ratings.

2. Characteristic Statistics
 Examined box office results and average ratings by genre.
 Looked at runtime distributions and how they related to revenue and ratings.

3. Information Visualization:
 Box Plot: Box plots were made to compare box office receipts and ratings for
various genres.
 Scatter Plot: To investigate the connections between runtime, box office
receipts, and ratings, scatter plots were created.

4. Trend Evaluation:
 Examined patterns in income and ratings over time to determine whether
certain genres become more well-like

5
Anushtha Rathore
EN22CS301186

Results
 Genre Performance: Compared to genres like romance and documentaries, action
and adventure films often earned higher average scores and performed better at the
box office.
 Genre Performance: Compared to genres like romance and documentaries, action
and adventure films often earned higher average scores and performed better at the
box office.
 Runtime Analysis: It appears that mid-length films are more popular with audiences
since they have longer runtimes—between 90 and 120 minutes—and generate higher
revenue.
 Director Influence: Movies with well-known filmmakers typically have better box
office results and ratings, demonstrating the value of celebrity in the business.

Perspectives
According to the analysis, it appears that:
 To optimize their appeal to audiences, movie companies ought to concentrate on
making mid-length action and adventure films.
 Working with well-known directors could improve a movie's chances of success.
Genre trends should be highlighted in marketing plans to correspond with audience
preferences.

Diagram 2: Box Office Revenue by Genre

6
Anushtha Rathore
EN22CS301186

 Type: Box Plot


 Description: This box plot compares box office revenue (in millions) across different
movie genres. The x-axis represents the genre, while the y-axis represents box office
revenue. This visualization helps to identify which genres perform better financially.

Reference:
https://research.ibm.com/publications/advances-in-exploratory-data-analysis-visualisation-
and-quality-for-data-centric-ai-systems

Question 2) What are the important challenges and scope of Data Science project
management.

The management of a data science project necessitates the harmony of organizational,


strategic, and technical aspects. This article provides a clear explanation of the main
obstacles and scope associated with effectively managing a data science project.

Important Difficulties in Project Management for Data Science:


1. Data Gathering and Quality
 Challenge: Obtaining high-quality data is a significant problem for data science
efforts. Data is frequently lacking important information or is unreliable.
 Example: let's say you are working on a client segmentation project and that
important demographic data is missing from half of your customer profiles. It
impacts your capacity to develop a robust prediction model.
 Solution: The key to resolving these problems is utilizing data cleaning and
preparation procedures (such as handling missing data and normalization).

2. Data Integration's Complexity


 Problem: Data in businesses is frequently gathered from a variety of sources,
including web scraping, databases, APIs, sensors, and more. There may be
differences in scale, units, or formats as a result of combining these data
sources.

7
Anushtha Rathore
EN22CS301186

 For instance, it can be challenging to combine historical data kept on cloud


servers with real-time sensor data from an IoT network as their forms and
architectures differ.
 Solution: To guarantee smooth access, data engineers and analysts must make
sure that the right integration pipelines are set up utilizing tools like ETL
(Extract, Transform, Load) procedures and APIs.

3. Selecting Appropriate Instruments and Technology


 Difficulty: Selecting the best combination is difficult due to the abundance of
tools, platforms (AWS, Azure), frameworks (TensorFlow, PyTorch), and
programming languages (Python, R).
 For instance, you may be working on a recommendation system project and are
unclear about which libraries will handle your dataset the best and whether to
utilize neural networks or collaborative filtering models.
 Solution: The choice ought to be made taking into account the scalability
requirements, team skill level, and project needs. Early alignment with
stakeholder expectations helps prevent misunderstandings.

4. Setting Specific Business Objectives


 Challenge: Data science initiatives frequently have unclear goals or are not in
line with business requirements. Even a technically good model may be
meaningless without a defined objective.
 Example: To "predict sales” a retail company's goal that is too wide to achieve
with machine learning. Which should take precedence: bettering customer
retention, pricing optimization, or seasonal trend prediction?
 Solution: It's critical to work closely with business stakeholders to create
quantifiable targets (such as a 10% increase in revenue). Consistent follow-ups
guarantee that data science initiatives and corporate strategy are in sync.

5. Model Trustworthiness and Interpretability


 Problem: Some machine learning models, especially those related to deep
learning, are like "black boxes"—they are challenging for stakeholders who
aren't technical to understand and comprehend.
 An illustration of this would be a financial institution that employs a neural
network to score credit, but regulators want to know the decision-making
process. Justifying or believing the model's predictions is difficult in the absence
of interpretability.

8
Anushtha Rathore
EN22CS301186

 Solution: Data scientists are able to interpret model outputs in a human-


readable manner by using model interpretability tools such as SHAP and LIME.
When interpretability and accuracy are equally important, choosing simpler
models is another choice.

Scope of Data Science Project Management


1. The Range of Cross-Functional Collaboration
 Scope: Projects involving data science encompass more than just data
scientists. Data engineers, analysts, business teams, and occasionally the legal
or compliance departments must work together to complete them.
 Impact: The scope involves establishing clear lines of communication between
various departments, outlining their responsibilities, and making sure that
their contributions complement the objectives of the project.

2. Adaptable and Iterative Workflow


 Scope: Since data science is by its very nature experimental, iterative
approaches are required for projects. Models are regularly developed, tested,
and improved in response to user input, fresh information, and shifting goals.
 Impact: Agile methodologies must be used by project managers to take into
consideration the iterative nature of model creation. Models are guaranteed
to adapt to changing datasets or business needs through regular review cycles.

3. Complete Pipeline Construction


 Scope: Project management goes beyond simply creating the model; it also
includes managing every step of the pipeline, including data collection,
cleaning, modeling, assessment, deployment, and monitoring.
 Impact: When the model is utilized in production, a well-managed project
scope guarantees a smooth transition from development to deployment with
the least amount of disturbances.

4. Model Upkeep and Monitoring


 Scope: After deployment, development continues. To make sure that models
in data science projects continue to function well when new data becomes
available, they must be continuously monitored and retrained.

9
Anushtha Rathore
EN22CS301186

 Impact: To prevent models from becoming skewed or out-of-date over time,


project managers should incorporate measures for model monitoring and
retraining in the project plan.
5. Assessing Achievement and Effect
 Scope: Measuring the project's success is the last stage of any data science
endeavor. Examining KPIs (Key Performance Indicators), cost reductions,
higher income, or more efficiency could all be part of this.
 Impact: Determining the true impact of the data science endeavor requires
early definition of success criteria and ongoing monitoring of performance
measures.

6. Managing Changing Technologies


 Scope: There is room for ongoing innovation in project management due to
the quick speed at which data science tools and methodologies are developing
technologically.
 Impact: In order to provide their teams with improved performance,
scalability, or more effective processes, managers must keep up with the most
recent technologies and frameworks.

A distinct set of difficulties arises when managing a data science project, ranging from scale
and legal constraints to data collecting and model interpretability. These initiatives, however,
span a wide range of topics, from departmental collaboration to post-deployment monitoring
and iteration. Because data science is a dynamic profession, project management in this
discipline must be highly collaborative, flexible, and agile.

References: https://iabac.org/blog/challenges-solutions-in-implementing-data-science-
projects-in-industry

10

You might also like