0% found this document useful (0 votes)
61 views29 pages

Prasad Shinde Data Analytics Portfolio

The document analyzes a dataset of IMDB movies. It cleans the data by removing duplicates and irrelevant columns. It then finds the movies with the highest profits, the IMDB top 250 movies, the best directors and genres by ratings, and the critic and audience favorite actors. The analysis helps derive insights from the initially useless movie data.

Uploaded by

reetubhanugarg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views29 pages

Prasad Shinde Data Analytics Portfolio

The document analyzes a dataset of IMDB movies. It cleans the data by removing duplicates and irrelevant columns. It then finds the movies with the highest profits, the IMDB top 250 movies, the best directors and genres by ratings, and the critic and audience favorite actors. The analysis helps derive insights from the initially useless movie data.

Uploaded by

reetubhanugarg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

DATA ANALYTICS PORTFOLIO

Prepared By: - Prasad Shinde

0
Professional Background:

Hi, my name is Prasad. I’m currently working remotely as a Data Analyst


Intern at Trainity. I hold a Bachelor’s of Engineering degree in Computer
Science from University of Mumbai with 7.18 CGPA. I have several skills
like Data Analysis, SQL, Microsoft Excel, Python, Tableau.

Since I was young, I have always enjoyed solving puzzles. So that’s


how I look at big data sets: to me it is one big puzzle I want to solve.
Finding patterns nobody else sees is the challenge to me.

As a Data Analyst, I know exactly how you can use data


optimally within a business. Based on your objectives, we
will make a plan to reach the right insight.

It would be great to experience the real challenges of the


corporate world and understand how things work. I’m very
flexible and adaptive to learn new things & tech. I’m willing
to use my theoretical knowledge in a practical way.
Table of contents:

Module 1 Data Analytics Process:


project Application in real life
scenario … 1

Module 2 SQL Fundamentals:


project Instagram User Analytics … 2–3

Module 3 Advanced SQL: Operation


project Analytics & Investigating
Metric Spike … 4

Module 4 Statistics: Hiring Process


project Analytics … 5–6

Final project 1 IMDB Movie Analysis … 7–8

Final project 2 Bank Loan Case Study … 9 – 13

Final project 3XYZ Ads Airing Report … 14–19

Final project 4 ABC Call Volume Trend … 20 – 22

Appendix … 23
Module 1 Project Data Analytics Process Application in real life scenario:

Our task is to give the example of such a real-life situation where we use
Data Analytics and link it with the data analytics process.

In conclusion, I would like to mention that after doing a thorough analysis we were able
to derive the insights from the data. The data that once looked useless became useful.
Analyzing the data proved helpful in finding various issues among the courses.

1
Module 2 Project SQL Fundamentals: Instagram User Analytics
We are working with the Product team of Instagram and the Product Manager has
asked us to provide insights on the questions asked by the management team.

Tech-Stack Used:
MySQL by Oracle Corporation – Creating Database, Running SQL Queries to
get insights
Word by Microsoft Corporation – For creating a project report
Insights & Results:
A) Marketing:
1. We found the 5 oldest users of the Instagram.

2. We found the users who have never posted a single photo on Instagram.

2
3. We identified the winner of the contest and provide their details to the team.

4. We identified and suggest the top 5 most commonly used hashtags on the platform.

5. We determined What day of the week do most users register on and Provide
insights on when to schedule an ad campaign.

B) Investor Metrics:
1. We provided how many times does average user posts on Instagram. Also,
provided the total number of photos on Instagram/total number of users.

2. We provided data on users (bots) who have liked every single photo on the site.

3
Module 3 Project Advanced SQL: Operation Analytics & Investigating
Metric Spike
We are working for a company like Microsoft designated as Data Analyst Lead and is
provided with different data sets & tables from which we must derive certain insights
out of it and answer the questions asked by different departments.

Tech-Stack Used:
MySQL by Oracle Corporation – Importing Database, Running SQL Queries
to get insights
Excel by Microsoft Corporation – For extracting & manipulating

data Word by Microsoft Corporation – For creating a project report

Insights & Results:


We plan to execute SQL queries on given database to create insights for the
teams to make data driven decision.
The SQL queries will give the following insights:
A) Case Study 1 (Job Data):
1. We calculated the number of jobs reviewed per hour per day for November 2020.

2. We calculated 7 day rolling average of throughput. For throughput, we prefer 7-


day rolling average. Because, rolling averages are useful for finding long-term trends
otherwise disguised by occasional fluctuations.
3. We calculated the percentage share of each language in the last 30 days.
4. We displayed duplicates from the table.
B) Case Study 2 (Investigating metric spike):
1. We calculated the weekly user engagement.
2. We calculated the user growth for product.
3. We calculated the weekly retention of users-sign up cohort.
4. We calculated the weekly engagement per device.
5. We calculated the email engagement metrics.

In conclusion, I would like to mention that after doing a thorough analysis we were
able to derive the insights from the data and was able to plot various graphs using
that data. The data that once looked useless became useful and helped to find out
the insights on job data that were a burden for Microsoft. Analyzing the data proved
helpful in finding various issues among the data provided.

4
Module 4 Project Statistics: Hiring Process Analytics
We are working for a MNC such as Google as a lead Data Analyst and the
company has provided with the data records of their previous hirings and
have asked us to answer certain questions making sense out of that data.
Tech-Stack Used:
Excel by Microsoft Corporation – For extracting & manipulating data

Word by Microsoft Corporation – For creating the project report

Insights & Results:


1. We determined number of males and females are Hired.

2. We determined the average salary offered in this company.

3. We have Drawn the class intervals for salary in the company.

5
4. We have drawn Bar Graph to show proportion of people working different
department.

5. We represented different post tiers using pie chart.

In conclusion, I would like to mention that after doing a thorough analysis we were
able to derive the insights from the data and was able to plot various graphs using
that data. The data that once looked useless became useful and helped to find out
the insights on hiring data that were a burden for Google. Analyzing the data proved
helpful in finding various issues among the data provided.

6
Final Project-1: IMDB Movie Analysis
For our Final Project, we are provided with a dataset having various columns
of different IMDB Movies. We are required to frame the problem. For this task,
we need to define a problem we want to shed some light on.
Tech-Stack Used:

Excel by Microsoft Corporation – For cleaning, extracting & manipulating


the data Word by Microsoft Corporation – For creating the project report

Insights & Results:


1. Data cleaning: Removed Duplicates Removed “” from the dataset

Dropped irrelevant columns such as color, director_facebook_likes,


actor_1_facebook_likes, actor_2_facebook_likes, actor_3_facebook_likes,
actor_2_name, cast_total_facebook_likes, actor_3_name, duration,
facenumber_in_poster, content_rating, country, movie_imdb_link,
aspect_ratio, plot_keywords

Found and replaced “Nan” values with “English”

Removed all the Null values

2. Movies with the highest profit:


Created a column “Profit” and found profit value by subtracting Gross from Budget
Sorted the movies with Sort and Filter
Following are the movies with their profits:
Avatar - $523505847
Jurassic World - $502177271
Titanic - $458672302
Star Wars: Episode IV - A New Hope - $449935665
E.T. the Extra-Terrestrial - $424449459

3. IMDB top 250:

We found IMDB Top 250

7
4. Best Directors:
We found the Best Directors by ratings as follows:
John Blanchard – 9.5
Frank Darabont – 9.3
Francis Ford Coppola – 9.2
John Stockwell – 9.1
Christopher Nolan - 9
Francis Ford Coppola - 9
Peter Jackson – 8.9
Sergio Leone – 8.9
Steven Spielberg – 8.9
Quentin Tarantino – 8.9

5. Popular genres:
We found popular Genres by ratings as follows:
Comedy - 9.5
Crime | Drama - 9.3
Crime | Drama - 9.2
Drama - 9.1
Drama - 9.1
Action - 9.1
Action | Crime | Drama | Thriller - 9
Crime | Drama - 9
Crime | Drama | Thriller - 9
Action | Adventure | Drama | Fantasy - 8.9

6. critic-favourite and audience-favourite actors:


We found Critic-favourite and Audience-favourite actor:

Best Actor is Leonardo DiCaprio by both Critics and audience as well.

In conclusion, I would like to mention that after doing a thorough analysis we


were able to derive the insights from the data and was able to plot various
tables using that data. The data that once looked useless became useful and
helped to find out the insights on IMDB Movies dataset. Analyzing the data
proved helpful in finding various issues among the data provided.

8
Final Project-2: Bank Loan Case Study
For our Final Project, we are provided with 3 datasets of Bank loan details. The loan
providing banks find it hard to give loans to the people due to their insufficient or
non-existent credit history. Because of that, some consumers use it as their
advantage by becoming a defaulter. Suppose we work for a consumer finance
company which specialises in lending various types of loans to urban customers.
We have to use EDA to analyse the patterns present in the data. This will ensure that
the applicants capable of repaying the loan are not rejected.

Tech-Stack Used:

Excel by Microsoft Corporation – For carrying out EDA on the datasets &
Visualisation Jupyter Notebook by Project Jupyter – For carrying out EDA &
Visualisation Word by Microsoft Corporation – For creating the project report

Insights & Results:


Current Applications (application_data.csv):
Income Groups & Gender Income Groups & Age Groups

Observations:
-People with High Income Groups are less defaulters compared to Low Income
Groups. -Mid age & Senior Age Groups including all income groups are less
defaulters. Recommended:
-It is safe to grant loan to Mid Age & Senior Age Groups with higher
income -It is risky to grant loan to young people with low income

9
Family Status & Age Group Family Status & Gender

Observations:
-Seniors irrespective or family status are less likely to be
defaulted. -Young people are more likely to be defaulted in all
Family Statuses. -Males are more likely to be defaulted than
Females. Recommended:
-It is safe to grant loan to all Family status’ Senior Age group.
-It is risky to grant loan to Single, Separated & Civil Marriage of Young Age group.
Credit Amount Group & Income Group Credit Amount Group & Age Group

Observations:
-Across all Income Group clients with Medium Credit Amount is highly defaulted, followed by
Low & High Credit Amount
-Young Age Groups with Medium & Low Credit Amount are most likely defaulted.
Recommended:
-It is very risky to grant Medium & Low Credit Amount to Young Age Group.

10
Educational Qualification & Gender Profession & Gender

Observations:
-People with Higher Education are less defaulted & People with Lower Secondary education
are more defaulted
-Unemployed & Maternity Leave clients are very much defaulted
Recommended:
-It is safe to grant loan to people with Higher Education in all professions except Females &
Females with Maternity Leave.

Educational Qualification & Gender Profession & Gender

Observations:
-People with Higher Education are less defaulted & People with Lower Secondary education
are more defaulted
-Unemployed & Maternity Leave clients are very much defaulted
Recommended:
-It is safe to grant loan to people with Higher Education in all professions except Females &
Females with Maternity Leave.

11
Loan Application Status Relations
Current & Previous (application_data.csv & previous_application.csv):
Previous Loan status & Gender Previous loan status & Client type

Observations:
-Previously Refused & Unused offer applications were more defaulted in
Males. -New clients with previously Unused offers are more defaulted.
Recommended:
-It is safe to provide loans to previously approved Females.
-It is risky to grant loans to clients whose applications were previously Refused & Unused offer.

Age Group & Previous loan status Income Group & Previous Loan status

Observations:
-Young clients which are previously Refused are highly defaulted.
-Senior clients are less defaulted irrespective of their previous loan
status. -Previously refused application in all Income Groups are
highly defaulted. Recommended:
-It is safe to grant loan to Senior clients.
-It is less risky to grant loans for approved applicants in all Income Groups.

12
Portfolio & Previous loan status External Source Score & Previous Loan status

Observations:
-The previous applications for Cards & POS are mostly defaulted.
-Previously refused application for Cash is highly defaulted.
-Low External Source Scorer are highly defaulted.
Recommended:
-It is safe to grant loan to any portfolio for previously approved clients.
-It is highly risky to grant loans to clients with poor external source score whose loan was
previously Refused, Unused offer or Cancelled.

We carried out EDA on the given datasets using Microsoft Excel & Jupyter Notebook-
Python and got answers as following which helped us to take data driven decisions.

Highly Recommended Groups:


- Approved clients in previous applications
- Clients with Higher Education with High income
- Clients with Higher external source score
- Senior clients in all categories
- Married clients
- Females
High Risk Groups:
- Clients who are previously Refused, Cancelled or Unused offer.
- Low income groups with previously refused status.
- Clients who are Unemployed.
- Clients with Poor external source score.
- Clients who are Young.
- Clients who are educated Lower Secondary & Secondary.
In conclusion, I would like to mention that after doing a thorough analysis we were
able to derive the insights from the data and was able to plot various tables using
that data. The data that once looked useless became useful and helped to find out
the insights on Current loan Application, Previous loan Application dataset.
Analyzing the data helped us find the clients who can and cannot be given loans.

13
Final Project-3: XYZ Ads Airing Report
For our Final Project, we are provided with a dataset having different TV Airing
Brands, their product, their category. Dataset includes the network through which
Ads are airing, types of networks like Cable/ Broadcast and the show name also on
which Ads got aired. We can also see the data of Dayparts, Time zone and the time &
date at which Ads got aired. It also includes other data like Pod Position (the lesser
the valuable), duration for which Ads aired on screen, Equivalent sales &, total
amount spent on the Ads aired.

Tech-Stack Used:

Excel by Microsoft Corporation – For analysing the data from the given
dataset Word by Microsoft Corporation – For creating the project report

Insights & Results:

1. What is Pod Position & does the Pod position number affect the amount spent
on Ads for a specific period of time by a company:

Ad pod is a term used in connection with CTV advertising to specify


multiple ads sequenced together and played back-to-back within a single
ad break, like traditional linear TV. They allow publishers to return multiple
ads from a single ad request, and then those ads are played in sequence.
Publishers with longer-form content can leverage the controls offered by ad
podding to set up more advanced monetization strategies for their streaming
content. Rather than setting the ad slots to a fixed duration, advanced ad podding
solutions can return the highest yielding ads per slot based on overall revenues.
For example, a single 30–second ad could be replaced by two 15–second
ads if this makes more money for the publisher or vice versa. If the
publisher was not using an advanced ad podding solution, they would
miss the opportunity to capture this incremental ad revenue.

14
2. The share of various brands in TV airings and how has it changed from Q1 to
Q4 in 2021:
- The following table shows the percentage share of spend by each brand
quarter-wise
- Maruti Suzuki is the only brand which has highest share in money spent
compared to other brands in all quarters

- The following table shows the percentage share of ads played of each brand
quarter-wise
- Maruti Suzuki has the highest share in number of ads played in all quarters

- The following table shows the percentage share of EQ units of each brand
quarter-wise
- Maruti Suzuki has the highest share in EQ units among all brands in every quarter

15
3. Competitive analysis for the brands and defined advertisement strategy of
different brands and how it differs across the brands:

Honda Cars (Day parts vs Count of Ads vs Quarter)

10000

9000 Daytime

8000 Early Fringe

7000
Early Morning

6000
Evening News

5000
Late Fringe

4000

Overnight

3000

Prime Access

2000

Prime Time
1000

Weekend
0
Q1 Q2 Q3 Q4

Hyundai Motors India (Day parts vs Count of Ads vs Quarter)

4500

4000 Daytime

3500 Early Fringe

3000 Early Morning

2500 Evening News

Late Fringe
2000 Overnight
1500
Prime Access
1000

Prime Time
500

Weekend

0
Q1 Q2 Q3 Q4

Mahindra and Mahindra (Day parts vs Count of Ads vs


Quarter)
14000
Daytime
12000
Early Fringe
10000
Early Morning
8000
Evening News
6000
Late Fringe
4000
Overnight
2000 Prime Access
0 Prime Time
Q1 Q2 Q3 Q4

16
Maruti Suzuki (Day parts vs Count of Ads vs Quarter)
16000

14000
Daytime
12000 Early Fringe
Early Morning
10000
Evening News
8000
Late Fringe
6000 Overnight
Prime Access
4000
Prime Time
2000
Weekend
0
Q1 Q2 Q3 Q4

Tata Motors (Day part vs Count of Ads vs Quarter)


7000
Daytime
6000
Early Fringe
5000
Early Morning
4000 Evening News
Late Fringe
3000
Overnight
2000
Prime Access
1000 Prime Time

0 Weekend
Q1 Q2 Q3 Q4

Toyota (Day part vs Count of Ads vs Quarter)


7000
Daytime
6000
Early Fringe
5000
Early Morning
4000 Evening News
Late Fringe
3000
Overnight
2000
Prime Access
1000
Prime Time
0 Weekend
Q1 Q2 Q3 Q4

17
The above table states that Maruti Suzuki has the highest contribution of
EQ units in first pod position among the top-5 Pod positions

The above table states that the major share of EQ units of Maruti Suzuki is
the highest on all days and significantly more on weekends which is not a
case in other brands

The above table shows that Maruti Suzuki has majority of EQ units share in
JAN, MAY, AUG, OCT.
Mahindra and Mahindra have majority of EQ units share in MAY & AUG.

18
4. Suggest a media plan to the CMO of Mahindra and Mahindra & which audience
should they target:
As we can see in Competitive Analysis charts, 4 out of 6 brands aired their
majority of ads at daytime in first quarter. Therefore, it would be profitable
to run the ads in this particular daypart.

The table shows that all brands have majority of share of EQ units in first
Pod Position in First Quarter. Therefore, it would be profitable to run the
digital ad in frist pod position.

Additional Actionable insights:


We can find the shows which has most viewers across different dayparts
and run digital ads to similar content on different platforms.
We can also segregate the age groups of people based on Genre of the
show and can target the same audience through various platforms.

In conclusion, I would like to mention that after doing a thorough analysis we were
able to derive the insights from the data and was able to plot various tables using
that data. The data that once looked useless became useful and helped to find out
the insights on Car sales dataset. Analyzing the data proved helpful in finding which
pod positions, dayparts should the ads be run. Also, suggested media plan to CMO
of Mahindra & Mahindra for potential audience target.

19
Final Project-4: ABC Call Volume Trend
For our final project we are provided with a dataset of a Customer Experience (CX)
Inbound calling team for 23 days. Data includes Agent_Name, Agent_ID, Queue_Time
[duration for which customer have to wait before they get connected to an agent], Time
[time at which call was made by customer in a day], Time_Bucket [for easiness we have
also provided you with the time bucket], Duration [duration for which a customer and
executives are on call, Call_Seconds [for simplicity we have also converted those time
into seconds], call status (Abandon, answered, transferred).

By solving our customers' problems and helping them achieve success


using our product or service, we can delight our customers and turn them
into a growth engine for our business.
Tech-Stack Used:

Excel by Microsoft Corporation – For carrying out EDA on the datasets &
Visualisation Word by Microsoft Corporation – For creating the project report

Insights & Results:


a. Average call time duration for all incoming calls received by agents:

Average call time in each time bucket


10_11 202.5938769
11_12 198.6600372
12_13 191.1536695
13_14 193.2963998
14_15 191.9543656
15_16 195.8571429
16_17 198.2948638
17_18 197.8801445
18_19 200.1208565
19_20 202.4782232
20_21 202.5173611
9_10 198.7373282
Grand
Total 196.9626009

20
b. total volume/ number of calls coming in:

Row Count of
Labels Call_Status Total Volume of calls
10_11 13313
11_12 14626 16000
12_13 12652
13_14 11561 14000

Call_Status
14_15 10561
15_16 9159 12000
16_17 8788
10000
17_18 8534
18_19 7238
8000
19_20 6463
20_21 5505 6000
9_10 9588 Total
Grand
Total 117988 4000

2000

0
11_
12_
13_
14_
15_
16_
17_
18_
19_
20_
21_
9 10_
1011121314151617181920
Time Bucket

c. minimum number of agents required in each time bucket:

Count of As we can see, the current abandon rate is


Row Labels Call_Status ~30%. Proposing a manpower plan required
abandon during each time bucket (9AM-9PM) to reduce
29.16%
answered the abandon rate to 10%
69.88%
transfer 0.96%
Grand Total 100.00%
Assumption: An agent work for 6 days a week; On an average total
unplanned leaves per agent is 4 days a month; An agent total working hrs is
9 Hrs out of which 1.5 Hrs goes into lunch and snacks in the office. On
average an agent occupied for 60% of his total actual working Hrs (i.e. 60%
of 7.5 Hrs) on call with customers/ users. Total days in a month is 30 days.
8378
Call Handling capacity 844.736
Incoming call (9AM – 9PM) 4
Working hour of each agent 9 Minimum agent required 99.1837
Average call Handling Time 196
Head Count required 130.505
Occupancy on average 60%

Call handling capacity = ((Total Calls*AHT)/ ((9 hours*60 min*60 second)Occupancy)

Minimum Agents required = Total calls / Call handling


Head count required = Minimum Agents required / 0.76 (Min. shrinkage count 0.76)

Manpower required for each time bucket = Head count required / Time bucket
= 130.505 / 12 = 10.87

21
d. manpower plan required during each time bucket in a day:

Incoming call per Call Handling capacity 330.612


night(9PM - 9AM) 30 Minimum agent
Working hours of each 9 required 0.091
agent Avg. call Handling Head Count required 0.336
Time(seconds) 196
Occupancy on average(as
per given) 60

Call Handling Capacity = (Working hours of each agent * 60 * 60) * 60 / (Incoming


call per night * Avg. call Handling Time (seconds))

Minimum Agents required = Total calls / call Handling

Head count required = Minimum Agents required / 0.76 (Min. shrinkage count 0.76)

Manpower required for each time bucket = Head count required / Time bucket
= 0.336 / 12 = 0.028

In conclusion, I would like to mention that after doing a thorough analysis we were
able to derive the insights from the data and was able to plot various tables using
that data. The data that once looked useless became useful and helped to find out
the insights on Calls dataset. Analyzing the data proved helpful finding average call
time duration, volume of calls, minimum agents required in each time bucket.

22
Appendix:
1. Google Drive link for Module 1 of Data Analytics Process:
https://docs.google.com/presentation/d/1329p1Dc1swc4q8L13ailbW_veuiiIrxk/edi
t?usp=share_link&ouid=101056032751701574408&rtpof=true&sd=true

2. Google Drive link for Module 2 of SQL Fundamentals Instagram User Analytics:

https://drive.google.com/file/d/1Aqt4VjwbsZYXq_oYiWpyjCyiregLFQs/view?usp=
share_link

3. Google Drive link for Module 3 of Advanced SQL Operation & Metric Analytics:
https://drive.google.com/file/d/1P9Kt2xT2hYml_1RTxdWXjEIlv43eC0v/view?usp=
share_link

4. Google Drive link for Module 4 of Statistics Hiring Process Analytics:


https://drive.google.com/file/d/14EgUtozpQj3ZAMAKdwalGRqGJaIGG1zc/view?u
sp=share_link

5. Google Drive link for Final Project-1 IMDB Movie Analysis:


https://drive.google.com/file/d/1Vqt5WXoPXsmT_ywu2O6JrXTjs4EVFf7c/view?u
sp=share_link

6. Google Drive link for Final Project-2 Bank Loan Case Study:
https://drive.google.com/file/d/1ZIzfXhONU5ntiate0dnwbZW4eir5tvT/view?usp=s
hare_link

7. Google Drive link for Final Project-3 XYZ Ads Airing Report:
https://drive.google.com/file/d/1oyzPQcSTg8boDk7c4P3RvCPXLRz5X_8Z/view?
usp=share_link

8. Google Drive link for Final Project-4 ABC Call Volume Trend:
https://drive.google.com/file/d/14TzUrGxkQLgSZ67Cl2WQgk3Rn2K7MCj/view?u
sp=share_link
Google Drive link for all Portfolio Projects:
https://drive.google.com/drive/folders/15A1EFbT-
bxsl3SeEPFdiXPmcNSp8Uz7S?usp=sharing

23

You might also like