0% found this document useful (0 votes)
14 views13 pages

Milestone 1

The document provides an overview of data analytics, detailing the data analysis process, types of data (structured, semi-structured, unstructured), and various analytical techniques (descriptive, diagnostic, predictive, prescriptive). It emphasizes the importance of understanding business problems, data acquisition, and the role of storytelling in presenting insights. Additionally, it covers practical applications in healthcare analytics and the use of spreadsheets for data organization and analysis.

Uploaded by

asthaarya0298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

Milestone 1

The document provides an overview of data analytics, detailing the data analysis process, types of data (structured, semi-structured, unstructured), and various analytical techniques (descriptive, diagnostic, predictive, prescriptive). It emphasizes the importance of understanding business problems, data acquisition, and the role of storytelling in presenting insights. Additionally, it covers practical applications in healthcare analytics and the use of spreadsheets for data organization and analysis.

Uploaded by

asthaarya0298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

MILESTONE 1

Introduction To Data Analytics

-The data analysis process involves gathering all the information, cleaning
data, transforming data, Modelling data and using it to find patterns and other
insights.

-Data analysis is defined as a process of cleaning, transforming, and modelling


data to discover useful information for business decision-making. The purpose of
Data Analysis is to extract useful information from data and take the decision
based upon the data analysis.

Data is a collection of unorganised facts & figures and does not provide any
further information regarding patterns, context, etc.
Facts are data but processed data is information.
Data comes before the information.

Structure data is well defined and its in tabular form, which has rows and
columns.
We can have age information in well defined format as date of birth and age. But
customer reviews, pictures and customer complaints can not be defined as its
formatting will change person to person. They are unstructured in nature.
( covering in next lectures)
Semi Structure data is not having predefined format, and is separated using tags
and different markers. This data is useful in Data analysis as well.

Tabular Data : Structured Data


Image and Voice : unstructured data`
Emails, for example, are semi-structured by Sender, Recipient, Subject, Date, etc.
but the content of an E-mail is unstructured in nature.
Log data is collected by applications, websites and instant messaging platforms
to record the interactions between a user and a system. Log files hold a record of
activity on a web server. This Semi Structure in nature.

Structured Data: This kind of data is stored in the Tabular format as row and
column.
Semi-Structured Data: It doesn’t have a defined structure, but data are separated
using tags and markers.
Unstructured Data: Not having any structure. This kind of data requires an
algorithm to convert into some structure.

Data analytic techniques enable you to take raw data and uncover patterns to
extract valuable insights from it. The data gives information, insights and helps in
making impactful decisions.

Descriptive analysis answers the “what happened” by summarizing past data.


Diagnostic analysis takes the insights found from descriptive analytics and drills
down to find the causes of those outcomes( Why). Predictive analysis attempts to
answer the question “what is likely to happen”( Future). Prescriptive analysis is
the frontier of data analysis, combining the insight from all previous analyses to
determine the course of action to take in a current problem or decision.

Descriptive analysis answers the “what happened” by summarizing past data.


This analysis is implemented after understanding Historical data. In this scenario
old symptoms or treatment will help in descriptive analysis and to understand the
current scenario.
Descriptive Analysis is about the demographic of the customers. But in the (D)
option we are analysing result based on their interest which is not part of
descriptive analysis.
Diagnostic analysis takes the insights found from descriptive analytics and drills
down to find the causes of those outcomes via asking the question “Why did it
happen?”
Doctor is trying to make connections between past data and identifies patterns of
behaviour of illness, hence doing Diagnostic analysis.

Analysts must identify areas that require further investigation based on the
findings of descriptive analysis because there are questions that could not be
answered merely by viewing the data. It helps in finding if any gap is there. And it
will also find the root cause problem which is affecting the business for example
why there has been a sudden change in traffic to a website with no apparent
cause.
Giving traffic conditions falls under descriptive analysis and predicting travel time
is predictive analysis. But when google tries to find out why google map is
showing more time for shortest path or Why google is showing a narrow road/non
vehicle friendly road to travel to reach the destination.
Predictive analysis attempts to answer the question “what is likely to happen”.
This type of analytics utilizes previous data to make predictions about future
outcomes.

Predictive analytics utilizes previous data to make predictions about future


outcomes. Doctor will compare the past and present reports of the patient and
will give necessary recommendations.
Benefits of Predictive analytics include Decreasing business risks, Foreseeing
potential problems and Predicting an effective solution to the problems.
Predictive analysis forecasts the near future trends of any industry.

Prescriptive analysis is the frontier of data analysis, combining the insight from all
previous analyses to determine the course of action to take in a current problem
or decision. It determines what should be done.
By using all the previous analysis process, prescriptive analytics helps the
business or organisation to make a sound decision on their various activities.

Prescriptive Analytics combines the insight from the last step, that is Predictive
Analytics, to determine the course of action to take in a current problem or
decision.
The e-commerce clothing company is collecting data and summarising past data,
hence doing Descriptive Analysis.
Now prescriptive analytics can be of assistance on the matter and help determine
options for action. Perhaps an algorithm can detect the learners who require that
new course, but lack that particular skill, and send an automated
recommendation that they take an additional training resource to acquire the
missing skill.
By predicting what exactly you are searching and combining it with your past
results.

Structured : Health Record(previous).


Unstructure: Images, sonography, sensors etc, Semistrictured: temperature /
oxygen by time logs.
Healthcare analytics can find things such as patient trends(admit to discharge),
budget performance for specific departments, rate of tests, etc. Analytics in
Healthcare improves diagnostics, reduces high cost and enhances operational
efficiency.
Each hospital maintains its own set of patient records. In the real-time healthcare
domain, data security and verification of all these records are huge challenges.
Moreover, as technology advances at a rapid pace, adoption of technology has
become a problem.
Before there was no analytics technique, people used to take advice from
experienced people to make decisions
Analytics can help the business to understand the customer's needs and their
buying patterns to help them grow. Customer targeting methods are used by
major E-commerce websites to reach their customers.

Data Analysis Framework


There are broadly 7 steps to perform Data Analysis and the first step is
understanding the business problem. It includes discussion with the SAMe,
Stakeholders or Domain expert to understand the problem .

Business problems understanding involves having a set of questions and


analysing the data to get an idea of the kind of work required.
Once the problem is fully understood in data analysis terms and all the
assumptions checked and validated, now comes the data acquisition phase.
Analysts need to decide which all data sets are necessary to solve the problem,
where these data sets would be residing and in which format.

In the House value Prediction dataset, the question was to find the rich people. To
solve this problem statement we have to set a threshold for salary. Above that
threshold salary, people are rich.

Once the problem is fully understood in data analysis terms and all the
assumptions checked and validated, now comes the data acquisition phase.
Analysts need to decide which all data sets are necessary to solve the problem,
where these data sets would be residing and in which format.
The assumption is created so that any threshold can be decided to get the correct
result in terms of business problems. Data analysts need to make assumptions
before deciding which data set is required to solve a problem.

To perform analysis, if we need all the data in one place, Data Analyst will merge
the multiple files. The purpose of data merging is to have data at a single
location, understand data in a more efficient manner and connect different
attributes from two different files.

To perform data merging, both the files should have a common column / attribute
by which they can be merged.
For data merging to occur, both files must share a common column or attribute. In
this case, the common column is Longitude.

“Admit date” column is in Char format. For better Data Analysis, it will be better
to transform the “Admit date” column to Date datatype.
The additional column will be storing days, which will always be integers. So the
datatype of this column will be int.

As per the data concern there are chances some data is not available for any
attribute. That data is known as missing values. These Missing Values can change
the overall analysis.
As the rows are very less, we can not drop the rows. We can handle the missing
values by Imputing the value with required mean/mode/other methods.
As there are many rows, we can drop the rows containing missing values or we
can handle the missing values by Imputing the value with required
mean/mode/other methods.
Outliers are the data points or value which does not fall into with all the other
values. Outliers can also reduce the accuracy of the model.
Outliers can be an error or some exceptional case in any particular data. It is
always good to do analysis with outlier and without outlier to describe how outlier
is impacting the analysis.

Feature Engineering is an art rather than science, as there can be impactful new
features which can be derived from already existing features in this process.
We could easily identify employees who are over 35 years old if we had an Age
column. As the next step, adding a new column - "Age" - from column "Birthdate"
will help this tremendously.

Storytelling should be in such a way that it should be understandable to


stakeholders. It should be consumable, impactful and actionable.

He used no of bedrooms and house size to create new features to get better
results depending upon the size and house price.
These were two insights after the analysis shown in the lecture:
Bay area, High house prices
Bay area, High income

Storytelling is the last step of the data analysis, and it is performed after the Data
analysis step, once you have all the insights.

Data analytics works by analyzing large data sets with a variety of tools and
methods in order to uncover unique patterns, hidden correlations and relevant
trends, and other insights that can be used to make data-driven decisions in the
pursuit of improved results.

A variety of visualisation tools, including Power BI, Tableau, Spotfire, etc, is used
for plotting and dashboarding data to perform data analysis.
ETL and Python are used for data preparation, and Excel, SQL, Python, and
Natural Language Processing (NLP) are used for data analysis.

Mostly ETL (Extract transform and load) tools like SSIS, Informatica, Alteryx,
Tableau Prep etc. are used for Data Acquisition. Python, R are used for Data
Preparation. A variety of visualisation tools, including Power BI, Tableau, Spotfire,
etc, is used for plotting and dashboarding data to perform data analysis.
Powerpoint presentations are used to present analysis.

Project : Covid Dataset

Structured data is the data that is stored in the Tabular format as row and
column. The Covid-19 Dataset, in tabular format, is a structured dataset.
The total number of cases for all states has been provided. The total number of
new cases in India will be the sum of all these cases. By dividing the observed
new cases in Kerala by the sum of the observed new cases in every state, we can
find the total percentage of new cases in Kerala compared to India.
There is information about Covid-19 cases of India in the dataset. You do not have
any information about other countries. Thus, you cannot determine which country
has the highest cases using the provided data.
- The dataset has information related to active and death cases. It also has a date
column, which indicates the month of the observed cases. Thus, you can find a
pattern of active and death cases across different months via the three
attributes:- Date, Active, Deaths.
- As you do not have information about various countries and districts, you cannot
find the timeline view of cases across different countries, states and districts.
- As you have no information relating covid-19 with climate, you cannot find the
impact of climate on the spread of infection.
If you want to find the % of cases about the state's population, you need to have
a ratio of the number of confirmed cases in a state to the state's population. The
population of a state is not available at the moment. It can be computed by
adding Population data.
You can find the death rate of each state by dividing its death cases by its
population. As the computation will be carried out statewide, 'State/Union
Territory will be the common merging column.
The "Confirmed_Indian_National" and "Confirmed_Foreign_National" Column
values for Haryana State are not provided. You must handle the missing attributes
before analysing the data.
Correct format helps in the analysis and feature engineering methods. The “Date”
column should be in Date-type format to better analyse
- The Date and Time attribute will help you select a
period.
- To target a state, you will need state attribute data,
- To find recovery rate, you will need an active case
and cured case attribute data.
To understand the trend, the government has to go to the past day's data of
covid cases. And going through past data is known as Descriptive Analysis.
Now, After looking at the data, the government finds out the cases are rising day
by day. But to reduce the cases, they are posing a lockdown. But they have to
forecast a future number of cases for the same. And they are forecasting the
future in any analysis known as predictive analysis. But to reduce the cases, they
are posing a lockdown.
After seeing the trend and predicting the number of cases, the government is
now in action to solve the rising cases and death. So to take proper action, like
with who they should start the vaccination ( age group) and taking action, comes
under the Prescriptive analysis.
After the success of the vaccination drive, the government needs the
data/records for the same for their profit. Now for the same, they will be needing
compiled data, which will come under the Descriptive Analysis.

Introduction To Spreadsheet
Spreadsheets provide instant visualisation and easy understanding of the data.
But we cannot store more than a million rows in spreadsheets.
In a Spreadsheet, data is organised in rows and columns.
In a Spreadsheet, the intersection of a row and column is known as cell.
In a Spreadsheet, each workbook can have multiple worksheets but each
worksheet cannot have multiple workbooks.
In a Spreadsheet, Worksheet is the collection of data and collection of worksheets
are workbook.
Spreadsheet can wrap text so it appears on multiple lines in a cell. We can format
the cell so the text wraps automatically.
At the top of the workbook, there is a ribbon where you can find the Text
Wrapping tab. And by clicking on it we can wrap the text.
In a spreadsheet, rows are represented using numbers and columns are
represented by Alphabets.
Red Square: This tab is used to apply borders to the cell.
Yellow Square: This B tab is used to change the font from normal to bold.
Blue Square : This tab is used for text wrapping
Green box: This tab is used to format the values in %.
To Copy Data : ctrl + c or cmd + c
To Paste Data : ctrl + v or cmd + v
To Print Data: ctrl + p or cmd + P
To Replace Data : ctrl + shift + h or cmd +shift +h
**cmd is use in mac book.
Select range of cells to sort
> Click on Data > Select Sort range option.

When you freeze panes you can fix specified rows and columns so that they are
always visible on the screen. So, freezing is the correct answer.

When sorting is applied to a particular column, all the columns (the sheet) get
rearranged based on the sorted column.

1. ctrl + s or cmd + s : To Save


2. ctrl + f or cmd + f : To Find
3. ctrl + k or cmd + k : To create hyperlink
4. ctrl + x or cmd + x : To cut any text

Paste special option have these underline option:


1. Values only
2. Format only
3. Formula only
4. Conditional Formatting only
5. Data validation only
6. Transposed ( paste row data in to column format, column data in to row format)
7. Borders only

In the Data tab, under the Data Clean up column, there is an option to remove
duplicates.

In Condition Formatting , select color scale.


-There we can format the data into colours based on their values.
Using Conditional formatting , we have coloured the rows.
- If you see it, the condition is applied on the gender column. where they coloured
according to female (Yellow) and Male(Green)

To launch the Conditional Formatting Rules Manager, click on the Conditional


Formatting button under the Home tab on the Ribbon and select Manage Rules
from the menu.
Change the cell range by clicking in the corresponding box in the Applies To
section and select a new cell range.

Every function in a spreadsheet must start with an equals symbol.


The SUMIF function adds all numbers in a range of cells based on one criteria or a
condition.
The syntax of SUMIF =SUMIF (range, criteria, [sum_range]) where
range - Range to apply criteria to.
criteria - Criteria to apply.
sum_range - [optional] Range to sum.
=IF(condition, condition if true, condition if false)
*Condition implies the criteria on which you want the result
*Condition if true implies if criteria matches than the output
*Condition if false implies if criteria doesn't match than the output

AVG is not a function in spreadsheet. The function to calculate average in


spreadsheet is “AVERAGE”
DAY(TODAY()) function returns today’s date.

click on Insert tab -> charts.


The area where we plot the chart is known as Plot area.
Histogram is a representation of a frequency distribution by means of rectangles
whose widths represent class intervals and whose areas are proportional to the
corresponding frequencies.
Spreadsheet can be used for calculation, graphs can be easily plotted and data
manipulation can be easily done.

Assignment-1

The manager wants to arrange the employee salaries in ascending order. For
arranging any data either in ascending/descending order, you are expected to use
the SORT operation.
Manager wants to find the youngest employee, to find the same he needs to sort
the age column in ascending order. Then the first cell of this particular column will
give the youngest employee.
The manager wants to find the employee's name with the age of less than 40 &
highest salary.
- To find the same he needs to apply Filter to filter out the details of all the
employees with the age less than 40.
-After that, he needs to sort the salary in descending order to get the details of
the employee with the highest salary.
Explanation: There are no duplicates in the sheet, You can find the same by:
Click on Data -> Data Cleanup -> Remove Duplicates
In the existing column , create one extra column by performing right click and
click on create new column on right.
-Once it is done, Name it as “ HealthFlag”.
Apply the if function to calculate the health flag of every employee as health or
unhealthy.
-IF function incorporates the first parameter as a condition which is weight more
than or equal to 60 . and the second parameter as results. So if it is true it will tag
them as Unhealthy or false it will tag them as healthy.
MIN(Range) function will give the lowest value of the respective column.
COUNTIF(Range , Criteria) function will give the count of the female employee.
Use Average( Range) function to find the Average of a given range.
use =SUMIF(P2:P14,"UNHEALTHY",L2:L14) function to find the sum of all the
unhealthy employees salary
Create FirstAlpha column with =LEFT(range,NumberofCharacter) , Create
LastAlpha column with =Right(range,NumberofCharacter). Then in the
ConcatAlpha column you have the use =Concatenate(String1,String2) function for
FirstAlpha and LastAlpha.
- Apply filter on HelathFlag column, for Healthy cells.
- Then apply sorting(Z-A) on the salary column.
- Than see the FirstAlpha column for first alphabet of the first name
- Write those alphabet together in a single string (without space)
Select the Salary column, and Click on insert > charts.
Search the histogram chart. A histogram chart will give you a range of salaries,
and how many employees reside in the range.
Filter the HelathFlag column for Unhealthy people. Now in the salary column,
select the value between 40000-60000. Then create a histogram for those values.
- Filter the Gender column for Female "F"
-Apply filter on Department Column for HR
-Then sum the salary of available rows.
Filter the "Department" Column with "Marketing".
Sum the value for female and Male employees using the SUMIF condition.
The total salary of female employees is higher than the total salary of male
employees in the Marketing Department

Introduction To Business Problem


ABC company is finding the target audience for their newly launched project.
Hence, the campaign they are doing is a target Campaign.
Customer Churning means that the customer who were the previous customers
but have stopped buying from the company now.
ABC company is facing two problems :
1. They are launching a new product, So they want to target customers who will
buy this new product.
2. Customer Churning is another problem when the customer's stopped buying
products from the ABC company.
For a Data Analyst, before moving forward with the analysis he/she should have
an understanding of the company structure and the type of business the company
does.
ABC food company have both Manufacturing and distribution units. Also, they sell
to the customer directly through their web portal or own franchise. They don’t sell
it to other businesses. This is why it is not business to business.
To find the potential customers, you will need the details of the customers and the
amount they have spent on purchasing fruits products.
It will not depend on Recency i.e Number of days since the customer last
purchased
Converting it into Data Analysis form would help analysts directly search for those
datasets, which can help them find insights/solutions.
Customer Income,Age and Education is the part of the customer detail which will
help us to target the customer of some segment. But Recency is about the days
since last purchase which will not help us in target the correct customers.
dt_customer column represents the date of registration in the company.
While analyzing the data, we can observe that customer details are on one sheet,
and purchasing details are on another. Both sheets have required data points to
solve the business problem. That means the Required data points are in different
sheets.
VLOOKUP is the function which is used to merge two sheets.
Vlookup function is used to merge two sheets only when they have common key.
Syntax is:
VLOOKUP(search_key, range, index, [is_sorted])
is_sorted is an optional parameter. It can either be TRUE or FALSE.
A FALSE value for is_sorted indicates that the first column of the range does not
need to be sorted in ascending order. So, the VLOOKUP function searches for an
exact match of the search_key.
If there is more than one value equal to search_key, then VLOOKUP accesses the
first occurrence of the search_key.

#N/A Error -
1. Numbers formatted as text
2. The column with lookup values is not farthest to the left in your lookup table.
3. Typos or additional spaces in the lookup data
4. Typos or additional spaces in your lookup value
#REF! Error -
1. Column Index Number is greater than the number of columns
2. The references in the VLOOKUP formula points to cells that no longer exists.
SUBMIT
PREVIOUS
NEXT

VLOOKUP is case-insensitive, meaning it treats lowercase and UPPERCASE letters


as the same characters

Vlookup function does not work for the column which are on the left of the
search_key column of the second sheet. As it only looks in the right of the
Search_key column.
1. Key will be id
2. Range will be the whole dataset(except name column)
3. index_column will be the same column where we want to find detail
4. is_sorted will be zero./Volumes/uTorrent Web
=VLOOKUP(3,B1:C6,2,0)

1. Key will be id
2. Range will be the whole dataset(except name column)
3. index_column will be the same column where we want to find detail
4. is_sorted will be zero.
=VLOOKUP("Sahil",A1:C6,2,0)
=VLOOKUP(B2,'Sheet1'!G1:H6,2,0)
The index is the column number within the range from which the corresponding
value (the one in the same row as search_key) should be retrieved
In the formula, the search key should be B2 instead of B1. because at B1 we have
column names.(which is first row)

Missing values can be represented with many types like zero, null,NA, blanks,
special characters or very large number.
We can find missing values through the filter function and using the CountA
function .
-Filter function will count blank cell
-CountA will count cells in which we have data. So if we subtract that count with
total row count. We can find out if there is any missing value or not.
CountA will count cells in which we have data(nonblank cell) . So if we subtract
that count with total row count. We can find out if there is any missing value or
not.
CountA will count cells in which we have data(nonblank cell) . So if we subtract
that count with total row count. We can find out if there is any missing value or
not.

The standard deviation is a measure of the differences of each observation from


the mean.
Median has an equal number of items on both sides after arranging from big to
small or small to big.
When we have data spread out over a large range then you will have high
standard deviation. And if you have lower standard deviation then the data points
are very close to the mean.

- Mean is the Average of the given dataset which sum of all data point divide by
the number of data points
- Median is the middle point of the dataset
- Mean and median can be same in such scenario where data is evenly distribute
or all the value in dataset is same

Average is calculated by dividing the sum of the numbers with their count, Mode
is the most frequently occurring data in a dataset, Median is the middlemost
number in our dataset, Standard deviation shows how far numbers are from
Average/mean, while variance is mathematically square of the standard deviation.

Imputation method will be useful here because the analysis is dependent on the
column which is having missing value. i.e, Experience.

If Standard deviation is similar/ near to the Average that means that means the
data is distributed and not clustered near the Average. So this is why we will
replace it with median not with the average.
If Standard deviations are far from the Average and also smaller that means, the
data is not much distributed and clustered near the average. Because of this we
can replace the missing value with Average. (because most of the value are near
to the average), which will not change analysis.
first handle the missing value using imputation method. Then,Use the excel
function =stdev(range) to calculate the standard deviation on income column.

Use the excel function =Average(range) to calculate the average of the purchase
column.
Use the excel function =stdev(range) to calculate the standard deviation on
Monthly expense.
Use the excel function =VAR(range) to calculate the Variance on Monthly
expense.

The Standard deviations are far from the Average that means, that means we can
replace missing value with average as both of them are already far from each
other and that will not make an impact.The distribution will remain the same.
The Standard deviations are far from the Average that means, that means we can
replace missing value with average as both of them are already far from each
other and that will not make an impact. The distribution will remain the same. So
we will replace it with 41664 which is the average .

ASSIGNMEMT - 2
First step to solve any problem, you must understand the business problem in
detail.
College can improve the placement if they can create admission criteria. Give
admission to the people who fall into the criteria.
The Analyst will have to analyse the data of the student which will include
placement data and students past record to suggest any admission criteria to
improve the placement.
Required detail which is student past record, students placement status and work
experience is an important factor. And they are available in different sheets. That
is why we need to merge the data using vlookup.
After merging both the sheets, apply the filter operation on the gender column
and choose “F” . Once we have filtered data, just select the complete column and
you can find the count in the right bottom corner.
After merging the data, find the name of the student using vlookup function
(=VLOOKUP("Anne McFarland",B2:N216,13,0)). From there you can see marks
scored by the student in the placement test.
Create a filter on Status column and filter out the details of placed students only,
after that count the number of rows
Or
Use =COUNTIF() function.
#N/A error occurs when the value cannot be found in the referenced data and it
can be clearly seen that the ‘Name’ column is not present in Placement Detail
sheet , hence vlookup will give us the error.
The data type of any cell can be checked by using =TYPE() function.
By applying filter on columns to filter out a particular result, the columns that
have missing values, gives us an option of “(Blanks)”, which means to filter out
the columns that have missing values
Create a filter on Salary column and filter out the Blanks only, after that count the
number of rows
Or
Use =COUNTIF() function ie

=COUNTIF(P2:P216,"").
Since the students are not employed so their salary can be taken as 0.
Apply filter on gender, specialization & status and then you can check the count of
the number of rows visible after filtering out the data in the bottom right corner
tab.
Apply filter on hsc_s,hsc_p & status and then you can check the count in the
bottom right corner tab.
Apply filter on workex column and status to count the required number of
students. and then you can check the count in the bottom right corner tab.
Calculate Average, median and standard deviation. It comes out to be
average=72.29. Standard Deviation=13.35
Now since Standard deviation<
use =AVERAGE() for finding Average

Data Processing : Business Problem

Before removing or including the outlier in any analysis, Data Analyst should
confirm the scenario with the stakeholder.
As removing and having the outlier can change the final analysis. So whenever
you encounter any outliers in the related feature/column, it is always good
practice to discuss with stakeholder how it will impact the analysis

In the given dataset, monthly expenses corresponding to id=1 is 20 & id=12 is


200000000 ,which are too low and too high respectively in comparison to other
monthly expenses ,so they can be counted as outliers in the dataset.

Quartile 2 returns the centre value of the sorted range which is similar to the
median value.

Outliers are those values in our dataset which are too large or too small in
comparison to our other dataset values and maybe present due to some human
error while recording the data.
Candlestick chart is the visual method to find outliers present in our data

Syntax is:
=Quartile ( data , quartile_number)

In the NumStorePurchases column


- MIN is 0
- MAX value is 13
- Q1 value after applying =QUARTILE() function formula is 3
- IQR value is Q3-Q1 i.e. 5

As you can see in this graph, We are concluding there exists an outlier because
MAX value is very far from IQR plot or from Q3.As the Maximum number of values
are clustered in or near IQR.
IQR represents the Q3-Q1 , which is the part of any feature where most of the
values are residing.
If the outlier’s value is important to our stakeholders or us while performing some
analysis, we need not remove the outlier. The same we did in the ABC food
company analysis.
We are using IF conditions to check if the value exists in between the lower limit
or upper limit of the range.

=IF(OR(Cell_valueQ3+1.5IQR) ,1,0)

This function implies if not in the given range than value will be marked as "1"
else "0".
Q1-1.5IQR and Q3+1.5IQR will be the lower and upper limit respectively.
Any value which is in between the upper limit and lower limit will not be
considered outlier.
50% of data will always be in range as the data between Q1 and Q3 will always be
there
MAX or MIN value will only be considered as an outlier if they are not in the range
of Q1-1.5IQR to Q3+1.5IQR.
Median value will always be part of IQR.
First calculate all the quartile values(Q1,Q2,Q3) and then find IQR for the column
amount of sweet products .

In new column apply this condition,


=If( OR(Cell_valueQ3+1.5IQR),1,0)
Then in the new column, apply filter or =COUNTIF() to calculate the number of
outliers present.(Value 1 represent Outlier)
First calculate all the quartile values(Q1,Q2,Q3) and then find IQR for the column
Number of web purchases .

In new column apply this condition,


=If( OR(Cell_valueQ3+1.5IQR),1,0)
then in the new column, apply filter or =COUNTIF() to calculate the number of
outliers present.(Value 1 represent Outlier)

Feature engineering helps to enrich our dataset with fruitful information and gives
us patterns to explore further. It is an independent step, this can be done while
doing analysis also. Sometimes it is not possible to see what feature might come
handy while doing analysis.

Create a new column in which you can use If function, so if the age is 61 it will
return 1. Then apply a filter on the new column by selecting the column which has
1. And then count the number of rows.
Create a new column taking help of IF function, using the condition if the total
purchase is more than 20 it will return 1.
Then apply a filter on the new column to filter out the rows with value 1, And then
count the number of rows.

Apply a filter on a column total number of kids (that stores the count of total
number of kids in a family)to be greater than 2 and then filter out the people who
have income less than 40000

Assignment 3

Quartile is applied to numeric columns only. And since we have 6 numerical


columns in the dataset so our answer should be 6.
If you check using Quartile by using the condition Cell_valueQ3+1.5IQR by
creating a New column for Outliers present by using
formulae(=IF(OR(Cell_valueQ3+1.5IQR ),1,0)) or candlestick charts.
After that count the number of 1's in the column for the number of outliers
present.

You will find there is no outlier in this particular column and all the marks are in
the given range.
Use =Quartile( ) function for every numerical column. (Q1,Q2,Q3,Q4 and IQR).
Once you have all the values, find if the target column’s value exists in between
the range Q1-1.5IQR to Q3+IQR mark it as 0, else 1. If we have got any ‘1’ in a
new column then the target column has an Outlier.

Perform this for all the numerical columns.


use =Quartile( Data , Quartile_number) function for hsc_p range and use 3 as
quartile number.
Use =Quartile( ) function for hsc_p range and Find all the values Q1,Q2,Q3 , Q4
and IQR . After this, create another column for finding if any value is in the range
of Q1-1.5IQR to Q3+IQR mark it as 0, else 1.
Now apply the filter option for ‘1’ , Then you can count the number of rows.
Use =Quartile() function to find the Q1 & Q3 values for the given columns
Use =Quartile() function to find the Q1 & Q3 values for the given columns
To calculate the IQR of the salary column,

you will need the value of Q3 and Q1 on salary column:-


- Q3 : Quartile (range,3)
- Q1 : Quartile (range,1)

then perform (Q3-Q1) which will give IQR value.


You can create the Box Plot for hsc_p column by selecting Q1,Q2,Q3,Q4 values for
hsc_p column and then go to insert chart option and insert a candlestick chart in
spreadsheet.
1. First create a new column Student_Rating as mentioned in the question.
2. Create new column where convert mba_p marks in percentage. (L represents
mba_p column)
=> =(L2)%
3. Provide the condition over the mba_p% column for the Student_rating column
i.e, => =if(Q2<60%,"Average","Good Performer")
(Q represents mba_p% Column)
4. Then Apply filter on student rating column for count of female and male.
By applying feature engineering on past academic records and analyzing on basis
of gender & work experience of the newly admitted student might help us to
divide them into new categories but their MBA marks will not help us as they have
significance after a student takes admission, but here our aim is to decide the
criteria for a student who is not currently studying mba i.e who is a fresher.
Create one more column bases on the etest_p column name as "Clear_test",
where take a criteria if etest_p marks is above 50%, write cleared else write not
cleared.
Count Cleared students and divide it by total appeared students so it becomes
213/215=0.99069767*100=99.06%.
First, apply a filter on students who cleared the test and then apply a filter on
people who got placed. Then count the number of students who came out of
these two filters. After this divide this number with the number of students who
cleared the exams.

So it becomes 148/213=0.694835*100=69.48%

You might also like