Milestone 1
Milestone 1
-The data analysis process involves gathering all the information, cleaning
data, transforming data, Modelling data and using it to find patterns and other
insights.
Data is a collection of unorganised facts & figures and does not provide any
further information regarding patterns, context, etc.
Facts are data but processed data is information.
Data comes before the information.
Structure data is well defined and its in tabular form, which has rows and
columns.
We can have age information in well defined format as date of birth and age. But
customer reviews, pictures and customer complaints can not be defined as its
formatting will change person to person. They are unstructured in nature.
( covering in next lectures)
Semi Structure data is not having predefined format, and is separated using tags
and different markers. This data is useful in Data analysis as well.
Structured Data: This kind of data is stored in the Tabular format as row and
column.
Semi-Structured Data: It doesn’t have a defined structure, but data are separated
using tags and markers.
Unstructured Data: Not having any structure. This kind of data requires an
algorithm to convert into some structure.
Data analytic techniques enable you to take raw data and uncover patterns to
extract valuable insights from it. The data gives information, insights and helps in
making impactful decisions.
Analysts must identify areas that require further investigation based on the
findings of descriptive analysis because there are questions that could not be
answered merely by viewing the data. It helps in finding if any gap is there. And it
will also find the root cause problem which is affecting the business for example
why there has been a sudden change in traffic to a website with no apparent
cause.
Giving traffic conditions falls under descriptive analysis and predicting travel time
is predictive analysis. But when google tries to find out why google map is
showing more time for shortest path or Why google is showing a narrow road/non
vehicle friendly road to travel to reach the destination.
Predictive analysis attempts to answer the question “what is likely to happen”.
This type of analytics utilizes previous data to make predictions about future
outcomes.
Prescriptive analysis is the frontier of data analysis, combining the insight from all
previous analyses to determine the course of action to take in a current problem
or decision. It determines what should be done.
By using all the previous analysis process, prescriptive analytics helps the
business or organisation to make a sound decision on their various activities.
Prescriptive Analytics combines the insight from the last step, that is Predictive
Analytics, to determine the course of action to take in a current problem or
decision.
The e-commerce clothing company is collecting data and summarising past data,
hence doing Descriptive Analysis.
Now prescriptive analytics can be of assistance on the matter and help determine
options for action. Perhaps an algorithm can detect the learners who require that
new course, but lack that particular skill, and send an automated
recommendation that they take an additional training resource to acquire the
missing skill.
By predicting what exactly you are searching and combining it with your past
results.
In the House value Prediction dataset, the question was to find the rich people. To
solve this problem statement we have to set a threshold for salary. Above that
threshold salary, people are rich.
Once the problem is fully understood in data analysis terms and all the
assumptions checked and validated, now comes the data acquisition phase.
Analysts need to decide which all data sets are necessary to solve the problem,
where these data sets would be residing and in which format.
The assumption is created so that any threshold can be decided to get the correct
result in terms of business problems. Data analysts need to make assumptions
before deciding which data set is required to solve a problem.
To perform analysis, if we need all the data in one place, Data Analyst will merge
the multiple files. The purpose of data merging is to have data at a single
location, understand data in a more efficient manner and connect different
attributes from two different files.
To perform data merging, both the files should have a common column / attribute
by which they can be merged.
For data merging to occur, both files must share a common column or attribute. In
this case, the common column is Longitude.
“Admit date” column is in Char format. For better Data Analysis, it will be better
to transform the “Admit date” column to Date datatype.
The additional column will be storing days, which will always be integers. So the
datatype of this column will be int.
As per the data concern there are chances some data is not available for any
attribute. That data is known as missing values. These Missing Values can change
the overall analysis.
As the rows are very less, we can not drop the rows. We can handle the missing
values by Imputing the value with required mean/mode/other methods.
As there are many rows, we can drop the rows containing missing values or we
can handle the missing values by Imputing the value with required
mean/mode/other methods.
Outliers are the data points or value which does not fall into with all the other
values. Outliers can also reduce the accuracy of the model.
Outliers can be an error or some exceptional case in any particular data. It is
always good to do analysis with outlier and without outlier to describe how outlier
is impacting the analysis.
Feature Engineering is an art rather than science, as there can be impactful new
features which can be derived from already existing features in this process.
We could easily identify employees who are over 35 years old if we had an Age
column. As the next step, adding a new column - "Age" - from column "Birthdate"
will help this tremendously.
He used no of bedrooms and house size to create new features to get better
results depending upon the size and house price.
These were two insights after the analysis shown in the lecture:
Bay area, High house prices
Bay area, High income
Storytelling is the last step of the data analysis, and it is performed after the Data
analysis step, once you have all the insights.
Data analytics works by analyzing large data sets with a variety of tools and
methods in order to uncover unique patterns, hidden correlations and relevant
trends, and other insights that can be used to make data-driven decisions in the
pursuit of improved results.
A variety of visualisation tools, including Power BI, Tableau, Spotfire, etc, is used
for plotting and dashboarding data to perform data analysis.
ETL and Python are used for data preparation, and Excel, SQL, Python, and
Natural Language Processing (NLP) are used for data analysis.
Mostly ETL (Extract transform and load) tools like SSIS, Informatica, Alteryx,
Tableau Prep etc. are used for Data Acquisition. Python, R are used for Data
Preparation. A variety of visualisation tools, including Power BI, Tableau, Spotfire,
etc, is used for plotting and dashboarding data to perform data analysis.
Powerpoint presentations are used to present analysis.
Structured data is the data that is stored in the Tabular format as row and
column. The Covid-19 Dataset, in tabular format, is a structured dataset.
The total number of cases for all states has been provided. The total number of
new cases in India will be the sum of all these cases. By dividing the observed
new cases in Kerala by the sum of the observed new cases in every state, we can
find the total percentage of new cases in Kerala compared to India.
There is information about Covid-19 cases of India in the dataset. You do not have
any information about other countries. Thus, you cannot determine which country
has the highest cases using the provided data.
- The dataset has information related to active and death cases. It also has a date
column, which indicates the month of the observed cases. Thus, you can find a
pattern of active and death cases across different months via the three
attributes:- Date, Active, Deaths.
- As you do not have information about various countries and districts, you cannot
find the timeline view of cases across different countries, states and districts.
- As you have no information relating covid-19 with climate, you cannot find the
impact of climate on the spread of infection.
If you want to find the % of cases about the state's population, you need to have
a ratio of the number of confirmed cases in a state to the state's population. The
population of a state is not available at the moment. It can be computed by
adding Population data.
You can find the death rate of each state by dividing its death cases by its
population. As the computation will be carried out statewide, 'State/Union
Territory will be the common merging column.
The "Confirmed_Indian_National" and "Confirmed_Foreign_National" Column
values for Haryana State are not provided. You must handle the missing attributes
before analysing the data.
Correct format helps in the analysis and feature engineering methods. The “Date”
column should be in Date-type format to better analyse
- The Date and Time attribute will help you select a
period.
- To target a state, you will need state attribute data,
- To find recovery rate, you will need an active case
and cured case attribute data.
To understand the trend, the government has to go to the past day's data of
covid cases. And going through past data is known as Descriptive Analysis.
Now, After looking at the data, the government finds out the cases are rising day
by day. But to reduce the cases, they are posing a lockdown. But they have to
forecast a future number of cases for the same. And they are forecasting the
future in any analysis known as predictive analysis. But to reduce the cases, they
are posing a lockdown.
After seeing the trend and predicting the number of cases, the government is
now in action to solve the rising cases and death. So to take proper action, like
with who they should start the vaccination ( age group) and taking action, comes
under the Prescriptive analysis.
After the success of the vaccination drive, the government needs the
data/records for the same for their profit. Now for the same, they will be needing
compiled data, which will come under the Descriptive Analysis.
Introduction To Spreadsheet
Spreadsheets provide instant visualisation and easy understanding of the data.
But we cannot store more than a million rows in spreadsheets.
In a Spreadsheet, data is organised in rows and columns.
In a Spreadsheet, the intersection of a row and column is known as cell.
In a Spreadsheet, each workbook can have multiple worksheets but each
worksheet cannot have multiple workbooks.
In a Spreadsheet, Worksheet is the collection of data and collection of worksheets
are workbook.
Spreadsheet can wrap text so it appears on multiple lines in a cell. We can format
the cell so the text wraps automatically.
At the top of the workbook, there is a ribbon where you can find the Text
Wrapping tab. And by clicking on it we can wrap the text.
In a spreadsheet, rows are represented using numbers and columns are
represented by Alphabets.
Red Square: This tab is used to apply borders to the cell.
Yellow Square: This B tab is used to change the font from normal to bold.
Blue Square : This tab is used for text wrapping
Green box: This tab is used to format the values in %.
To Copy Data : ctrl + c or cmd + c
To Paste Data : ctrl + v or cmd + v
To Print Data: ctrl + p or cmd + P
To Replace Data : ctrl + shift + h or cmd +shift +h
**cmd is use in mac book.
Select range of cells to sort
> Click on Data > Select Sort range option.
When you freeze panes you can fix specified rows and columns so that they are
always visible on the screen. So, freezing is the correct answer.
When sorting is applied to a particular column, all the columns (the sheet) get
rearranged based on the sorted column.
In the Data tab, under the Data Clean up column, there is an option to remove
duplicates.
Assignment-1
The manager wants to arrange the employee salaries in ascending order. For
arranging any data either in ascending/descending order, you are expected to use
the SORT operation.
Manager wants to find the youngest employee, to find the same he needs to sort
the age column in ascending order. Then the first cell of this particular column will
give the youngest employee.
The manager wants to find the employee's name with the age of less than 40 &
highest salary.
- To find the same he needs to apply Filter to filter out the details of all the
employees with the age less than 40.
-After that, he needs to sort the salary in descending order to get the details of
the employee with the highest salary.
Explanation: There are no duplicates in the sheet, You can find the same by:
Click on Data -> Data Cleanup -> Remove Duplicates
In the existing column , create one extra column by performing right click and
click on create new column on right.
-Once it is done, Name it as “ HealthFlag”.
Apply the if function to calculate the health flag of every employee as health or
unhealthy.
-IF function incorporates the first parameter as a condition which is weight more
than or equal to 60 . and the second parameter as results. So if it is true it will tag
them as Unhealthy or false it will tag them as healthy.
MIN(Range) function will give the lowest value of the respective column.
COUNTIF(Range , Criteria) function will give the count of the female employee.
Use Average( Range) function to find the Average of a given range.
use =SUMIF(P2:P14,"UNHEALTHY",L2:L14) function to find the sum of all the
unhealthy employees salary
Create FirstAlpha column with =LEFT(range,NumberofCharacter) , Create
LastAlpha column with =Right(range,NumberofCharacter). Then in the
ConcatAlpha column you have the use =Concatenate(String1,String2) function for
FirstAlpha and LastAlpha.
- Apply filter on HelathFlag column, for Healthy cells.
- Then apply sorting(Z-A) on the salary column.
- Than see the FirstAlpha column for first alphabet of the first name
- Write those alphabet together in a single string (without space)
Select the Salary column, and Click on insert > charts.
Search the histogram chart. A histogram chart will give you a range of salaries,
and how many employees reside in the range.
Filter the HelathFlag column for Unhealthy people. Now in the salary column,
select the value between 40000-60000. Then create a histogram for those values.
- Filter the Gender column for Female "F"
-Apply filter on Department Column for HR
-Then sum the salary of available rows.
Filter the "Department" Column with "Marketing".
Sum the value for female and Male employees using the SUMIF condition.
The total salary of female employees is higher than the total salary of male
employees in the Marketing Department
#N/A Error -
1. Numbers formatted as text
2. The column with lookup values is not farthest to the left in your lookup table.
3. Typos or additional spaces in the lookup data
4. Typos or additional spaces in your lookup value
#REF! Error -
1. Column Index Number is greater than the number of columns
2. The references in the VLOOKUP formula points to cells that no longer exists.
SUBMIT
PREVIOUS
NEXT
Vlookup function does not work for the column which are on the left of the
search_key column of the second sheet. As it only looks in the right of the
Search_key column.
1. Key will be id
2. Range will be the whole dataset(except name column)
3. index_column will be the same column where we want to find detail
4. is_sorted will be zero./Volumes/uTorrent Web
=VLOOKUP(3,B1:C6,2,0)
1. Key will be id
2. Range will be the whole dataset(except name column)
3. index_column will be the same column where we want to find detail
4. is_sorted will be zero.
=VLOOKUP("Sahil",A1:C6,2,0)
=VLOOKUP(B2,'Sheet1'!G1:H6,2,0)
The index is the column number within the range from which the corresponding
value (the one in the same row as search_key) should be retrieved
In the formula, the search key should be B2 instead of B1. because at B1 we have
column names.(which is first row)
Missing values can be represented with many types like zero, null,NA, blanks,
special characters or very large number.
We can find missing values through the filter function and using the CountA
function .
-Filter function will count blank cell
-CountA will count cells in which we have data. So if we subtract that count with
total row count. We can find out if there is any missing value or not.
CountA will count cells in which we have data(nonblank cell) . So if we subtract
that count with total row count. We can find out if there is any missing value or
not.
CountA will count cells in which we have data(nonblank cell) . So if we subtract
that count with total row count. We can find out if there is any missing value or
not.
- Mean is the Average of the given dataset which sum of all data point divide by
the number of data points
- Median is the middle point of the dataset
- Mean and median can be same in such scenario where data is evenly distribute
or all the value in dataset is same
Average is calculated by dividing the sum of the numbers with their count, Mode
is the most frequently occurring data in a dataset, Median is the middlemost
number in our dataset, Standard deviation shows how far numbers are from
Average/mean, while variance is mathematically square of the standard deviation.
Imputation method will be useful here because the analysis is dependent on the
column which is having missing value. i.e, Experience.
If Standard deviation is similar/ near to the Average that means that means the
data is distributed and not clustered near the Average. So this is why we will
replace it with median not with the average.
If Standard deviations are far from the Average and also smaller that means, the
data is not much distributed and clustered near the average. Because of this we
can replace the missing value with Average. (because most of the value are near
to the average), which will not change analysis.
first handle the missing value using imputation method. Then,Use the excel
function =stdev(range) to calculate the standard deviation on income column.
Use the excel function =Average(range) to calculate the average of the purchase
column.
Use the excel function =stdev(range) to calculate the standard deviation on
Monthly expense.
Use the excel function =VAR(range) to calculate the Variance on Monthly
expense.
The Standard deviations are far from the Average that means, that means we can
replace missing value with average as both of them are already far from each
other and that will not make an impact.The distribution will remain the same.
The Standard deviations are far from the Average that means, that means we can
replace missing value with average as both of them are already far from each
other and that will not make an impact. The distribution will remain the same. So
we will replace it with 41664 which is the average .
ASSIGNMEMT - 2
First step to solve any problem, you must understand the business problem in
detail.
College can improve the placement if they can create admission criteria. Give
admission to the people who fall into the criteria.
The Analyst will have to analyse the data of the student which will include
placement data and students past record to suggest any admission criteria to
improve the placement.
Required detail which is student past record, students placement status and work
experience is an important factor. And they are available in different sheets. That
is why we need to merge the data using vlookup.
After merging both the sheets, apply the filter operation on the gender column
and choose “F” . Once we have filtered data, just select the complete column and
you can find the count in the right bottom corner.
After merging the data, find the name of the student using vlookup function
(=VLOOKUP("Anne McFarland",B2:N216,13,0)). From there you can see marks
scored by the student in the placement test.
Create a filter on Status column and filter out the details of placed students only,
after that count the number of rows
Or
Use =COUNTIF() function.
#N/A error occurs when the value cannot be found in the referenced data and it
can be clearly seen that the ‘Name’ column is not present in Placement Detail
sheet , hence vlookup will give us the error.
The data type of any cell can be checked by using =TYPE() function.
By applying filter on columns to filter out a particular result, the columns that
have missing values, gives us an option of “(Blanks)”, which means to filter out
the columns that have missing values
Create a filter on Salary column and filter out the Blanks only, after that count the
number of rows
Or
Use =COUNTIF() function ie
=COUNTIF(P2:P216,"").
Since the students are not employed so their salary can be taken as 0.
Apply filter on gender, specialization & status and then you can check the count of
the number of rows visible after filtering out the data in the bottom right corner
tab.
Apply filter on hsc_s,hsc_p & status and then you can check the count in the
bottom right corner tab.
Apply filter on workex column and status to count the required number of
students. and then you can check the count in the bottom right corner tab.
Calculate Average, median and standard deviation. It comes out to be
average=72.29. Standard Deviation=13.35
Now since Standard deviation<
use =AVERAGE() for finding Average
Before removing or including the outlier in any analysis, Data Analyst should
confirm the scenario with the stakeholder.
As removing and having the outlier can change the final analysis. So whenever
you encounter any outliers in the related feature/column, it is always good
practice to discuss with stakeholder how it will impact the analysis
Quartile 2 returns the centre value of the sorted range which is similar to the
median value.
Outliers are those values in our dataset which are too large or too small in
comparison to our other dataset values and maybe present due to some human
error while recording the data.
Candlestick chart is the visual method to find outliers present in our data
Syntax is:
=Quartile ( data , quartile_number)
As you can see in this graph, We are concluding there exists an outlier because
MAX value is very far from IQR plot or from Q3.As the Maximum number of values
are clustered in or near IQR.
IQR represents the Q3-Q1 , which is the part of any feature where most of the
values are residing.
If the outlier’s value is important to our stakeholders or us while performing some
analysis, we need not remove the outlier. The same we did in the ABC food
company analysis.
We are using IF conditions to check if the value exists in between the lower limit
or upper limit of the range.
=IF(OR(Cell_valueQ3+1.5IQR) ,1,0)
This function implies if not in the given range than value will be marked as "1"
else "0".
Q1-1.5IQR and Q3+1.5IQR will be the lower and upper limit respectively.
Any value which is in between the upper limit and lower limit will not be
considered outlier.
50% of data will always be in range as the data between Q1 and Q3 will always be
there
MAX or MIN value will only be considered as an outlier if they are not in the range
of Q1-1.5IQR to Q3+1.5IQR.
Median value will always be part of IQR.
First calculate all the quartile values(Q1,Q2,Q3) and then find IQR for the column
amount of sweet products .
Feature engineering helps to enrich our dataset with fruitful information and gives
us patterns to explore further. It is an independent step, this can be done while
doing analysis also. Sometimes it is not possible to see what feature might come
handy while doing analysis.
Create a new column in which you can use If function, so if the age is 61 it will
return 1. Then apply a filter on the new column by selecting the column which has
1. And then count the number of rows.
Create a new column taking help of IF function, using the condition if the total
purchase is more than 20 it will return 1.
Then apply a filter on the new column to filter out the rows with value 1, And then
count the number of rows.
Apply a filter on a column total number of kids (that stores the count of total
number of kids in a family)to be greater than 2 and then filter out the people who
have income less than 40000
Assignment 3
You will find there is no outlier in this particular column and all the marks are in
the given range.
Use =Quartile( ) function for every numerical column. (Q1,Q2,Q3,Q4 and IQR).
Once you have all the values, find if the target column’s value exists in between
the range Q1-1.5IQR to Q3+IQR mark it as 0, else 1. If we have got any ‘1’ in a
new column then the target column has an Outlier.
So it becomes 148/213=0.694835*100=69.48%