Data Analysis Using Box and Whisker Plot
Data Analysis Using Box and Whisker Plot
Vignesh M Balaji R
MS Software Engineering, MS Software Engineering,
School of Information Technology and Engineering School of Information Technology and Engineering,
VIT University, Vellore, India. VIT University, Vellore, India.
[email protected] [email protected]
Abstract— In statistical analysis, we have a collection of data, To analyze the relevant data of Lung cancer dataset we
with the use of these data, we have to do analysis based on our have an applied Box plot data analysis method which is shown
requirements. With the collection of data using Statistical in Section 3. A boxplot is a data analysis method used to find
analysis, we deal collection, analysis, presentation and organizing the output of the samples. With the use of boxplot, we can
the data. With the help of statistical analysis, we can find easily compare the different datasets. In other words, boxplot
underlying patterns, relationships, and trends between data also called box and whisker plot method.
samples. The R system for statistical computing is an
environment for data analysis and graphics. Here we are going to TABLE II. DATA SET OF 2ND PART OF LUNG CANCER ATTRIBUTES.
implement boxplot method and control chart methods for Lung
cancer dataset. With the help of boxplot, we can easily make Race Height Weight Family Copd Year Cancer
relations between samples and we can find the outliers. history
Age Smoking status Years Average Gender Grade Latin 175 65 No No 2003 No
smoked per day Asian 178 59 Yes No 2004 No
American
68 Smoker 10 15 Male UG 187 70 No No 2005 No
Indian
77 Former Smoker 15 10 Male PG
American
187 54 Yes Yes 2002 No
68 Non Smoker 0 0 Male PG Indian
71 Smoker 27 10 Male Nil American
187 56 Yes Yes 2003 No
74 Smoker 10 5 Male Nil Indian
51 Smoker 10 3 Female UG Asian 187 58 Yes Yes 2001 Yes
54 Former Smoker 14 6 Female PG Asian 185 89 Yes Yes 2003 Yes
50 Non Smoker 0 0 Female Nil Asian 185 84 No Yes 2002 No
60 Smoker 5 5 Male UG Asian 185 74 No Yes 2004 Yes
54 Smoker 12 5 Male PG This instructive datasets are used as the commitment to
figure the Pearson [6], [11], [14], [16], [22], [24]. In the present
54 Non Smoker 0 0 Male UG
days, there are enormous measures of data recorded by the
56 Former Smoker 12 12 Male Nil banks and exploring them requires complex estimations. We
87 Smoker 10 10 Male PG played out the item metric examination on the given
45 Non Smoker 0 0 Male PG enlightening accumulation. From the data examination [8], [9],
76 Former Smoker 25 12 Male UG [12], [17], [18], [20] we can pick which quality can be viewed
1
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
as and which trademark can be expelled. For instance, in the III. CALCULATION AND DISCUSSIONS
Pearson strategy if the estimation of r is more than 0.5 then the This is the sample dataset that we are going to know how
credits are thought to be unequivocally related and if it is
the boxplot method works.
underneath 0.3 the qualities are pitifully related.
A segment of the past procedures to appraise the decisions TABLE III. SAMPLE DATASET OF 1ST PART BETWEEN AGE 25 TO 45.
in perspective of their relationship of value are Spearman [11], Age Smoking status Years Average Gender Grade
Analytical Hierarchical Process (AHP) [7], [13], [15] and smoked per day
Traveling Salesman Problem (TSP) [26]. The fragile
information's among various components [19], [21], [28], [30], 25 Smoker 12 15 Male Nil
[32], [34], [36] among the bank stock model are managed by 21 Non Smoker 0 0 Male Nil
late secured systems [23], [25], [27], [29], [31], [33], [35], [37]. 22 Former Smoker 5 2 Male Nil
In boxplot method, the input data set is split to quartiles. In 28 Smoker 10 8 Female PG
a boxplot, it has a minimum value, lower quartile, median, 35 Smoker 7 3 Male PG
upper quartile, maximum value. Boxplot, it contains one box, it 18 Former Smoker 8 2 Female PG
goes from lower quartile to upper quartile. The difference 19 Non Smoker 0 0 Female PG
between upper quartile and lower quartile is the length of the PG
40 Smoker 12 6 Male
box. Inside the box of boxplot, one vertical line is drawn, it is
45 Smoker 45 4 Female PG
the median of the dataset. Median of the lower samples is
called “Lower quartile” and Median of the higher samples is 23 Smoker 2 5 Male PG
called “Upper quartile”. In the outside of the box in a boxplot,
two more vertical lines are drawn, one vertical near upper TABLE IV. SAMPLE DATASET OF 2ND PART BETWEEN AGE 25 TO 45.
quartile is called upper whisker and another one line near lower Race Height Weight Family history Cancer Year
quartile is called lower whisker is shown in the following Fig.
1. The easiest way to find the quartiles have first sorted the Asian 180 75 Yes Yes 2005
data and take the minimum and maximum values as lower Asian 178 80 No No 2004
bound and upper bound respectively. Lower quartile, median Asian 165 78 No No 2005
upper quartile is we can find using the following methods in
Asian 178 79 Yes No 2004
Section 2.
Asian 189 75 Yes Yes 2003
Asian 175 80 Yes No 2005
Asian 148 79 No Yes 2005
Asian 168 72 Yes No 2003
Asian 189 85 No No 2004
Asian 168 69 No No 2005
Fig. 1. Box Plot Attributes.
Five number summary: min, Q1, M, Q3, max Fig. 2. Smokers by Ages.
Boxplot: ends of the box are the quartiles, median is The above boxplot shows that when comparing to former
marked, whiskers, and plot outlier individually smoker and nonsmoker, the smoker is having higher chances
of getting affected by lung cancer, from the boxplot of a
Step 4: Calculate the Outlier: smoker having a higher median, when comparing to age
More than 1.5 x IQR. attribute from people having age 25 to 40 are high chances for
cancer disease.
2
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
Above boxplot shows that when comparing the years 2004 From the above Fig. 6, it shows that the age between 60 to
to 2005, in year the boxplot for getting affected by cancer the
80, people those who are smokers and former smoker are
chances is very low, because the median is in the lower quartile
having higher chances to get cancer with comparing to a
and people in 2005, having higher chances of getting cancer
disease, because the median is near the upper quartile, we can nonsmoker.
understand this easily from the boxplot. D. Boxplot for Smoking status based on Year
IV. NUMERICAL RESULT ANALYSIS
A. Boxplot for cancer in years
3
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
I. Control chart
Fig. 10. Scatter plot of Year, Smoking Status, Years Smoked, and Age.
4
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
5
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]