Boxplot
Boxplot
net/publication/315358903
Data analysis using Box and Whisker Plot for Lung Cancer
CITATIONS READS
32 22,046
3 authors, including:
All content following this page was uploaded by Chandra Segar Thirumalai on 18 March 2017.
Vignesh M Balaji R
MS Software Engineering, MS Software Engineering,
School of Information Technology and Engineering School of Information Technology and Engineering,
VIT University, Vellore, India. VIT University, Vellore, India.
[email protected] [email protected]
Abstract— In statistical analysis, we have a collection of data, easily compare the different datasets. In other words, boxplot
with the use of these data, we have to do analysis based on our also called box and whisker plot method.
requirements. With the collection of data using Statistical
analysis, we deal collection, analysis, presentation and organizing TABLE II. DATA SET OF 2ND PART OF LUNG CANCER ATTRIBUTES.
the data. With the help of statistical analysis, we can find
Race Height Weight Family Copd Year Cancer
underlying patterns, relationships, and trends between data history
samples. The R system for statistical computing is an
environment for data analysis and graphics. Here we are going to Asian 175 85 No Yes 2000 Yes
implement boxplot method and control chart methods for Lung Asian 180 90 Yes Yes 2001 Yes
cancer dataset. With the help of boxplot, we can easily make Asian 182 57 Yes No 2002 No
relations between samples and we can find the outliers. American
Indian 170 80 Yes Yes 2003 Yes
African
Keywords-component; Data analysis, Lung Cancer, Decision
American 182 85 No Yes 2000 No
making White 170 60 Yes Yes 2002 No
Latin 175 65 No No 2003 No
I. INTRODUCTION Asian 178 59 Yes No 2004 No
We have taken lung cancer datasets of 12 primary American
Indian 187 70 No No 2005 No
attributes as shown in the following Table I and II. American
Indian 187 54 Yes Yes 2002 No
TABLE I. DATA SET OF 1ST PART OF LUNG CANCER ATTRIBUTES. American
Age Smoking status Years Average Gender Grade Indian 187 56 Yes Yes 2003 No
smoked per day Asian 187 58 Yes Yes 2001 Yes
Asian 185 89 Yes Yes 2003 Yes
68 Smoker 10 15 Male UG Asian 185 84 No Yes 2002 No
77 Former Smoker 15 10 Male PG Asian 185 74 No Yes 2004 Yes
68 Non Smoker 0 0 Male PG In boxplot method, the input data set is split to quartiles. In
71 Smoker 27 10 Male Nil a boxplot, it has a minimum value, lower quartile, median,
74 Smoker 10 5 Male Nil
51 Smoker 10 3 Female UG
upper quartile, maximum value. Boxplot, it contains one box, it
54 Former Smoker 14 6 Female PG goes from lower quartile to upper quartile. The difference
50 Non Smoker 0 0 Female Nil between upper quartile and lower quartile is the length of the
60 Smoker 5 5 Male UG box. Inside the box of boxplot, one vertical line is drawn, it is
54 Smoker 12 5 Male PG the median of the dataset. Median of the lower samples is
54 Non Smoker 0 0 Male UG called “Lower quartile” and Median of the higher samples is
56 Former Smoker 12 12 Male Nil called “Upper quartile”. In the outside of the box in a boxplot,
87 Smoker 10 10 Male PG two more vertical lines are drawn, one vertical near upper
45 Non Smoker 0 0 Male PG
76 Former Smoker 25 12 Male UG
quartile is called upper whisker and another one line near lower
quartile is called lower whisker is shown in the following Fig.
To analyze the relevant data of Lung cancer dataset we
1. The easiest way to find the quartiles have first sorted the
have an applied Box plot data analysis method which is shown
data and take the minimum and maximum values as lower
in Section 3. A boxplot is a data analysis method used to find
bound and upper bound respectively. Lower quartile, median
the output of the samples. With the use of boxplot, we can
1
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
2
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
In the above Fig. 4, from the median, we can easily E. GG plot for Smoking Status vs Years Smoked
understand that the number of people affected by cancer is
increased with comparing to a nonsmoker.
B. Boxplot for all the attributes in the dataset
3
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
I. Control chart
J. Control Chart
Fig. 15. C Chart of 300 Samples of Smokers for Lung Cancer Cause.
4
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]
5
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]