0% found this document useful (0 votes)
14 views

Data Analysis Using Box and Whisker Plot

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Analysis Using Box and Whisker Plot

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]

Data analysis using Box and Whisker Plot for Lung


Cancer
Chandrasegar Thirumalai, IEEE Member,
School of Information Technology and Engineering,
VIT University, Vellore, India.
[email protected]

Vignesh M Balaji R
MS Software Engineering, MS Software Engineering,
School of Information Technology and Engineering School of Information Technology and Engineering,
VIT University, Vellore, India. VIT University, Vellore, India.
[email protected] [email protected]

Abstract— In statistical analysis, we have a collection of data, To analyze the relevant data of Lung cancer dataset we
with the use of these data, we have to do analysis based on our have an applied Box plot data analysis method which is shown
requirements. With the collection of data using Statistical in Section 3. A boxplot is a data analysis method used to find
analysis, we deal collection, analysis, presentation and organizing the output of the samples. With the use of boxplot, we can
the data. With the help of statistical analysis, we can find easily compare the different datasets. In other words, boxplot
underlying patterns, relationships, and trends between data also called box and whisker plot method.
samples. The R system for statistical computing is an
environment for data analysis and graphics. Here we are going to TABLE II. DATA SET OF 2ND PART OF LUNG CANCER ATTRIBUTES.
implement boxplot method and control chart methods for Lung
cancer dataset. With the help of boxplot, we can easily make Race Height Weight Family Copd Year Cancer
relations between samples and we can find the outliers. history

Asian 175 85 No Yes 2000 Yes


Keywords-component; Data analysis, Lung Cancer, Decision
Asian 180 90 Yes Yes 2001 Yes
making
Asian 182 57 Yes No 2002 No
I. INTRODUCTION American
170 80 Yes Yes 2003 Yes
Indian
We have taken lung cancer datasets of 12 primary
attributes as shown in the following Table I and II. African
182 85 No Yes 2000 No
American
TABLE I. DATA SET OF 1ST PART OF LUNG CANCER ATTRIBUTES. White 170 60 Yes Yes 2002 No

Age Smoking status Years Average Gender Grade Latin 175 65 No No 2003 No
smoked per day Asian 178 59 Yes No 2004 No
American
68 Smoker 10 15 Male UG 187 70 No No 2005 No
Indian
77 Former Smoker 15 10 Male PG
American
187 54 Yes Yes 2002 No
68 Non Smoker 0 0 Male PG Indian
71 Smoker 27 10 Male Nil American
187 56 Yes Yes 2003 No
74 Smoker 10 5 Male Nil Indian
51 Smoker 10 3 Female UG Asian 187 58 Yes Yes 2001 Yes
54 Former Smoker 14 6 Female PG Asian 185 89 Yes Yes 2003 Yes
50 Non Smoker 0 0 Female Nil Asian 185 84 No Yes 2002 No
60 Smoker 5 5 Male UG Asian 185 74 No Yes 2004 Yes
54 Smoker 12 5 Male PG This instructive datasets are used as the commitment to
figure the Pearson [6], [11], [14], [16], [22], [24]. In the present
54 Non Smoker 0 0 Male UG
days, there are enormous measures of data recorded by the
56 Former Smoker 12 12 Male Nil banks and exploring them requires complex estimations. We
87 Smoker 10 10 Male PG played out the item metric examination on the given
45 Non Smoker 0 0 Male PG enlightening accumulation. From the data examination [8], [9],
76 Former Smoker 25 12 Male UG [12], [17], [18], [20] we can pick which quality can be viewed

978-1-5090-5682-8 /17/$31.00 ©2017 IEEE

1
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]

as and which trademark can be expelled. For instance, in the III. CALCULATION AND DISCUSSIONS
Pearson strategy if the estimation of r is more than 0.5 then the This is the sample dataset that we are going to know how
credits are thought to be unequivocally related and if it is
the boxplot method works.
underneath 0.3 the qualities are pitifully related.
A segment of the past procedures to appraise the decisions TABLE III. SAMPLE DATASET OF 1ST PART BETWEEN AGE 25 TO 45.
in perspective of their relationship of value are Spearman [11], Age Smoking status Years Average Gender Grade
Analytical Hierarchical Process (AHP) [7], [13], [15] and smoked per day
Traveling Salesman Problem (TSP) [26]. The fragile
information's among various components [19], [21], [28], [30], 25 Smoker 12 15 Male Nil
[32], [34], [36] among the bank stock model are managed by 21 Non Smoker 0 0 Male Nil
late secured systems [23], [25], [27], [29], [31], [33], [35], [37]. 22 Former Smoker 5 2 Male Nil
In boxplot method, the input data set is split to quartiles. In 28 Smoker 10 8 Female PG
a boxplot, it has a minimum value, lower quartile, median, 35 Smoker 7 3 Male PG
upper quartile, maximum value. Boxplot, it contains one box, it 18 Former Smoker 8 2 Female PG
goes from lower quartile to upper quartile. The difference 19 Non Smoker 0 0 Female PG
between upper quartile and lower quartile is the length of the PG
40 Smoker 12 6 Male
box. Inside the box of boxplot, one vertical line is drawn, it is
45 Smoker 45 4 Female PG
the median of the dataset. Median of the lower samples is
called “Lower quartile” and Median of the higher samples is 23 Smoker 2 5 Male PG
called “Upper quartile”. In the outside of the box in a boxplot,
two more vertical lines are drawn, one vertical near upper TABLE IV. SAMPLE DATASET OF 2ND PART BETWEEN AGE 25 TO 45.
quartile is called upper whisker and another one line near lower Race Height Weight Family history Cancer Year
quartile is called lower whisker is shown in the following Fig.
1. The easiest way to find the quartiles have first sorted the Asian 180 75 Yes Yes 2005
data and take the minimum and maximum values as lower Asian 178 80 No No 2004
bound and upper bound respectively. Lower quartile, median Asian 165 78 No No 2005
upper quartile is we can find using the following methods in
Asian 178 79 Yes No 2004
Section 2.
Asian 189 75 Yes Yes 2003
Asian 175 80 Yes No 2005
Asian 148 79 No Yes 2005
Asian 168 72 Yes No 2003
Asian 189 85 No No 2004
Asian 168 69 No No 2005
Fig. 1. Box Plot Attributes.

II. DATA ANALYSIS


A. Box Plot

Step 1: Sort the data on a primary attribute.


Step 2: Calculate the Median.
Step 3: Calculate the Quartiles.
Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max Fig. 2. Smokers by Ages.

Boxplot: ends of the box are the quartiles, median is The above boxplot shows that when comparing to former
marked, whiskers, and plot outlier individually smoker and nonsmoker, the smoker is having higher chances
of getting affected by lung cancer, from the boxplot of a
Step 4: Calculate the Outlier: smoker having a higher median, when comparing to age
More than 1.5 x IQR. attribute from people having age 25 to 40 are high chances for
cancer disease.

2
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]

C. Boxplot for Smoking status based on Age

Fig. 3. Cancer in Years (2003 – 2005).


Fig. 6. Boxplot for Smokers by Ages.

Above boxplot shows that when comparing the years 2004 From the above Fig. 6, it shows that the age between 60 to
to 2005, in year the boxplot for getting affected by cancer the
80, people those who are smokers and former smoker are
chances is very low, because the median is in the lower quartile
having higher chances to get cancer with comparing to a
and people in 2005, having higher chances of getting cancer
disease, because the median is near the upper quartile, we can nonsmoker.
understand this easily from the boxplot. D. Boxplot for Smoking status based on Year
IV. NUMERICAL RESULT ANALYSIS
A. Boxplot for cancer in years

Fig. 7. Boxplot for Smoking Status of the Peoples (2000 – 2005).

The numbers of smokers are increased in 2004 when


Fig. 4. Box plot for Cancer in Years.
compared to the year 2000 – 2005. Former smokers also
In the above Fig. 4, from the median, we can easily having fewer chances of getting lung cancer disease with
understand that the number of people affected by cancer is compared to nonsmoker and smoker.
increased with comparing to a nonsmoker. E. GG plot for Smoking Status vs Years Smoked
B. Boxplot for all the attributes in the dataset

Fig. 5. Boxplot for Overall Attributes.


Fig. 8. GG plot for Smoking Status vs Years Smoked.
In the above Fig. 5 shows boxplot for all attributes with
outliers. In Fig. 8 shows the average numbers of cigarette smokers
are high when compared to former smoker and nonsmoker.
Here, a maximum average of cigarette consumers per day is 20
and least is 0.

3
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]

F. 3D plot for Lung Cancer H. 3D Scatterplot

Fig. 12. 3D Scatter plot of Year, Years Smoked, and Age.

Fig. 9. The 3D plot of Lung Cancer.

Fig 9 shows the cancer, years smoked and smoking status.


G. Scatterplot for Lung cancer

Fig. 13. Lung Cancer Chances.

In the above Fig 13 shows the getting chance for lung


cancer for all the attributes in the datasets. In the above Fig. 13
first one age, shows that when the age between 55 to 90, this
aged people who are having smoking habits have high chances
of lung cancer disease.

I. Control chart
Fig. 10. Scatter plot of Year, Smoking Status, Years Smoked, and Age.

Fig. 11. Lung Cancer Causes Options.


Fig. 14. C Chart for Cancer over a period of Years.
From the above Fig. 11, scatterplot diagram we can easily
From the above Fig. 14, the upper control limit for age is
make the relationship between the attributes. Here we have
1563 (15.63), control limit is 1448(14.48) and the lower
four attributes and four columns. The above scatterplot
control limit is 1334(13.34). Cancer disease symptoms we can
diagram first column for years, the second column for
mostly identify between the age 13 to 15.
smoking status, and the third column for year’s smoked and
fourth one for age.

4
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]

J. Control Chart [8] P. Dhavachelvan, Chandra Segar T, K. Satheskumar, "Evaluation of


SOA Complexity Metrics Using Weyuker’s Axioms," IEEE International
Advance Computing (IACC), India, pp. 2325 – 2329, March 2009
[9] Halstead Metric for Intelligence, Effort, Time predictions,
DOI:10.13140/RG.2.2.17988.42881
[10] Fisher R.A., 1921. On the “probable error” of a coefficient of correlation
deduced from a small sample. Metron 1: 3–32.
[11] Spearman C.E, 1904b. General intelligence, objectively determined and
measured. American Journal of Psychology 15: 201–293.
[12] Software metric Numerical Data analysis using Box plot and control chart
methods, VIT University, DOI:10.13140/RG.2.2.27422.95041
[13] Vaishnavi B, Karthikeyan J, Kiran Yarrakula, Chandrasegar Thirumalai,
“An Assessment Framework for Precipitation Decision Making Using
AHP”, International Conference on Electronics and Communication
Systems (ICECS), IEEE & 978-1-4673-7832-1, Feb. 2016
[14] Griffith D.A., 2003. Spatial autocorrelation and spatial filtering. Springer,
Berlin.
Fig. 15. C Chart of 300 Samples of Smokers for Lung Cancer Cause. [15] Chandrasegar Thirumalai, Senthilkumar M, “An Assessment Framework
of Intuitionistic Fuzzy Network for C2B Decision Making”, International
Conference on Electronics and Communication Systems (ICECS), IEEE
From the above control chart, upper control limit is 18 and & 978-1-4673-7832-1, Feb. 2016
control limit is 2.9. It is the control chart for all the data in the [16] Rodgers J.L. & Nicewander W.A., 1988. Thirteen ways to look at the
dataset. In the dataset of 300 samples, someone have high correlation coefficient. The American Statistician 42 (1): 59–66.
chances of getting lung cancer disease. [17] F. Fioravanti, P. Nesi, “A method and tool for assessing object-oriented
projects and metrics management,” Journal of Systems and Software,
V. CONCLUSION Volume 53, Issue 2, 31 August 2000, Pages 111-136
The purpose of this paper is to use “box and whisker plot” [18] Galton F., 1875. Statistics by intercomparison. Philosophical Magazine
49: 33–46
method for visualizing the samples of the dataset and from
[19] Chandrasegar Thirumalai, Viswanathan P, “Diophantine based
that results we can easily make relationships between the Asymmetric Cryptomata for Cloud Confidentiality and Blind Signature
attributes. From the above boxplot method, we learned about applications,” JISA, Elsevier, 2017.
which age of people mostly smoking people or farmer [20] Galton F., 1877. Typical laws of heredity. Proceedings of the Royal
smoking people will have chances of getting lung cancer Institution 8: 282–301.
disease. we got some result with the help of these boxplot [21] Chandrasegar Thirumalai, Sathish Shanmugam, “Multi-key distribution
scheme using Diophantine form for secure IoT communications,” IEEE
method results, we can make a system that gets some input IPACT 2017.
from the user, so that can predicate whether the person has any [22] Galton F., 1888. Co-relations and their measurement, chiefly from
chances to get cancer disease. anthropometric data. Proceedings of the Royal Society of London 45:
135–145.
[23] Chandrasegar Thirumalai, Senthilkumar M, “Spanning Tree approach for
REFERENCES Error Detection and Correction,” IJPT, Vol. 8, Issue No. 4, Dec-2016, pp.
[1] Kampstra, Peter. "Beanplot: A boxplot alternative for visual comparison 5009-5020.
of distributions." Journal of statistical software 28, no. 1 (2008): 1-9. [24] Galton F., 1890. Kinship and correlation. North American Review 150:
[2] Frigge, Michael, David C. Hoaglin, and Boris Iglewicz. "Some 419–431.
implementations of the boxplot." The American Statistician 43, no. 1 [25] Chandrasegar Thirumalai, Senthilkumar M, “Secured E-Mail System
(1989): 50-54. using Base 128 Encoding Scheme,” International journal of pharmacy and
[3] Benjamini, Yoav. "Opening the Box of a Boxplot." The American technology, Vol. 8 Issue 4, Dec. 2016, pp. 21797-21806.
Statistician 42, no. 4 (1988): 257-262. [26] M.Senthilkumar, T.Chandrasegar, M.K. Nallakaruppan, S.Prasanna, “A
[4] Hubert, Mia, and Ellen Vandervieren. "An adjusted boxplot for skewed Modified and Efficient Genetic Algorithm to Address a Travelling
distributions." Computational statistics & data analysis 52, no. 12 (2008): Salesman Problem,” in International Journal of Applied Engineering
5186-5201. Research, Vol. 9 No. 10, 2014, pp. 1279-1288
[5] Thriumani, Reena, et al. "Cancer detection using an electronic nose: A [27] Nallakaruppan, M.K., Senthil Kumar, M., Chandrasegar, T., Suraj, K.A.,
preliminary study on detection and discrimination of cancerous cells." Magesh, G., “Accident avoidance in railway tracks using Adhoc wireless
Biomedical Engineering and Sciences (IECBES), 2014 IEEE Conference networks,” 2014, IJAER, 9 (21), pp. 9551-9556.
on. IEEE, 2014. [28] T Chandra Segar, R Vijayaragavan, “Pell's RSA key generation and its
[6] Hauke J., Kossowski T., Comparison of values of Pearson’s and security analysis,” in Computing, Communications and Networking
Spearman’s correlation coefficient on the same sets of data. Quaestiones Technologies (ICCCNT) 2013, pp. 1-5
Geographicae 30(2), Bogucki Wydawnictwo Naukowe, Poznań 2011, pp. [29] Chandrasegar Thirumalai, Senthilkumar M, Vaishnavi B, “Physicians
87–93, 3 figs, 1 table. DOI 10.2478/v10117-011-0021-1, ISBN 978-83- Medicament using Linear Public Key Crypto System,” in International
62662-62-3, ISSN 0137-477X. conference on Electrical, Electronics, and Optimization Techniques,
[7] Piovani J.I., 2008. The historical construction of correlation as a ICEEOT, IEEE & 978-1-4673-9939-5, March 2016.
conceptual and operative instrument for empirical research. Quality & [30] Chandrasegar Thirumalai, “Physicians Drug encoding system using an
Quantity 42: 757–777. Efficient and Secured Linear Public Key Cryptosystem (ESLPKC),”
International journal of pharmacy and technology, Vol. 8 Issue 3, Sep.
2016, pp. 16296-16303

5
International Conference on Innovations in Power and Advanced Computing Technologies [i-PACT2017]

[31] E Malathy, Chandra Segar Thirumalai, "Review on non-linear set


associative cache design," IJPT, Dec-2016, Vol. 8, Issue No.4, pp. 5320-
5330
[32] “DDoS: Survey Of Traceback Methods”, International Joint Journal
Conference in Engineering 2009, ISSN 1797-9617.
[33] Chandrasegar Thirumalai, Senthilkumar M, Silambarasan R, Carlos
Becker Westphall, “Analyzing the strength of Pell’s RSA,” IJPT, Vol. 8
Issue 4, Dec. 2016, pp. 21869-21874.
[34] Chandramowliswaran N, Srinivasan.S and Chandra Segar T, “A Novel
scheme for Secured Associative Mapping” The International J. of
Computer Science and Applications (TIJCSA) & India, TIJCSA
Publishers & 2278-1080, Vol. 1, No 5 / pp. 1-7 / July 2012
[35] Chandrasegar Thirumalai, “Review on the memory efficient RSA
variants,” International Journal of Pharmacy and Technology, Vol. 8
Issue 4, Dec. 2016, pp. 4907-4916.
[36] Vinothini S, Chandra Segar Thirumalai, Vijayaragavan R, Senthil Kumar
M, “A Cubic based Set Associative Cache encoded mapping,”
International Research Journal of Engineering and Technology (IRJET),
Volume: 02 Issue: 02 May -2015
[37] Chandrasegar Thirumalai, Himanshu Kar, “Memory Efficient Multi Key
(MEMK) generation scheme for secure transportation of sensitive data
over Cloud and IoT devices,” IEEE IPACT 2017.

You might also like