IJERT Data Analysis Using Python
IJERT Data Analysis Using Python
Abstract- In this paper, the analysis of data using Python frames, and panels, solves that need of analyzing and
Programming Language is studied. The very basic processes of visualization of data [2].
data analysis like cleaning, transforming, modeling of data is
briefly explained in this paper and focus more on exploratory Data analysis using Python makes task easier since Python
data analysis of an already existing dataset and finding the Programming language has many advantages over any other
insights. Some graphical analysis of the data from the dataset will
programming language. It has prominent features like being a
be shown using different libraries and functions of Python. Here,
a dataset named “World Happiness report 2021” is used to high-level programming language (the codes are in human
analyze and extract various information in both numerical and readable form) it is easy to understand and use by any
pictorial form. programmer or user. Many libraries and functions for statistical,
numerical analysis are available in Python. Moreover, the
Keywords:- Data analysis; python; data visualization; pandas; source code is freely available to anyone (free and open source).
seaborn; exploratory data analysis
This paper includes all the basic terms and functions which are
I. INTRODUCTION much needed by a beginner to know what data analysis is. The
Data are those raw facts and figures with no proper paper is divided broadly into 4 sections. In section II, the main
information hence need to be processed to get the desired steps in data analysis will be discussed. In section III, data
information. While information is those results which we get analysis using python will be studied with all the basic needs of
after processing the raw data in different levels or extracted python in doing data analysis and data visualization will aid the
conclusions from a given dataset through a process called data analysis by representing them in picture format. In section IV,
analysis. conclusion of the paper is given.
Data Analysis is simply the analysis of various data means II. MAIN PHASES IN DATA ANALYSIS
cleaning the data, transforming it into understandable form, and
then modeling data to extract some useful information for A. Data requirements
business use or an organizational use. It is mainly used in taking Data are the most important unit in any study. Data must
business decisions. Many libraries are available for doing the be provided as inputs to the analysis based on the analysis’
analysis. For example, NumPy, Pandas, Seaborn, Matplotlib, requirements. The term “experimental unit” refers to the type
Sklearn, etc. [7]. of organization that would be used to gather data (e.g., a
• NumPy: NumPy is a library written in Python, used person or population of people). It is possible to identify and
for numerical analysis in Python. It stores the data in obtain specific population variables (such as height, weight,
the form of nd-arrays (n-dimensional arrays). age, and salary). It doesn’t matter whether the data is
• Pandas: Pandas is mainly used for converting data into numerical or categorical.
tabular form and hence, makes the data more
B. Data Collecting:
structured and easily to read.
• Matplotlib: Matplotlib is a data visualisation and The collecting of data is simply known as Data Collecting.
graphical plotting package for Python and its Data is gathered from a variety of sources, including relational
numerical extension NumPy that runs on all platforms. databases, cloud databases, and other sources, depending on
• Seaborn: Seaborn is a Python data visualisation the study’ needs. Field sensors, such as traffic cameras,
package based on matplotlib that is tightly connected satellites, monitoring systems, and so on, can also be used as
with pandas data structures. The core component of data sources.
Seaborn is visualisation, which aids in data C. Data processing
exploration and comprehension. Data that are collected must be processed or organized for
• Sklearn: Scikit-learn is the most useful library for analysis. For instance, these may involve arranging data into
machine learning in Python. It includes numerous rows and columns in a table format (known as structured data)
useful tools for classification, regression, clustering, for further analysis, often through the use of spreadsheet or
and dimensionality reduction. statistical software.
Data visualization will help the data analysis to make it more
understandable and interactive by plotting or displaying the D. Data cleaning:
data in pictorial form. Pandas, a Python open-source package The method of cleaning data after it has been processed
that deals with three different data structures: series, data and organized is known as data cleaning. It scans for data
For some selected rows: taildata.describe() outliers. We also divided GEDA into three categories:
Univariate GEDA, Bivariate GEDA, and Multivariate GEDA.
We’ll go through these important varieties in more detail in the
following paragraphs and aspects of GEDA [5].
First, a subset of the dataframe is taken to analyse or visualize
using it.
G. Graphical EDA
Fundamentally, graphical exploratory data analysis is the
graphical equivalent to conventional non-graphical exploratory
data analysis. EDA that examines data sets in order to
summarise their statistical characteristics by focusing on the
same four main features, such as measures of central tendency,
measures of spread, distribution form, and the presence of Fig. 14. Stem plot
IV. CONCLUSION
In this paper, various phases of data analysis including data
collection, cleaning and analysis are discussed briefly.
Explorative data analysis is mainly studied here. For the
implementation, Python programming language is used. For
Fig. 16. Scatter Plot detailed research, jupyter notebook is used. Different Python
libraries and packages are introduced. Using various analysis
• Heat Maps: A heatmap is a graphical depiction of data and visulaization methods, numerous results are extracted. The
that uses a color-coding method to represent various dataset “World Happiness Record 2021” is used and extract
values. It represents two- dimensional table of color- important informations like the difference in the score of
shades. This technique of plotting is popularly used in happiness of different countries, the dependence of one attribute
biology to represent gene expression and other in building up the score, how a variable affects another variable,
multivariate data [3]. etc. are seen in this analysis and various graphs has been plotted
A heatmap example is shown in the fig. 17. using various attributes in the dataset and draw conclusions in
an easy way.
V. ACKNOWLEDGMENT
I express my heartfelt gratitude towards my mentor Ms.
Deepika Sharma for guiding me to accomplish such a great
work. I offer my sincere appreciation towards the Head of
Department, University Institute of Sciences (Mathematics
Department), Chandigarh University for giving me such a
chance to gain a wider view of knowledge.
VI. REFERENCES [7] Fabio Nelli. Python data analytics: Data analysis and science using
PANDAs, Matplotlib and the Python Programming Language. Apress,
[1] Viv Bewick, Liz Cheek, and Jonathan Ball. Statistics review 7: Correlation 2015.
and regression. Critical care, 2003.
[8] Kabita Sahoo, Abhaya Kumar Samal, Jitendra Pramanik, and Subhendu
[2] Dr Ossama Embarak, Embarak, and Karkal. Data analysis and Kumar Pani. Exploratory data analysis using python. International
visualization using python. Springer, 2018. Journal of Innovative Technology and Exploring Engineering (IJITEE),
[3] Nils Gehlenborg and Bang Wong. Heat maps. Nature Methods, 2012. 2019.
[4] Michel Jambu. Exploratory and multivariate data analysis. Elsevier, 1991. [9] Guido Van Rossum et al. Python programming language. In USENIX
[5] Matthieu Komorowski, Dominic C Marshall, Justin D Salciccioli, and Yves annual technical conference, 2007.
Crutain. Exploratory data analysis. Secondary analysis of electronic [10] David F Williamson, Robert A Parker, and Juliette S Kendrick. The box
health records, 2016. plot: a simple visual method to interpret data. Annals of internal medicine,
[6] Wes McKinney. Python for data analysis: Data wrangling with Pandas, 1989.
NumPy, and IPython. ” O’Reilly Media, Inc.”, 2012.