0% found this document useful (0 votes)
109 views

IJERT Data Analysis Using Python

The document discusses data analysis using Python. It describes the main phases of data analysis like data collection, cleaning, and modeling. It then discusses how Python can be used for data analysis and visualization through libraries like Pandas, NumPy, Matplotlib and Seaborn. As an example, it analyzes a dataset on world happiness using these Python libraries and techniques of exploratory data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

IJERT Data Analysis Using Python

The document discusses data analysis using Python. It describes the main phases of data analysis like data collection, cleaning, and modeling. It then discusses how Python can be used for data analysis and visualization through libraries like Pandas, NumPy, Matplotlib and Seaborn. As an example, it analyzes a dataset on world happiness using these Python libraries and techniques of exploratory data analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 10 Issue 07, July-2021

Data Analysis using Python


Kiranbala Nongthombam Deepika Sharma
University Institute of Sciences University Institute of Sciences
(Mathematics Department) Chandigarh University, (Mathematics Department)
Punjab, India Chandigarh University, Punjab, India

Abstract- In this paper, the analysis of data using Python frames, and panels, solves that need of analyzing and
Programming Language is studied. The very basic processes of visualization of data [2].
data analysis like cleaning, transforming, modeling of data is
briefly explained in this paper and focus more on exploratory Data analysis using Python makes task easier since Python
data analysis of an already existing dataset and finding the Programming language has many advantages over any other
insights. Some graphical analysis of the data from the dataset will
programming language. It has prominent features like being a
be shown using different libraries and functions of Python. Here,
a dataset named “World Happiness report 2021” is used to high-level programming language (the codes are in human
analyze and extract various information in both numerical and readable form) it is easy to understand and use by any
pictorial form. programmer or user. Many libraries and functions for statistical,
numerical analysis are available in Python. Moreover, the
Keywords:- Data analysis; python; data visualization; pandas; source code is freely available to anyone (free and open source).
seaborn; exploratory data analysis
This paper includes all the basic terms and functions which are
I. INTRODUCTION much needed by a beginner to know what data analysis is. The
Data are those raw facts and figures with no proper paper is divided broadly into 4 sections. In section II, the main
information hence need to be processed to get the desired steps in data analysis will be discussed. In section III, data
information. While information is those results which we get analysis using python will be studied with all the basic needs of
after processing the raw data in different levels or extracted python in doing data analysis and data visualization will aid the
conclusions from a given dataset through a process called data analysis by representing them in picture format. In section IV,
analysis. conclusion of the paper is given.

Data Analysis is simply the analysis of various data means II. MAIN PHASES IN DATA ANALYSIS
cleaning the data, transforming it into understandable form, and
then modeling data to extract some useful information for A. Data requirements
business use or an organizational use. It is mainly used in taking Data are the most important unit in any study. Data must
business decisions. Many libraries are available for doing the be provided as inputs to the analysis based on the analysis’
analysis. For example, NumPy, Pandas, Seaborn, Matplotlib, requirements. The term “experimental unit” refers to the type
Sklearn, etc. [7]. of organization that would be used to gather data (e.g., a
• NumPy: NumPy is a library written in Python, used person or population of people). It is possible to identify and
for numerical analysis in Python. It stores the data in obtain specific population variables (such as height, weight,
the form of nd-arrays (n-dimensional arrays). age, and salary). It doesn’t matter whether the data is
• Pandas: Pandas is mainly used for converting data into numerical or categorical.
tabular form and hence, makes the data more
B. Data Collecting:
structured and easily to read.
• Matplotlib: Matplotlib is a data visualisation and The collecting of data is simply known as Data Collecting.
graphical plotting package for Python and its Data is gathered from a variety of sources, including relational
numerical extension NumPy that runs on all platforms. databases, cloud databases, and other sources, depending on
• Seaborn: Seaborn is a Python data visualisation the study’ needs. Field sensors, such as traffic cameras,
package based on matplotlib that is tightly connected satellites, monitoring systems, and so on, can also be used as
with pandas data structures. The core component of data sources.
Seaborn is visualisation, which aids in data C. Data processing
exploration and comprehension. Data that are collected must be processed or organized for
• Sklearn: Scikit-learn is the most useful library for analysis. For instance, these may involve arranging data into
machine learning in Python. It includes numerous rows and columns in a table format (known as structured data)
useful tools for classification, regression, clustering, for further analysis, often through the use of spreadsheet or
and dimensionality reduction. statistical software.
Data visualization will help the data analysis to make it more
understandable and interactive by plotting or displaying the D. Data cleaning:
data in pictorial form. Pandas, a Python open-source package The method of cleaning data after it has been processed
that deals with three different data structures: series, data and organized is known as data cleaning. It scans for data

IJERTV10IS070241 www.ijert.org 463


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 07, July-2021

inconsistencies, duplicates, and errors, and then removes C. Platform used:


them. The data cleaning process includes tasks such as record • Anaconda (Jupyter Notebook)
matching, identifying data inaccuracy, data sort, outlier data
identification, textual data spell checker, and data quality D. Dataset used:
maintenance. As a consequence, it keeps us from having • World Happiness record 2021
unexpected outcomes and assists us in delivering high-quality
data, which is essential for a successful outcome.
E. Exploratory data analysis:
Once the datasets are cleaned and free of error, it can then
be analyzed. A variety of techniques can be applied such as
exploratory data analysis- understanding the messages
contained within the obtained data and descriptive statistics-
finding average, median, etc. Data visualization is also a
technique used, in which the data is represented in a graphical
format in order to obtain additional insights, regarding the
information within the data [4].
F. Modeling and algorithms:
Fig. 1. A view of the dataset (World Happiness record 2021)
Mathematical formulas or models (known as algorithms),
may be applied to the data in order to identify relationships
among the variables; for example, using correlation or causation. E. Working with dataset
G. Data product • Importing libraries:
A data product, is a computer application that takes data Libraries that would be used in the process of analysis are to be
inputs and generates outputs, feeding them back into the imported first. Here are the codes to import the libraries.
environment. It may be based on a model or algorithm. import pandas as pd
import numpy as np
III. DATA ANALYSIS USING PYTHON import matplotlib.pyplot as plt
In this section, data analysis using python will be studied. The import seaborn as sns
most basic things like why using python for data analysis will
be understood. Moreover, how anyone can start using python
will be shown. The important libraries, the platforms, the
dataset to carry out the analysis will be introduced. Usage of
various python functions for numerical analysis are given Fig. 2. Importing libraries
along with various methods of plotting graphs or charts are
discussed. • Importing dataset
Here, the dataset (World Happiness report 2021) is imported in
A. Why using Python? the jupyter notebook.
Python is a high-level, interpreted, multi-purpose mydata=pd.read csv(“World Happiness report 2021.csv”)
programming language. Many programming paradigms like mydata
procedural programming language, object-oriented
programming is supported in python. It can be used for many
applications, that includes statistical computing with various
packages and functions. Moreover, it is easy to learn. It can
be picked up by anyone including those who has less
programming skills [9].
Some features of Python are as listed below:
• Open source and free
• Interpreted language
• Dynamic typesetting
• Portable
• Numerous IDE
B. Packages used:
• Numpy
• Pandas
• Seaborn
Fig. 3. Importing dataset
• Matplotlib

IJERTV10IS070241 www.ijert.org 464


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 07, July-2021

• Cleaning Data F. Exploratory Data Analysis


Removing unwanted data or null values are done in the process In statistics, exploratory data analysis is an approach of
of data cleaning. So, first we need to check the dataset whether analyzing data sets to summarize their main characteristics,
it contains any null value or empty cells [6]. often using statistical graphics and other data visualization
# isnull() returns true in the entry where there is no value or NA methods. A statistical model can be used or not, but primarily
value. And sum() is used together with isnull() to find the total EDA is for seeing what the data can tell us beyond the formal
number of null values in every columns. modeling or hypothesis testing task. Exploratory data analysis
mydata.isnull().sum() was promoted by John Tukey to encourage statisticians to
explore the data, and possibly formulate hypotheses that could
lead to new data collection and experiments [4][8].

• Data types: Datatype refers to the type of data- int,


object, float are the basic datatypes in python. Printing
the types of data of all the columns in the dataset using
dtypes-
mydata.dtypes

Fig. 4. Checking null values in the dataset

According to our needs for the analysis, we can extract some


particular rows or records from the dataset. Here is an example
to extract the top most and last rows from the dataset.
#head() is used to extract the top-most data in the dataset. 5 is
the default value of the head(). Here, top 10 rows from the
dataset is taken.
headdata=mydata.head(10) headdata

Fig. 7. Datatypes of the whole coumns in the dataset

• Describing the dataset: Describing data of a dataset


means extracting the summary of the given dataframe
such as mean, count, min, max, etc. It can be done
using describe() function-

For the whole dataset: mydata.describe()

Fig. 5. Top 10 rows of the dataset

#tail() is used to extract the last rows in the dataset. 5 is the


default value of the tail(). taildata=mydata.tail(10) taildata

Fig. 8. Summary of the whole dataset

Fig. 6. Last 10 rows of the dataset

IJERTV10IS070241 www.ijert.org 465


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 07, July-2021

For some selected rows: taildata.describe() outliers. We also divided GEDA into three categories:
Univariate GEDA, Bivariate GEDA, and Multivariate GEDA.
We’ll go through these important varieties in more detail in the
following paragraphs and aspects of GEDA [5].
First, a subset of the dataframe is taken to analyse or visualize
using it.

Fig. 9. Summary of some selected entries(10 last rows)

• Correlations: Correlation shows the relation between


any two variables in the dataset. The strength of a
linear relation between two variables is measured by
correlation. Printing Correlation of various attributes Fig. 12. A subset of the dataframe
using corr() [1].
1. Univariate GEDA
# For whole dataset-
mydata.corr() • Histogram: A histogram is a data representation that
looks like a bar graph that buckets a variety of
outcomes into columns along the x-axis. The y-axis
can be used to illustrate data distributions by
representing the numerical count or percentage of
occurrences in each column. Histogram in python can
be drawn using matplotlib.pyplot.hist()-

Fig. 10. Correlation of the whole dataset

# For some selected coulmns or attributes-


mydata[[‘Country name’, ‘Regional indicator’, ‘Ladder
score’, ‘Standard error of ladder score’, ‘Logged GDP per Fig. 13. Histogram
capita’, ‘Social support’, ‘Healthy life expectancy’,
‘Generosity’, ‘Perceptions of corruption’]].corr() • Stem Plot: A stem plot draws vertical lines from the
baseline to the y axis and sets a marker at each x point.
The x-positions are not necessary. The formats can be
specified as keyword-arguments or as positional
arguments. Stem plot in python can be drawn using
matplotlib.pyplot.stem()

Fig. 11. Correlation of some attributes in the dataset

G. Graphical EDA
Fundamentally, graphical exploratory data analysis is the
graphical equivalent to conventional non-graphical exploratory
data analysis. EDA that examines data sets in order to
summarise their statistical characteristics by focusing on the
same four main features, such as measures of central tendency,
measures of spread, distribution form, and the presence of Fig. 14. Stem plot

IJERTV10IS070241 www.ijert.org 466


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 07, July-2021

• Box Plot: Box plot is a visual representation of and


comparison of groups of data. The box plot depicts the
level, spread, and symmetry of a data distribution by
using the median, approximate quartiles, outliers, and
the lowest and highest data points (extreme values)
[10].

Fig. 17. Heatmap

• Count Plot: A Seaborn count plot is a graphical


representation of the number of occurrences or
frequency for each category data using bars to depict
the number of occurrences or frequency. The
countplot() function is used to visualize the number of
Fig. 15. Boxplot observations in each categorical category as bars.
Here, Count plot is plotted for the subdata dataframe.
2. Multivariate GEDA
• Scatter plot: Dots are used to indicate values for two
different numeric variables in a scatter plot. The values
for each data point are indicated by the position of each
dot on the horizontal and vertical axes. Scatter plots
are used to see how variables relate to one another.
Here, scatter plot of “Ladder score” against “Standard
error of ladder score” is plotted below-

Fig. 18. Countplot

IV. CONCLUSION
In this paper, various phases of data analysis including data
collection, cleaning and analysis are discussed briefly.
Explorative data analysis is mainly studied here. For the
implementation, Python programming language is used. For
Fig. 16. Scatter Plot detailed research, jupyter notebook is used. Different Python
libraries and packages are introduced. Using various analysis
• Heat Maps: A heatmap is a graphical depiction of data and visulaization methods, numerous results are extracted. The
that uses a color-coding method to represent various dataset “World Happiness Record 2021” is used and extract
values. It represents two- dimensional table of color- important informations like the difference in the score of
shades. This technique of plotting is popularly used in happiness of different countries, the dependence of one attribute
biology to represent gene expression and other in building up the score, how a variable affects another variable,
multivariate data [3]. etc. are seen in this analysis and various graphs has been plotted
A heatmap example is shown in the fig. 17. using various attributes in the dataset and draw conclusions in
an easy way.
V. ACKNOWLEDGMENT
I express my heartfelt gratitude towards my mentor Ms.
Deepika Sharma for guiding me to accomplish such a great
work. I offer my sincere appreciation towards the Head of
Department, University Institute of Sciences (Mathematics
Department), Chandigarh University for giving me such a
chance to gain a wider view of knowledge.

IJERTV10IS070241 www.ijert.org 467


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 07, July-2021

VI. REFERENCES [7] Fabio Nelli. Python data analytics: Data analysis and science using
PANDAs, Matplotlib and the Python Programming Language. Apress,
[1] Viv Bewick, Liz Cheek, and Jonathan Ball. Statistics review 7: Correlation 2015.
and regression. Critical care, 2003.
[8] Kabita Sahoo, Abhaya Kumar Samal, Jitendra Pramanik, and Subhendu
[2] Dr Ossama Embarak, Embarak, and Karkal. Data analysis and Kumar Pani. Exploratory data analysis using python. International
visualization using python. Springer, 2018. Journal of Innovative Technology and Exploring Engineering (IJITEE),
[3] Nils Gehlenborg and Bang Wong. Heat maps. Nature Methods, 2012. 2019.
[4] Michel Jambu. Exploratory and multivariate data analysis. Elsevier, 1991. [9] Guido Van Rossum et al. Python programming language. In USENIX
[5] Matthieu Komorowski, Dominic C Marshall, Justin D Salciccioli, and Yves annual technical conference, 2007.
Crutain. Exploratory data analysis. Secondary analysis of electronic [10] David F Williamson, Robert A Parker, and Juliette S Kendrick. The box
health records, 2016. plot: a simple visual method to interpret data. Annals of internal medicine,
[6] Wes McKinney. Python for data analysis: Data wrangling with Pandas, 1989.
NumPy, and IPython. ” O’Reilly Media, Inc.”, 2012.

IJERTV10IS070241 www.ijert.org 468


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like