0% found this document useful (0 votes)

195 views6 pages

IJERT Data Analysis Using Python

The document discusses data analysis using Python. It describes the main phases of data analysis like data collection, cleaning, and modeling. It then discusses how Python can be used for data analysis and visualization through libraries like Pandas, NumPy, Matplotlib and Seaborn. As an example, it analyzes a dataset on world happiness using these Python libraries and techniques of exploratory data analysis.

Uploaded by

yuliushendriansyah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

195 views6 pages

IJERT Data Analysis Using Python

Uploaded by

yuliushendriansyah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181

Vol. 10 Issue 07, July-2021

Data Analysis using Python

Kiranbala Nongthombam Deepika Sharma
University Institute of Sciences University Institute of Sciences
(Mathematics Department) Chandigarh University, (Mathematics Department)
Punjab, India Chandigarh University, Punjab, India

Abstract- In this paper, the analysis of data using Python frames, and panels, solves that need of analyzing and
Programming Language is studied. The very basic processes of visualization of data [2].
data analysis like cleaning, transforming, modeling of data is
briefly explained in this paper and focus more on exploratory Data analysis using Python makes task easier since Python
data analysis of an already existing dataset and finding the Programming language has many advantages over any other
insights. Some graphical analysis of the data from the dataset will
programming language. It has prominent features like being a
be shown using different libraries and functions of Python. Here,
a dataset named “World Happiness report 2021” is used to high-level programming language (the codes are in human
analyze and extract various information in both numerical and readable form) it is easy to understand and use by any
pictorial form. programmer or user. Many libraries and functions for statistical,
numerical analysis are available in Python. Moreover, the
Keywords:- Data analysis; python; data visualization; pandas; source code is freely available to anyone (free and open source).
seaborn; exploratory data analysis
This paper includes all the basic terms and functions which are
I. INTRODUCTION much needed by a beginner to know what data analysis is. The
Data are those raw facts and figures with no proper paper is divided broadly into 4 sections. In section II, the main
information hence need to be processed to get the desired steps in data analysis will be discussed. In section III, data
information. While information is those results which we get analysis using python will be studied with all the basic needs of
after processing the raw data in different levels or extracted python in doing data analysis and data visualization will aid the
conclusions from a given dataset through a process called data analysis by representing them in picture format. In section IV,
analysis. conclusion of the paper is given.

Data Analysis is simply the analysis of various data means II. MAIN PHASES IN DATA ANALYSIS
cleaning the data, transforming it into understandable form, and
then modeling data to extract some useful information for A. Data requirements
business use or an organizational use. It is mainly used in taking Data are the most important unit in any study. Data must
business decisions. Many libraries are available for doing the be provided as inputs to the analysis based on the analysis’
analysis. For example, NumPy, Pandas, Seaborn, Matplotlib, requirements. The term “experimental unit” refers to the type
Sklearn, etc. [7]. of organization that would be used to gather data (e.g., a
• NumPy: NumPy is a library written in Python, used person or population of people). It is possible to identify and
for numerical analysis in Python. It stores the data in obtain specific population variables (such as height, weight,
the form of nd-arrays (n-dimensional arrays). age, and salary). It doesn’t matter whether the data is
• Pandas: Pandas is mainly used for converting data into numerical or categorical.
tabular form and hence, makes the data more
B. Data Collecting:
structured and easily to read.
• Matplotlib: Matplotlib is a data visualisation and The collecting of data is simply known as Data Collecting.
graphical plotting package for Python and its Data is gathered from a variety of sources, including relational
numerical extension NumPy that runs on all platforms. databases, cloud databases, and other sources, depending on
• Seaborn: Seaborn is a Python data visualisation the study’ needs. Field sensors, such as traffic cameras,
package based on matplotlib that is tightly connected satellites, monitoring systems, and so on, can also be used as
with pandas data structures. The core component of data sources.
Seaborn is visualisation, which aids in data C. Data processing
exploration and comprehension. Data that are collected must be processed or organized for
• Sklearn: Scikit-learn is the most useful library for analysis. For instance, these may involve arranging data into
machine learning in Python. It includes numerous rows and columns in a table format (known as structured data)
useful tools for classification, regression, clustering, for further analysis, often through the use of spreadsheet or
and dimensionality reduction. statistical software.
Data visualization will help the data analysis to make it more
understandable and interactive by plotting or displaying the D. Data cleaning:
data in pictorial form. Pandas, a Python open-source package The method of cleaning data after it has been processed
that deals with three different data structures: series, data and organized is known as data cleaning. It scans for data

IJERTV10IS070241 www.ijert.org 463

(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 10 Issue 07, July-2021

inconsistencies, duplicates, and errors, and then removes C. Platform used:

them. The data cleaning process includes tasks such as record • Anaconda (Jupyter Notebook)
matching, identifying data inaccuracy, data sort, outlier data
identification, textual data spell checker, and data quality D. Dataset used:
maintenance. As a consequence, it keeps us from having • World Happiness record 2021
unexpected outcomes and assists us in delivering high-quality
data, which is essential for a successful outcome.
E. Exploratory data analysis:
Once the datasets are cleaned and free of error, it can then
be analyzed. A variety of techniques can be applied such as
exploratory data analysis- understanding the messages
contained within the obtained data and descriptive statistics-
finding average, median, etc. Data visualization is also a
technique used, in which the data is represented in a graphical
format in order to obtain additional insights, regarding the
information within the data [4].
F. Modeling and algorithms:
Fig. 1. A view of the dataset (World Happiness record 2021)
Mathematical formulas or models (known as algorithms),
may be applied to the data in order to identify relationships
among the variables; for example, using correlation or causation. E. Working with dataset
G. Data product • Importing libraries:
A data product, is a computer application that takes data Libraries that would be used in the process of analysis are to be
inputs and generates outputs, feeding them back into the imported first. Here are the codes to import the libraries.
environment. It may be based on a model or algorithm. import pandas as pd
import numpy as np
III. DATA ANALYSIS USING PYTHON import matplotlib.pyplot as plt
In this section, data analysis using python will be studied. The import seaborn as sns
most basic things like why using python for data analysis will
be understood. Moreover, how anyone can start using python
will be shown. The important libraries, the platforms, the
dataset to carry out the analysis will be introduced. Usage of
various python functions for numerical analysis are given Fig. 2. Importing libraries
along with various methods of plotting graphs or charts are
discussed. • Importing dataset
Here, the dataset (World Happiness report 2021) is imported in
A. Why using Python? the jupyter notebook.
Python is a high-level, interpreted, multi-purpose mydata=pd.read csv(“World Happiness report 2021.csv”)
programming language. Many programming paradigms like mydata
procedural programming language, object-oriented
programming is supported in python. It can be used for many
applications, that includes statistical computing with various
packages and functions. Moreover, it is easy to learn. It can
be picked up by anyone including those who has less
programming skills [9].
Some features of Python are as listed below:
• Open source and free
• Interpreted language
• Dynamic typesetting
• Portable
• Numerous IDE
B. Packages used:
• Numpy
• Pandas
• Seaborn
Fig. 3. Importing dataset
• Matplotlib

IJERTV10IS070241 www.ijert.org 464

• Cleaning Data F. Exploratory Data Analysis

Removing unwanted data or null values are done in the process In statistics, exploratory data analysis is an approach of
of data cleaning. So, first we need to check the dataset whether analyzing data sets to summarize their main characteristics,
it contains any null value or empty cells [6]. often using statistical graphics and other data visualization
# isnull() returns true in the entry where there is no value or NA methods. A statistical model can be used or not, but primarily
value. And sum() is used together with isnull() to find the total EDA is for seeing what the data can tell us beyond the formal
number of null values in every columns. modeling or hypothesis testing task. Exploratory data analysis
mydata.isnull().sum() was promoted by John Tukey to encourage statisticians to
explore the data, and possibly formulate hypotheses that could
lead to new data collection and experiments [4][8].

• Data types: Datatype refers to the type of data- int,

object, float are the basic datatypes in python. Printing
the types of data of all the columns in the dataset using
dtypes-
mydata.dtypes

Fig. 4. Checking null values in the dataset

According to our needs for the analysis, we can extract some

particular rows or records from the dataset. Here is an example
to extract the top most and last rows from the dataset.
#head() is used to extract the top-most data in the dataset. 5 is
the default value of the head(). Here, top 10 rows from the
dataset is taken.
headdata=mydata.head(10) headdata

Fig. 7. Datatypes of the whole coumns in the dataset

• Describing the dataset: Describing data of a dataset

means extracting the summary of the given dataframe
such as mean, count, min, max, etc. It can be done
using describe() function-

For the whole dataset: mydata.describe()

Fig. 5. Top 10 rows of the dataset

#tail() is used to extract the last rows in the dataset. 5 is the

default value of the tail(). taildata=mydata.tail(10) taildata

Fig. 8. Summary of the whole dataset

Fig. 6. Last 10 rows of the dataset

IJERTV10IS070241 www.ijert.org 465

For some selected rows: taildata.describe() outliers. We also divided GEDA into three categories:
Univariate GEDA, Bivariate GEDA, and Multivariate GEDA.
We’ll go through these important varieties in more detail in the
following paragraphs and aspects of GEDA [5].
First, a subset of the dataframe is taken to analyse or visualize
using it.

Fig. 9. Summary of some selected entries(10 last rows)

• Correlations: Correlation shows the relation between

any two variables in the dataset. The strength of a
linear relation between two variables is measured by
correlation. Printing Correlation of various attributes Fig. 12. A subset of the dataframe
using corr() [1].
1. Univariate GEDA
# For whole dataset-
mydata.corr() • Histogram: A histogram is a data representation that
looks like a bar graph that buckets a variety of
outcomes into columns along the x-axis. The y-axis
can be used to illustrate data distributions by
representing the numerical count or percentage of
occurrences in each column. Histogram in python can
be drawn using matplotlib.pyplot.hist()-

Fig. 10. Correlation of the whole dataset

# For some selected coulmns or attributes-

mydata[[‘Country name’, ‘Regional indicator’, ‘Ladder
score’, ‘Standard error of ladder score’, ‘Logged GDP per Fig. 13. Histogram
capita’, ‘Social support’, ‘Healthy life expectancy’,
‘Generosity’, ‘Perceptions of corruption’]].corr() • Stem Plot: A stem plot draws vertical lines from the
baseline to the y axis and sets a marker at each x point.
The x-positions are not necessary. The formats can be
specified as keyword-arguments or as positional
arguments. Stem plot in python can be drawn using
matplotlib.pyplot.stem()

Fig. 11. Correlation of some attributes in the dataset

G. Graphical EDA
Fundamentally, graphical exploratory data analysis is the
graphical equivalent to conventional non-graphical exploratory
data analysis. EDA that examines data sets in order to
summarise their statistical characteristics by focusing on the
same four main features, such as measures of central tendency,
measures of spread, distribution form, and the presence of Fig. 14. Stem plot

IJERTV10IS070241 www.ijert.org 466

• Box Plot: Box plot is a visual representation of and

comparison of groups of data. The box plot depicts the
level, spread, and symmetry of a data distribution by
using the median, approximate quartiles, outliers, and
the lowest and highest data points (extreme values)
[10].

Fig. 17. Heatmap

• Count Plot: A Seaborn count plot is a graphical

representation of the number of occurrences or
frequency for each category data using bars to depict
the number of occurrences or frequency. The
countplot() function is used to visualize the number of
Fig. 15. Boxplot observations in each categorical category as bars.
Here, Count plot is plotted for the subdata dataframe.
2. Multivariate GEDA
• Scatter plot: Dots are used to indicate values for two
different numeric variables in a scatter plot. The values
for each data point are indicated by the position of each
dot on the horizontal and vertical axes. Scatter plots
are used to see how variables relate to one another.
Here, scatter plot of “Ladder score” against “Standard
error of ladder score” is plotted below-

Fig. 18. Countplot

IV. CONCLUSION
In this paper, various phases of data analysis including data
collection, cleaning and analysis are discussed briefly.
Explorative data analysis is mainly studied here. For the
implementation, Python programming language is used. For
Fig. 16. Scatter Plot detailed research, jupyter notebook is used. Different Python
libraries and packages are introduced. Using various analysis
• Heat Maps: A heatmap is a graphical depiction of data and visulaization methods, numerous results are extracted. The
that uses a color-coding method to represent various dataset “World Happiness Record 2021” is used and extract
values. It represents two- dimensional table of color- important informations like the difference in the score of
shades. This technique of plotting is popularly used in happiness of different countries, the dependence of one attribute
biology to represent gene expression and other in building up the score, how a variable affects another variable,
multivariate data [3]. etc. are seen in this analysis and various graphs has been plotted
A heatmap example is shown in the fig. 17. using various attributes in the dataset and draw conclusions in
an easy way.
V. ACKNOWLEDGMENT
I express my heartfelt gratitude towards my mentor Ms.
Deepika Sharma for guiding me to accomplish such a great
work. I offer my sincere appreciation towards the Head of
Department, University Institute of Sciences (Mathematics
Department), Chandigarh University for giving me such a
chance to gain a wider view of knowledge.

IJERTV10IS070241 www.ijert.org 467

VI. REFERENCES [7] Fabio Nelli. Python data analytics: Data analysis and science using
PANDAs, Matplotlib and the Python Programming Language. Apress,
[1] Viv Bewick, Liz Cheek, and Jonathan Ball. Statistics review 7: Correlation 2015.
and regression. Critical care, 2003.
[8] Kabita Sahoo, Abhaya Kumar Samal, Jitendra Pramanik, and Subhendu
[2] Dr Ossama Embarak, Embarak, and Karkal. Data analysis and Kumar Pani. Exploratory data analysis using python. International
visualization using python. Springer, 2018. Journal of Innovative Technology and Exploring Engineering (IJITEE),
[3] Nils Gehlenborg and Bang Wong. Heat maps. Nature Methods, 2012. 2019.
[4] Michel Jambu. Exploratory and multivariate data analysis. Elsevier, 1991. [9] Guido Van Rossum et al. Python programming language. In USENIX
[5] Matthieu Komorowski, Dominic C Marshall, Justin D Salciccioli, and Yves annual technical conference, 2007.
Crutain. Exploratory data analysis. Secondary analysis of electronic [10] David F Williamson, Robert A Parker, and Juliette S Kendrick. The box
health records, 2016. plot: a simple visual method to interpret data. Annals of internal medicine,
[6] Wes McKinney. Python for data analysis: Data wrangling with Pandas, 1989.
NumPy, and IPython. ” O’Reilly Media, Inc.”, 2012.

IJERTV10IS070241 www.ijert.org 468

(This work is licensed under a Creative Commons Attribution 4.0 International License.)

Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Documentation Sample
No ratings yet
Documentation Sample
37 pages
Weekly Quiz 1 Machine Learning Great Learning PDF
100% (2)
Weekly Quiz 1 Machine Learning Great Learning PDF
7 pages
Getting Started With Python Data Analysis - Sample Chapter
0% (1)
Getting Started With Python Data Analysis - Sample Chapter
17 pages
Module 1 DAP
No ratings yet
Module 1 DAP
55 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
10 pages
Deep Learning in Hilbert Spaces - New Frontiers in Algorithmic Trading
No ratings yet
Deep Learning in Hilbert Spaces - New Frontiers in Algorithmic Trading
351 pages
1106 Slides UserTrainingBeginners
100% (1)
1106 Slides UserTrainingBeginners
164 pages
For Gold 3-15 Min
100% (1)
For Gold 3-15 Min
5 pages
Data Science For Public Policy Springer Series in The Data Sciences 1st Ed 2021 Jeffrey C Chen Download
No ratings yet
Data Science For Public Policy Springer Series in The Data Sciences 1st Ed 2021 Jeffrey C Chen Download
88 pages
Elements of Programming Interviews
0% (1)
Elements of Programming Interviews
1 page
Book Review: IFRS 9 and CECL Credit Risk Modelling and Validation - A Practical Guide With Examples in R and SAS
No ratings yet
Book Review: IFRS 9 and CECL Credit Risk Modelling and Validation - A Practical Guide With Examples in R and SAS
2 pages
Module 3 - Thc3 Tourism and Hospitality Marketing 1st Sem 2021-2022
No ratings yet
Module 3 - Thc3 Tourism and Hospitality Marketing 1st Sem 2021-2022
35 pages
Analysis and Linear Algebra For Finance Part I
No ratings yet
Analysis and Linear Algebra For Finance Part I
127 pages
Chung-Ki Min - Applied Econometrics - A Practical Guide (Routledge Advanced Texts in Economics and Finance) - Routledge (2019)
No ratings yet
Chung-Ki Min - Applied Econometrics - A Practical Guide (Routledge Advanced Texts in Economics and Finance) - Routledge (2019)
313 pages
Financial Modeling Case Study (Enercon)
No ratings yet
Financial Modeling Case Study (Enercon)
2 pages
Enterprise Credit Risk Evaluation Based On Neural Network Algorithm
No ratings yet
Enterprise Credit Risk Evaluation Based On Neural Network Algorithm
8 pages
Pattern Recognition and Machine Learning Errata and Additional Comments
0% (1)
Pattern Recognition and Machine Learning Errata and Additional Comments
7 pages
AFRM-Advanced Financial Risk Management - Programme by IIM Bangalore Overview
No ratings yet
AFRM-Advanced Financial Risk Management - Programme by IIM Bangalore Overview
22 pages
Quant Econ
No ratings yet
Quant Econ
462 pages
Harvard Applied Mathematics 21a Syllabus
100% (2)
Harvard Applied Mathematics 21a Syllabus
3 pages
John Fox - Using The R Commander. A Point-And-Click Interface For R-CRC (2018)
No ratings yet
John Fox - Using The R Commander. A Point-And-Click Interface For R-CRC (2018)
223 pages
Conceptual Framework IFRS 2022
No ratings yet
Conceptual Framework IFRS 2022
27 pages
Resource Estimation
No ratings yet
Resource Estimation
188 pages
Bootstrap PDF
No ratings yet
Bootstrap PDF
24 pages
Accelerated C++ Practical Programming by Example
No ratings yet
Accelerated C++ Practical Programming by Example
34 pages
Matlab Manual
No ratings yet
Matlab Manual
70 pages
R Programming Course Notes
No ratings yet
R Programming Course Notes
28 pages
XL Wings
No ratings yet
XL Wings
214 pages
15ma207 Probability & Queueing Theory Maths 4th Semester Question Bank All Unit Question Paper 2017.v.srm - Ramapuram PDF
No ratings yet
15ma207 Probability & Queueing Theory Maths 4th Semester Question Bank All Unit Question Paper 2017.v.srm - Ramapuram PDF
15 pages
CSE-Machine Learning & Big Data - WSS Source Book
No ratings yet
CSE-Machine Learning & Big Data - WSS Source Book
181 pages
Expected Value Markov Chains
No ratings yet
Expected Value Markov Chains
10 pages
Elementary Introduction To Mathematical Finance 3rd Edition Ross Instructor Test Bank
No ratings yet
Elementary Introduction To Mathematical Finance 3rd Edition Ross Instructor Test Bank
317 pages
Resume - Rajat Chaturvedi
No ratings yet
Resume - Rajat Chaturvedi
3 pages
John D. Levendis - Time Series Econometrics - Learning Through Replication (Springer Texts in Business and Economics) - Springer (2023)
No ratings yet
John D. Levendis - Time Series Econometrics - Learning Through Replication (Springer Texts in Business and Economics) - Springer (2023)
493 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Data Science Questions and Answers
No ratings yet
Data Science Questions and Answers
4 pages
Lecture 3 EdgeDetection
No ratings yet
Lecture 3 EdgeDetection
52 pages
Mit Data Science Program
100% (1)
Mit Data Science Program
15 pages
Actuarial Mathematics For Life Contingent Risks
No ratings yet
Actuarial Mathematics For Life Contingent Risks
7 pages
Midsem Regular MFDS 22-12-2019 Answer Key PDF
No ratings yet
Midsem Regular MFDS 22-12-2019 Answer Key PDF
5 pages
Analysis and Linear Algebra For Finance Part I
No ratings yet
Analysis and Linear Algebra For Finance Part I
127 pages
GARCH Family of Models
No ratings yet
GARCH Family of Models
40 pages
Full Syllabus of Calicut University (2004) Information Technology (IT)
No ratings yet
Full Syllabus of Calicut University (2004) Information Technology (IT)
191 pages
Analysis and Linear Algebra For Finance Part II
No ratings yet
Analysis and Linear Algebra For Finance Part II
156 pages
Building A Career in Data Science - The Overview
No ratings yet
Building A Career in Data Science - The Overview
2 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
Financial Statistics Laboratory 3: Bootstrap
No ratings yet
Financial Statistics Laboratory 3: Bootstrap
16 pages
Introduction To Numpy Exercise
No ratings yet
Introduction To Numpy Exercise
24 pages
Big Data and Business Analytics: Trends, Platforms, Success Factors and Applications
No ratings yet
Big Data and Business Analytics: Trends, Platforms, Success Factors and Applications
32 pages
Business Intelligence and Analytics Notes
No ratings yet
Business Intelligence and Analytics Notes
260 pages
Econometric Methods With Applications in Business
No ratings yet
Econometric Methods With Applications in Business
9 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
No ratings yet
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
19 pages
Banking Credit Risk Analysis With Naive Bayes Approach and Cox Proportional Hazard
No ratings yet
Banking Credit Risk Analysis With Naive Bayes Approach and Cox Proportional Hazard
6 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
Basel Implementation Issues PDF
No ratings yet
Basel Implementation Issues PDF
16 pages
Loan Prediction 10
No ratings yet
Loan Prediction 10
10 pages
ML - Confusion Matrix
No ratings yet
ML - Confusion Matrix
15 pages
Consulting Business
No ratings yet
Consulting Business
13 pages
Numpy Ref
No ratings yet
Numpy Ref
1,128 pages
Stata Tutorial
No ratings yet
Stata Tutorial
88 pages
The Vision, The Tool, and The Project: Scikit
No ratings yet
The Vision, The Tool, and The Project: Scikit
75 pages
Data Analysis and Approximate Models - Model Choice, - Davies, Patrick Laurie - Chapman & Hall - CRC Monographs On Statistics & Applied - Chapman and - 9780429161698 - 2dca
No ratings yet
Data Analysis and Approximate Models - Model Choice, - Davies, Patrick Laurie - Chapman & Hall - CRC Monographs On Statistics & Applied - Chapman and - 9780429161698 - 2dca
318 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
Design Pattern in Option Pricing Part I
No ratings yet
Design Pattern in Option Pricing Part I
3 pages
Sample SPSS Project Paper #14-1
No ratings yet
Sample SPSS Project Paper #14-1
9 pages
Presentation On SPSS
No ratings yet
Presentation On SPSS
13 pages
0216
No ratings yet
0216
96 pages
Nabil Bank Project Report
No ratings yet
Nabil Bank Project Report
36 pages
Time Series by Oscar Torres-Reyna
No ratings yet
Time Series by Oscar Torres-Reyna
31 pages
Final Thesis Cha 1&3 (New)
No ratings yet
Final Thesis Cha 1&3 (New)
14 pages
The Effects of Projects Funding On Their Performance in Rwanda
No ratings yet
The Effects of Projects Funding On Their Performance in Rwanda
32 pages
Sample Data Analysis
No ratings yet
Sample Data Analysis
78 pages
06 - ITTO Spreadsheet
100% (1)
06 - ITTO Spreadsheet
26 pages
Business Analytics Unit 5
No ratings yet
Business Analytics Unit 5
19 pages
PACS FRP-4 e v1.0.1 2004
No ratings yet
PACS FRP-4 e v1.0.1 2004
68 pages
What Is Cluster Analysis?: Dmitriy (Dima) Gorenshteyn
No ratings yet
What Is Cluster Analysis?: Dmitriy (Dima) Gorenshteyn
54 pages
WatsappChatAnalysis 2
No ratings yet
WatsappChatAnalysis 2
23 pages
PERT Time Estimates
No ratings yet
PERT Time Estimates
1 page
POM - Assignment 2 - Nov 2015
No ratings yet
POM - Assignment 2 - Nov 2015
10 pages
Pengaruh Kondisi Infrastruktur Terhadap Pertumbuhan Ekonomi Di Jawa Barat
No ratings yet
Pengaruh Kondisi Infrastruktur Terhadap Pertumbuhan Ekonomi Di Jawa Barat
10 pages
B Lzsynogh7h o 8
No ratings yet
B Lzsynogh7h o 8
5 pages
Tugas Ridho 2.100-2.102
No ratings yet
Tugas Ridho 2.100-2.102
9 pages
Correlation Coefficient
No ratings yet
Correlation Coefficient
4 pages
Forecasting Examples
No ratings yet
Forecasting Examples
12 pages
Focus-Group Interview and Data Analysis: Fatemeh Rabiee
No ratings yet
Focus-Group Interview and Data Analysis: Fatemeh Rabiee
6 pages
A Review of Methods For Measuring Willingness-To-Pay: Christoph Breidert, Michael Hahsler, Thomas Reutterer
No ratings yet
A Review of Methods For Measuring Willingness-To-Pay: Christoph Breidert, Michael Hahsler, Thomas Reutterer
32 pages
Goodness of Fit: Squares (ESS) To The Total Sum of Squares (TSS)
No ratings yet
Goodness of Fit: Squares (ESS) To The Total Sum of Squares (TSS)
2 pages
PDF 20230501 220524 0000
No ratings yet
PDF 20230501 220524 0000
2 pages

IJERT Data Analysis Using Python

Uploaded by

IJERT Data Analysis Using Python

Uploaded by

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181

Data Analysis using Python

IJERTV10IS070241 www.ijert.org 463

inconsistencies, duplicates, and errors, and then removes C. Platform used:

IJERTV10IS070241 www.ijert.org 464

• Cleaning Data F. Exploratory Data Analysis

• Data types: Datatype refers to the type of data- int,

Fig. 4. Checking null values in the dataset

According to our needs for the analysis, we can extract some

Fig. 7. Datatypes of the whole coumns in the dataset

• Describing the dataset: Describing data of a dataset

For the whole dataset: mydata.describe()

Fig. 5. Top 10 rows of the dataset

#tail() is used to extract the last rows in the dataset. 5 is the

Fig. 8. Summary of the whole dataset

Fig. 6. Last 10 rows of the dataset

IJERTV10IS070241 www.ijert.org 465

Fig. 9. Summary of some selected entries(10 last rows)

• Correlations: Correlation shows the relation between

Fig. 10. Correlation of the whole dataset

# For some selected coulmns or attributes-

Fig. 11. Correlation of some attributes in the dataset

IJERTV10IS070241 www.ijert.org 466

• Box Plot: Box plot is a visual representation of and

Fig. 17. Heatmap

• Count Plot: A Seaborn count plot is a graphical

Fig. 18. Countplot

IJERTV10IS070241 www.ijert.org 467

IJERTV10IS070241 www.ijert.org 468

You might also like