Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and Seaborn Last Updated : 23 Jul, 2025 Comments Improve Suggest changes Like Article Like Report Exploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming, and exploring data to make it suitable for analysis. Why EDA important in Data Science?To effectively work with data, it’s essential to first understand the nature and structure of data. EDA helps answer critical questions about the dataset and guides the necessary preprocessing steps before applying any algorithms. For instance: What type of data do we have? Are we working with numbers, text, or dates?Are there outliers? These are unusual values that are very different from the rest.Is anything missing? Are some parts of the dataset empty or incomplete?Imagine you’re working with a student performance dataset. If some rows are missing test scores, or the names of subjects are inconsistently spelled (e.g., "Math" and "Mathematics"), you’ll need to address these issues before proceeding. EDA helps to identify such problems and clean the data to ensure reliable analysis. Now, we will understand core packages for exploratory data analysis (EDA), including NumPy, Pandas, Seaborn, and Matplotlib.1. NumPy for Numerical OperationsNumPy is used for working with numerical data in Python. Handles Large Datasets Efficiently: NumPy allows to work with large, multi-dimensional arrays and matrices of numerical data. Provides functions for performing mathematical operations such as linear algebra, statistical analysis.Facilitates Data Transformation: Helps in sorting, reshaping, and aggregating data. Example : Let’s consider a simple example where we analyze the distribution of a dataset containing exam scores for students using numpy: Python import numpy as np # Dataset: Exam scores scores = np.array([45, 50, 55, 60, 65, 70, 75, 80, 200]) # Note: One extreme value (200) # Calculate basic statistics mean_score = np.mean(scores) median_score = np.median(scores) std_dev_score = np.std(scores) print(f"Mean: {mean_score}, Median: {median_score}, Standard Deviation: {std_dev_score}") OutputMean: 77.77777777777777, Median: 65.0, Standard Deviation: 44.541560561838764 This example demonstrates how NumPy can quickly compute statistics. We can also detect anomalies in data using z-score. Now follow below resources for in-depth understanding.Introduction to NumPyBasics of NumPy ArraysData types and type castingAccessing and Modifying Data - Indexing and slicingBroadcasting - Perform operations on arrays with different shapesLinear algebra operations: Solving Mathematical ProblemsSaving and loading NumPy arrays2. Pandas for Data ManipulationBuilt on top of NumPy, Pandas excels at handling tabular data (data organized in rows and columns) through its core data structures: Series (1D) and DataFrame (2D). Pandas simplifies the process of working with structured data by:Easy loading and saving of datasets in formats like CSV, Excel, SQL, or JSON: Read Dataset with PandasSave DataFrame as CSV file for further useReading from JSON files into Pandas DataFrameWorking with Excel filesData Processing with PandasSlicing rows with pandas IndexingData Aggregation and GroupingWorking with Date and Time3. Matplotlib for Data VisualizationMatplotlib brings us data visualizations, it is a powerful and versatile open-source plotting library for Python, designed to help users visualize data in a variety of formats.Introduction to MatplotlibPyplot in MatplotlibMatplotlib – Axes ClassMatplotlib for 3D PlottingExploratory Data Analysis with matplotlib4. Seaborn for Statistical Data VisualizationSeaborn is built on top of Matplotlib and is specifically designed for statistical data visualization. It provides a high-level interface for drawing attractive and informative statistical graphics.Introduction to SeabornTypes Of Seaborn PlotsPairplot function in seabornFacetGrid in SeabornTime Series Visualization with Seaborn : Line PlotComplete EDA Workflow Using NumPy, Pandas, and SeabornLet's implement complete workflow for performing EDA: starting with numerical analysis using NumPy and Pandas, followed by insightful visualizations using Seaborn to make data-driven decisions effectively.Performing EDA with Numpy and Pandas - Set 1 After analysis : Visualizing with seaborn - Set 2For more hands-on implementation - Explore projects below: Titanic Data EDA using SeabornUber Rides Data AnalysisZomato Data Analysis Using PythonGlobal Covid-19 Data Analysis and VisualizationsiPhone Sales AnalysisGoogle Search AnalysisWeb Scraping For EDANow, what is Web-scraping? : It is the automated process of extracting data from websites for later on analysis. How to Extract Weather Data from Google in Python?Movies Review Scraping And AnalysisProduct Price Scraping and AnalysisNews Scraping and AnalysisReal-time Share Price scrapping and analysis Comment More infoAdvertise with us A anurag702 Follow Improve Article Tags : Data Analysis AI-ML-DS With Python Similar Reads Data Analysis (Analytics) Tutorial Data Analytics is a process of examining, cleaning, transforming and interpreting data to discover useful information, draw conclusions and support decision-making. It helps businesses and organizations understand their data better, identify patterns, solve problems and improve overall performance. 4 min read Prerequisites for Data AnalysisExploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and SeabornExploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming, 4 min read SQL for Data AnalysisSQL (Structured Query Language) is a powerful tool for data analysis, allowing users to efficiently query and manipulate data stored in relational databases. Whether you are working with sales, customer or financial data, SQL helps extract insights and perform complex operations like aggregation, fi 6 min read Python | Math operations for Data analysisPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.There are some important math operations that can be performed on a pandas series to si 2 min read Python - Data visualization tutorialData visualization is the process of converting complex data into graphical formats such as charts, graphs, and maps. It allows users to understand patterns, trends, and outliers in large datasets quickly and clearly. By transforming data into visual elements, data visualization helps in making data 5 min read Free Public Data Sets For AnalysisData analysis is a crucial aspect of modern decision-making processes across various domains, including business, academia, healthcare, and government. However, obtaining high-quality datasets for analysis can be challenging and costly. Fortunately, there are numerous free public datasets available 5 min read Data Analysis LibrariesPandas TutorialPandas (stands for Python Data Analysis) is an open-source software library designed for data manipulation and analysis. Revolves around two primary Data structures: Series (1D) and DataFrame (2D)Built on top of NumPy, efficiently manages large datasets, offering tools for data cleaning, transformat 6 min read NumPy Tutorial - Python LibraryNumPy is a core Python library for numerical computing, built for handling large arrays and matrices efficiently.ndarray object â Stores homogeneous data in n-dimensional arrays for fast processing.Vectorized operations â Perform element-wise calculations without explicit loops.Broadcasting â Apply 3 min read Data Analysis with SciPySciPy is a Python library useful for solving many mathematical equations and algorithms. It is designed on the top of Numpy library that gives more extension of finding scientific mathematical formulae like Matrix Rank, Inverse, polynomial equations, LU Decomposition, etc. Using its high-level funct 5 min read Understanding the DataWhat is Data ?Data is a word we hear everywhere nowadays. In general, data is a collection of facts, information, and statistics and this can be in various forms such as numbers, text, sound, images, or any other format.In this article, we will learn about What is Data, the Types of Data, Importance of Data, and 9 min read Understanding Data Attribute Types | Qualitative and QuantitativeWhen we talk about data mining , we usually discuss knowledge discovery from data. To learn about the data, it is necessary to discuss data objects, data attributes, and types of data attributes. Mining data includes knowing about data, finding relations between data. And for this, we need to discus 6 min read Univariate, Bivariate and Multivariate data and its analysisData analysis is an important process for understanding patterns and making informed decisions based on data. Depending on the number of variables involved it can be classified into three main types: univariate, bivariate and multivariate analysis. Each method focuses on different aspects of the dat 5 min read Attributes and its Types in Data AnalyticsIn this article, we are going to discuss attributes and their various types in data analytics. We will also cover attribute types with the help of examples for better understanding. So let's discuss them one by one. What are Attributes?Attributes are qualities or characteristics that describe an obj 4 min read Loading the DataPandas Read CSV in PythonCSV files are the Comma Separated Files. It allows users to load tabular data into a DataFrame, which is a powerful structure for data manipulation and analysis. To access data from the CSV file, we require a function read_csv() from Pandas that retrieves data in the form of the data frame. Hereâs a 6 min read Export Pandas dataframe to a CSV fileWhen working on a Data Science project one of the key tasks is data management which includes data collection, cleaning and storage. Once our data is cleaned and processed itâs essential to save it in a structured format for further analysis or sharing.A CSV (Comma-Separated Values) file is a widely 2 min read Pandas - Parsing JSON DatasetJSON (JavaScript Object Notation) is a popular way to store and exchange data especially used in web APIs and configuration files. Pandas provides tools to parse JSON data and convert it into structured DataFrames for analysis. In this guide we will explore various ways to read, manipulate and norma 2 min read Exporting Pandas DataFrame to JSON FilePandas a powerful Python library for data manipulation provides the to_json() function to convert a DataFrame into a JSON file and the read_json() function to read a JSON file into a DataFrame.In this article we will explore how to export a Pandas DataFrame to a JSON file with detailed explanations 2 min read Working with Excel files using PandasExcel sheets are very instinctive and user-friendly, which makes them ideal for manipulating large datasets even for less technical folks. If you are looking for places to learn to manipulate and automate stuff in Excel files using Python, look no further. You are at the right place.In this article, 7 min read Data CleaningWhat is Data Cleaning?Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies within a dataset. This crucial step in the data management and data science pipeline ensures that the data is accurate, consistent, and 12 min read ML | Overview of Data CleaningData cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi 13 min read Best Data Cleaning Techniques for Preparing Your DataData cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to improve their quality, accuracy, and reliability for analysis or other applications. It involves several steps aimed at detecting and r 6 min read Handling Missing DataWorking with Missing Data in PandasIn Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE 5 min read Drop rows from Pandas dataframe with missing values or NaN in columnsWe are given a Pandas DataFrame that may contain missing values, also known as NaN (Not a Number), in one or more columns. Our task is to remove the rows that have these missing values to ensure cleaner and more accurate data for analysis. For example, if a row contains NaN in any specified column, 4 min read Count NaN or missing values in Pandas DataFrameIn this article, we will see how to Count NaN or missing values in Pandas DataFrame using isnull() and sum() method of the DataFrame. 1. DataFrame.isnull() MethodDataFrame.isnull() function detect missing values in the given object. It return a boolean same-sized object indicating if the values are 3 min read ML | Handling Missing ValuesMissing values are a common challenge in machine learning and data analysis. They occur when certain data points are missing for specific variables in a dataset. These gaps in information can take the form of blank cells, null values or special symbols like "NA", "NaN" or "unknown." If not addressed 10 min read Working with Missing Data in PandasIn Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE 5 min read ML | Handle Missing Data with Simple ImputerSimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. It is implemented by the use of the SimpleImputer() method which takes the following arguments : missing_values : The missing_ 2 min read How to handle missing values of categorical variables in Python?Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. Often we come across datasets in which some values are missing from the columns. This causes problems when we apply a machine learning model to the dataset. This increases the cha 4 min read Replacing missing values using Pandas in PythonDataset is a collection of attributes and rows. Data set can have missing data that are represented by NA in Python and in this article, we are going to replace missing values in this article We consider this data set: Dataset data set In our data contains missing values in quantity, price, bought, 2 min read Outliers DetectionBox PlotBox Plot is a graphical method to visualize data distribution for gaining insights and making informed decisions. Box plot is a type of chart that depicts a group of numerical data through their quartiles. In this article, we are going to discuss components of a box plot, how to create a box plot, u 7 min read Detect and Remove the Outliers using PythonOutliers are data points that deviate significantly from other data points in a dataset. They can arise from a variety of factors such as measurement errors, rare events or natural variations in the data. If left unchecked it can distort data analysis, skew statistical results and impact machine lea 8 min read Z score for Outlier Detection - PythonOutlier detection is an important task in data as identifying outliers can help us to understand the data better and improve the accuracy of our models. One common technique for detecting outliers is Z score. It is a statistical measurement that describes how far a data point is from the mean, expre 4 min read Clustering-Based approaches for outlier detection in data miningClustering Analysis is the process of dividing a set of data objects into subsets. Each subset is a cluster such that objects are similar to each other. The set of clusters obtained from clustering analysis can be referred to as Clustering. For example: Segregating customers in a Retail market as a 6 min read Exploratory Data AnalysisWhat is Exploratory Data Analysis?Exploratory Data Analysis (EDA) is a important step in data science and data analytics as it visualises data to understand its main features, find patterns and discover how different parts of the data are connected. In this article, we will see more about Exploratory Data Analysis (EDA).Why Explorat 8 min read EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration 6 min read Time Series Data AnalysisTime Series Analysis & Visualization in PythonTime series data consists of sequential data points recorded over time which is used in industries like finance, pharmaceuticals, social media and research. Analyzing and visualizing this data helps us to find trends and seasonal patterns for forecasting and decision-making. In this article, we will 6 min read What is a trend in time series?Time series data is a sequence of data points that measure some variable over ordered period of time. It is the fastest-growing category of databases as it is widely used in a variety of industries to understand and forecast data patterns. So while preparing this time series data for modeling it's i 3 min read Basic DateTime Operations in PythonPython has an in-built module named DateTime to deal with dates and times in numerous ways. In this article, we are going to see basic DateTime operations in Python. There are six main object classes with their respective components in the datetime module mentioned below: datetime.datedatetime.timed 12 min read How to deal with missing values in a Timeseries in Python?It is common to come across missing values when working with real-world data. Time series data is different from traditional machine learning datasets because it is collected under varying conditions over time. As a result, different mechanisms can be responsible for missing records at different tim 9 min read Like