The document outlines key steps in data exploration, pre-processing, and visualization, focusing on handling missing values, categorical data, and outlier detection. It discusses types of missingness (MCAR, MAR, MNAR) and various techniques for treating missing values, including imputation and deletion. Additionally, it covers methods for detecting and treating outliers to ensure robust machine learning model performance.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
7 views114 pages
PS ML Lect 5 9 Unit 2
The document outlines key steps in data exploration, pre-processing, and visualization, focusing on handling missing values, categorical data, and outlier detection. It discusses types of missingness (MCAR, MAR, MNAR) and various techniques for treating missing values, including imputation and deletion. Additionally, it covers methods for detecting and treating outliers to ensure robust machine learning model performance.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114
Data Exploration,
Pre-processing & Visualization
Prepared By: Piyush Kumar Soni Contents • Missing Values Treatment • Handling Categorical data: • Mapping ordinal features • Encoding class labels • Performing one-hot encoding on nominal features • Outlier Detection and Treatment • Feature Engineering • Variable Transformation and Variable Creation • Selecting meaningful features
Piyush Kumar Soni 2
Missing Values Treatment • Missing data can arise from various places in data: • A survey was conducted and values were just randomly missed when being entered in the computer. • A respondent chooses not to respond to a question like `Have you ever recreationally used opioids?'. • You decide to start collecting a new variable (due to new actions: like a pandemic) partway through the data collection of a study. • You want to measure the speed of meteors, and some observations are just 'too quick' to be measured properly.
Piyush Kumar Soni 3
Types of Missingness • Missing Completely at Random (MCAR) • Definition: Data is MCAR when the missingness is independent of both the observed and unobserved data. In other words, the likelihood of a value being missing is the same for all data points, without any underlying pattern or reason. • Characteristics: • No systematic difference between the missing data and the observed data. • Handling MCAR data does not introduce bias if the missing values are ignored or imputed. • Example: • A researcher conducts a survey, but some participants forget to answer a question due to oversight. There’s no reason to believe that those who didn’t answer systematically differ from those who did. • In a sensor network, a random sensor fails temporarily, causing some missing temperature readings. • Handling MCAR: • Simple imputation techniques or deletion (listwise or pairwise) can be applied without biasing the results.
Piyush Kumar Soni 4
• Missing at Random (MAR) • Definition: Data is MAR when the missingness is related to the observed data but not to the missing values themselves. The reason for the missing data can be explained by the observed data. • Characteristics: • There is a systematic relationship between missingness and observed data. • MAR requires more sophisticated imputation techniques to avoid bias. • Example: • In a medical study, patients who are older (observed variable) are less likely to respond to certain survey questions, but the likelihood of missingness is unrelated to the unobserved value of the missing response. • In a customer database, high-income individuals (observed variable) may choose not to disclose their spending habits (missing value). • Handling MAR: • Techniques like multiple imputation or model-based approaches (e.g., maximum likelihood estimation) are suitable for handling MAR data.
Piyush Kumar Soni 5
• Missing Not at Random (MNAR) • Definition: Data is MNAR when the missingness depends on the value of the missing data itself or other unobserved factors. The missingness introduces systematic bias because the missing data is not random and cannot be explained by the observed data. • Characteristics: • MNAR data is the most challenging to handle because the reasons for missingness are tied to the missing values themselves. • Requires domain knowledge to model the missingness mechanism. • Example: • In a survey about income, people with very high or very low incomes may be less likely to report their earnings due to privacy concerns. Here, the likelihood of missing data depends on the income itself (missing variable). • In a health study, patients with severe symptoms might be less likely to attend follow-up appointments, leading to missing health outcome data. • Handling MNAR: • Requires external information, assumptions, or domain-specific knowledge to model the missingness mechanism accurately. • Sensitivity analysis or pattern-mixture models can be used.
Piyush Kumar Soni 6
Handling Missing Values • Handling missing values is a crucial step in the data preprocessing phase for machine learning. Missing data can adversely affect the performance and accuracy of machine learning models. Here are some common techniques for treating missing values: • Identifying Missing Values: • Start by identifying the missing values in your dataset. This can be done using functions like isnull() or info() in pandas for Python, or summary() in R. • Remove Missing Values: • If the proportion of missing values for a particular feature is small and the missing values are randomly distributed, you may choose to remove the rows with missing values using the dropna() function. However, be cautious about removing too many rows, as it may lead to loss of valuable information. df.dropna(inplace=True)
Piyush Kumar Soni 7
• Imputation: • Imputation involves replacing missing values with estimated or calculated values. Common imputation methods include: • Mean/Median Imputation: Replace missing values with the mean or median of the observed values in that column. df['column_name'].fillna(df['column_name'].mean(), inplace=True) • Mode Imputation: For categorical data, replace missing values with the mode (most frequent value) of the column. df['column_name'].fillna(df['column_name'].mode()[0], inplace=True) • Forward Fill or Backward Fill: Use the values from the previous or next row to fill missing values in time-series data. df.fillna(method='ffill', inplace=True) # forward fill df.fillna(method='bfill', inplace=True) # backward fill
Piyush Kumar Soni 8
• Predictive Modeling: • Use machine learning algorithms to predict missing values based on the other features in the dataset. This approach can be powerful but may require more computational resources. • Indicator Variables: • Create an indicator variable to signify whether a value was missing. This way, you retain information about the missingness, which may be useful for some models. df['column_name_missing'] = df['column_name'].isnull().astype(int) • Domain-Specific Imputation: • Depending on the nature of the data, domain knowledge can be used to impute missing values. For example, if data represents time series, missing values might be imputed differently than in cross-sectional data.
Piyush Kumar Soni 9
• Multiple Imputation: • Generate multiple imputations to account for uncertainty in the imputation process. This involves creating several datasets with different imputed values and combining the results. • The choice of the method depends on the nature of your data and the underlying assumptions of your analysis. • It's essential to carefully evaluate the impact of missing value treatment on your model and choose the method that best suits your specific use case.
Piyush Kumar Soni 10
Choosing the Best Method • Percentage of missing values: If less than 5%, deletion might be acceptable. • Pattern of missingness: Understanding the type (MCAR, MAR, MNAR) is crucial. • Variable importance: More important variables might warrant more sophisticated imputation. • Algorithm sensitivity: Some algorithms are more sensitive to missing data than others. • Domain knowledge: Insights into reasons for missingness can guide appropriate methods.
Piyush Kumar Soni 11
Imputation through Modeling • How do we use models to fill in missing data?
Piyush Kumar Soni 12
Piyush Kumar Soni 13 • How do we use models to fill in missing data? Using k-NN for k = 2?
Piyush Kumar Soni 14
• How do we use models to fill in missing data? Using linear regression?
Piyush Kumar Soni 15
Handling Categorical Data • Handling categorical data is an important aspect of data preprocessing in machine learning. • Categorical data represents variables that can take on one of a limited and usually fixed number of possible values, such as colors, gender, or country names. • Types of categorical data: • Nominal • Ordinal • Data that lacks any intrinsic order, such as colors, genders, or animal species, is represented as nominal categorical data. • While ordinal categorical data refers to information that is naturally ranked or ordered, such as customer satisfaction levels or educational attainment.
Piyush Kumar Soni 16
• Encoding class labels: • Encodes target labels with values between 0 and n_classes-1.
Piyush Kumar Soni 17
• It can also be used to transform non-numerical labels
Piyush Kumar Soni 18
• Mapping ordinal features: • Ordinal coding is a popular technique for encoding categorical data where each category is given a different numerical value based on its rank or order. • The categories with the lowest values receive the smallest integers, while those with the highest values receive the largest integers. • When the categories are grouped organically, like with ratings (poor, fair, good, outstanding), or educational achievement, this strategy is beneficial (high school, college, graduate school).
Piyush Kumar Soni 19
Piyush Kumar Soni 20 No Name Gender Blood Grade Height Study 1 Tom M O 56 160 Math 2 Harry M A 76 192 Math 3 John M A 45 178 English 4 Nancy F B 78 157 Biology 5 Mike M O 79 167 Math 6 Kate F AB 66 156 English 7 Mary F O 99 166 Science
Piyush Kumar Soni 21
No Name Gender Blood Grade Height Study 1 Tom 0 O 56 160 0 2 Harry 0 A 76 192 0 3 John 0 A 45 178 1 4 Nancy 1 B 78 157 2 5 Mike 0 O 79 167 0 6 Kate 1 AB 66 156 1 7 Mary 1 O 99 166 3
• Performing one-hot encoding on nominal features: • One-Hot Encoding is a technique used to convert categorical data into numerical format. • It creates a binary vector for each category in the dataset. The vector contains a 1 for the category it represents and 0s for all other categories.
Piyush Kumar Soni 24
Piyush Kumar Soni 25 Outlier Detection and Treatment • Outlier detection and treatment are important steps in the process of building machine learning models. • Outlier: • Outliers are those data points that are significantly different from the rest of the dataset. • They are often abnormal observations that skew the data distribution. • Arise due to inconsistent data entry, or erroneous observations. • To ensure that the trained model generalizes well to the valid range of test inputs, it’s important to detect and remove outliers.
Piyush Kumar Soni 26
• Outliers can negatively impact the performance of machine learning models in several ways: • Overfitting: Models can focus on fitting the outliers rather than the underlying patterns in the majority of the data. • Reduced accuracy: Outliers can pull the model’s predictions towards themselves, leading to inaccurate predictions for other data points. • Unstable models: The presence of outliers can make the model’s predictions sensitive to small changes in the data.
Piyush Kumar Soni 27
Outlier Detection • Statistical Methods • Box Plot • Z-Score • IQR (Interquartile Range) • Outlier Detection Using Percentile • Distance from the mean • Distance-based Methods • Distance to Centroid • Nearest Neighbors • Density-Based Methods • DBSCAN (Density-Based Spatial Clustering of Applications with Noise) • LOF (Local Outlier Factor)
Piyush Kumar Soni 28
Box Plot • Box plots are a visual method to identify outliers. • Box plots are one of the many ways to visualize data distribution. • Box plot plots the q1 (25th percentile), q2 (50th percentile or median) and q3 (75th percentile) of the data along with (q1–1.5*(q3-q1)) and (q3+1.5*(q3-q1)). • Outliers, if any, are plotted as points above and below the plot.
Piyush Kumar Soni 29
Piyush Kumar Soni 30 IQR method • QR stands for interquartile range, which is the difference between q3 (75th percentile) and q1 (25th percentile). The IQR method computes lower bound and upper bound to identify outliers. Lower Bound = q1–1.5*IQR Upper Bound = q3+1.5*IQR • Any values below the lower bound and above the upper bound are considered to be outliers. • Example • 27, 2, 22, 29, 19, 30, 32, 59, 52, 35 • 78, 74, 88, 90, 94, 90, 98, 80
Piyush Kumar Soni 31
Outlier Detection Using Percentile • Define a custom range that accommodates all data points that lie anywhere between 0.5 and 99.5 (for eg.) percentile of the dataset. • Observations Outside this range are treated as outliers.
Piyush Kumar Soni 32
Z-score method • Z-score method is generally used when a variable’ distribution looks close to Normal. • Z-score is the number of standard deviations a value of a variable is away from the variable’ mean. Z-Score = (X-mean) / Standard deviation • when the values of a variable are converted to Z-scores, then the distribution of the variable is called standard normal distribution with mean=0 and standard deviation=1.
Piyush Kumar Soni 33
• For data that is normally distributed, around 68.2% of the data will lie within one standard deviation from the mean. Close to 95.4% and 99.7% of the data lie within two and three standard deviations from the mean, respectively. • One approach to outlier detection is to set the lower limit to three standard deviations below the mean (μ - 3*σ), and the upper limit to three standard deviations above the mean (μ + 3*σ). Any data point that falls outside this range is detected as an outlier.
Piyush Kumar Soni 34
Piyush Kumar Soni 35 Distance from the mean • Unlike the previous methods, this method considers multiple variables in a data set to detect outliers. • This method calculates the Euclidean distance of the data points from their mean and converts the distances into absolute z-scores. Any z- score greater than the pre-specified cut-off is considered to be an outlier.
Piyush Kumar Soni 36
Outlier Treatment • Removal: • The simplest approach is to remove outliers from the dataset. However, this should be done carefully, as it may lead to information loss. • Transformation: • Apply mathematical transformations to make the distribution more symmetrical, such as logarithmic or square root transformations. • Imputation: • Replace outlier values with a measure of central tendency, such as mean, median, or mode. • Winsorizing: • Replace extreme values with values within a certain percentile range. This helps to limit the impact of outliers without entirely removing them.
Piyush Kumar Soni 37
• Binning: • Group outlier values into a specific bin or category, treating them as a separate group. • Model-Specific Approaches: • Some models have built-in methods to handle outliers. For example, decision trees and random forests are often less sensitive to outliers.
Piyush Kumar Soni 38
Feature Engineering • Feature engineering is the process of transforming raw data into features that are suitable for machine learning models. • In other words, it is the process of selecting, extracting, and transforming the most relevant features from the available data to build more accurate and efficient machine learning models. • The success of machine learning models heavily depends on the quality of the features used to train them. • Feature engineering involves a set of techniques that enable us to create new features by combining or transforming the existing ones. • These techniques help to highlight the most important patterns and relationships in the data, which in turn helps the machine learning model to learn from the data more effectively.
Piyush Kumar Soni 39
Piyush Kumar Soni 40 What is a Feature? • In the context of machine learning, a feature (also known as a variable or attribute) is an individual measurable property or characteristic of a data point that is used as input for a machine learning algorithm. • Features can be numerical, categorical, or text-based, and they represent different aspects of the data that are relevant to the problem at hand. • The choice and quality of features are critical in machine learning, as they can greatly impact the accuracy and performance of the model.
Piyush Kumar Soni 41
Feature Creation • Feature Creation is the process of generating new features based on domain knowledge or by observing patterns in the data. • It is a form of feature engineering that can significantly improve the performance of a machine-learning model. • Types of Feature Creation: • Domain-Specific: Creating new features based on domain knowledge, such as creating features based on business rules or industry standards. • Data-Driven: Creating new features by observing patterns in the data, such as calculating aggregations or creating interaction features. • Synthetic: Generating new features by combining existing features or synthesizing new data points.
Piyush Kumar Soni 42
Feature Transformation • Feature Transformation is the process of transforming the features into a more suitable representation for the machine learning model. • This is done to ensure that the model can effectively learn from the data. • Types of Feature Transformation • Transformation: Transforming the features using mathematical operations to change the distribution or scale of the features. Examples are logarithmic, square root, and reciprocal transformations. • Encoding Categorical Variables: Converting categorical variables into a numerical format that can be fed into machine learning algorithms. Common methods include one-hot encoding, label encoding, or target encoding.
Piyush Kumar Soni 43
Piyush Kumar Soni 44 • Scaling and Normalization: Scaling features to ensure that they are on similar scales. Common methods include MaxAbs scaling, Min-Max scaling, Standard scaling (Z-score normalization), and robust scaling. • Binning or Discretization: Grouping continuous numerical features into discrete bins or intervals. This can help capture non-linear relationships and patterns in the data. • Polynomial Features: Introducing interaction terms or polynomial features to capture non-linear relationships between variables. • Text Data Features: Extracting features from text data, such as word frequency, TF-IDF scores, or word embeddings.
Piyush Kumar Soni 45
Feature Selection • Feature Selection is the process of selecting a subset of relevant features from the dataset to be used in a machine-learning model. • It is an important step in the feature engineering process as it can have a significant impact on the model’s performance. • Common feature selection techniques include: • Filter methods: • Assess feature relevance independent of the model using statistical measures. • Examples: correlation analysis, chi-square test, ANOVA, mutual information. • Wrapper methods: • Evaluate different feature subsets using the performance of a specific machine learning model. • Examples: forward selection, backward elimination, recursive feature elimination (RFE).
Piyush Kumar Soni 46
• Embedded methods: • Incorporate feature selection into the model training process. • Examples: L1 regularization (Lasso regression), tree-based models (random forests, decision trees) that inherently provide feature importance scores.
Piyush Kumar Soni 47
Libraries Sci-kit Learn for Pre-processing • The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. • Standardization, or mean removal and variance scaling- • Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. • In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. • The preprocessing module provides the StandardScaler utility class, which is a quick and easy way to perform the following operation on an array-like dataset:
Piyush Kumar Soni 49
Piyush Kumar Soni 50 • Scaling features to a range- • An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.
Piyush Kumar Soni 51
Piyush Kumar Soni 52 • Normalization- • Normalization is the process of scaling individual samples to have unit norm. • The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1, l2, or max norms:
Piyush Kumar Soni 53
• Encoding categorical features- • Label encoding • Ordinal encoding • One hot encoding • Target encoding
Piyush Kumar Soni 54
• Discretization- • Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. • Certain datasets with continuous features may benefit from discretization because discretization can transform the dataset of continuous attributes to one with only nominal attributes. • K-bins discretization-
Piyush Kumar Soni 55
• For the current example, these intervals are defined as:
• Based on these bin intervals, X is transformed as follows:
Piyush Kumar Soni 56
• Feature binarization- • Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that assume that the input data is distributed according to a multi-variate Bernoulli distribution. • It is also common among the text processing community to use binary feature values (probably to simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly better in practice.
Piyush Kumar Soni 57
Piyush Kumar Soni 58 • It is possible to adjust the threshold of the binarizer:
Piyush Kumar Soni 59
• Imputation of missing values- • Generating polynomial features- • Often it’s useful to add complexity to a model by considering nonlinear features of the input data. • A simple and common method to use is polynomial features, which can get features’ high-order and interaction terms. It is implemented in PolynomialFeatures.
Piyush Kumar Soni 60
Piyush Kumar Soni 61 • In some cases, only interaction terms among features are required, and it can be gotten with the setting interaction_only=True:
Piyush Kumar Soni 62
• Custom transformers- • You can implement a transformer from an arbitrary function with FunctionTransformer. For example, to build a transformer that applies a log transformation in a pipeline, do:
Piyush Kumar Soni 63
Matplotlib for Data Visualization • Data Visualization is the process of presenting data in the form of graphs or charts. • It helps to understand large and complex amounts of data very easily. It allows the decision-makers to make decisions very efficiently and also allows them to identify new trends and patterns very easily. • Matplotlib- • Matplotlib is a low-level library of Python which is used for data visualization. • It is easy to use and emulates MATLAB like graphs and visualization. • This library is built on the top of NumPy arrays and consist of several plots like line chart, bar chart, histogram, etc.
Piyush Kumar Soni 64
• Import Essential Packages import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import matplotlib.image as mpimg %matplotlib inline import os
Piyush Kumar Soni 65
• Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is designed to be as usable as MATLAB, with the ability to use Python and the advantage of being free and open-source. • Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. • The various plots we can utilize using Pyplot are Line Plot, Histogram, Scatter, 3D Plot, Image, Contour, and Polar.
Piyush Kumar Soni 66
• How to create a simple plot?
Piyush Kumar Soni 67
• Adding Title-
Piyush Kumar Soni 68
• Adding X Label and Y Label-
Piyush Kumar Soni 69
• Adding Legends-
Piyush Kumar Soni 70
• Axes class is the most basic and flexible unit for creating sub-plots. A given figure may contain many axes, but a given axes can only be present in one figure. • The plt.subplots() method is the best way to handle several subplots at once.
Piyush Kumar Soni 71
Piyush Kumar Soni 72 Piyush Kumar Soni 73 • Creating a pie chart- • To create a pie chart using matplotlib, refer to the pie() function.
Piyush Kumar Soni 74
• Labels- • Adding labels to a pie chart is pretty straightforward just need to pass a list of strings with labels corresponding to the list of values:
Piyush Kumar Soni 75
Piyush Kumar Soni 76 • A heatmap is a graph that extensively uses color for data visualization. The colors depend on several independent variables.
Piyush Kumar Soni 77
• Adding labels-
Piyush Kumar Soni 78
• A bar chart is a diagram where variables are represented as rectangular bars the taller or longer the bar, the higher the value it represents. • Usually, one axis of a bar chart represents a category, and the other is its value. • A bar chart is used to compare discrete data, such as occurrences or proportions.
Piyush Kumar Soni 79
Piyush Kumar Soni 80 Piyush Kumar Soni 81 • Plotting multiple bars next to each other can come in handy when we need to compare two or more data series that share categories.
Piyush Kumar Soni 82
Piyush Kumar Soni 83 • Stacked bar plot
Piyush Kumar Soni 84
Piyush Kumar Soni 85 • Histogram- • A histogram is a graphical display of data that organizes groups of data points into ranges. These ranges are represented by bars. • It resembles a bar chart, but it's not quite the same. The key difference is that you use a bar chart for categorical data representation, while a histogram displays only numerical data.
Piyush Kumar Soni 86
Piyush Kumar Soni 87 • Changing bins-
Piyush Kumar Soni 88
Piyush Kumar Soni 89 • Scatter Plots- • A scatter plot is a visualization of how two variables relate to each other by using plots. It is widely used for its simplicity in building a chart.
Piyush Kumar Soni 90
• Box plot- • A box plot (also known as a box- and-whisker plot) is a convenient way to visualize the distributions of numerical data using quartiles. Box plots are widespread in descriptive statistics, they allow you to quickly explore one or more datasets.
Piyush Kumar Soni 91
Piyush Kumar Soni 92 Piyush Kumar Soni 93 • Stack Plots- • A stack plot is basically like a pie-chart, only over time. • Let's consider a situation where we have 24 hours in a day, and we'd like to see how we're spending our time. We'll divide our activities into: Sleeping, eating, working, and playing. • We're going to assume that we're tracking this over the course of 5 days, so our starting data will look like:
Piyush Kumar Soni 94
Piyush Kumar Soni 95 Piyush Kumar Soni 96 Pandas for Exploratory Data Analysis • Exploratory data analysis (EDA) is a vital initial step of any data analysis or machine learning project. It is necessary for: • getting an overall understanding of the data, including first insights • identifying the size of the dataset, its structure, and the features that are crucial for the project goal • gathering fundamental statistics of the data • detecting potential issues to fix (such as missing values, duplicates, or outliers)
Piyush Kumar Soni 97
• head() and tail() • By default, the head() method returns the first five and tail() – the last five rows of a dataframe or a series. To return the number of rows different from five, we need to pass in that number. • sample() • By default, it returns a random row of a dataframe or a series. To return a certain number of random rows, we need to pass in that number. • shape • For a dataframe, it returns a tuple with the number of rows and columns. For a series, it returns a one-element tuple with the number of rows.
Piyush Kumar Soni 98
• size • Returns the number of elements in a dataframe or a series. For a series object, it makes more sense to use size rather than shape. The obtained information is the same in both cases, but size returns it in a more handy form – as an integer rather than a one-element tuple. • info() • Returns overall information about a dataframe, including the index data type, the number of rows and columns, column names, indices, and data types, the number of non-null values by column, and memory usage. • describe() • Returns the major statistics of a dataframe or a series, including the number of non-null values, the minimum, maximum, and mean values, and percentiles. For a dataframe, it returns the information by column, and by default, only for numeric columns. To include the statistics for the columns of an object type as well, we need to pass in include='all'. For object columns, the method returns the number of non-null values, the number of unique values, the most frequent value, and the number of times it is encountered in the corresponding column.
Piyush Kumar Soni 99
• dtypes • Returns the data types of a dataframe by column. If a column contains mixed data types or if all its values are None, the returned data type of that column will be an object. This also includes a special case of a column containing booleans and null (NaN) or None values. • For a series, we can interchangeably use either dtypes or dtype for EDA in pandas. • select_dtypes() • Returns a subset of the columns of a dataframe based on the provided column data type (or types). We have to specify the data type (or types) either to include into the subset (using the include parameter) or exclude from it (exclude). • Some typical data types are number, int, float, object, bool, category, and datetime. To specify several data types to include or exclude, we pass in a list of those data types. • columns • Returns the column names of a dataframe. Piyush Kumar Soni 100 • count() • Returns the count of non-null values in a dataframe or a series. For a dataframe, by default, returns the results by column. Passing in axis=1 or axis='columns' will give the results by row. • unique() and nunique() • The unique() method returns the unique values of a series, while nunique() – the number of unique values in a dataframe or a series. • For a dataframe, nunique(), by default, returns the results by column. Otherwise, passing in axis=1 or axis='columns' will give the results by row. • is_unique • Returns True if all the values in a series are unique.
Piyush Kumar Soni 101
• isnull() and isna() • Both methods return a boolean same-sized object showing which values are null (True) and which are not (False). These functions apply to both series and dataframes. • isnull() and isna() work best when chained with sum() (e.g., df.isnull().sum()) returning the number of null values by column for a dataframe or their total number for a series. For a dataframe, the method chaining df.isnull().sum().sum() gives the total number of null values. • hasnans • Returns if a series contains at least one null value. • value_counts() • Returns the count of each unique value in a series. By default, the outputs are not normalized, or sorted in descending order, and the null values are not considered. To override the defaults, we can set the optional parameters normalize, ascending, and dropna accordingly.
Piyush Kumar Soni 102
• nsmallest() and nlargest() • By default, nsmallest() returns the five smallest while nlargest() – the five largest values of a series together with their indices. To return the number of values different from five, we need to pass in that number. • corr() • This method of EDA in pandas applies both to dataframes and series, but in a slightly different way. For a dataframe (df.corr()), it returns column pairwise correlation, excluding null values. For a series (Series1.corr(Series2)), this method returns the correlation of that series with another one, excluding null values. • plot() • Allows the creation of simple plots of various kinds for a data frame or a series. The main parameters are x, y, and kind. The popular types of supported plots are line, bar, barh, hist, box, area, density, pie, and scatter. • In general, pandas are not the best choice for creating compelling visualizations in Python. However, for the purposes of EDA in pandas, the plot() method works just fine.
Piyush Kumar Soni 103
• Sorting Values • Sorting your data according to a certain column can also be useful in EDA. For example, you might want to sort your data by a 'population' column to see which countries have the highest populations. In Pandas, you can use the sort_values() function to sort your DataFrame: sorted_df = df.sort_values(by='population', ascending=False) • This will return a new DataFrame sorted by the 'population' column in descending order. The ascending=False argument sorts the column in descending order. If you want to sort in ascending order, you can omit this argument as True is the default value.
Piyush Kumar Soni 104
• Grouping Data • Grouping your data based on certain criteria can provide valuable insights. For example, you might want to group your data by 'continent' to analyze the data at the continent level. In Pandas, you can use the groupby() function to group your data: grouped_df = df.groupby('continent').mean() • This will return a new DataFrame where the data is grouped by the 'continent' column, and the values in each group are the mean values of the original data in that group.
Piyush Kumar Soni 105
• Filtering Data Based on Data Types • Sometimes, you might want to perform operations only on columns of a certain data type. For example, you might want to calculate statistical measures like mean, median, etc., only on numerical columns. In such cases, you can filter the columns based on their data types. • In Pandas, you can use the select_dtypes() function to select columns of a specific data type: numeric_df = df.select_dtypes(include='number') • This will return a new DataFrame containing only the columns with numerical data. Similarly, you can select columns with object (string) data type as follows: object_df = df.select_dtypes(include='object')
Piyush Kumar Soni 106
• Applying Functions to Cells, Columns and Rows • To apply functions to each column, use apply(): df.apply(np.max) • The apply method can also be used to apply a function to each row. To do this, specify axis=1. Lambda functions are very convenient in such scenarios. For example, if we need to select all states starting with ‘W’, we can do it like this: df[df["State"].apply(lambda state: state[0] == "W")].head() • The map method can be used to replace values in a column by passing a dictionary of the form {old_value: new_value} as its argument: d = {"No": False, "Yes": True} df["International plan"] = df["International plan"].map(d) df.head()
Piyush Kumar Soni 107
NumPy for Statistical Analysis • numpy.amin() and numpy.amax() • These functions return the minimum and the maximum from the elements in the given array along the specified axis.
Piyush Kumar Soni 108
• numpy.ptp() • The numpy.ptp() function returns the range (maximum-minimum) of values along an axis.
Piyush Kumar Soni 109
• numpy.percentile() • Percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. The function numpy.percentile() takes the following arguments. numpy.percentile(a, q, axis) Sr.N Argument & Description o. a 1 Input array q 2 The percentile to compute must be between 0-100 axis 3 The axis along which the percentile is to be calculated Piyush Kumar Soni 110 Piyush Kumar Soni 111 • numpy.median() • Median is defined as the value separating the higher half of a data sample from the lower half. The numpy.median() function is used as shown in the following program.
Piyush Kumar Soni 112
• numpy.mean() • Arithmetic mean is the sum of elements along an axis divided by the number of elements. The numpy.mean() function returns the arithmetic mean of elements in the array. If the axis is mentioned, it is calculated along it.
Piyush Kumar Soni 113
• Standard Deviation import numpy as np print np.std([1,2,3,4]) • Variance import numpy as np print np.var([1,2,3,4])