0% found this document useful (0 votes)
21 views7 pages

Exp 12

Ml lab exp 12

Uploaded by

g.monikadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

Exp 12

Ml lab exp 12

Uploaded by

g.monikadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 7
Experiment-12 Exploratory Data Analysis for Classification using Panda or Matplottib. Introduction to EDA Data Preparation fr ru Structure Data Exploration Insight Data Analysis Reports (EDA) visual Graphs Exploratory Data Analysis (EDA) is a critical step in the data analysis process, which involves examining and visualizing data to gain insights and uncover patterns, anomalies, and relationships within the dataset. EDA helps data analysts and scientists understand the data they are working with before proceeding to more advanced analytics or modeling. Below is a detailed explanation of the key steps involved in EDA: 1. Data Collection: Gather the dataset from various sources, such as databases, CSV files, APIs, or web scraping. Ensure that the data is structured and organized for analysis 2. Data Loading: Import the dataset into your preferred data analysis environment, such as Python using libraries like Pandas. 3. Initial Data Inspection: Examine the first few rows of the dataset to get a sense of its structure and content. Check the data types, column names, and missing values. Page 128 of 182 4, Data Cleaning: Handle missing values by either imputing them or removing rows/columns with missing data. Correct data inconsistencies and errors, such as typos and outliers, Ensure that data types are appropriate for each column (e.g., numeric, categorical) 5. Descriptive Statistics: Calculate basic statistics for numerical variables, including mean, median, standard deviation, and quartiles. Understand the central tendencies and spread of the data, 6, Univariate Analysis: Visualize the distribution of individual variables through histograms, density plots, box plots, or bar charts Identify outliers and anomalies. 7. Bivariate and Multivariate Analysis: Explore relationships between pairs of variables through scatter plots, heatmaps, or correlation matrices. Investigate how variables interact with each other. Identify potential predictors for the target variable in a classification or regression task 8. Data Visualization: Create meaningful visualizations such as line plots, bar charts, pie charts, and box plots to represent data patterns. se color and labels to make visualizations more interpretable. 9, Feature Engineering: Create new features based on domain knowledge or insights from the EDA. Transform variables to better suit the modeling algorithms. Page 129 of 182 10. Outlier Detection: - Identify and handle outliers that may affect the quality of the analysis or model. - Consider whether outliers should be removed or transformed. 11. Categorical Variable Analysis: - Analyze categorical variables using frequency tables, bar plots, or stacked bar charts. - Understand the distribution of categories within each variable. 12. Time Series Analysis (if applicable): - For time series data, examine trends, seasonality, and autocorrelation. - Decompose time series data to better understand its components. 13. Hypothesis Testing (if applicable): - Perform statistical tests to validate or reject hypotheses about the data, - Common tests include t-tests, chi-squared tests, and ANOVA. 14. Summary and Insights: - Summarize the key findings from the EDA proce: Document interesting patterns, relationships, and potential insights. 15. Data Visualization and Reporting: - Create clear and informative data visualizations for reporting and presentation. - Communicate the results and insights effectively to stakeholders. import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load your dataset (replace 'your_dataset.csv' with your dataset's file path) data = pd.read_esv(/content/Iris.csv') # Display the first few rows of the dataset to get an overview print(data.head() # Summary statistics for numeric columns print(data.deseribe()) Page 130 of 182 # Missing value analysis print("\nMissing Values:") print(data.isnull().sum() # Explore missing data plt.figure(figsize=(8, 6)) sns.heatmap(data.isnull(), cbar=False, emap='viridis') plt.itle(Missing Data’) plt.show() # Class distribution for classification class_counts = data['Species'].value_counts() print(class_counts) # Visualization of class distribution pit figure(figsize=(8, 6)) sns.countplot(x—'Species’, data-data) plt.title(Class Distribution’) plt.xlabel(’Target Class’) pltylabel((Count’) plt.show0) # Import label encoder from skleam import preprocessing Page 131 of 182 # label_encoder object knows # how to understand word labels. label_encoder ~ preprocessing. LabelEncoder() # Encode labels in column 'species'. data['Species'}= label_encoder.fit_transform(data['Species'}) data['Species'].unique() # Correlation matrix for numeric features correlation_matrix = data.corr() plt.figure(figsize=(10, 8)) sns.heatmap(correlation_matrix, annot=True, emap='coolwarm') plt.title('Correlation Matrix’) plt.show(, # Pairplot to visualize relationships between numerical features sns.pairplot(data, hue='Species’) plt.show0, # Box plots for numerical features vs. target variable plt.figure(figsize-(12, 8)) for i, feature in enumerate(data.columns[:-1]): plt.subplot(2, 3, i+ 1) sns.boxplot(data=data, x='Species’, yfeature) Page 132 of 182 plttitle(f {feature} vs. Target’) plt.tight_layout() plt.show0) # Box plots for numeric features by cl pit.figure(figsize=(12, 8)) sns.boxplot(x='Species', y—'SepalLengthCm’, data=data) plt.itle('Box Plot of SepalLengthCm by Class’) plt.xlabel(’Target Class’) plt.ylabel('Feature!") pltshow() # Histograms for numeric features data.hist(bins=20, figsize=(12, 8)) plt.suptitle(Histograms of Numeric Features’, y-1.02) pltshow() # Distribution plots for numerical features numerical_features = data.select_dtypes(include=['int64’, 'float64']).columns plt.figure(figsize=(12, 8)) for i, feature in enumerate(numerical_features): plt.subplot(2, 3, i+ 1) sns.histplot(data=data, x=feature, kde=True) plttitle(f {feature} Distribution’) plt.tight_layout() Page 133 of 182 plt.show() # Scatter plot for feature relationships plt.figure(figsize=(8, 6)) sns.scatterplot(data, x='SepalLengthCm’, y’SepalWidthCm', hue='Species') plt.title("Scatter Plot between Featurel and Feature2") pit.show() # Pairwise feature correlation with the target variable correlation_with_target = data.corr()['Species'].abs().sort_values(ascending-False) print("\nFeature Correlation with Target:") print(correlation_with_target) Page 134 of 182

You might also like