0% found this document useful (0 votes)
5 views5 pages

Data Analysis With Python

The document provides a comprehensive guide on data analysis using Python, covering topics such as datasets, data preprocessing, exploratory data analysis (EDA), model development, and model evaluation. It includes practical examples using Jupyter Notebook and details on handling missing data, data formatting, and various regression techniques. Additionally, it emphasizes the importance of understanding datasets and provides methods for exporting data in different formats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Data Analysis With Python

The document provides a comprehensive guide on data analysis using Python, covering topics such as datasets, data preprocessing, exploratory data analysis (EDA), model development, and model evaluation. It includes practical examples using Jupyter Notebook and details on handling missing data, data formatting, and various regression techniques. Additionally, it emphasizes the importance of understanding datasets and provides methods for exporting data in different formats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Analysis with Python

• Datasets

o Exporting to different formats in Python

o Jupyter Notebook: Import data

• Preprocessing Data in Python

o How to deal with missing data

o Data Formatting in Python

o Data Normalization in Python

o Binning

o Turning categorical variables into quantitative variables in Python

o Jupyter Notebook: Preprocessing data

• Exploratory Data Analysis (EDA)

o Descriptive Statistics - Describe()

o Grouping data

▪ groupby

▪ pivot

▪ Heatmap

o Correlation

o Correlation - Statistics

▪ Pearson Correlation

▪ Correlation Heatmap

o Association between two categorical variables: Chi-Square

o Jupyter Notebook: Exploratory Data Analysis (EDA)

• Model Development

o Linear Regression and Multiple Linear Regression

o Model Evaluation using Visualization

▪ Regression Plot

▪ Residual Plot

▪ Distribution Plots
o Polynomial Regression and Pipelines

o Measures for In-Sample Evaluation

▪ Mean Squared Error (MSE)

▪ R-squared

o Jupyter Notebook: Model Development

• Model Evaluation and Refinement

o Function cross_val_score()

o Function cross_val_predict()

o Overfitting, Underfitting and Model Selection

o Ridge Regression

o Grid Search

o Jupyter Notebook: Model Evaluation and Refinement

o Jupyter Notebook: House Sales in King Count USA

Datasets

Understanding Datasets

Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/


Exporting to different formats in Python

Data Format Read Save

csv pd.read_csv() df.to_csv()

json pd.read_json() df.to_json()

Excel pd.read_excel() df.to_excel()

sql pd.read_sql() df.to_sql()

Basic insights from the data

• Understand your data before you begin any analysis

• Should check:

o data types

▪ df.dtypes

o data distribution

▪ df.describe()

▪ df.describe(include="all"), provides full summary statistics

▪ unique

▪ top

▪ freq

• Locate potential issues with the data


o potential info and type mismatch

o compatibility with python methods

Jupyter Notebook: Import data

↥ back to top

Preprocessing Data in Python

• Identify and handle missing values

• Data formatting

• Data normalization (centering / scaling)

• Data binning

• Turning categorical values to numeric variables

How to deal with missing data

• Check with the data collection source

• Drop the missing values

o drop the variable

o drop the data entry

• Replace the missing values

o replace it with an average (of similar datapoints)


o replace it by frequency

o replace it based on other functions

• Leave it as missing data

df.dropna(subset=["price"], axis=0, inplace=True)

is equivalent to

df = df.dropna(subset=["price"], axis=0)

Data Formatting in Python

Non-formatted:

• confusing

• hard to aggregate

• hard to compare

You might also like