0% found this document useful (0 votes)

18 views11 pages

Unit-2 Bda

Uploaded by

claritysubhash55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views11 pages

Unit-2 Bda

Uploaded by

claritysubhash55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT – 2

PYTHON FOR DATA ANALYTICS

Python is a general purpose language and is often used for things other than data
analysis and data science. What makes Python extremely useful for working with
data?

There are libraries that give users the necessary functionality when crunching data.
Below are the major Python libraries that are used for working with data. You should
take some time to familiarize yourself with the basic purposes of these packages.

Numpy
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with
other low level languages like Fortran, C and C++.

Pandas – Data Manipulation and Analysis

Pandas for structured data operations and manipulations. It is extensively used for
data munging and preparation. Pandas were added relatively recently to Python and
have been instrumental in boosting Python’s usage in data scientist community.

Loading and manipulating data with

Pandas DataFr ames

Loading and manipulating data with Pandas DataFrames is a crucial step in data analysis with
Python. Here are some basic steps to load and manipulate data with Pandas DataFrames.

1. Loading data: You can load data into a Pandas DataFrame from various sources such as CSV
files, Excel files, SQL databases, and APIs. You can use
the read_csv(), read_excel(), read_sql(), and read_json() functions in
Pandas to read data from different sources.
2. Exploring data: Once you load data into a DataFrame, you can explore it using various
functions such as head(), tail(), describe(), info(), shape, columns,
and dtypes. These functions provide basic information about the DataFrame, such as the
column names, data types, and summary statistics.
3. Cleaning data: Data cleaning is an essential step in data analysis to ensure data quality. You
can clean data using various functions such as dropna(), fillna(), replace(),
and drop_duplicates(). These functions help you handle missing values, duplicate
rows, and inconsistent data.
4. Manipulating data: You can manipulate data in a DataFrame using functions such
as groupby(), pivot_table(), merge(), and concat(). These functions allow you
to group data, pivot tables, and combine data from multiple sources.
5. Visualizing data: You can use Pandas’ built-in visualization tools to create various plots such
as bar plots, line plots, scatter plots, and histograms. These plots help you visualize the data
and gain insights into data trends.
6. Exporting data: Once you analyze and manipulate data, you may need to export the results to
various file formats such as CSV, Excel, or SQL databases. You can use
the to_csv(), to_excel(), to_sql(), and to_json() functions in Pandas to export
data.

Loading and manipulating data using Pandas DataFrame in Python is a common task for data
analysis. Here's a brief overview of how to accomplish this:

Step 1: Install and Import Pandas

First, ensure you have Pandas installed. If not, you can install it using pip:

sh
Copy code
pip install pandas

Then, import Pandas in your Python script or notebook:

python
Copy code
import pandas as pd
Step 2: Load Data

Pandas can read data from various file formats such as CSV, Excel, SQL, and more. The
most common method is reading from a CSV file:

python
Copy code
df = pd.read_csv('path_to_your_file.csv')

For other formats:

python
Copy code
# Excel
df = pd.read_excel('path_to_your_file.xlsx')

# SQL
from sqlalchemy import create_engine
engine = create_engine('your_database_connection_string')
df = pd.read_sql('your_table_name', engine)
Step 3: Basic DataFrame Operations

Once the data is loaded, you can perform various operations on the DataFrame.

Viewing Data

python
Copy code
# Display the first few rows
print(df.head())

# Display summary information

print(df.info())

# Display basic statistics

print(df.describe())

Selecting Data

python
Copy code
# Select a single column
col = df['column_name']

# Select multiple columns

cols = df[['column_name1', 'column_name2']]

# Select rows by index

row = df.iloc[0] # First row
rows = df.iloc[0:5] # First 5 rows

# Select rows based on condition

filtered_df = df[df['column_name'] > value]

Adding/Modifying Columns

python
Copy code
# Add a new column
df['new_column'] = value

# Modify an existing column

df['existing_column'] = df['existing_column'].apply(lambda x: x + 1)

Handling Missing Data

python
Copy code
# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values

df = df.dropna()

# Fill missing values

df = df.fillna(value)

Grouping and Aggregation

python
Copy code
# Group by a column and calculate the mean of each group
grouped_df = df.groupby('column_name').mean()

# Group by multiple columns and perform multiple aggregations

agg_df = df.groupby(['column1', 'column2']).agg({'column3': 'mean', 'column4': 'sum'})
Step 4: Save Data

After manipulation, you might want to save the DataFrame back to a file:

python
Copy code
# Save to CSV
df.to_csv('path_to_save_file.csv', index=False)

# Save to Excel
df.to_excel('path_to_save_file.xlsx', index=False)

# Save to SQL
df.to_sql('your_table_name', engine, if_exists='replace', index=False)
Example: Complete Workflow

Here is a complete example that loads data, manipulates it, and saves the result:

python
Copy code
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Display first 5 rows

print(df.head())

# Filter rows where 'column1' > 10

filtered_df = df[df['column1'] > 10]

# Add a new column 'new_column' which is 'column2' squared

filtered_df['new_column'] = filtered_df['column2'] ** 2
# Group by 'column1' and calculate the mean of 'new_column'
grouped_df = filtered_df.groupby('column1')['new_column'].mean().reset_index()

# Save the result to a new CSV file

grouped_df.to_csv('processed_data.csv', index=False)

What is Data Manipulation?

Data manipulation in data science is a crucial step that helps unfold patterns that eventually help make
informed decisions. Simply, it refers to modifying, transforming, or reorganising data to extract
meaningful insights, prepare it for analysis, or meet specific requirements.

Key Features of Data Manipulation

Data Filtering
Data filtering:

It is crucial for manipulating data and extracting pertinent insights from raw datasets. Filtering
streamlines analysis processes by selectively isolating specific data points or patterns, which enhances
efficiency. It ensures that only relevant information contributes to informed decision-making and
insightful discoveries.

Data Sorting:
Sorting data or structuring it into columns and rows enhances readability and comprehension.
Analysts can quickly identify patterns, outliers, and trends by organising data logically, streamlining
the analysis process. This structured presentation aids in extracting meaningful insights and making
informed decisions based on a clear understanding of the data.

Data Aggregation:
Aggregation, a vital data manipulation feature, condenses multiple records into a concise summary in
data science. It encompasses computing averages, summations, accounting totals, and identifying
maximum or minimum values. It streamlines analysis processes and yields actionable insights from
complex datasets.

Example 1: Filtering and Sorting:

One fundamental data manipulation task is filtering and sorting. It involves selecting specific rows or
columns based on certain criteria and arranging the data in order. For instance, in a customer database,
you might filter the records only to include customers who purchased in the last month and then sort
them based on their total spending.

Example 2: Aggregation and Summarisation:

Another essential aspect of data manipulation is aggregating and summarising data. It involves
calculating summary statistics, such as a specific variable’s average, sum, minimum, or maximum
values. For instance, in sales data, you might aggregate the total revenue generated per product
category or calculate the monthly average sales.

DATA CLEANING AND IMPORTING

Reading data from various sources such as CSV files, Excel files, and databases is a common
task in data analysis. Here is how you can accomplish this using Pandas in Python.
Step 1: Set Up the Environment

Make sure you have Pandas installed. If not, install it using pip:

sh
Copy code
pip install pandas
Step 2: Import Pandas

Import Pandas in your Python script or Jupyter Notebook:

python
Copy code
import pandas as pd
Reading Data from CSV

To read data from a CSV file, use the pd.read_csv function:

python
Copy code
# Reading data from a CSV file
csv_file_path = 'path_to_your_file.csv'
df_csv = pd.read_csv(csv_file_path)
print(df_csv.head())
Reading Data from Excel

To read data from an Excel file, use the pd.read_excel function:

python
Copy code
# Reading data from an Excel file
excel_file_path = 'path_to_your_file.xlsx'
df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1') # Specify
sheet name if necessary
print(df_excel.head())
Reading Data from Databases

To read data from a database, you will need to use SQLAlchemy to establish a connection.
Install SQLAlchemy if you haven't:

sh
Copy code
pip install sqlalchemy

Then, use the create_engine function to establish a connection and pd.read_sql to read
the data:

python
Copy code
from sqlalchemy import create_engine

# Replace with your actual database connection string

database_connection_string =
'dialect+driver://username:password@host:port/database'
# Create an engine
engine = create_engine(database_connection_string)

# Reading data from a SQL table

table_name = 'your_table_name'
df_sql = pd.read_sql(table_name, con=engine)
print(df_sql.head())

# Reading data using a SQL query

sql_query = 'SELECT * FROM your_table_name WHERE some_column > some_value'
df_sql_query = pd.read_sql(sql_query, con=engine)
print(df_sql_query.head())
Example: Combined Data Reading

Here's an example of reading data from CSV, Excel, and SQL sources in one script:

python
Copy code
import pandas as pd
from sqlalchemy import create_engine

# Paths to your files

csv_file_path = 'path_to_your_file.csv'
excel_file_path = 'path_to_your_file.xlsx'

# Database connection string

database_connection_string =
'dialect+driver://username:password@host:port/database'
engine = create_engine(database_connection_string)

# Reading data from CSV

df_csv = pd.read_csv(csv_file_path)
print('CSV Data:')
print(df_csv.head())

# Reading data from Excel

df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1')
print('Excel Data:')
print(df_excel.head())

# Reading data from SQL

table_name = 'your_table_name'
df_sql = pd.read_sql(table_name, con=engine)
print('SQL Data:')
print(df_sql.head())

DATA CLEANING TECHNIQUES

Handling missing values

it is a critical step in data preprocessing and can significantly impact the performance of
machine learning models. There are several strategies to deal with missing data:
1. Remove Missing Values

 Row Deletion: Remove rows with missing values.

o Useful when the dataset is large and missing values are few.
 Column Deletion: Remove columns with missing values.
o Appropriate when a column has a high percentage of missing values and is
less important.

2. Impute Missing Values

 Mean/Median/Mode Imputation: Replace missing values with the mean (for

numerical data), median, or mode (for categorical data) of the column.
o Simple and fast, but can distort the data distribution.
 Forward/Backward Fill: Replace missing values with the previous/next value in the
column.
o Suitable for time series data.
 Interpolation: Use linear or polynomial interpolation for numerical data.
 K-Nearest Neighbors (KNN) Imputation: Replace missing values with the average
value of the nearest neighbors.
o Takes into account the similarity between data points but is computationally
expensive.
 Multivariate Imputation by Chained Equations (MICE): Uses the relationships
between features to predict missing values.
o More sophisticated and accurate but also computationally intensive.

3. Use Algorithms that Support Missing Values

 Some machine learning algorithms can handle missing values natively, such as
decision trees, Random Forests, and XGBoost.

4. Flag and Fill

 Create a new binary column to indicate whether the value was missing and then fill
the missing value using one of the imputation methods.
 Preserves information about missingness, which can be useful for some models.

5. Prediction Models

 Use machine learning models to predict and impute missing values based on other
features.

HANDLING DUPLICATES:

Dealing with duplicates is a crucial aspect of data cleaning to ensure the quality and
reliability of the dataset. Here are some techniques to handle duplicates effectively:

1. Identifying Duplicates

 Exact Duplicates: Rows that are completely identical across all columns.
 Partial Duplicates: Rows that are identical in specific key columns but may have different
values in other columns.

2. Removing Duplicates

 Remove Exact Duplicates: Use functions to identify and remove rows that are exact
duplicates.

3. Handling Partial Duplicates

 Prioritize Based on a Column: Keep the duplicate row based on the value of a particular
column (e.g., latest timestamp).
 Aggregation: Aggregate duplicate rows by summarizing or averaging numerical values.
 Manual Review: Sometimes manual review is necessary for critical data.

4. Combining Information from Duplicates

 Merge Information: Combine information from duplicate rows, especially if different rows
have different useful information.

HANDLING OUTLIERS:
Handling outliers is an essential part of data preprocessing, as outliers can significantly
impact the performance of statistical analyses and machine learning models. Here are several
techniques to identify and handle outliers:

Identifying Outliers

1. Statistical Methods
o Z-Score: Measures the number of standard deviations a data point is from the
mean.

python
Copy code
from scipy.stats import zscore
import numpy as np

df['z_score'] = zscore(df['column_name'])
outliers = df[np.abs(df['z_score']) > 3]

o IQR (Interquartile Range): Identifies outliers based on the range between

the first quartile (Q1) and the third quartile (Q3).

python
Copy code
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) |
(df['column_name'] > upper_bound)]
2. Visual Methods
o Box Plot: Displays the distribution of data and highlights outliers.

python
Copy code
import matplotlib.pyplot as plt

df['column_name'].plot.box()
plt.show()

o Scatter Plot: Useful for identifying outliers in a two-dimensional dataset.

python
Copy code
df.plot.scatter(x='column_x', y='column_y')
plt.show()

3. Model-Based Methods
o Isolation Forest: Identifies anomalies by isolating observations.

python
Copy code
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.1)
df['anomaly_score'] =
iso_forest.fit_predict(df[['column_name']])
outliers = df[df['anomaly_score'] == -1]
Handling Outliers

1. Removing Outliers
o Simply remove the outliers identified by statistical or visual methods.

python
Copy code
df_cleaned = df[(df['column_name'] >= lower_bound) &
(df['column_name'] <= upper_bound)]

2. Transforming Data
o Log Transformation: Reduces the impact of outliers.

python
Copy code
df['log_column'] = np.log(df['column_name'] + 1) # Adding 1 to
avoid log(0)

o Square Root Transformation: Similar to log transformation, but less

aggressive.

python
Copy code
df['sqrt_column'] = np.sqrt(df['column_name'])

3. Capping/Flooring
o Limit the values of outliers to the nearest threshold (e.g., the upper or lower bound
of IQR).

python
Copy code
df['column_name'] = np.where(df['column_name'] > upper_bound,
upper_bound,
np.where(df['column_name'] <
lower_bound, lower_bound, df['column_name']))

4. Imputation
o Replace outliers with statistical measures like mean or median.

python
Copy code
median = df['column_name'].median()
df['column_name'] = np.where((df['column_name'] < lower_bound)
| (df['column_name'] > upper_bound), median, df['column_name'])

5. Model-Based Methods
o Use robust algorithms that are less sensitive to outliers, such as decision trees or
robust regression models.

SAP Group Reporting Preparation Ledger
100% (4)
SAP Group Reporting Preparation Ledger
114 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Engineering Emergence - Joris Dormans
No ratings yet
Engineering Emergence - Joris Dormans
302 pages
AFM 241 Notes AFSA Education
No ratings yet
AFM 241 Notes AFSA Education
55 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Python For Analytics - 2025 - 2020
No ratings yet
Python For Analytics - 2025 - 2020
28 pages
What Is Pandas
No ratings yet
What Is Pandas
9 pages
Python & MySQL For Data Analysis
No ratings yet
Python & MySQL For Data Analysis
45 pages
Datascience
No ratings yet
Datascience
26 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Dav 2 Unit
No ratings yet
Dav 2 Unit
55 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Pandas CheatSheet
No ratings yet
Pandas CheatSheet
18 pages
Python Quick Notes
No ratings yet
Python Quick Notes
2 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Jenisha INTERNSHIP REPORT-2
No ratings yet
Jenisha INTERNSHIP REPORT-2
19 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Python For Data Analysts - Quick Summary
No ratings yet
Python For Data Analysts - Quick Summary
6 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
84 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Data Analytics With PowerBI
No ratings yet
Data Analytics With PowerBI
27 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
Report
No ratings yet
Report
18 pages
Data Sciene File
No ratings yet
Data Sciene File
36 pages
Pandas
No ratings yet
Pandas
10 pages
BasicAnalysis Using PYTHON
No ratings yet
BasicAnalysis Using PYTHON
6 pages
Python by Example Book 2 (Data Manipulation and Analysis)
No ratings yet
Python by Example Book 2 (Data Manipulation and Analysis)
105 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
DS Final
No ratings yet
DS Final
46 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Data Analytics and Reporting - Notes Unit 1 and 2
No ratings yet
Data Analytics and Reporting - Notes Unit 1 and 2
11 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
10 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Introduction To Pandas Programming 2
No ratings yet
Introduction To Pandas Programming 2
3 pages
Pandas
No ratings yet
Pandas
28 pages
Data Analytics Curriculum
No ratings yet
Data Analytics Curriculum
8 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Using Pandas To Perform Data Analysis - IBM Developer
No ratings yet
Using Pandas To Perform Data Analysis - IBM Developer
15 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
26 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Arjun Jaggi: Mapple July 2012 - Jan 2013
No ratings yet
Arjun Jaggi: Mapple July 2012 - Jan 2013
3 pages
SCADA User Interface: E-Terracontrol - Module 4
No ratings yet
SCADA User Interface: E-Terracontrol - Module 4
14 pages
Performance Best Practices For VMware Vsphere 6.7 VMware ESXi 6.7
No ratings yet
Performance Best Practices For VMware Vsphere 6.7 VMware ESXi 6.7
220 pages
LED Lightboxes Specs
No ratings yet
LED Lightboxes Specs
14 pages
Brochure SpeedCast
No ratings yet
Brochure SpeedCast
16 pages
Predictive Data Analytics With Python
100% (1)
Predictive Data Analytics With Python
97 pages
Fixlog
No ratings yet
Fixlog
108 pages
(ET) Remote Utilities (Viewer + Host) Pro 6.8.0.1 TORRENT (v6.8.0
No ratings yet
(ET) Remote Utilities (Viewer + Host) Pro 6.8.0.1 TORRENT (v6.8.0
5 pages
M04 - Dax Part1
No ratings yet
M04 - Dax Part1
16 pages
08 Sensor Guide
100% (1)
08 Sensor Guide
2 pages
Sample Complaint Letter
No ratings yet
Sample Complaint Letter
2 pages
Config Zyxel 3550
No ratings yet
Config Zyxel 3550
370 pages
Performance Metrics: - Bandwidth (Throughput) - Latency (Delay) - Bandwidth
No ratings yet
Performance Metrics: - Bandwidth (Throughput) - Latency (Delay) - Bandwidth
17 pages
Why I Hate Microsoft by F.W. Van Wensveen
No ratings yet
Why I Hate Microsoft by F.W. Van Wensveen
73 pages
Julia: Fresh Approach To Numerical Computing
No ratings yet
Julia: Fresh Approach To Numerical Computing
34 pages
Fundamental of Programmingi
No ratings yet
Fundamental of Programmingi
21 pages
COS 101.use. Lecture 1
No ratings yet
COS 101.use. Lecture 1
16 pages
Project 2
No ratings yet
Project 2
8 pages
Pro Python System Administration 2nd Edition Rytis Sileika Download
100% (1)
Pro Python System Administration 2nd Edition Rytis Sileika Download
60 pages
The Elements of User Experience
No ratings yet
The Elements of User Experience
23 pages
Web Based Customer Management System For Electric Power Nekemte City
No ratings yet
Web Based Customer Management System For Electric Power Nekemte City
80 pages
Chapter 01
No ratings yet
Chapter 01
36 pages
2000 Procedimientos Industriales - Formoso
100% (2)
2000 Procedimientos Industriales - Formoso
1,219 pages
SP916GK Manual
No ratings yet
SP916GK Manual
41 pages
Digital Communications: Instructor: Dr. Phan Van Ca Lecture #4: Introduction To Digital Communications
No ratings yet
Digital Communications: Instructor: Dr. Phan Van Ca Lecture #4: Introduction To Digital Communications
27 pages
Odbc
No ratings yet
Odbc
2 pages
HONO
No ratings yet
HONO
29 pages

Unit-2 Bda

Uploaded by

Unit-2 Bda

Uploaded by

UNIT – 2

PYTHON FOR DATA ANALYTICS

Pandas – Data Manipulation and Analysis

Loading and manipulating data with

Step 1: Install and Import Pandas

Then, import Pandas in your Python script or notebook:

For other formats:

# Display summary information

# Display basic statistics

# Select multiple columns

# Select rows by index

# Select rows based on condition

# Modify an existing column

Handling Missing Data

# Drop rows with missing values

# Fill missing values

Grouping and Aggregation

# Group by multiple columns and perform multiple aggregations

# Display first 5 rows

# Filter rows where 'column1' > 10

# Add a new column 'new_column' which is 'column2' squared

# Save the result to a new CSV file

What is Data Manipulation?

Key Features of Data Manipulation

Example 1: Filtering and Sorting:

Example 2: Aggregation and Summarisation:

DATA CLEANING AND IMPORTING

Import Pandas in your Python script or Jupyter Notebook:

To read data from a CSV file, use the pd.read_csv function:

To read data from an Excel file, use the pd.read_excel function:

# Replace with your actual database connection string

# Reading data from a SQL table

# Reading data using a SQL query

# Paths to your files

# Database connection string

# Reading data from CSV

# Reading data from Excel

# Reading data from SQL

DATA CLEANING TECHNIQUES

Handling missing values

 Row Deletion: Remove rows with missing values.

2. Impute Missing Values

 Mean/Median/Mode Imputation: Replace missing values with the mean (for

3. Use Algorithms that Support Missing Values

4. Flag and Fill

3. Handling Partial Duplicates

4. Combining Information from Duplicates

o IQR (Interquartile Range): Identifies outliers based on the range between

o Scatter Plot: Useful for identifying outliers in a two-dimensional dataset.

o Square Root Transformation: Similar to log transformation, but less

You might also like