Unit-2 Bda
Unit-2 Bda
There are libraries that give users the necessary functionality when crunching data.
Below are the major Python libraries that are used for working with data. You should
take some time to familiarize yourself with the basic purposes of these packages.
Numpy
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with
other low level languages like Fortran, C and C++.
Pandas for structured data operations and manipulations. It is extensively used for
data munging and preparation. Pandas were added relatively recently to Python and
have been instrumental in boosting Python’s usage in data scientist community.
Loading and manipulating data with Pandas DataFrames is a crucial step in data analysis with
Python. Here are some basic steps to load and manipulate data with Pandas DataFrames.
1. Loading data: You can load data into a Pandas DataFrame from various sources such as CSV
files, Excel files, SQL databases, and APIs. You can use
the read_csv(), read_excel(), read_sql(), and read_json() functions in
Pandas to read data from different sources.
2. Exploring data: Once you load data into a DataFrame, you can explore it using various
functions such as head(), tail(), describe(), info(), shape, columns,
and dtypes. These functions provide basic information about the DataFrame, such as the
column names, data types, and summary statistics.
3. Cleaning data: Data cleaning is an essential step in data analysis to ensure data quality. You
can clean data using various functions such as dropna(), fillna(), replace(),
and drop_duplicates(). These functions help you handle missing values, duplicate
rows, and inconsistent data.
4. Manipulating data: You can manipulate data in a DataFrame using functions such
as groupby(), pivot_table(), merge(), and concat(). These functions allow you
to group data, pivot tables, and combine data from multiple sources.
5. Visualizing data: You can use Pandas’ built-in visualization tools to create various plots such
as bar plots, line plots, scatter plots, and histograms. These plots help you visualize the data
and gain insights into data trends.
6. Exporting data: Once you analyze and manipulate data, you may need to export the results to
various file formats such as CSV, Excel, or SQL databases. You can use
the to_csv(), to_excel(), to_sql(), and to_json() functions in Pandas to export
data.
Loading and manipulating data using Pandas DataFrame in Python is a common task for data
analysis. Here's a brief overview of how to accomplish this:
First, ensure you have Pandas installed. If not, you can install it using pip:
sh
Copy code
pip install pandas
python
Copy code
import pandas as pd
Step 2: Load Data
Pandas can read data from various file formats such as CSV, Excel, SQL, and more. The
most common method is reading from a CSV file:
python
Copy code
df = pd.read_csv('path_to_your_file.csv')
python
Copy code
# Excel
df = pd.read_excel('path_to_your_file.xlsx')
# SQL
from sqlalchemy import create_engine
engine = create_engine('your_database_connection_string')
df = pd.read_sql('your_table_name', engine)
Step 3: Basic DataFrame Operations
Once the data is loaded, you can perform various operations on the DataFrame.
Viewing Data
python
Copy code
# Display the first few rows
print(df.head())
Selecting Data
python
Copy code
# Select a single column
col = df['column_name']
Adding/Modifying Columns
python
Copy code
# Add a new column
df['new_column'] = value
python
Copy code
# Check for missing values
print(df.isnull().sum())
python
Copy code
# Group by a column and calculate the mean of each group
grouped_df = df.groupby('column_name').mean()
After manipulation, you might want to save the DataFrame back to a file:
python
Copy code
# Save to CSV
df.to_csv('path_to_save_file.csv', index=False)
# Save to Excel
df.to_excel('path_to_save_file.xlsx', index=False)
# Save to SQL
df.to_sql('your_table_name', engine, if_exists='replace', index=False)
Example: Complete Workflow
Here is a complete example that loads data, manipulates it, and saves the result:
python
Copy code
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
It is crucial for manipulating data and extracting pertinent insights from raw datasets. Filtering
streamlines analysis processes by selectively isolating specific data points or patterns, which enhances
efficiency. It ensures that only relevant information contributes to informed decision-making and
insightful discoveries.
Data Sorting:
Sorting data or structuring it into columns and rows enhances readability and comprehension.
Analysts can quickly identify patterns, outliers, and trends by organising data logically, streamlining
the analysis process. This structured presentation aids in extracting meaningful insights and making
informed decisions based on a clear understanding of the data.
Data Aggregation:
Aggregation, a vital data manipulation feature, condenses multiple records into a concise summary in
data science. It encompasses computing averages, summations, accounting totals, and identifying
maximum or minimum values. It streamlines analysis processes and yields actionable insights from
complex datasets.
Reading data from various sources such as CSV files, Excel files, and databases is a common
task in data analysis. Here is how you can accomplish this using Pandas in Python.
Step 1: Set Up the Environment
Make sure you have Pandas installed. If not, install it using pip:
sh
Copy code
pip install pandas
Step 2: Import Pandas
python
Copy code
import pandas as pd
Reading Data from CSV
python
Copy code
# Reading data from a CSV file
csv_file_path = 'path_to_your_file.csv'
df_csv = pd.read_csv(csv_file_path)
print(df_csv.head())
Reading Data from Excel
python
Copy code
# Reading data from an Excel file
excel_file_path = 'path_to_your_file.xlsx'
df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1') # Specify
sheet name if necessary
print(df_excel.head())
Reading Data from Databases
To read data from a database, you will need to use SQLAlchemy to establish a connection.
Install SQLAlchemy if you haven't:
sh
Copy code
pip install sqlalchemy
Then, use the create_engine function to establish a connection and pd.read_sql to read
the data:
python
Copy code
from sqlalchemy import create_engine
Here's an example of reading data from CSV, Excel, and SQL sources in one script:
python
Copy code
import pandas as pd
from sqlalchemy import create_engine
Some machine learning algorithms can handle missing values natively, such as
decision trees, Random Forests, and XGBoost.
Create a new binary column to indicate whether the value was missing and then fill
the missing value using one of the imputation methods.
Preserves information about missingness, which can be useful for some models.
5. Prediction Models
Use machine learning models to predict and impute missing values based on other
features.
HANDLING DUPLICATES:
Dealing with duplicates is a crucial aspect of data cleaning to ensure the quality and
reliability of the dataset. Here are some techniques to handle duplicates effectively:
1. Identifying Duplicates
Exact Duplicates: Rows that are completely identical across all columns.
Partial Duplicates: Rows that are identical in specific key columns but may have different
values in other columns.
2. Removing Duplicates
Remove Exact Duplicates: Use functions to identify and remove rows that are exact
duplicates.
Prioritize Based on a Column: Keep the duplicate row based on the value of a particular
column (e.g., latest timestamp).
Aggregation: Aggregate duplicate rows by summarizing or averaging numerical values.
Manual Review: Sometimes manual review is necessary for critical data.
Merge Information: Combine information from duplicate rows, especially if different rows
have different useful information.
HANDLING OUTLIERS:
Handling outliers is an essential part of data preprocessing, as outliers can significantly
impact the performance of statistical analyses and machine learning models. Here are several
techniques to identify and handle outliers:
Identifying Outliers
1. Statistical Methods
o Z-Score: Measures the number of standard deviations a data point is from the
mean.
python
Copy code
from scipy.stats import zscore
import numpy as np
df['z_score'] = zscore(df['column_name'])
outliers = df[np.abs(df['z_score']) > 3]
python
Copy code
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) |
(df['column_name'] > upper_bound)]
2. Visual Methods
o Box Plot: Displays the distribution of data and highlights outliers.
python
Copy code
import matplotlib.pyplot as plt
df['column_name'].plot.box()
plt.show()
python
Copy code
df.plot.scatter(x='column_x', y='column_y')
plt.show()
3. Model-Based Methods
o Isolation Forest: Identifies anomalies by isolating observations.
python
Copy code
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(contamination=0.1)
df['anomaly_score'] =
iso_forest.fit_predict(df[['column_name']])
outliers = df[df['anomaly_score'] == -1]
Handling Outliers
1. Removing Outliers
o Simply remove the outliers identified by statistical or visual methods.
python
Copy code
df_cleaned = df[(df['column_name'] >= lower_bound) &
(df['column_name'] <= upper_bound)]
2. Transforming Data
o Log Transformation: Reduces the impact of outliers.
python
Copy code
df['log_column'] = np.log(df['column_name'] + 1) # Adding 1 to
avoid log(0)
python
Copy code
df['sqrt_column'] = np.sqrt(df['column_name'])
3. Capping/Flooring
o Limit the values of outliers to the nearest threshold (e.g., the upper or lower bound
of IQR).
python
Copy code
df['column_name'] = np.where(df['column_name'] > upper_bound,
upper_bound,
np.where(df['column_name'] <
lower_bound, lower_bound, df['column_name']))
4. Imputation
o Replace outliers with statistical measures like mean or median.
python
Copy code
median = df['column_name'].median()
df['column_name'] = np.where((df['column_name'] < lower_bound)
| (df['column_name'] > upper_bound), median, df['column_name'])
5. Model-Based Methods
o Use robust algorithms that are less sensitive to outliers, such as decision trees or
robust regression models.