0% found this document useful (0 votes)
18 views11 pages

Unit-2 Bda

Uploaded by

claritysubhash55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Unit-2 Bda

Uploaded by

claritysubhash55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT – 2

PYTHON FOR DATA ANALYTICS


Python is a general purpose language and is often used for things other than data
analysis and data science. What makes Python extremely useful for working with
data?

There are libraries that give users the necessary functionality when crunching data.
Below are the major Python libraries that are used for working with data. You should
take some time to familiarize yourself with the basic purposes of these packages.

Numpy
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-
dimensional array. This library also contains basic linear algebra functions, Fourier
transforms, advanced random number capabilities and tools for integration with
other low level languages like Fortran, C and C++.

Pandas – Data Manipulation and Analysis

Pandas for structured data operations and manipulations. It is extensively used for
data munging and preparation. Pandas were added relatively recently to Python and
have been instrumental in boosting Python’s usage in data scientist community.

Loading and manipulating data with


Pandas DataFr ames

Loading and manipulating data with Pandas DataFrames is a crucial step in data analysis with
Python. Here are some basic steps to load and manipulate data with Pandas DataFrames.

1. Loading data: You can load data into a Pandas DataFrame from various sources such as CSV
files, Excel files, SQL databases, and APIs. You can use
the read_csv(), read_excel(), read_sql(), and read_json() functions in
Pandas to read data from different sources.
2. Exploring data: Once you load data into a DataFrame, you can explore it using various
functions such as head(), tail(), describe(), info(), shape, columns,
and dtypes. These functions provide basic information about the DataFrame, such as the
column names, data types, and summary statistics.
3. Cleaning data: Data cleaning is an essential step in data analysis to ensure data quality. You
can clean data using various functions such as dropna(), fillna(), replace(),
and drop_duplicates(). These functions help you handle missing values, duplicate
rows, and inconsistent data.
4. Manipulating data: You can manipulate data in a DataFrame using functions such
as groupby(), pivot_table(), merge(), and concat(). These functions allow you
to group data, pivot tables, and combine data from multiple sources.
5. Visualizing data: You can use Pandas’ built-in visualization tools to create various plots such
as bar plots, line plots, scatter plots, and histograms. These plots help you visualize the data
and gain insights into data trends.
6. Exporting data: Once you analyze and manipulate data, you may need to export the results to
various file formats such as CSV, Excel, or SQL databases. You can use
the to_csv(), to_excel(), to_sql(), and to_json() functions in Pandas to export
data.

Loading and manipulating data using Pandas DataFrame in Python is a common task for data
analysis. Here's a brief overview of how to accomplish this:

Step 1: Install and Import Pandas

First, ensure you have Pandas installed. If not, you can install it using pip:

sh
Copy code
pip install pandas

Then, import Pandas in your Python script or notebook:

python
Copy code
import pandas as pd
Step 2: Load Data

Pandas can read data from various file formats such as CSV, Excel, SQL, and more. The
most common method is reading from a CSV file:

python
Copy code
df = pd.read_csv('path_to_your_file.csv')

For other formats:

python
Copy code
# Excel
df = pd.read_excel('path_to_your_file.xlsx')

# SQL
from sqlalchemy import create_engine
engine = create_engine('your_database_connection_string')
df = pd.read_sql('your_table_name', engine)
Step 3: Basic DataFrame Operations

Once the data is loaded, you can perform various operations on the DataFrame.

Viewing Data

python
Copy code
# Display the first few rows
print(df.head())

# Display summary information


print(df.info())

# Display basic statistics


print(df.describe())

Selecting Data

python
Copy code
# Select a single column
col = df['column_name']

# Select multiple columns


cols = df[['column_name1', 'column_name2']]

# Select rows by index


row = df.iloc[0] # First row
rows = df.iloc[0:5] # First 5 rows

# Select rows based on condition


filtered_df = df[df['column_name'] > value]

Adding/Modifying Columns

python
Copy code
# Add a new column
df['new_column'] = value

# Modify an existing column


df['existing_column'] = df['existing_column'].apply(lambda x: x + 1)

Handling Missing Data

python
Copy code
# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values


df = df.dropna()

# Fill missing values


df = df.fillna(value)

Grouping and Aggregation

python
Copy code
# Group by a column and calculate the mean of each group
grouped_df = df.groupby('column_name').mean()

# Group by multiple columns and perform multiple aggregations


agg_df = df.groupby(['column1', 'column2']).agg({'column3': 'mean', 'column4': 'sum'})
Step 4: Save Data

After manipulation, you might want to save the DataFrame back to a file:

python
Copy code
# Save to CSV
df.to_csv('path_to_save_file.csv', index=False)

# Save to Excel
df.to_excel('path_to_save_file.xlsx', index=False)

# Save to SQL
df.to_sql('your_table_name', engine, if_exists='replace', index=False)
Example: Complete Workflow

Here is a complete example that loads data, manipulates it, and saves the result:

python
Copy code
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Display first 5 rows


print(df.head())

# Filter rows where 'column1' > 10


filtered_df = df[df['column1'] > 10]

# Add a new column 'new_column' which is 'column2' squared


filtered_df['new_column'] = filtered_df['column2'] ** 2
# Group by 'column1' and calculate the mean of 'new_column'
grouped_df = filtered_df.groupby('column1')['new_column'].mean().reset_index()

# Save the result to a new CSV file


grouped_df.to_csv('processed_data.csv', index=False)

What is Data Manipulation?


Data manipulation in data science is a crucial step that helps unfold patterns that eventually help make
informed decisions. Simply, it refers to modifying, transforming, or reorganising data to extract
meaningful insights, prepare it for analysis, or meet specific requirements.

Key Features of Data Manipulation


Data Filtering
Data filtering:

It is crucial for manipulating data and extracting pertinent insights from raw datasets. Filtering
streamlines analysis processes by selectively isolating specific data points or patterns, which enhances
efficiency. It ensures that only relevant information contributes to informed decision-making and
insightful discoveries.

Data Sorting:
Sorting data or structuring it into columns and rows enhances readability and comprehension.
Analysts can quickly identify patterns, outliers, and trends by organising data logically, streamlining
the analysis process. This structured presentation aids in extracting meaningful insights and making
informed decisions based on a clear understanding of the data.

Data Aggregation:
Aggregation, a vital data manipulation feature, condenses multiple records into a concise summary in
data science. It encompasses computing averages, summations, accounting totals, and identifying
maximum or minimum values. It streamlines analysis processes and yields actionable insights from
complex datasets.

Example 1: Filtering and Sorting:


One fundamental data manipulation task is filtering and sorting. It involves selecting specific rows or
columns based on certain criteria and arranging the data in order. For instance, in a customer database,
you might filter the records only to include customers who purchased in the last month and then sort
them based on their total spending.

Example 2: Aggregation and Summarisation:


Another essential aspect of data manipulation is aggregating and summarising data. It involves
calculating summary statistics, such as a specific variable’s average, sum, minimum, or maximum
values. For instance, in sales data, you might aggregate the total revenue generated per product
category or calculate the monthly average sales.

DATA CLEANING AND IMPORTING

Reading data from various sources such as CSV files, Excel files, and databases is a common
task in data analysis. Here is how you can accomplish this using Pandas in Python.
Step 1: Set Up the Environment

Make sure you have Pandas installed. If not, install it using pip:

sh
Copy code
pip install pandas
Step 2: Import Pandas

Import Pandas in your Python script or Jupyter Notebook:

python
Copy code
import pandas as pd
Reading Data from CSV

To read data from a CSV file, use the pd.read_csv function:

python
Copy code
# Reading data from a CSV file
csv_file_path = 'path_to_your_file.csv'
df_csv = pd.read_csv(csv_file_path)
print(df_csv.head())
Reading Data from Excel

To read data from an Excel file, use the pd.read_excel function:

python
Copy code
# Reading data from an Excel file
excel_file_path = 'path_to_your_file.xlsx'
df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1') # Specify
sheet name if necessary
print(df_excel.head())
Reading Data from Databases

To read data from a database, you will need to use SQLAlchemy to establish a connection.
Install SQLAlchemy if you haven't:

sh
Copy code
pip install sqlalchemy

Then, use the create_engine function to establish a connection and pd.read_sql to read
the data:

python
Copy code
from sqlalchemy import create_engine

# Replace with your actual database connection string


database_connection_string =
'dialect+driver://username:password@host:port/database'
# Create an engine
engine = create_engine(database_connection_string)

# Reading data from a SQL table


table_name = 'your_table_name'
df_sql = pd.read_sql(table_name, con=engine)
print(df_sql.head())

# Reading data using a SQL query


sql_query = 'SELECT * FROM your_table_name WHERE some_column > some_value'
df_sql_query = pd.read_sql(sql_query, con=engine)
print(df_sql_query.head())
Example: Combined Data Reading

Here's an example of reading data from CSV, Excel, and SQL sources in one script:

python
Copy code
import pandas as pd
from sqlalchemy import create_engine

# Paths to your files


csv_file_path = 'path_to_your_file.csv'
excel_file_path = 'path_to_your_file.xlsx'

# Database connection string


database_connection_string =
'dialect+driver://username:password@host:port/database'
engine = create_engine(database_connection_string)

# Reading data from CSV


df_csv = pd.read_csv(csv_file_path)
print('CSV Data:')
print(df_csv.head())

# Reading data from Excel


df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1')
print('Excel Data:')
print(df_excel.head())

# Reading data from SQL


table_name = 'your_table_name'
df_sql = pd.read_sql(table_name, con=engine)
print('SQL Data:')
print(df_sql.head())

DATA CLEANING TECHNIQUES

Handling missing values


it is a critical step in data preprocessing and can significantly impact the performance of
machine learning models. There are several strategies to deal with missing data:
1. Remove Missing Values

 Row Deletion: Remove rows with missing values.


o Useful when the dataset is large and missing values are few.
 Column Deletion: Remove columns with missing values.
o Appropriate when a column has a high percentage of missing values and is
less important.

2. Impute Missing Values

 Mean/Median/Mode Imputation: Replace missing values with the mean (for


numerical data), median, or mode (for categorical data) of the column.
o Simple and fast, but can distort the data distribution.
 Forward/Backward Fill: Replace missing values with the previous/next value in the
column.
o Suitable for time series data.
 Interpolation: Use linear or polynomial interpolation for numerical data.
 K-Nearest Neighbors (KNN) Imputation: Replace missing values with the average
value of the nearest neighbors.
o Takes into account the similarity between data points but is computationally
expensive.
 Multivariate Imputation by Chained Equations (MICE): Uses the relationships
between features to predict missing values.
o More sophisticated and accurate but also computationally intensive.

3. Use Algorithms that Support Missing Values

 Some machine learning algorithms can handle missing values natively, such as
decision trees, Random Forests, and XGBoost.

4. Flag and Fill

 Create a new binary column to indicate whether the value was missing and then fill
the missing value using one of the imputation methods.
 Preserves information about missingness, which can be useful for some models.

5. Prediction Models

 Use machine learning models to predict and impute missing values based on other
features.

HANDLING DUPLICATES:

Dealing with duplicates is a crucial aspect of data cleaning to ensure the quality and
reliability of the dataset. Here are some techniques to handle duplicates effectively:

1. Identifying Duplicates

 Exact Duplicates: Rows that are completely identical across all columns.
 Partial Duplicates: Rows that are identical in specific key columns but may have different
values in other columns.

2. Removing Duplicates

 Remove Exact Duplicates: Use functions to identify and remove rows that are exact
duplicates.

3. Handling Partial Duplicates

 Prioritize Based on a Column: Keep the duplicate row based on the value of a particular
column (e.g., latest timestamp).
 Aggregation: Aggregate duplicate rows by summarizing or averaging numerical values.
 Manual Review: Sometimes manual review is necessary for critical data.

4. Combining Information from Duplicates

 Merge Information: Combine information from duplicate rows, especially if different rows
have different useful information.

HANDLING OUTLIERS:
Handling outliers is an essential part of data preprocessing, as outliers can significantly
impact the performance of statistical analyses and machine learning models. Here are several
techniques to identify and handle outliers:

Identifying Outliers

1. Statistical Methods
o Z-Score: Measures the number of standard deviations a data point is from the
mean.

python
Copy code
from scipy.stats import zscore
import numpy as np

df['z_score'] = zscore(df['column_name'])
outliers = df[np.abs(df['z_score']) > 3]

o IQR (Interquartile Range): Identifies outliers based on the range between


the first quartile (Q1) and the third quartile (Q3).

python
Copy code
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) |
(df['column_name'] > upper_bound)]
2. Visual Methods
o Box Plot: Displays the distribution of data and highlights outliers.

python
Copy code
import matplotlib.pyplot as plt

df['column_name'].plot.box()
plt.show()

o Scatter Plot: Useful for identifying outliers in a two-dimensional dataset.

python
Copy code
df.plot.scatter(x='column_x', y='column_y')
plt.show()

3. Model-Based Methods
o Isolation Forest: Identifies anomalies by isolating observations.

python
Copy code
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.1)
df['anomaly_score'] =
iso_forest.fit_predict(df[['column_name']])
outliers = df[df['anomaly_score'] == -1]
Handling Outliers

1. Removing Outliers
o Simply remove the outliers identified by statistical or visual methods.

python
Copy code
df_cleaned = df[(df['column_name'] >= lower_bound) &
(df['column_name'] <= upper_bound)]

2. Transforming Data
o Log Transformation: Reduces the impact of outliers.

python
Copy code
df['log_column'] = np.log(df['column_name'] + 1) # Adding 1 to
avoid log(0)

o Square Root Transformation: Similar to log transformation, but less


aggressive.

python
Copy code
df['sqrt_column'] = np.sqrt(df['column_name'])

3. Capping/Flooring
o Limit the values of outliers to the nearest threshold (e.g., the upper or lower bound
of IQR).

python
Copy code
df['column_name'] = np.where(df['column_name'] > upper_bound,
upper_bound,
np.where(df['column_name'] <
lower_bound, lower_bound, df['column_name']))

4. Imputation
o Replace outliers with statistical measures like mean or median.

python
Copy code
median = df['column_name'].median()
df['column_name'] = np.where((df['column_name'] < lower_bound)
| (df['column_name'] > upper_bound), median, df['column_name'])

5. Model-Based Methods
o Use robust algorithms that are less sensitive to outliers, such as decision trees or
robust regression models.

You might also like