CODE EXPLANATIONFOR DATA IMPULATION
It looks like you meant "data manipulation" instead of "data impulation." Data manipulation
refers to modifying, organizing, or analyzing data using programming languages like Python,
SQL, or R. Below are some key concepts with code explanations in Python using pandas, a
popular data manipulation library.
1. Importing Data
Before manipulating data, you need to import it.
import pandas as pd
# Load a CSV file
df = pd.read_csv('[Link]')
# Display the first 5 rows
print([Link]())
2. Selecting Specific Columns
You can select a single or multiple columns from a DataFrame.
# Select a single column
df['column_name']
# Select multiple columns
df[['column1', 'column2']]
3. Filtering Data
Filtering allows you to extract rows that meet specific conditions.
# Select rows where the value in 'age' column is greater than 30
df_filtered = df[df['age'] > 30]
4. Sorting Data
Sorting helps in organizing the data based on one or more columns.
# Sort by a single column
df_sorted = df.sort_values(by='age', ascending=True)
# Sort by multiple columnsdf_sorted = df.sort_values(by=['age', 'salary'],
ascending=[True, False])
5. Handling Missing Data
Missing data can be handled by filling or dropping missing values.
# Drop rows with missing values
df_cleaned = [Link]()
# Fill missing values with a specific value
df_filled = [Link](0)
# Fill missing values with column mean
df['column_name'].fillna(df['column_name'].mean(), inplace=True)
6. Creating New Columns
New columns can be derived from existing ones.
# Create a new column based on existing columns
df['total_salary'] = df['base_salary'] + df['bonus']
7. Grouping Data
Grouping helps in aggregating data based on categorical values.
# Group by a column and calculate mean
df_grouped = [Link]('department')['salary'].mean()
8. Merging & Joining DataFrames
Combining data from multiple sources.
# Merge two DataFrames on a common column
df_merged = [Link](df1, df2, on='employee_id', how='inner')
9. Pivot Tables
Summarizing data in a tabular format.
df_pivot = df.pivot_table(values='sales', index='region', columns='month',
aggfunc='sum')
10. Applying Functions
Using apply() to apply custom functions to rows or columns.
# Define a function
def convert_to_upper(text):
return [Link]()
# Apply function to a column
df['name'] = df['name'].apply(convert_to_upper)
CODE EXPLANATION’S FOR DATA MISSING VALUE FINDINGS
Handling Missing Data in Pandas
Missing values can cause issues in data analysis. In pandas, missing values are usually
represented as NaN (Not a Number). Below are different ways to find missing values in a dataset
1. Checking for Missing Values:
To check if a dataset has missing values, use isnull() or notnull().
Import pandas as pd
# Sample data with missing values
Data = {‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, None, ‘Eve’],
‘Age’: [25, 30, None, 35, 40],
‘Salary’: [50000, 60000, 55000, None, 70000]}
Df = [Link](data)
# Check for missing values in the DataFrame
Print([Link]())
# Summary count of missing values in each column
Print([Link]().sum())
# Check for non-missing values
Print([Link]())
Explanation:
[Link]() returns a Boolean DataFrame, showing True where values are missing.
[Link]().sum() gives the count of missing values per column.
[Link]() is the inverse, showing True for non-missing values.
2. Finding Rows with Missing Values
To identify rows that contain at least one missing value:
# Filter rows where at least one column has a missing value
Missing_rows = df[[Link]().any(axis=1)]
Print(missing_rows)
Explanation:
[Link]().any(axis=1) checks if any column in a row has NaN.
Df[condition] selects only those rows.
3. Finding the Percentage of Missing Values
To get the percentage of missing values per column:
# Calculate percentage of missing values
Missing_percentage = ([Link]().sum() / len(df)) * 100
Print(missing_percentage)
Explanation:
[Link]().sum() gives the number of missing values per column.
Dividing by len(df) and multiplying by 100 gives the percentage.
4. Finding Total Missing Values in the DataFrame
To get the total number of missing values in the entire dataset:
# Total missing values in the DataFrame
Total_missing = [Link]().sum().sum()
Print(“Total missing values:”, total_missing)
Explanation:
The first .sum() calculates missing values per column.
The second .sum() gives the total across the entire Data.
CODE OF DATA EMPTINESS FINDING
Finding Unique Values in a Dataset (Pandas)
Uniqueness in data helps identify distinct values in a column, which is useful for tasks like data
cleaning, categorization, and analysis.
1. Finding Unique Values in a Column
You can use .unique() to get distinct values in a specific column.
Import pandas as pd
# Sample dataData = {‘Category’: [‘A’, ‘B’, ‘A’, ‘C’, ‘B’, ‘C’, ‘A’],
‘Values’: [10, 20, 10, 30, 20, 30, 40]}
Df = [Link](data)
# Get unique values in the ‘Category’ column
Unique_categories = df[‘Category’].unique()
Print(unique_categories)
Explanation:
Df[‘Category’].unique() returns a NumPy array of unique values.
Output:
[‘A’ ‘B’ ‘C’]
2. Counting Unique Values in a Column
To count how many unique values exist in a column, use .nunique().
# Count unique values in the ‘Category’ column
Unique_count = df[‘Category’].nunique()
Print(unique_count)
Output:
3
3. Counting Frequency of Unique Values
To get the count of each unique value, use .value_counts().
# Count occurrences of each unique value
Value_counts = df[‘Category’].value_counts()
Print(value_counts)
Output:
A 3
B 2
C 2
Name: Category, dtype: int64
Explanation:
Df[‘Category’].value_counts() returns a Series with counts of each unique value.
4. Finding Unique Pairs in Multiple Columns
If you want to find unique combinations across multiple columns:
# Get unique rows based on ‘Category’ and ‘Values’
Unique_pairs = df[[‘Category’, ‘Values’]].drop_duplicates()
Print(unique_pairs)
Explanation:
.drop_duplicates() removes duplicate rows, keeping only unique ones.
5. Checking If All Values in a Column Are Unique
To check whether all values in a column are unique:
Is_unique = df[‘Values’].is_unique
Print(is_unique)
Output:
False
Explanation:
.is_unique returns True if all values in the column are distinct, otherwise False.
CODE EXPLAINTIONS FOR NaN FINDINGS
Finding NaN (Missing) Values in Pandas
In pandas, missing values are represented as NaN (Not a Number). Below are different ways to
find and analyze NaN values in a DataFrame.
1. Checking for NaN Values
To check if a dataset contains NaN values, use .isnull() or .isna().
Import pandas as pd
Import numpy as np
# Sample data with NaN values
Data = {‘Name’: [‘Alice’, ‘Bob’, [Link], ‘David’, ‘Eve’],
‘Age’: [25, [Link], 30, 35, 40],
‘Salary’: [50000, 60000, [Link], 70000, [Link]]}
Df = [Link](data)
# Check for NaN values in the entire DataFrame
Print([Link]())
# Equivalent to isnull()
Print([Link]())
Explanation:
[Link]() returns a Boolean DataFrame, where True means the value is NaN.
[Link]() does the same as .isnull(), they are interchangeable.
2. Counting NaN Values Per Column
To find the number of missing values in each column:
# Count NaN values per column
Print([Link]().sum())
Output:
Name 1
Age 1
Salary 2
Dtype: int64
Explanation:
[Link]().sum() counts NaN values for each column.
[Link] Total NaN Values in the DataFrame
To count all missing values in the entire dataset:
# Total number of NaN values
Print([Link]().sum().sum())
Output:
4
Explanation:
The first .sum() counts NaNs per column.
The second .sum() gives the total NaNs across all columns.
[Link] Rows with NaN Values
To get only the rows containing at least one NaN value:
# Get rows where at least one column has NaN
Print(df[[Link]().any(axis=1)])
Explanation:
[Link]().any(axis=1) checks if any column in a row has NaN.
Df[condition] selects those rows.
[Link] Rows Where All Values Are NaN
To check for rows where all columns are NaN:
# Get rows where all values are NaN
Print(df[[Link]().all(axis=1)])
Explanation:[Link]().all(axis=1) checks if all columns in a row are NaN.
[Link] Columns That Contain NaN
To list columns that have missing values:
# List columns with NaN values
Columns_with_nan = [Link][[Link]().any()].tolist()
Print(columns_with_nan)
Output:
[‘Name’, ‘Age’, ‘Salary’]
Explanation:
[Link]().any() checks for NaNs in each column.
.columns[…] extracts column names where True.
[Link] If a DataFrame Has Any NaN Values
To quickly check if there are any NaN values in the DataFrame:
# Check if any NaN exists in DataFrame
Print([Link]().[Link]())
Output:
True
Explanation:
[Link]().values converts to a NumPy array of True/False.
.any() returns True if at least one NaN existing