0% found this document useful (0 votes)
7 views14 pages

Final Document

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

Final Document

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

PERSONALIZED CONTENT

RECOMMENDATION IN BOOK
PHASE 2 SUBMISSION
College code:8100
College Name: University College of Engineering, BIT
Campus, Anna University, Tiruchirappalli-620 024.
Technology: AI
Total number of students in a group:5
Student’s detail within the group:
1. Viththagi K - 810022205057
2. Sibani Selvi P - 810022205056
3. Arun J - 8100222053301
4. Ranjith M C - 810022205304
5. Gautham R A - 810022205303

Submitted by,
GAUTHAM R A, au810022205303
PHASE 2 DOCUMENT: DATA WRAGLING
AND ANALYSIS
Introduction:
Phase 2 of our project is dedicated to data wrangling and
analysis, critical steps in preparing the raw dataset for
building a ai tool for detecting online fraud transactions. This
phase involves employing various data manipulation
techniques using Python to clean, transform, and explore the
dataset. Additionally, we assume a scenario where the project
aims to recommend users about the fraud transactions once
they were about to start the transactions.
Objectives:
1. Cleanse the dataset by addressing inconsistencies, errors,
and missing values to ensure data integrity.
2. Explore the dataset's characteristics through exploratory
data analysis (EDA) to understand distributions and
correlations.
3. Engineer relevant features to enhance model performance
for accurate detections on fraud transactions.

Dataset Description:
A dataset for building a ai tool for detecting online fraud
transactions typically includes a variety of information about
both the fraud transactions and the user’s account details. In
the fraud1.csv we have the following feature variables
1.step
2.type
3.amount
4.nameOrig
5.oldbalanceOrg
6.newBalanceOrig
7.nameDest
8.oldbalanceDest
9.newbalanceDest
10.isFraud
11.isFlaggedFraud
Data Wrangling Techniques:
Data Description
➢ Head: The head() function displays the top rows of a dataset.
➢ Tail: The tail() function displays the bottom rows of a dataset.
➢ Info: The
info() method prints information about dataset, datatypes,
memory usage, column labels.
➢ Describe: The describe() method is used for calculating some
statistical data like percentile, mean and std of the numerical values.

Code:
#Data Description
import pandas as pd
import numpy as np
data=pd.read_csv("/content/fraud1.csv")
data.head()
data.tail()
data.info()
data.describe()

Output:
#head:

#tail:
#info:

#descibe:

Null Data Handling:


➢ Null data identification : Identifying null data involves finding
missing or empty values within the dataset.
➢ Null data imputation: Filling in missing values within the
dataset.
➢ Null data removal:
Eliminating the rows or columns within missing values from the
dataset.
Code :
#Null Data Handling
data.isnull()
data.notnull()
data.isnull().sum()
data.dropna()
data.fillna(0)
Output:
#isnull():

#notnull():

#isnull().sum():

#dropna():
#fillna(0):

•Data validation:
➢ Data integrity check: Verifying data consistency and integrity to
eliminate errors.
➢ Data consistency verification: Ensuring data consistency across
different columns in a datasets.
Code:
#Data Validation
data["type"].unique()
data["oldbalanceOrg"].unique()
data["isFraud"].unique()

Output:
#type:

#oldBalanceOrg:
#isFraud:

4.Data Reshaping:
➢ Reshaping rows and columns: In a dataset involves restructuring
the data to better suit the analysis or visualization needs.
➢ Transposing data: Converting rows into columns and vice versa as
needed.
Code:
#Data Reshaping
df_stacked=data.stack()
print(df_stacked.head(10))
df_unstacked=df_stacked.unstack()
print(df_unstacked.head(5))
df_melt=data.melt(id_vars=['type','isFraud'])
print(df_melt.head(10))
transposed_data=data.T
print(transposed_data)

Output:
#stacked():
#unstacked():

#melt():

#transpose():

5.Data merging:
➢ Combining datasets: Merging multiple datasets or data sources to
enrich the information available for analysis.
➢ Joining data: Joining datasets based on common columns or keys.

Code:
#data merging
data1=pd.read_csv("/content/crd.csv")
merged_data=pd.merge(data, data1, on="type", how="inner")
print(merged_data)

Output:
6.Data aggregation:
➢ Grouping data: Grouping dataset rows based on specific criteria.
➢ Aggregating data: Computing summary statistics for grouped data.
Code:
#Data Aggregation
aggregated_df = data.groupby('type').agg({'amount': ['mean', 'sum']})
print(aggregated_df)
#data Groupby
mean_value = data.groupby('type')['amount'].mean()
sum_value = data.groupby('type')['amount'].sum()

print("Mean:", mean_value)
print("Sum:", sum_value)

Output:
#data aggregation:

#data groupby:
Data Analysis Techniques:
7.Exploratory Data Analysis(EDA) :
➢ Univariate Analysis: Analysing individual variables to
understand their distributions and characteristics.
➢ Bivariate Analysis: Investigation relationships between pairs of
variables to identify correlations and dependencies.
➢ Multivariate Analysis: Exploring interactions among multiple
variables to uncover complex patterns and trends.
Code:
#Data Analysis Techniques
#Univariate Analysis
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(data['amount'].tail(15),bins=20)
plt.title("univariate analysis")
plt.show()
#Bivariate analysis
x=data["amount"].head(10)
y=df["oldbalanceOrg"].head(10)
plt.scatter(x,y)
plt.title("Bivariate analysis")
plt.show()
#multivariate analysis
sns.pairplot(data.head(10))
plt.title("multivariate analysis")
plt.show()

Output:
#univariate analysis:

#bivariate analysis:

#multivariate analysis:
9. Feature Engineering:
Creating User Profiles : Aggregating user interaction data to
construct comprehensive user profiles capturing preferences and
behaviors.
Temporal Analysis : Incorporating temporal features such as time
of day or day of week to capture temporal trends in user behavior.
Content Embeddings : Generating embeddings for content
items to represent their characteristics and relationships.
Code:
import pandas as pd
from gensim.models import Word2Vec
# Creating user profiles
user_profiles = data.groupby('type').agg({'amount': 'mean'})
print("User Profiles:")
print(user_profiles)
# Temporal analysis
data['oldbalanceOrg'] = pd.to_datetime(data['oldbalanceOrg'])
data['isFraud'] = data['oldbalanceOrg'].dt.hour
print("\nTemporal Analysis (isFraud):")
print(data[['oldbalanceOrg', 'isFraud']])

Output:

#user profiles:

#temporal analysis:

Assumed Scenario:
➢ Scenario : The project aims to build an ai tool to create awareness
for user in online fraud transaction detections.
➢ Objective : Enhance user engagement and satisfaction by
delivering non fraud transactions by detecting the fraud one.
➢ Target Audience : Digital platform users who use online
transactions.
Conclusion:
Phase 2 of the project focuses on data wrangling and analysis to
prepare the dataset for building an ai tool for detecting online fraud
detections. By employing Python-based data manipulation techniques
and assuming a scenario focused on online fraud detection
transactions, we aim to transform raw data into actionable insights for
enhancing user experience and engagement on digital platforms.
Dataset link : https://www.kaggle.com/datasets/jainilcoder/online-
payment-fraud-detection

You might also like