0% found this document useful (0 votes)

13 views13 pages

Kavin

Uploaded by

vikashtamila

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views13 pages

Kavin

Uploaded by

vikashtamila

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

P.S.R.

R COLLEGE OF ENGINEERING
Approved by AICTE & Affiliated to Anna University, Chennai.
Sevalpatti, Sivakasi-626140

Department of Computer Science and Engineering

Course Name:Artificial Intelligence

Project Title: phase 2(Data Wrangling and Analysis)

Our Team Members:

1) Madesh S
2) Gopinath M
3) Kavin kumar M
4) Jerry Aathavan
5) Mareeswaran S

Done by: M.Kavin Kumar

Introduction:
In the realm of supply chain optimization, data wrangling and analysis are
indispensable processes driving efficiency and informed decision-making. Data
wrangling involves collecting, organizing, and preparing raw data from various
supply chain sources for analysis. Subsequently, data analysis employs statistical
techniques, machine learning algorithms, and optimization models to extract
valuable insights from the structured data. By leveraging data wrangling and
analysis, companies can make informed decisions that enhance operational
performance and competitiveness. This paper examines the critical roles of data
wrangling and analysis in supply chain optimization, emphasizing their
transformative potential in driving efficiency and strategic decision-making.
Through real-world examples and practical applications, we illustrate how data-
driven approaches revolutionize supply chain management, enabling businesses to
adapt to dynamic market conditions and achieve sustainable growth.

Objectives:
1. Data Cleansing: Ensure the dataset's integrity by addressing inconsistencies,
errors, and missing values, enhancing its reliability for subsequent analysis.

2. Exploratory Data Analysis (EDA): Explore the dataset's characteristics

through EDA techniques, unraveling distributions, correlations, and patterns
within the data.

3.Feature Engineering: Enhance model performance by engineering relevant

features, enabling accurate content recommendations by the chatbot.

4. Documentation: Comprehensively document the data wrangling process to

ensure transparency and reproducibility of the analysis, detailing steps taken to
cleanse the data, conduct EDA, and engineer features, alongside explanations for
any decisions made during the process.
Dataset Description
The dataset consists of movie reviews sourced from IMDb, obtained from the Hugging
Face repository. It contains user-provided ratings indicating whether the review is
positive or negative. Each entry in the dataset represents a movie review and its
associated sentiment label, reflecting user opinions on the movies' quality.

1.Data Description
print("Data Description:")

Head:Displaying the first few rows of the dataset to get an initial overview.
print("\nHead:")
print(df.head())

Tail: Examining the last few rows of the dataset to ensure completeness.
print("\nTail:")
print(df.tail())

Info:Obtaining information about the dataset structure, data types, and memory
usage.
print("\nInfo:")
print(df.info())
Describe:Generating descriptive statistics for numerical features to understand
their distributions and central tendencies.
print("\nDescribe:")
print(df.describe())

2. Null Data Handling

Null Data Identification
To identify missing values in the dataset, we'll use the isnull() method followed
by sum() to count the missing values in each column.
print(data.isnull().sum())
Output Screenshot:

Null Data Imputation

We'll fill missing values with appropriate strategies using methods like fillna().
data.fillna(data.mean(), inplace=True)
Null Data Removal
To eliminate rows with excessive missing values, we'll use
dropna().data.dropna(inplace=True) # Drop rows with missing values

3. Data Validation
Data Integrity Check
To verify data consistency and integrity, we can check for unique values in a column.
print(data['column_name'].unique())

Output Screenshot:

Data Consistency Verification

We can ensure data consistency across different columns or datasets by
comparing them

if data['column1'].equals(data['column2']):
print("Data in column1 is consistent with column2")
else:
print("Data in column1 is not consistent with column2")
4.Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial step in understanding the characteristics
of the dataset. It involves analyzing individual variables, investigating relationships
between variables, and exploring patterns and trends in the data.

Univariate Analysis
Univariate analysis involves analyzing individual variables to understand their
distributions and characteristics.# Univariate analysis - Histogram
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(data['numerical_column'], bins=20)
plt.title('Histogram of Numerical Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Bivariate Analysis
Bivariate analysis investigates relationships between pairs of variables to identify
correlations and dependencies.

# Bivariate analysis - Scatter plot

sns.scatterplot(data['feature1'], data['feature2'])
plt.title('Scatter Plot of Feature1 vs Feature2')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.show()
Multivariate Analysis
Multivariate analysis explores interactions among multiple variables to uncover
complex patterns and trends.

# Multivariate analysis - Pair plot

sns.pairplot(data)
plt.title('Pair Plot of the Data')
plt.show()

Output Screenshot:
5.Preprocessed Data
Preprocessing the data is a critical step in data analysis, especially when dealing with
textual data. It involves cleaning and transforming the data to make it suitable for
analysis. In this case, we'll preprocess textual data by converting it to lowercase,
removing punctuation, and eliminating stopwords.
# Sample code for preprocessing text data
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def preprocess_text(text):
# Convert text to lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Tokenize the text
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Join the tokens back into text
preprocessed_text = ' '.join(tokens)
return preprocessed_text

# Apply preprocessing to the 'review' column

df['clean_review'] = df['review'].apply(preprocess_text)

# Display preprocessed data

print("\nPreprocessed Data:")
print(df[['review', 'clean_review']].head())
Output Screenshot:

6.Model Evaluation
Model evaluation is crucial to assess the performance of a machine learning
model. We typically evaluate models using metrics such as accuracy, precision,
recall, F1-score, and confusion matrix.

# Import necessary libraries

from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
# Make predictions on the test data
predictions = model.predict(X_test_tfidf)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
classification_rep = classification_report(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

# Print evaluation results

print("\nModel Evaluation:")
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_rep)
print("\nConfusion Matrix:")
print(conf_matrix)
Output Screenshot:

7.Assumed Scenario
Scenario: Leveraging natural language processing techniques, the project aims to
develop a chatbot capable of recommending personalized content to users based on
their historical interactions and preferences extracted from textual data.
Objective: Enhance user engagement and satisfaction by providing relevant and
tailored content recommendations through the chatbot, thereby enriching the user
experience.
Target Audience: Individuals interacting with the chatbot seeking personalized
content recommendations tailored to their interests and preferences across various
domains

Code:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

nltk.download('punkt')
nltk.download('stopwords')

df = pd.read_csv("IMDB Dataset.csv")

print("Data Description:")
print("\nHead:")
print(df.head())
print("\nTail:")
print(df.tail())
print("\nInfo:")
print(df.info())
print("\nDescribe:")
print(df.describe())

print("\nNull Data Handling:")

print("\nNull Data Identification:")
print(df.isnull().sum())

print("\nData Validation:")
print("\nData Integrity Check - Sentiment Column:")
print(df['sentiment'].value_counts())

print("\nExploratory Data Analysis (EDA):")

print("\nUnivariate Analysis - Distribution of Sentiment Labels:")
print(df['sentiment'].value_counts())

print("\nBivariate Analysis:")
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='sentiment', hue='sentiment', palette='Set2', legend=False)
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
text = ' '.join(tokens)
return text

df['clean_review'] = df['review'].apply(preprocess_text)

print("\nPreprocessed Data:")
print(df[['review', 'clean_review']].head())

# Splitting the data

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

X_train = train_data['clean_review'].values
y_train = train_data['sentiment'].values

X_test = test_data['clean_review'].values
y_test = test_data['sentiment'].values

# Vectorizing the text data

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Training the model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Evaluating the model

from sklearn.metrics import classification_report, accuracy_score

predictions = model.predict(X_test_tfidf)

print("\nModel Evaluation:")
print("Accuracy:", accuracy_score(y_test, predictions))
print("\nClassification Report:")
print(classification_report(y_test, predictions))

Output:
Conclusion:
In summary, data wrangling and analysis are indispensable components of supply
chain optimization, enabling businesses to extract actionable insights from complex
datasets. Through statistical analysis, machine learning, and optimization techniques,
companies can streamline operations, reduce costs, and enhance customer satisfaction.
The transformative potential of data-driven approaches is evident in their ability to
drive informed decision-making and improve overall supply chain efficiency. Moving
forward, continued investment in data analytics capabilities will be essential for
businesses to remain competitive in today's fast-paced market environment. By
embracing data-driven strategies, organizations can unlock new opportunities for
growth and innovation, ensuring long-term success in an ever-evolving marketplace.
With a proactive approach to data utilization, businesses can stay agile and responsive
to emerging trends, solidifying their position as industry leaders.

Case 2 - A Zero Wage Increase Again
63% (8)
Case 2 - A Zero Wage Increase Again
3 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
FoxScanner+Update+Guide+EN V1.00
No ratings yet
FoxScanner+Update+Guide+EN V1.00
12 pages
89443939-Wiring Diagram, FM Cab Facelift (ENG)
100% (1)
89443939-Wiring Diagram, FM Cab Facelift (ENG)
174 pages
Tad1241ge PDF
No ratings yet
Tad1241ge PDF
14 pages
Application Under Section 476
100% (5)
Application Under Section 476
2 pages
Sample Phase 2 Document
No ratings yet
Sample Phase 2 Document
7 pages
Phase 2
No ratings yet
Phase 2
14 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Sma Exp 3
No ratings yet
Sma Exp 3
7 pages
Datascience
No ratings yet
Datascience
26 pages
Hemanth SDP
No ratings yet
Hemanth SDP
13 pages
Data Analytics
No ratings yet
Data Analytics
22 pages
Python - Data Analysis
No ratings yet
Python - Data Analysis
11 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
No ratings yet
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
6 pages
Predictive Data Analytics With Python
100% (1)
Predictive Data Analytics With Python
97 pages
Customer Segmentation 2
No ratings yet
Customer Segmentation 2
19 pages
Project List Data Analytics
No ratings yet
Project List Data Analytics
13 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Data Analysis Using Python2
No ratings yet
Data Analysis Using Python2
27 pages
DMV Lab Manual
No ratings yet
DMV Lab Manual
45 pages
Vamshi ml-1,2
No ratings yet
Vamshi ml-1,2
25 pages
Final Project Report
No ratings yet
Final Project Report
43 pages
Session 4 Machine Learning Process
No ratings yet
Session 4 Machine Learning Process
28 pages
Internship Project
No ratings yet
Internship Project
10 pages
Sodapdf
No ratings yet
Sodapdf
1 page
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Self Intoduction 1 Project
No ratings yet
Self Intoduction 1 Project
11 pages
Dav - Lab Manual
No ratings yet
Dav - Lab Manual
34 pages
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
No ratings yet
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
5 pages
Data - Science - Manaul (Te)
No ratings yet
Data - Science - Manaul (Te)
78 pages
Dnyaneshwar Ds
No ratings yet
Dnyaneshwar Ds
2 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Sasi 2nd
No ratings yet
Sasi 2nd
34 pages
Data Analysis PHASE
No ratings yet
Data Analysis PHASE
14 pages
Internship-Data Science and Machine Learning Using Python
No ratings yet
Internship-Data Science and Machine Learning Using Python
5 pages
ITECH2302 MainAssessment Report
No ratings yet
ITECH2302 MainAssessment Report
8 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Ce473 Project - Fall 2024
No ratings yet
Ce473 Project - Fall 2024
8 pages
Report Shawari
No ratings yet
Report Shawari
10 pages
Prac 7
No ratings yet
Prac 7
5 pages
Finaldoc
No ratings yet
Finaldoc
19 pages
BAET Record
No ratings yet
BAET Record
19 pages
Final Project Documentation
No ratings yet
Final Project Documentation
53 pages
Data Preparation Basics#
No ratings yet
Data Preparation Basics#
2 pages
Final Document
No ratings yet
Final Document
14 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
Lab 03
No ratings yet
Lab 03
13 pages
Python BigData Alternative Assignment
No ratings yet
Python BigData Alternative Assignment
5 pages
Python and PowerBI Syllabus
No ratings yet
Python and PowerBI Syllabus
3 pages
Data Analytics
No ratings yet
Data Analytics
34 pages
Types of Data Analysis With Code
No ratings yet
Types of Data Analysis With Code
8 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
84 pages
Internshippresentation 230414184008 11879a25
No ratings yet
Internshippresentation 230414184008 11879a25
24 pages
Data Task Breakdown
No ratings yet
Data Task Breakdown
12 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
Machine Learning Lab Record Report
No ratings yet
Machine Learning Lab Record Report
38 pages
Ds Final
No ratings yet
Ds Final
45 pages
DS Curriculum
No ratings yet
DS Curriculum
4 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Question Set - Asset Integrity
100% (1)
Question Set - Asset Integrity
5 pages
Gamal Mohamed CV
No ratings yet
Gamal Mohamed CV
2 pages
Applied Modelling and Visualisation
No ratings yet
Applied Modelling and Visualisation
12 pages
Geronimo Creer, Jr. For Plaintiffs-Appellees. Benedicto G. Cobarde For Defendant, Defendant-Appellant
No ratings yet
Geronimo Creer, Jr. For Plaintiffs-Appellees. Benedicto G. Cobarde For Defendant, Defendant-Appellant
2 pages
Mun of La Carlota V NAWASA
No ratings yet
Mun of La Carlota V NAWASA
2 pages
Odms E-14 Ef009-50 - Manual
100% (1)
Odms E-14 Ef009-50 - Manual
247 pages
FMX / Cruiso / BW 8-12: Ganzeboom Transmission Parts & Torque Converters
No ratings yet
FMX / Cruiso / BW 8-12: Ganzeboom Transmission Parts & Torque Converters
2 pages
Mystery Shopping TVS Motors by HS BRANDS
No ratings yet
Mystery Shopping TVS Motors by HS BRANDS
77 pages
Chemistry IA Exemplar Document
No ratings yet
Chemistry IA Exemplar Document
15 pages
2.2. BASIC Work in Team Environment
No ratings yet
2.2. BASIC Work in Team Environment
3 pages
Tao Et Al - 2017 - Reconfigurable Conversions of Reflection, Transmission, and Polarization States
No ratings yet
Tao Et Al - 2017 - Reconfigurable Conversions of Reflection, Transmission, and Polarization States
6 pages
Iso 15614 11 2002
No ratings yet
Iso 15614 11 2002
12 pages
UNDP Malaysia Peat Swamp Forest PDF
100% (1)
UNDP Malaysia Peat Swamp Forest PDF
40 pages
Incred PL LXDEL18924 257067901 06 12 2024 - Statement - of - Account - 140420
No ratings yet
Incred PL LXDEL18924 257067901 06 12 2024 - Statement - of - Account - 140420
2 pages
Newseam 1 Module 2 Matanacio
No ratings yet
Newseam 1 Module 2 Matanacio
32 pages
Exaugural Speech by Outgoing President Ronaldo Nilo
No ratings yet
Exaugural Speech by Outgoing President Ronaldo Nilo
1 page
Heroes of Might & Magic 2 - Manual UK
No ratings yet
Heroes of Might & Magic 2 - Manual UK
142 pages
Historiopreneurship Related Paper 3
No ratings yet
Historiopreneurship Related Paper 3
13 pages
Resume Sonali Sahu Tenth Revolution Group
No ratings yet
Resume Sonali Sahu Tenth Revolution Group
2 pages
EGR System Diagnostic Procedures
No ratings yet
EGR System Diagnostic Procedures
7 pages
Royal Caribbean
No ratings yet
Royal Caribbean
13 pages
Infineon-AN50987 Getting Started With I2C in PSoC 1-ApplicationNotes-V07 00-En
No ratings yet
Infineon-AN50987 Getting Started With I2C in PSoC 1-ApplicationNotes-V07 00-En
28 pages
En Girafe
No ratings yet
En Girafe
4 pages
Airforceregs
No ratings yet
Airforceregs
308 pages
Tally Prime Additional Entries
No ratings yet
Tally Prime Additional Entries
4 pages

Kavin

Uploaded by

Kavin

Uploaded by

P.S.R.

Department of Computer Science and Engineering

Course Name:Artificial Intelligence

Our Team Members:

Done by: M.Kavin Kumar

2. Exploratory Data Analysis (EDA): Explore the dataset's characteristics

3.Feature Engineering: Enhance model performance by engineering relevant

4. Documentation: Comprehensively document the data wrangling process to

2. Null Data Handling

Null Data Imputation

Data Consistency Verification

# Bivariate analysis - Scatter plot

# Multivariate analysis - Pair plot

# Apply preprocessing to the 'review' column

# Display preprocessed data

# Import necessary libraries

# Print evaluation results

print("\nNull Data Handling:")

print("\nExploratory Data Analysis (EDA):")

# Splitting the data

train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Vectorizing the text data

# Training the model

# Evaluating the model

You might also like