0% found this document useful (0 votes)
13 views13 pages

Kavin

Uploaded by

vikashtamila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

Kavin

Uploaded by

vikashtamila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

P.S.R.

R COLLEGE OF ENGINEERING
Approved by AICTE & Affiliated to Anna University, Chennai.
Sevalpatti, Sivakasi-626140

Department of Computer Science and Engineering

Course Name:Artificial Intelligence


Project Title: phase 2(Data Wrangling and Analysis)

Our Team Members:


1) Madesh S
2) Gopinath M
3) Kavin kumar M
4) Jerry Aathavan
5) Mareeswaran S

Done by: M.Kavin Kumar


Introduction:
In the realm of supply chain optimization, data wrangling and analysis are
indispensable processes driving efficiency and informed decision-making. Data
wrangling involves collecting, organizing, and preparing raw data from various
supply chain sources for analysis. Subsequently, data analysis employs statistical
techniques, machine learning algorithms, and optimization models to extract
valuable insights from the structured data. By leveraging data wrangling and
analysis, companies can make informed decisions that enhance operational
performance and competitiveness. This paper examines the critical roles of data
wrangling and analysis in supply chain optimization, emphasizing their
transformative potential in driving efficiency and strategic decision-making.
Through real-world examples and practical applications, we illustrate how data-
driven approaches revolutionize supply chain management, enabling businesses to
adapt to dynamic market conditions and achieve sustainable growth.

Objectives:
1. Data Cleansing: Ensure the dataset's integrity by addressing inconsistencies,
errors, and missing values, enhancing its reliability for subsequent analysis.

2. Exploratory Data Analysis (EDA): Explore the dataset's characteristics


through EDA techniques, unraveling distributions, correlations, and patterns
within the data.

3.Feature Engineering: Enhance model performance by engineering relevant


features, enabling accurate content recommendations by the chatbot.

4. Documentation: Comprehensively document the data wrangling process to


ensure transparency and reproducibility of the analysis, detailing steps taken to
cleanse the data, conduct EDA, and engineer features, alongside explanations for
any decisions made during the process.
Dataset Description
The dataset consists of movie reviews sourced from IMDb, obtained from the Hugging
Face repository. It contains user-provided ratings indicating whether the review is
positive or negative. Each entry in the dataset represents a movie review and its
associated sentiment label, reflecting user opinions on the movies' quality.

1.Data Description
print("Data Description:")

Head:Displaying the first few rows of the dataset to get an initial overview.
print("\nHead:")
print(df.head())

Tail: Examining the last few rows of the dataset to ensure completeness.
print("\nTail:")
print(df.tail())

Info:Obtaining information about the dataset structure, data types, and memory
usage.
print("\nInfo:")
print(df.info())
Describe:Generating descriptive statistics for numerical features to understand
their distributions and central tendencies.
print("\nDescribe:")
print(df.describe())

2. Null Data Handling


Null Data Identification
To identify missing values in the dataset, we'll use the isnull() method followed
by sum() to count the missing values in each column.
print(data.isnull().sum())
Output Screenshot:

Null Data Imputation


We'll fill missing values with appropriate strategies using methods like fillna().
data.fillna(data.mean(), inplace=True)
Null Data Removal
To eliminate rows with excessive missing values, we'll use
dropna().data.dropna(inplace=True) # Drop rows with missing values

3. Data Validation
Data Integrity Check
To verify data consistency and integrity, we can check for unique values in a column.
print(data['column_name'].unique())

Output Screenshot:

Data Consistency Verification


We can ensure data consistency across different columns or datasets by
comparing them

if data['column1'].equals(data['column2']):
print("Data in column1 is consistent with column2")
else:
print("Data in column1 is not consistent with column2")
4.Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial step in understanding the characteristics
of the dataset. It involves analyzing individual variables, investigating relationships
between variables, and exploring patterns and trends in the data.

Univariate Analysis
Univariate analysis involves analyzing individual variables to understand their
distributions and characteristics.# Univariate analysis - Histogram
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(data['numerical_column'], bins=20)
plt.title('Histogram of Numerical Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Bivariate Analysis
Bivariate analysis investigates relationships between pairs of variables to identify
correlations and dependencies.

# Bivariate analysis - Scatter plot


sns.scatterplot(data['feature1'], data['feature2'])
plt.title('Scatter Plot of Feature1 vs Feature2')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.show()
Multivariate Analysis
Multivariate analysis explores interactions among multiple variables to uncover
complex patterns and trends.

# Multivariate analysis - Pair plot


sns.pairplot(data)
plt.title('Pair Plot of the Data')
plt.show()

Output Screenshot:
5.Preprocessed Data
Preprocessing the data is a critical step in data analysis, especially when dealing with
textual data. It involves cleaning and transforming the data to make it suitable for
analysis. In this case, we'll preprocess textual data by converting it to lowercase,
removing punctuation, and eliminating stopwords.
# Sample code for preprocessing text data
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def preprocess_text(text):
# Convert text to lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Tokenize the text
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Join the tokens back into text
preprocessed_text = ' '.join(tokens)
return preprocessed_text

# Apply preprocessing to the 'review' column


df['clean_review'] = df['review'].apply(preprocess_text)

# Display preprocessed data


print("\nPreprocessed Data:")
print(df[['review', 'clean_review']].head())
Output Screenshot:

6.Model Evaluation
Model evaluation is crucial to assess the performance of a machine learning
model. We typically evaluate models using metrics such as accuracy, precision,
recall, F1-score, and confusion matrix.

# Import necessary libraries


from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
# Make predictions on the test data
predictions = model.predict(X_test_tfidf)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
classification_rep = classification_report(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

# Print evaluation results


print("\nModel Evaluation:")
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_rep)
print("\nConfusion Matrix:")
print(conf_matrix)
Output Screenshot:

7.Assumed Scenario
Scenario: Leveraging natural language processing techniques, the project aims to
develop a chatbot capable of recommending personalized content to users based on
their historical interactions and preferences extracted from textual data.
Objective: Enhance user engagement and satisfaction by providing relevant and
tailored content recommendations through the chatbot, thereby enriching the user
experience.
Target Audience: Individuals interacting with the chatbot seeking personalized
content recommendations tailored to their interests and preferences across various
domains

Code:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

nltk.download('punkt')
nltk.download('stopwords')

df = pd.read_csv("IMDB Dataset.csv")

print("Data Description:")
print("\nHead:")
print(df.head())
print("\nTail:")
print(df.tail())
print("\nInfo:")
print(df.info())
print("\nDescribe:")
print(df.describe())

print("\nNull Data Handling:")


print("\nNull Data Identification:")
print(df.isnull().sum())

print("\nData Validation:")
print("\nData Integrity Check - Sentiment Column:")
print(df['sentiment'].value_counts())

print("\nExploratory Data Analysis (EDA):")


print("\nUnivariate Analysis - Distribution of Sentiment Labels:")
print(df['sentiment'].value_counts())

print("\nBivariate Analysis:")
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='sentiment', hue='sentiment', palette='Set2', legend=False)
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
text = ' '.join(tokens)
return text

df['clean_review'] = df['review'].apply(preprocess_text)

print("\nPreprocessed Data:")
print(df[['review', 'clean_review']].head())

# Splitting the data


from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

X_train = train_data['clean_review'].values
y_train = train_data['sentiment'].values

X_test = test_data['clean_review'].values
y_test = test_data['sentiment'].values

# Vectorizing the text data


from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Training the model


from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Evaluating the model


from sklearn.metrics import classification_report, accuracy_score

predictions = model.predict(X_test_tfidf)

print("\nModel Evaluation:")
print("Accuracy:", accuracy_score(y_test, predictions))
print("\nClassification Report:")
print(classification_report(y_test, predictions))

Output:
Conclusion:
In summary, data wrangling and analysis are indispensable components of supply
chain optimization, enabling businesses to extract actionable insights from complex
datasets. Through statistical analysis, machine learning, and optimization techniques,
companies can streamline operations, reduce costs, and enhance customer satisfaction.
The transformative potential of data-driven approaches is evident in their ability to
drive informed decision-making and improve overall supply chain efficiency. Moving
forward, continued investment in data analytics capabilities will be essential for
businesses to remain competitive in today's fast-paced market environment. By
embracing data-driven strategies, organizations can unlock new opportunities for
growth and innovation, ensuring long-term success in an ever-evolving marketplace.
With a proactive approach to data utilization, businesses can stay agile and responsive
to emerging trends, solidifying their position as industry leaders.

You might also like