Kavin
Kavin
R COLLEGE OF ENGINEERING
Approved by AICTE & Affiliated to Anna University, Chennai.
Sevalpatti, Sivakasi-626140
Objectives:
1. Data Cleansing: Ensure the dataset's integrity by addressing inconsistencies,
errors, and missing values, enhancing its reliability for subsequent analysis.
1.Data Description
print("Data Description:")
Head:Displaying the first few rows of the dataset to get an initial overview.
print("\nHead:")
print(df.head())
Tail: Examining the last few rows of the dataset to ensure completeness.
print("\nTail:")
print(df.tail())
Info:Obtaining information about the dataset structure, data types, and memory
usage.
print("\nInfo:")
print(df.info())
Describe:Generating descriptive statistics for numerical features to understand
their distributions and central tendencies.
print("\nDescribe:")
print(df.describe())
3. Data Validation
Data Integrity Check
To verify data consistency and integrity, we can check for unique values in a column.
print(data['column_name'].unique())
Output Screenshot:
if data['column1'].equals(data['column2']):
print("Data in column1 is consistent with column2")
else:
print("Data in column1 is not consistent with column2")
4.Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial step in understanding the characteristics
of the dataset. It involves analyzing individual variables, investigating relationships
between variables, and exploring patterns and trends in the data.
Univariate Analysis
Univariate analysis involves analyzing individual variables to understand their
distributions and characteristics.# Univariate analysis - Histogram
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(data['numerical_column'], bins=20)
plt.title('Histogram of Numerical Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Bivariate Analysis
Bivariate analysis investigates relationships between pairs of variables to identify
correlations and dependencies.
Output Screenshot:
5.Preprocessed Data
Preprocessing the data is a critical step in data analysis, especially when dealing with
textual data. It involves cleaning and transforming the data to make it suitable for
analysis. In this case, we'll preprocess textual data by converting it to lowercase,
removing punctuation, and eliminating stopwords.
# Sample code for preprocessing text data
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def preprocess_text(text):
# Convert text to lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Tokenize the text
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Join the tokens back into text
preprocessed_text = ' '.join(tokens)
return preprocessed_text
6.Model Evaluation
Model evaluation is crucial to assess the performance of a machine learning
model. We typically evaluate models using metrics such as accuracy, precision,
recall, F1-score, and confusion matrix.
7.Assumed Scenario
Scenario: Leveraging natural language processing techniques, the project aims to
develop a chatbot capable of recommending personalized content to users based on
their historical interactions and preferences extracted from textual data.
Objective: Enhance user engagement and satisfaction by providing relevant and
tailored content recommendations through the chatbot, thereby enriching the user
experience.
Target Audience: Individuals interacting with the chatbot seeking personalized
content recommendations tailored to their interests and preferences across various
domains
Code:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
nltk.download('punkt')
nltk.download('stopwords')
df = pd.read_csv("IMDB Dataset.csv")
print("Data Description:")
print("\nHead:")
print(df.head())
print("\nTail:")
print(df.tail())
print("\nInfo:")
print(df.info())
print("\nDescribe:")
print(df.describe())
print("\nData Validation:")
print("\nData Integrity Check - Sentiment Column:")
print(df['sentiment'].value_counts())
print("\nBivariate Analysis:")
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='sentiment', hue='sentiment', palette='Set2', legend=False)
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()
def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
text = ' '.join(tokens)
return text
df['clean_review'] = df['review'].apply(preprocess_text)
print("\nPreprocessed Data:")
print(df[['review', 'clean_review']].head())
X_train = train_data['clean_review'].values
y_train = train_data['sentiment'].values
X_test = test_data['clean_review'].values
y_test = test_data['sentiment'].values
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
predictions = model.predict(X_test_tfidf)
print("\nModel Evaluation:")
print("Accuracy:", accuracy_score(y_test, predictions))
print("\nClassification Report:")
print(classification_report(y_test, predictions))
Output:
Conclusion:
In summary, data wrangling and analysis are indispensable components of supply
chain optimization, enabling businesses to extract actionable insights from complex
datasets. Through statistical analysis, machine learning, and optimization techniques,
companies can streamline operations, reduce costs, and enhance customer satisfaction.
The transformative potential of data-driven approaches is evident in their ability to
drive informed decision-making and improve overall supply chain efficiency. Moving
forward, continued investment in data analytics capabilities will be essential for
businesses to remain competitive in today's fast-paced market environment. By
embracing data-driven strategies, organizations can unlock new opportunities for
growth and innovation, ensuring long-term success in an ever-evolving marketplace.
With a proactive approach to data utilization, businesses can stay agile and responsive
to emerging trends, solidifying their position as industry leaders.