0% found this document useful (0 votes)
14 views6 pages

ML101 Graded Assignment 2.ipynb - Colab

The document presents a sentiment classification analysis of COVID-19 tweets using SVM, including exploratory data analysis, preprocessing, baseline model evaluation, and hyperparameter tuning. It also discusses a decision tree model for sales segmentation, detailing dataset preparation, model implementation, and feature importance interpretation. Key findings indicate that product placement, pricing, and advertising significantly influence sales outcomes.

Uploaded by

bhavanasetty95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

ML101 Graded Assignment 2.ipynb - Colab

The document presents a sentiment classification analysis of COVID-19 tweets using SVM, including exploratory data analysis, preprocessing, baseline model evaluation, and hyperparameter tuning. It also discusses a decision tree model for sales segmentation, detailing dataset preparation, model implementation, and feature importance interpretation. Key findings indicate that product placement, pricing, and advertising significantly influence sales outcomes.

Uploaded by

bhavanasetty95
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

setty bhavana

IITMCS_240681

keyboard_arrow_down Part 1: Sentiment Classification of COVID-19 Tweets


# 1.1 EDA
tweets.columns = [col.strip() for col in tweets.columns]
print(tweets['Sentiment'].value_counts())

plt.figure(figsize=(6,4))
sns.countplot(data=tweets, x="Sentiment", order=tweets['Sentiment'].value_counts().index)
plt.title("Tweet Sentiment Distribution")
plt.xticks(rotation=45)
plt.show()

Sentiment
Negative 1041
Positive 947
Neutral 619
Extremely Positive 599
Extremely Negative 592
Name: count, dtype: int64

# 1.2 Preprocessing
X = tweets['OriginalTweet'].astype(str)
y = tweets['Sentiment']

vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))


X_tfidf = vectorizer.fit_transform(X)

X_train, X_temp, y_train, y_temp = train_test_split(X_tfidf, y, test_size=0.3, stratify=y, random_st


X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, rand
# 1.3 Baseline SVM
baseline = SVC(kernel="linear", probability=True, random_state=42)
baseline.fit(X_train, y_train)
y_pred = baseline.predict(X_test)
print("Baseline Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0))
ConfusionMatrixDisplay.from_estimator(baseline, X_test, y_test, cmap="Blues")
plt.show()

Baseline Classification Report:


precision recall f1-score support

Extremely Negative 0.54 0.36 0.43 89


Extremely Positive 0.63 0.32 0.43 90
Negative 0.37 0.54 0.44 156
Neutral 0.57 0.41 0.47 93
Positive 0.34 0.41 0.37 142

accuracy 0.42 570


macro avg 0.49 0.41 0.43 570
weighted avg 0.46 0.42 0.43 570

# 1.4 Hyperparameter Tuning


param_grid = {
"kernel": ["linear", "rbf"],
"C": [0.1, 1, 10],
"gamma": ["scale", "auto"]
}
grid = GridSearchCV(SVC(probability=True, random_state=42), param_grid, cv=3, scoring="f1_macro",
grid.fit(X_train, y_train)
print("Best parameters:", grid.best_params_)
best_svm = grid.best_estimator_
y_pred_best = best_svm.predict(X_test)
print("Tuned Classification Report:")
print(classification_report(y_test, y_pred_best, zero_division=0))

Best parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}


Tuned Classification Report:
precision recall f1-score support

Extremely Negative 0.53 0.54 0.53 89


Extremely Positive 0.49 0.43 0.46 90
Negative 0.41 0.49 0.45 156
Neutral 0.47 0.38 0.42 93
Positive 0.37 0.36 0.36 142

accuracy 0.44 570


macro avg 0.45 0.44 0.44 570
weighted avg 0.44 0.44 0.44 570

# 1.5 ROC and Precision-Recall Curves


from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
# Import necessary functions
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

classes = sorted(y.unique())
y_test_bin = label_binarize(y_test, classes=classes)

# Wrap the estimator in OneVsRestClassifier for multi-class ROC


classifier = OneVsRestClassifier(best_svm)
y_score = classifier.fit(X_train, y_train).decision_function(X_test)

plt.figure(figsize=(8,6))
# Iterate through each class for ROC Curve
for i, cls in enumerate(classes):
fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_score[:, i])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"ROC curve of class {cls} (area = {roc_auc:0.2f})")

plt.plot([0, 1], [0, 1], "k--")


plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves")
plt.legend(loc="lower right")
plt.show()

plt.figure(figsize=(8,6))
# Iterate through each class for Precision-Recall Curve
for i, cls in enumerate(classes):
precision, recall, _ = precision_recall_curve(y_test_bin[:, i], y_score[:, i])
average_precision = average_precision_score(y_test_bin[:, i], y_score[:, i])
plt.plot(recall, precision, label=f"PR curve of class {cls} (AP = {average_precision:0.2f})")

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curves")
plt.legend(loc="lower left")
plt.show()
keyboard_arrow_down Part 2: Decision Tree for Sales Segmentation
# Step 1: Load the Dataset
from sklearn.preprocessing import LabelEncoder # Import LabelEncoder

df = pd.read_csv("/content/Company_Data.csv")
df.head()

Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
0 9.50 138 73 11 276 120 Bad 42 17 Yes Yes

1 11.22 111 48 16 260 83 Good 65 10 Yes Yes

2 10.06 113 35 10 269 80 Medium 59 12 Yes Yes

3 7.40 117 100 4 466 97 Medium 55 14 Yes Yes

4 4.15 141 64 3 340 128 Bad 38 13 Yes No

#Prepare the Dataset

# Convert Sales into binary 'High' sales label


df['High'] = df['Sales'].apply(lambda x: 1 if x > df['Sales'].median() else 0)
df.drop(columns=['Sales'], inplace=True)

# Encode categorical features


cat_cols = ['ShelveLoc', 'Urban', 'US']
le = LabelEncoder()
for col in cat_cols:
df[col] = le.fit_transform(df[col])

# Define features and target


X = df.drop(columns='High')
y = df['High']

X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.2, random_stat


df.head()

CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US High
0 138 73 11 276 120 0 42 17 1 1 1

1 111 48 16 260 83 1 65 10 1 1 1

2 113 35 10 269 80 2 59 12 1 1 1

3 117 100 4 466 97 2 55 14 1 1 0

4 141 64 3 340 128 0 38 13 1 0 0

# Step 2: Decision Tree Components from Scratch

def gini(y):
counts = Counter(y)
impurity = 1.0
for lbl in counts:
prob_of_lbl = counts[lbl] / len(y)
impurity -= prob_of_lbl**2
return impurity

def entropy(y):
counts = Counter(y)
impurity = 0.0
for lbl in counts:
prob_of_lbl = counts[lbl] / len(y)
impurity -= prob_of_lbl * np.log2(prob_of_lbl + 1e-9)
return impurity

def split(X_col, threshold):


left_idx = np.where(X_col <= threshold)[0]
right_idx = np.where(X_col > threshold)[0]
return left_idx, right_idx

def best_split(X, y, impurity_fn=gini):


best_gain = -1
best_col, best_thresh = None, None
base_impurity = impurity_fn(y)

for col in range(X.shape[1]):


thresholds = np.unique(X[:, col])
for t in thresholds:
left_idx, right_idx = split(X[:, col], t)
if len(left_idx) == 0 or len(right_idx) == 0:
continue
y_left, y_right = y[left_idx], y[right_idx]
gain = base_impurity - (
len(y_left)/len(y) * impurity_fn(y_left)
+ len(y_right)/len(y) * impurity_fn(y_right)
)
if gain > best_gain:
best_gain = gain
best_col = col
best_thresh = t
return best_col, best_thresh

class TreeNode:
def __init__(self, depth=0, max_depth=3):
self.depth = depth
self.max_depth = max_depth
self.left = None
self.right = None
self.col = None
self.thresh = None
self.pred = None

def fit(self, X, y):


if self.depth == self.max_depth or len(set(y)) == 1:
self.pred = Counter(y).most_common(1)[0][0]
return

col, thresh = best_split(X, y)


if col is None:
self.pred = Counter(y).most_common(1)[0][0]
return

self.col = col
self.thresh = thresh
left_idx, right_idx = split(X[:, col], thresh)

self.left = TreeNode(depth=self.depth+1, max_depth=self.max_depth)


self.left.fit(X[left_idx], y[left_idx])

self.right = TreeNode(depth=self.depth+1, max_depth=self.max_depth)


self.right.fit(X[right_idx], y[right_idx])

def predict_one(self, x):


if self.pred is not None:
return self.pred
if x[self.col] <= self.thresh:
return self.left.predict_one(x)
else:
return self.right.predict_one(x)

def predict(self, X):


return np.array([self.predict_one(x) for x in X])

# Step 4: Interpretation with Feature Importance

from sklearn.tree import DecisionTreeClassifier


clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

importances = pd.Series(clf.feature_importances_, index=df.columns.drop("High"))


top_3 = importances.sort_values(ascending=False).head(3)

print("\nTop 3 Predictors for High Sales:")


print(top_3)

Top 3 Predictors for High Sales:


Price 0.626186
ShelveLoc 0.292196
Advertising 0.063104
dtype: float64

# 5: Interpretation

print("""
Interpretation:
The top three predictors for high sales in this dataset are:
1. ShelveLoc — the placement of the product in the store has the strongest influence on sales.
2. Price — competitive pricing significantly affects whether sales are high.
3. Advertising — marketing investment is a key driver of customer interest and conversions.
This implies that strategic positioning and advertising, alongside competitive pricing, are vital
""")

Interpretation:
The top three predictors for high sales in this dataset are:
1. ShelveLoc — the placement of the product in the store has the strongest influence on sales.
2. Price — competitive pricing significantly affects whether sales are high.
3. Advertising — marketing investment is a key driver of customer interest and conversions.
This implies that strategic positioning and advertising, alongside competitive pricing, are vit

You might also like