0% found this document useful (0 votes)
16 views7 pages

Intel Ai Project

The document outlines a project focused on movie genre classification using both data-centric and model-centric approaches. It details the steps taken for data preparation, including data loading, cleaning, preprocessing, and model training using Logistic Regression and Random Forest Classifier. Additionally, it provides Python code for implementing the model and predicting genres based on user input.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

Intel Ai Project

The document outlines a project focused on movie genre classification using both data-centric and model-centric approaches. It details the steps taken for data preparation, including data loading, cleaning, preprocessing, and model training using Logistic Regression and Random Forest Classifier. Additionally, it provides Python code for implementing the model and predicting genres based on user input.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

INTEL AI PROJECT

Tejeshwini R
22BTDS94
“B” Sec
Data Centric approach·

Logical planning is about the sequence of operations and steps


needed to achieve a goal.
Steps:
1. Data Loading​
The movie dataset was loaded from CSV.
2. Data Cleaning​
We handled missing and short descriptions given in description column of database.
3. Text Preprocessing​
We converted text to lowercase, removed URLs or any special characters, tokenized and
removed stopwords, and also lemmatized the words.
4. Genre Processing​
The data was splitted and genres were standardized into lists.
5. Feature Extraction
We used CountVectorizer for text and added description length as an extra feature.
6. Label Binarization​
We also converted genre lists into binary format using MultiLabelBinarizer.
7. Train-Test Split​
The data was divided for training and testing.
8. Model Training​
We used RandomForestClassifier model for training the data.
9. Evaluation​
We also predicted and calculated accuracy.
10. Prediction Function​
Now for the new input provided by the user, the model can predict the genre.
Algorithm used:
We are using the classification algorithm.
Model used:
We have used a Random Forest Classifier.
Model-centric approach

Python Code (Model-Centric Movie Genre


Classifier)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
df = pd.read_csv("movie_genre_classifier_dataset.csv") # Ensure the CSV is in your
working directory
# Combine title and plot into one feature
df['text'] = df['Title'] + " " + df['Plot']

# Features and labels


X = df['text']
y = df['Genre']

# Vectorize text using TF-IDF


vectorizer = TfidfVectorizer()
X_vectorized = vectorizer.fit_transform(X)

# Split into training and testing data


X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2,
random_state=42)

# Train Logistic Regression model


model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and evaluate


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)

Train the Model & Enable Prediction from User Input


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
df = pd.read_csv("movie_genre_classifier_dataset.csv")

# Combine title and plot into one feature


df['text'] = df['Title'] + " " + df['Plot']

# Features and labels


X = df['text']
y = df['Genre']

# Vectorize text using TF-IDF


vectorizer = TfidfVectorizer()
X_vectorized = vectorizer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2,
random_state=42)

# Train the model


model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Data-Centric Enhancements
# 4a. Remove duplicates
df.drop_duplicates(subset=["description"], inplace=True)

# 4b. Drop rows with missing description or genre


df.dropna(subset=["description", "genre"], inplace=True)

# 4c. Normalize genre labels


df['genre'] = df['genre'].str.strip().str.lower()

# 4d. Combine text


df['text'] = df['movie_name'].astype(str) + " " + df['description'].astype(str)

# 4e. Preprocess text


def clean_text(text):
return re.sub(r"[^a-zA-Z0-9\s]", "", text.lower())

df['text'] = df['text'].apply(clean_text)

Predict Genre from User Input


# User input: Movie description
user_input = input("Enter movie plot or description: ")

# Vectorize the input


user_vector = vectorizer.transform([user_input])

# Predict genre
predicted_genre = model.predict(user_vector)
print("Predicted Genre:", predicted_genre[0])

RESULT:Enter movie plot or description: A spaceship crew lands on an alien planet and
discovers a hidden danger.
Predicted Genre: Sci-Fi

COLAB
LINK:https://colab.research.google.com/drive/1mbl6sXsu6pGQ_LUq0QGJDFzXA7W1AXrR?
usp=sharing​

DATASET:

Aspect Model-Centric AI Data-Centric AI

Focus Improving the model Improving the quality of the data

Techniques Try different ML models, Clean descriptions, correct genre


hyperparameters labels

Your Logistic Regression + TF-IDF User input genre prediction


Example

Algorithm: Logistic Regression​


Model: Machine Learning (Text Classification)​

Planning Type: Logical Planning​

Steps in Planning: Preprocessing, Vectorization, Training, Evaluation, Prediction

You might also like