Project
Project
Internship Report on
“FULL STACK DEVELOPMENT”
By:
CERTIFICATE
Certified that the internship carried out by LOVELESH DEBNATH bearing 1BI21IS049 a
bonafide student of VI Semester B.E., BANGALORE INSTITUTE OF TECHNOLOGY in
partial fulfillment of Bachelor of Engineering in INFORMATION SCIENCE AND
ENGINEERING of VISVESVARAYA TECHNOLOGICAL UNIVERSITY, Belagavi.
during the year 2023-2024. It is certified that all corrections / suggestions indicated for Internal
Assessment have been incorporated in the report. The internship report has been approved as it satisfies
the academic requirements in respect of internship work prescribed for the said Degree.
USN: 1BI21IS049
Place: Bengaluru
Date:
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompanies the successful completion of any task would
be incomplete without complementing those who made it possible and whose guidance and
encouragement made my efforts successful. So, my sincere thanks to all those who have
supported me in completing this technical Seminar successfully.
My sincere thanks to Dr. M. U. Ashwath, Principal, BIT and Dr. Asha T, HOD, Department
of IS&E, BIT for their encouragement, support and guidance to the student community in
all fields of education. I am grateful to our institution for providing us with a congenial
atmosphere to carry out the Technical Seminar successfully.
I would not forget to remember the group of Prinston Smart Engineers for their
encouragement and more over for timely support and guidance till the completion of the
Internship.
I extend my sincere thanks to all the department faculty members and non-teaching staff
for supporting me directly or indirectly in the completion of this Internship.
LOVELSH DEBNATH
1BI21IS049
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
CHAPTER 1
INTRODUCTION
Gathering past data in any form suitable for processing. The better the
quality of data, the more suitable it will be for modeling
Data Processing – Sometimes, the data collected is in raw form and it needs to
be pre-processed. Example: Some tuples may have missing values for certain
attributes, and, in this case, it has to be filled with suitable values in order to
perform machine learning or any form of data mining. Missing values for
numerical attributes such as the price of the house may be replaced with the
mean value of the attribute whereas missing values for categorical attributes
Dept.IS&E,BIT 2022-23 1
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
may be replace with the attribute with the highest mode. This invariably
depends on the types of filters we use. If data is in the form of text or images
then converting it to numerical form will be required, be it a list or array or
matrix.
Divide the input data into training, cross-validation, and test sets. The
ratio between the respective sets must be 6:2:2
Building models with suitable algorithms and techniques on the training set.
Testing our conceptualized model with data that was not fed to the
model at the time of training and evaluating its performance using metrics
such as F1 score, precision, and recall.
o Linear Algebra
o Calculus
o Graph theory
Based on the methods and way of learning, machine learning is divided into mainly
four types, which are:
Dept.IS&E,BIT 2022-23 2
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
Supervised machine learning can be classified into two types of problems, which are
given below:
o Classification
o Regression
Dept.IS&E,BIT 2022-23 3
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
a) Classification
Classification algorithms are used to solve the classification problems in which the
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous output
variables, such as market trends, weather prediction, etc.
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Dept.IS&E,BIT 2022-23 4
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Disadvantages:
In unsupervised learning, the models are trained with the data that is neither classified
nor labeled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of other
groups. An example of the clustering algorithm is grouping the customers by their purchasing
behavior.
Dept.IS&E,BIT 2022-23 5
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
2) Association
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
Dept.IS&E,BIT 2022-23 6
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Semi-Supervised Learning
Advantages:
o It is highly efficient.
Disadvantages:
o Accuracy is low.
Dept.IS&E,BIT 2022-23 7
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Reinforcement Learning
In reinforcement learning, there is no labeled data like supervised learning, and agents
learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of reinforcement
learning is to play a game, where the Game is the environment, moves of an agent at each
step define states, and the goal of the agent is to get a high score. Agent receives feedback in
terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
Dept.IS&E,BIT 2022-23 8
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
o Robotics: RL is widely being used in Robotics applications. Robots are used in the
industrial and manufacturing area, and these robots are made more powerful with
reinforcement learning. There are different industries that have their vision of building
intelligent robots using AI and Machine learning technology.
o Text Mining: Text-mining, one of the great applications of NLP, is now being
implemented with the help of Reinforcement Learning by Salesforce company.
Advantages
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
Disadvantage
o Too much reinforcement learning can lead to an overload of states which can weaken
the results.
The curse of dimensionality limits reinforcement learning for real physical systems.
Dept.IS&E,BIT 2022-23 9
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Dept.IS&E,BIT 2022-23 10
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Mean Absolute Error: It’s the absolute difference between the estimated
value and true value. It decreases the weight for outlier errors when compared
to the mean squared error.
Smooth Absolute Error: It’s the absolute difference between the estimated
value and true value for the predictions lying close to the real value, and it’s
the square of the difference between the estimated and the true values of the
outliers (or points far off from predicted values). Essentially, it’s a
combination of MSE and MAE.
Accuracy
Accuracy is the simplest metric and can be defined as the number of test cases
correctly classified divided by the total number of test cases.
Accuracy = (TP + TN)/(TP + FP + TN + FN)
It can be applied to most generic problems but is not very useful when it comes to
unbalanced datasets. For instance, if we’re detecting fraud in bank data, the ratio of fraud to
non-fraud cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be
99% accurate by predicting all test cases as non-fraud.
Dept.IS&E,BIT 2022-23 11
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
This is why accuracy is a false indicator of model health, and for such a case, a metric
is required that can focus on the fraud data points.
Precision
Precision is the metric used to identify the correctness of classification.
Precision = TP / (TP + FP)
Intuitively, this equation is the ratio of correct positive classifications to the total
number of predicted positive classifications. The greater the fraction, the higher the precision,
which means better ability of the model to correctly classify the positive class.
Recall
Recall tells us the number of positive cases correctly identified out of the total number
of positive cases.
Recall = TP / (TP + FN)
F1 Score
F1 score is the harmonic mean of Recall and Precision, therefore it balances out the
strengths of each. It’s useful in cases where both recall and precision can be valuable – like in
the identification of plane parts that might require repairing. Here, precision will be required
to save on the company’s cost (because plane parts are extremely expensive) and recall will
be required to ensure that the machinery is stable and not a threat to human lives.
F1 Score = 2 * ((Precision * Recall) / (Precision + Recall))
Preprocessing
When we talk about data, we usually think of some large datasets with a huge number
of rows and columns. While that is a likely scenario, it is not always the case — data could be
in so many different forms: Structured Tables, Images, Audio files, Videos, etc..
Machines don’t understand free text, image, or video data as it is, they understand 1s
Dept.IS&E,BIT 2022-23 12
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
and 0s. So it probably won’t be good enough if we put on a slideshow of all our images and
expect our machine learning model to get trained just by that.
A dataset can be viewed as a collection of data objects, which are often also called as
a records, points, vectors, patterns, events, cases, samples, observations, or entities.
Data objects are described by a number of features, that capture the basic
characteristics of an object, such as the mass of a physical object or the time at which an event
occurred, etc.. Features are often called as variables, characteristics, fields, attributes, or
dimensions.
For instance, color, mileage and power can be considered as features of a car. There are
different types of features that we can come across when we deal with data.
Categorical: Features whose values are taken from a defined set of values. For
instance, days in a week: {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday,
Sunday} is a category because its value is always taken from this set. Another example
could be the Boolean set : {True, False}
Numerical: Features whose values are continuous or integer-valued. They are
represented by numbers and possess most of the properties of numbers. For instance,
number of steps you walk in a day, or the speed at which you are driving your car at.
Dept.IS&E,BIT 2022-23 13
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
The steps of Data Preprocessing: Not all the steps are applicable for each problem, it is highly
dependent on the data we are working with, so maybe only a few steps might be required with
the dataset. Generally, they are:
Dept.IS&E,BIT 2022-23 14
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Because data is often taken from multiple sources which are normally not too
reliable and that too in different formats, more than half our time is consumed in dealing with
data quality issues when working on a machine learning problem. It is simply unrealistic to
expect that the data will be perfect. There may be problems due to human error, limitations of
measuring devices, or flaws in the data collection process. The methods to deal with the
problem :
1.Missing values
It is very much usual to have missing values in your dataset. It may have happened
during data collection, or maybe due to some data validation rule, but regardless missing
values must be taken into consideration.
2. Inconsistent values
The data can contain inconsistent values. For instance, the ‘Address’ field contains
the ‘Phone number’. It may be due to human error or maybe the information was misread
while being scanned from a handwritten form.
It is therefore always advised to perform data assessment like knowing what the data
type of the features should be and whether it is the same for all the data objects.
Dept.IS&E,BIT 2022-23 15
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
3. Duplicate values
A dataset may include data objects which are duplicates of one another. It may
happen when say the same person submits a form more than once. The term deduplication is
often used to refer to the process of dealing with duplicates.
In most cases, the duplicates are removed so as to not give that particular data object
an advantage or bias, when running machine learning algorithms.
Feature Aggregation
Dept.IS&E,BIT 2022-23 16
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Feature Sampling
Sampling is a very common method for selecting a subset of the dataset that we are
analyzing. In most cases, working with the complete dataset can turn out to be too expensive
considering the memory and time constraints. Using a sampling algorithm can help us reduce
the size of the dataset to a point where we can use a better, but more expensive, machine
learning algorithm.
The key principle here is that the sampling should be done in such a manner that the
sample generated should have approximately the same properties as the original dataset,
meaning that the sample is representative. This involves choosing the correct sample size and
sampling strategy.
Simple Random Sampling dictates that there is an equal probability of selecting any
particular entity. It has two main variations as well:
Sampling without Replacement : As each item is selected, it is removed from the set
of all the objects that form the total dataset.
Sampling with Replacement : Items are not removed from the total dataset after
getting selected. This means they can get selected more than once.
Dimensionality Reduction
Most real world datasets have a large number of features. For example, consider an
image processing problem, we might have to deal with thousands of features, also called as
dimensions. As the name suggests, dimensionality reduction aims to reduce the number of
features - but not simply by selecting a sample of features from the feature-set, which is
something else — Feature Subset Selection or simply Feature Selection.
Conceptually, dimension refers to the number of geometric planes the dataset lies in,
which could be high so much so that it cannot be visualized with pen and paper. More the
number of such planes, more is the complexity of the dataset.
This refers to the phenomena that generally data analysis tasks become significantly
harder as the dimensionality of the data increases. As the dimensionality increases, the
number planes occupied by the data increases thus adding more and more sparsity to the data
which is difficult to model and visualize.
Dept.IS&E,BIT 2022-23 17
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
What dimension reduction essentially does is that it maps the dataset to a lower-
dimensional space, which may very well be to a number of planes which can now be
visualized, say 2D. The basic objective of techniques which are used for this purpose is to
reduce the dimensionality of a dataset by creating new features which are a combination of
the old features. In other words, the higher-dimensional feature-space is mapped to a lower-
dimensional feature-space. Principal Component Analysis and Singular Value Decomposition
are two widely accepted techniques.
Data Analysis algorithms work better if the dimensionality of the dataset is lower.
This is mainly because irrelevant features and noise have now been eliminated.
The models which are built on top of lower-dimensional data are more understandable
and explainable.
The data may now also get easier to visualize.
Feature Encoding
The whole purpose of data preprocessing is to encode the data in order to bring it to
such a state that the machine now understands it.
Feature encoding is basically performing transformations on the data such that it can
be easily accepted as input for machine learning algorithms while still retaining its original
meaning.
There are some general norms or rules which are followed when performing feature
encoding.
Nominal: Any one-to-one mapping can be done which retains the meaning. For
instance, a permutation of values like in One-Hot Encoding.
Ordinal: An order-preserving change of values. The notion of small, medium and
large can be represented equally well with the help of a new function, that is,
<new_value = f(old_value)> - For example, {0, 1, 2} or maybe {1, 2, 3}.
Dept.IS&E,BIT 2022-23 18
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
After feature encoding is done, our dataset is ready for the exciting machine learning
algorithms. But before we start deciding the algorithm which should be used, it is always
advised to split the dataset into 2 or sometimes 3 parts. Machine Learning algorithms, or any
algorithm for that matter, has to be first trained on the data distribution available and then
validated and tested, before it can be deployed to deal with real-world data.
Training data: This is the part on which your machine learning algorithms are
actually trained to build a model. The model tries to learn the dataset and its various
characteristics and intricacies, which also raises the issue of Overfitting v/s
Underfitting.
Validation data: This is the part of the dataset which is used to validate our various model
Dept.IS&E,BIT 2022-23 19
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
fits. In simpler words, we use validation data to choose and improve our model
hyperparameters. The model does not learn the validation set but uses it to get to a better state
of hyperparameters.
Test data: This part of the dataset is used to test our model hypothesis. It is left untouched
and unseen until the model and hyperparameters are decided, and only after that the model is
applied on the test data to get an accurate measure of how it would perform when deployed
on real-world data.
Split Ratio: Data is split as per a split ratio which is highly dependent on the type of model
we are building and the dataset itself. If our dataset and model are such that a lot of training is
required, then we use a larger chunk of the data just for training purposes (usually the case)
For instance, training on textual data, image data, or video data usually involves thousands of
features.
If the model has a lot of hyperparameters that can be tuned, then keeping a higher
percentage of data for the validation set is advisable. Models with less number of
hyperparameters are easy to tune and update, and so we can keep a smaller validation set.
Like many other things in Machine Learning, the split ratio is highly dependent on
the problem we are trying to solve and must be decided after taking into account all the
various details about the model and the dataset in hand.
EDA is very essential because it is a good practice to first understand the problem
statement and the various relationships between the data features before getting your hands
dirty.
Dept.IS&E,BIT 2022-23 20
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
In today's digital age, laptops have become an essential tool for work, education, and
entertainment. With so many options available on the market, it can be challenging to choose
the right laptop for your needs. The models predict the price of the laptop based on the
requirement and the best model is selected based on the root mean square error and r2score.
1.2 Objectives
To predict the price of the laptop.
An approach to receive higher accuracy.
To build a machine learning model to classify the given problem statement.
Dept.IS&E,BIT 2022-23 21
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
CHAPTER 2
REQUIREMENTS SPECIFICATION
Dept.IS&E,BIT 2022-23 22
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
CHAPTER 3
SYSTEM DEFINITION
PROJECT DESCRIPTION
Supervised Machine Learning algorithm can be broadly classified into Regression and
Classification Algorithms. Regression algorithms are used to predict the output for
continuous values whereas the classification algorithms are used to predict discrete values. In
this project regression algorithms are used to predict the price of the laptops.
Regression analysis is often used in finance, investing, and others, and finds out the
relationship between a single dependent variable(target variable) dependent on several
independent ones. For example, predicting house price, stock market or salary of an
employee, etc are the most common regression problems. Here, the target variable is the price
of the Laptops.
Regression Algorithms
1. Linear Regression
2. Decision Tree
3. Support Vector Regression
4. Random Forest
Linear regression
Linear Regression is an ML algorithm used for supervised learning. Linear regression
performs the task to predict a dependent variable(target) based on the given independent
variable(s). So, this regression technique finds out a linear relationship between a dependent
variable and the other given independent variables. Hence, the name of this algorithm is
Linear Regression.
Dept.IS&E,BIT 2022-23 23
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
In the figure above, on X-axis is the independent variable and on Y-axis is the output. The
regression line is the best fit line for a model. And our main objective in this algorithm is to
find this best fit line.
Pros:
Cons:
Dept.IS&E,BIT 2022-23 24
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Decision Tree
The decision tree models can be applied to all those data which contains numerical
features and categorical features. Decision trees are good at capturing non-linear interaction
between the features and the target variable. Decision trees somewhat match human-level
thinking so it’s very intuitive to understand the data.
For example, if we are classifying how many hours a kid plays in particular weather
then the decision tree looks like somewhat this above in the image.
So, in short, a decision tree is a tree where each node represents a feature, each branch
represents a decision, and each leaf represents an outcome(numerical value for regression).
Step 1. Begin the tree with the root node, says S, which contains the complete dataset.
Step 2. Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step 3. Divide the S into subsets that contains possible values for the best attributes.
Step 4. Generate the decision tree node, which contains the best attribute.
Step 5. Recursively make new decision trees using the subsets of the datasets of the
dataset created in step 3.
Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Dept.IS&E,BIT 2022-23 25
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Pros:
Cons:
It tends to overfit.
A small change in the data tends to cause a big difference in the tree structure, which
causes instability.
In the figure above, the Blue line is the Hyper Plane; Red Line is the Boundary Line
Dept.IS&E,BIT 2022-23 26
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
All the data points are within the boundary line(Red Line). The main objective of
SVR is to basically consider the points that are within the boundary line.
Pros:
Robust to outliers.
Excellent generalization capability
High prediction accuracy.
Cons:
Random Forest works in two-phase first is to create the random forest by combining
N decision tree, and second is to make predictions for each tree created in the first
phase.
Step 2. Build the decision trees associated with the selected data points.
Step 3. Choose the number N for decision trees that you want to build.
Step 5. For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.
Pros:
Cons:
WORKING DESCRIPTION
Dept.IS&E,BIT 2022-23 28
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
LAPTOP SELECTION
As in today's digital age, laptops have become an essential tool for work, education,
and entertainment. With so many options available on the market, it can be challenging to
choose the right laptop for people’s needs. To solve this challenge designing a model which
predicts the price of the laptop based on their specification and brands of the laptops.
Context of Dataset
The dataset provides detailed information on 1000 laptops available on Flipkart,
including technical specifications, customer reviews and ratings, and prices. Researchers,
analysts, and consumers can use this dataset to gain insights into the Indian laptop market,
compare different models, and make informed purchase decisions. With this comprehensive
dataset, anyone can find the perfect laptop for their needs, whether for work, gaming, or
personal use.
Data Preprocessing
Data preprocessing is an important step before using it. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of
data preprocessing is to improve the quality of the data and to make it more suitable for
the specific model to train.
In the laptop selection dataset, there are both numerical and categorical features.
The categorical features need to be converted to numerical as the models takes only the
numerical values.
The categorical features present are image link, name, processor, ram, os, storage
And numerical features are price(in Rs.), display(in inch), rating, number of ratings, number
of reviews.
Dept.IS&E,BIT 2022-23 29
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Also, there are many null values in the dataset, they are filled before using the
dataset. The null values of numerical values is filled with mean value of the feature whereas
the null values of the categorical features is filled with mode value.
Before splitting the data for training and testing, we have to assign the
response variable and predictor variable to Y and X respectively. Now we have to split
the data in an 80:20 ratio. 80% of the data will be used for training the models and 20%
of the data will be used for testing.
Performing Regression
Prepare the model using the X_train and y_train (training data) using the following
algorithms:
● Decision Tree
● Random Forest
● Support Vector Regression
With the prepared model ,test that with the 20% (X_test) testing data and assign
that to the y_pred variable Now test the performance of the model using root squared
mean error and r2 score.
Libraries Used:
NumPy
Numpy is a general-purpose array-processing package. It provides a
high-performance multidimensional array object, and tools for working with
these arrays. It is the fundamental package for scientific computing with Python.
Besides its obvious scientific uses, Numpy can also be used as an efficient multi-
dimensional container of generic data.
Pandas
Pandas is an open-source library that is built on top of NumPy
library. It is a Python package that offers various data structures and
operations for manipulating numerical data and time series. It is mainly
popular for importing and analyzing data much easier. Pandas is fast and it
has high-performance & productivity for users.
Dept.IS&E,BIT 2022-23 30
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots
of arrays. Matplotlib is a multi-platform data visualization library built on
NumPy arrays and designed to work with the broader SciPy stack. It was
introduced by John Hunter in the year 2002. One of the greatest benefits of
visualization is that it allows us visual access to huge amounts of data in
easily digestible visuals. Matplotlib consists of several plots like line, bar,
scatter, histogram etc.
Seaborn
Seaborn is a library mostly used for statistical plotting in Python. It is built
on top of Matplotlib and provides beautiful default styles and color palettes to
make statistical plots more attractive.
TECHNOLOGY USED
Machine Learning
LANGUAGE USED
Python
Dept.IS&E,BIT 2022-23 31
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
DATASET
For this project, I have used the dataset extracted from kaggle. The dataset
given by the source is fairly accurate and it taken from flipkart about 1000
laptops. The dataset has 984 rows and 12 columns. Snapshot of part of the dataset
is given below
Dept.IS&E,BIT 2022-23 32
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Dept.IS&E,BIT 2022-23 33
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Dept.IS&E,BIT 2022-23 34
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
CHAPTER 4
IMPLEMENTATION
Dept.IS&E,BIT 2022-23 35
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
#price range
plt.figure(figsize=(12,6))
sns.histplot(data = data, x = 'price(in Rs.)', kde = True,bins=20)
plt.title('Price Range of the Laptops')
plt.show()
plt.figure(figsize=(12,6))
data.groupby('brand').size().sort_values(ascending=False).head(5).plot(kind = 'bar',color =
sns.color_palette('Paired'))
plt.xlabel('Laptop Brand')
plt.ylabel('Number of Laptops')
plt.title('Top 5 most popular laptops brand')
plt.show()
# Ram
ram = []
for i in data['ram']:
ram.append(int(i.split(' ')[0]))
data['ramSize'] = ram
plt.figure(figsize=(12,6))
data.groupby('ramSize').size().sort_values(ascending=False).plot(kind = 'bar',color =
sns.color_palette('Blues'))
plt.xlabel('Ram Size in GB')
plt.ylabel('Number of Laptops')
plt.show()
Dept.IS&E,BIT 2022-23 36
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
#processor
plt.figure(figsize=(12,6))
df.groupby('processor').size().sort_values(ascending = False).head(6).plot(kind = 'bar', color
= sns.color_palette("coolwarm"))
plt.xlabel("Processor")
plt.ylabel("Frequency")
plt.title("Top 6 Most Frequent Processor")
plt.show()
# OS
os = []
for i in data['os']:
os.append(i.split('bit')[-1].strip())
data['operating_system'] = os
data.drop(['name','os'],axis=1,inplace=True)
plt.figure(figsize=(12,6))
data.groupby('operating_system').size().sort_values(ascending = False).plot(kind = 'bar',
color = sns.color_palette("hsv_r"))
plt.xlabel("Operating System")
plt.ylabel("Frequency")
plt.title("Most Frequent OS")
plt.show()
plt.figure(figsize=(12,6))
sns.scatterplot(data = new_df,x = 'price(in Rs.)', y = 'no_of_ratings')
plt.xlabel('Price in INR')
plt.ylabel("Number of Ratings")
plt.title("Price VS Number of Ratings")
plt.show()
Dept.IS&E,BIT 2022-23 37
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
# Price Dependency
# Processor
plt.figure(figsize=(15,5))
sns.scatterplot(x=categorical_df['processor'], y=numerical_df['price(in Rs.)'])
plt.xticks(rotation=70, horizontalalignment='right',
fontsize=9)
print()
# ram
plt.figure(figsize=(10,5))
sns.scatterplot(x=categorical_df['ram'], y=numerical_df['price(in Rs.)'])
plt.xticks(rotation=40, horizontalalignment='right',
fontsize=9)
print()
plt.figure(figsize=(10,4))
Dept.IS&E,BIT 2022-23 38
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
sns.barplot(x=laptops_sorted_by_rating['brand'][:20],
y=laptops_sorted_by_rating['no_of_ratings'][:20])
plt.xticks(rotation=40, horizontalalignment='right',
fontsize=10)
print()
plt.figure(figsize=(10,4))
sns.barplot(x=laptops_sorted_by_rating['brand'][:20],
y=laptops_sorted_by_rating['no_of_reviews'][:20])
plt.xticks(rotation=40, horizontalalignment='right',
fontsize=10)
# Numerical features
sns.pairplot(numerical_df)
#Preprocessing
#ram
data['ramType'] = data.ram.str.split().apply(lambda x : ' '.join(x[2:-1]))
data.drop('ram',axis=1,inplace=True)
#storage
data = data[data["storage"].str.contains("PCI-e SSD (NVMe) ready,Silver-Lining Print
Keyboard,Matrix Display (Extend),Cooler Boost 5,Hi-Res Audio,Nahimic 3,144Hz
Panel,Thin Bezel,RGB Gaming Keyboard,Speaker Tuning Engine,MSI Center",
regex=False) == False]
data = data[data["storage"].str.contains("PCI-e Gen4 SSD?SHIFT?Matrix Display
(Extend)?Cooler Boost 3?Thunderbolt 4?Finger Print Security?True Color 2.0?Hi-Res
Audio?Nahimic 3? 4-Sided Thin bezel?MSI Center?Silky Smooth Touchpad?Military-Grade
Durability", regex=False) == False]
Dept.IS&E,BIT 2022-23 39
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
data["first"]= new[0]
data["first"]=data["first"].str.strip()
data["second"]= new[1]
data["first"] = data["first"].astype(int)
data["second"] = data["second"].astype(int)
data["HDD"]=(data["first"]*data["Layer1HDD"]+data["second"]*data["Layer2HDD"])
data["SSD"]=(data["first"]*data["Layer1SSD"]+data["second"]*data["Layer2SSD"])
data.drop(columns=['first', 'second', 'Layer1HDD', 'Layer1SSD','Layer2HDD',
'Layer2SSD'],inplace=True)
data.drop('storage',axis=1,inplace=True)
Dept.IS&E,BIT 2022-23 40
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
data.brand.value_counts()
dummies = pd.get_dummies(data.operating_system)
dummies
data = pd.concat([data,dummies],axis=1)
data.drop('operating_system',axis=1,inplace=True)
dummies = pd.get_dummies(data.brand)
data = pd.concat([data,dummies],axis=1)
data.drop('brand',axis=1,inplace=True)
data.head()
dummies = pd.get_dummies(data.ramType)
data = pd.concat([data,dummies],axis=1)
data.drop('ramType',axis=1,inplace=True)
data.head()
data.info()
data.processor.value_counts()
data['cpu'] = data['processor'].apply(lambda x:" ".join(x.split()[0:3]))
data.cpu.value_counts()
dummies = pd.get_dummies(data.cpu)
data = pd.concat([data,dummies],axis=1)
data.drop('processor',axis=1,inplace=True)
data.head()
data.drop('cpu',axis=1,inplace=True)
data.info()
Dept.IS&E,BIT 2022-23 41
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
len(x_train),len(y_train),len(x_test),len(y_test)
#standardizing data
x_sc = StandardScaler()
y_sc = StandardScaler()
y__train = y_train.reshape(len(y_train),1)
y__test = y_test.reshape(len(y_test), 1)
x__train = x_sc.fit_transform(x_train)
y__train = y_sc.fit_transform(y__train)
x__test = x_sc.transform(x_test)
y_test = y_sc.transform(y__test)
#models
#decision tree
dreg = DecisionTreeRegressor()
dreg.fit(x__train,y__train.reshape(len(y__train)))
dtpred = dreg.predict(x__test)
#random forest
rreg = RandomForestRegressor(n_estimators=10)
rreg.fit(x__train,y__train.reshape(len(y__train)))
rfpred = rreg.predict(x__test)
#svr
sreg = SVR(kernel='linear')
sreg.fit(x__train,y__train.reshape(len(y__train)))
srpred = sreg.predict(x__test)
Dept.IS&E,BIT 2022-23 42
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
#decision tree
rsme_dt = np.sqrt(mean_squared_error(y_test,dtpred))
rsdt = r2_score(y_test,dtpred)
rsme_dt,rsdt
#random forest
rsme_rf = np.sqrt(mean_squared_error(y_test,rfpred))
rsrf = r2_score(y_test,rfpred)
rsme_rf,rsrf
#svr
rsme_sr = np.sqrt(mean_squared_error(y_test,srpred))
rssr = r2_score(y_test,srpred)
rsme_sr,rssr
Dept.IS&E,BIT 2022-23 43
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
SNAPSHOTS
Before Preprocessing
Dept.IS&E,BIT 2022-23 44
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
After Preprocessing
Dept.IS&E,BIT 2022-23 45
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Dept.IS&E,BIT 2022-23 46
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
EDA
Price Range
The price range of the laptops ranges from 15,990 to 4,19,990 with the moximum frequency
of laptops lies between 20,000 to 75,000.
Popular Brands
Asus, Lenovo, Dell, HP, Acer are the top five Popular Brand
Dept.IS&E,BIT 2022-23 47
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
RAM
Ram size ranges from 4 to 32 GB. As the cost of most of the laptops lies between Rs 45,000
to Rs 75,000 the ram would be either 8GB or 16GB.
Operating System
Windows 11 operating system is most frequently used operating system compared to others.
Dept.IS&E,BIT 2022-23 48
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Processor
When the price is between 25,000 to 1,00,000 there are more number of rating which means
that this is the price range in which most people are interested upon
Dept.IS&E,BIT 2022-23 49
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Price dependency
Processor
I think the price is most influenced by the processor that is installed in the laptop. Based on
the above graph, we can conclude that the price is indeed to some degree based on the type of
processor, in addition, there are some processors that are not often found in laptops.
Dept.IS&E,BIT 2022-23 50
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
RAM
The price depends on the amount of memory, more memory - more expensive laptop.
Best Rating
Dept.IS&E,BIT 2022-23 51
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Most Rated
Most Reviews
Dept.IS&E,BIT 2022-23 52
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Dept.IS&E,BIT 2022-23 53
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
Dept.IS&E,BIT 2022-23 54
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
From comparison we can see that Random forest algorithm has the highest r2score and
lowest root square mean error. So, the random forest algorithm is the best to predict the price
of the laptops.
Dept.IS&E,BIT 2022-23 55
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
CHAPTER 5
DECLARATION
I also declare that, to the best of my knowledge and belief, the work reported
here is not from the part of dissertation on the basis of which a degree or award
was conferred on an earlier occasion on this by any other student.
Place: Bengaluru
Amrutha R.
[1BI19CS020]
Dept.IS&E,BIT 2022-23 56
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
CHAPTER 6
CONCLUSION/FUTURE ENHANCEMENT
As laptops become an essential tool for work, education, and entertainment, it can be
challenging for consumers to choose the right laptop for their needs. With the proposed
machine learning model one can predict the price of the model based on the specifications.
The r2 score of decision tree, random forest, and support vector regression is 0.7824,
0.8543, and 0.8131 respectively. From this we can conclude that the random forest algorithm
predicts more precisely compared to other algorithms.
I have found the important features which could play a vital role in laptop
selection and non influential features as well. I studied the report of prediction
carefully. I can expand the existing system with additional analysis methods and
implementation with neural networks and deep learning.
Dept.IS&E,BIT 2022-23 57
LAPTOP SELECTION & PRICE PREDICTION INTERNSHIP PROJECT
CHAPTER 7
REFERENCES
https://www.kaggle.com/datasets/rajugc/laptop-selection-dataset
https://pandas.pydata.org/docs/
https://matplotlib.org/stable/index.html
https://www.academia.edu/69591584/Laptop_Price_Prediction_using_Machine_Lear
ning
Dept.IS&E,BIT 2022-23 58