Capstone Project
Capstone Project
on
Title
SUBMITTED TO
FOUNDATION FOR INNOVATION
AND TECHNOLOGY TRANSFER
Submitted By
PRASHANT DWIVEDI
[email protected]
2.3 TASK
3. LITERATURE SURVEY
4. DATASET
7.3 CONCLUSION
8. CONCLUSION
2. INTRODUCTION
2.1 Problem Formulation
Car price prediction is a vital part of the automobile industry that affects customers,
sellers, and manufacturers. A reliable car price forecasting system can help automobile
dealers to price their automobiles sensibly, inform customers about a better decision,
and allow manufacturers to modify prices reasonably. Manual comparisons are the
conventional technique for car pricing, which might have been subjective and less
accurate. Through the use of data-centric approaches, we can automate and enhance
the precision of car pricing estimation.
• Data Exploration: Getting to know the organization of the dataset and the most
critical features.
• Data Preprocessing: Handling missing values, encoding categorical variables, and
normalization.
• Exploratory Data Analysis (EDA): Identification of the most significant trends and
correlations within the dataset.
• Model Selection and Training: Choosing and training the most suitable machine
learning algorithms.
• Model Evaluation: Comparison of model performance using standard evaluation
metrics.
• Conclusion and Future Work: Summary of results and proposing improvements.
3. LITERATURE SURVEY
3.1 Current Models for Price Prediction
A few of the machine learning models used commonly for predicting car prices are:
For the purposes of research here, we have Multiple Linear Regression and Random Forest
Regression.
• Linear Regression: A simple but concise model that postulates a linear relationship
between attributes and car price.
• Random Forest: An ensemble algorithm that enhances accuracy by aggregating
several decision trees, hence being less susceptible to overfitting.
• Linear Regression: The model is trained to estimate the function mapping input
features to car prices as a weighted sum of predictors.
• Random Forest: Decision trees are learned in parallel and whose predictions are
averaged to provide more accurate outcomes.
4. DATASET
4.1 Data Collection
The CarPrice dataset has 205 rows and 26 columns, with prominent car features such as:
• Car details: Name of the car, type of fuel, type of body, type of drive.
• Engine specification: Size of the engine, horsepower, fuel system.
• Performance indicators: City/highway MPG mileage, curb weight, compression
ratio.
• Price: The target variable to be predicted.
Exploratory Data Analysis helps uncover hidden patterns, detect outliers, and identify
feature importance before building predictive models.
• Correlation matrix
• Feature importance scores from Random Forest
• Splitting the dataset into training (80%) and testing (20%) sets.
• Training Linear Regression and Random Forest models.
• Using hyperparameter tuning to optimize model performance.
6.4 Results Summary
• Linear Regression: Provides interpretability but may not capture complex patterns.
• Random Forest: Offers better accuracy and robustness.
7 MODEL EVALUATION
7.1Evaluation Metrics
• Mean Squared Error (MSE): Measures the average squared difference between
predicted and actual values.
• Mean Absolute Error (MAE): Calculates the average absolute differences.
• R² Score: Indicates how well independent variables explain price variance.
• Random Forest performs best, achieving the lowest RMSE and highest R² score.
• Linear Regression is useful for interpretability, though less accurate.
8. CONCLUSION
8.1 Key Achievements
This study demonstrates the power of machine learning in price prediction. By leveraging
data-driven techniques, we can achieve accurate and reliable estimates, benefiting both
consumers and businesses in the automotive industry.
ABSTRACT
This is a project for machine learning-based car price prediction. Car price varies on the
basis of many parameters such as brand, engine size, horsepower, body, fuel, etc.
Human estimation of price cannot be exact and consistent. Hence, this project
attempts to develop a model that estimates car prices from structured data on these
parameters.
We used a dataset that contains data about 205 cars with 26 features. The data were
preprocessed and cleaned by converting text data to numeric values and scaling the
features. We used the data to determine which features are most influential in car
prices. Engine size, horsepower, and curb weight were the most influential features.
We employed two machine learning models: Linear Regression and Random Forest
Regression. Linear Regression helped us develop a simple, interpretable model, while
Random Forest helped with better performance since it could handle complex patterns
in the data. We contrasted the models using the performance metrics of R² score, Mean
Squared Error, and Mean Absolute Error, and we noticed that Random Forest performed
much better with more than 92% accuracy.
This project illustrates how machine learning can be employed to accurately predict car
prices and allow buyers, sellers, and car dealers to make better-informed decisions.
This can be enhanced in the future with increased data and the development of a web
application for ease of use.
CHAPTER 1
INTRODUCTION
Forecasting the price of a car is a sophisticated task that entails countless features like
technical features, reputation of the brand, fuel type, body style, and powerplant. Price
forecasting in the conventional car market usually relies on experience or narrow
market information.
However, with the growing ability of data analytics and machine learning, we are now in
a position to build intelligent systems that can learn from past data and calculate car
prices with high accuracy.
The fundamental objective of this project is to develop a machine learning model that
can predict the price of a car based on its features. It will be helpful to several
stakeholders across the automotive community:
The consumer knows very well how much a car is worth when they purchase it.
Sellers and dealers can price competitively with the help of forecasting methods.
Producers know the impact various features have on cost and make smart design
decisions.
The data set utilized in this project has 205 records of different car models, each with 26
attributes like car brand, fuel type, engine size, horsepower, body style, and the price of
the car. The variability in this data set enables us to analyze how various features affect
the price and create a sound model.
In order to complete project follows these steps:
1. Data Preprocessing: Cleaned and preprocessed the dataset to a suitable format.
2. Exploratory Data Analysis (EDA): Identifying patterns, trends, and relationships within
the data.
3. Model Training and Selection: Selecting appropriate machine learning models and
training them on the available data.
4. Model Evaluation: Evaluating the performance of each model using the appropriate
metrics.
5. Conclusion and Future Scope: Condensing results and establishing how the model
can be extended or enhanced.
At the completion of this project, we want to show that machine learning can be a
useful tool in the car business for smart price estimation. Through good analysis and
modeling, we can develop a system that provides accuracy and understanding of how
various features affect automobile prices.
CHAPTER 2
LITRATURE SURVEY
Existing Practices
• Linear Regression: Excellent default model, although linear relationship is
assumed.
• Decision Trees: Can handle non-linear relationships but over fit.
• Random Forest: Trained ensemble of trees that compensates for variance
and provides a more precise output.
• Support Vector Regression (SVR): Excellent with small or high-
dimensional data.
• Neural Networks: Represent complex patterns at the cost of larger
datasets and more computation.
Models of Interest
• Linear Regression: For understanding and baseline.
• Random Forest Regression: For fitting complex interactions among
features.
CHAPTER 3
DATA COLLECTION AND PREPROCESSING
Dataset Overview
• Total Records: 205
• Columns: 26
• Target Variable: price
• Data Types: 10 categorical, 8 integer, 8 float
Preprocessing Steps
• No missing values were found.
• Brand extraction: Extracted brand from CarName by splitting on the space.
• Label Encoding: Used for categorical features such as fueltype, aspiration,
etc.
• Feature scaling: StandardScaler applied to numeric features for
normalization.
CHAPTER 4
Exploratory Data Analysis (EDA)
EDA uncovered strong correlations between a number of features and car price:
Visualizations such as scatter plots, box plots, and heatmaps were employed to
detect trends and correlations
Key Observations:
• Sedan and hatchback vehicles are lower-priced
• Luxury brands exhibit considerably higher prices
• High multicollinearity detected between some numeric features
CHAPTER 5
MODEL SELECTION AND TRAINING
We chose Linear Regression as the starting model due to its interpretability and
performance on continuous variables.
Training Steps
• Data split: 80% for training, 20% for testing
• Used LinearRegression() from scikit-learn
• Trained on several predictors, both numerical and encoded categorical
variables
CHAPTER 6
MODEL EVALUATION
The Linear Regression model was tested with:
• R² Score: ~0.85 (meaning that the model explains 85% of price variation)
• Mean Squared Error (MSE): Low, affirming good prediction quality
The model is good at predicting car prices and meets industry standards for baseline
models. However, performance may be enhanced through regularization or
advanced algorithms.
CHAPTER 7
CONCLUSION
The purpose of this project is to create a prediction model to forecast car prices with
the help of a Linear Regression algorithm. The model is implemented with the help
of the CarPrice dataset, which holds technical, numerical, and categorical
characteristics of several models of cars. After data preprocessing, exploratory
analysis, and training of the model, the Linear Regression model proves to be well-
performing with an R² score of about 0.85, showing excellent prediction strength.
REFERENCES
• Kaggle – Car Price Prediction Datasets and Notebooks
• Community-contributed data and notebooks exploring various car price
prediction models.
• W3Schools.com
https://colab.research.google.com/drive/1LG3g0_eGLnFy3zJF9x--5lv12iE7-
9Dn?usp=sharing