Final PHPU
Final PHPU
Bachelor of Technology
in
SANDIP UNIVERSITY
NeelamVidyaVihar, Sijoul, Mailam, Madhubani, BIHAR – 847235
Website:http://www.sandipuniversity.edu.in Email: info@ sandipuniversity.edu.in
Session: 2021-2025
SANDIP UNIVERSITY
NeelamVidyaVihar, Sijoul, Mailam, Madhubani, BIHAR – 847235
Website:http://www.sandipuniversity.edu.in Email: info@
sandipuniversity.edu.in
==================================================
CERTIFICATE
This is to certify that the Project entitled “ Predicting House Prices Using decision tree ”
submitted by
Date:
Place: Sijoul
Acknowledgements
We here, feel very grateful presenting this project report on “Predicting house prices using
decision tree”. It was very nice experience while working on this project.
First of all, I would like to thank Dr. Shambhu Kumar Singh, Dean of SOCSE and our project
coordinator, for his invaluable guidance, support, and feedback throughout this project. His
expertise was instrumental in improving the quality of our research and report.
We would also like to extend my gratitude to the respected Head of the Department, Computer
Science and Engineering, Prof. Aishwarya Shekhar, Dean Academics, SOCSE
Dr. Shambhu Kumar Singh, and Hon’ble Vice Chancellor, Dr. Samir Kumar Varma for
providing me with all the facilities that was required.
Lastly, we would also like to thank my all faculties of CSE department, friends and all
those people who have helped me a lot throughout this project.
Date:
Place:
INDEX
1 List of Figures ii
3 List of Acronyms iv
4 Abstract v
5 Chapter-1:Introduction 1-3
9 Chapter-5:Design 14-24
11. References 26
LIST OF FIGURES
LIST OF TABLES
LIST OF ACRONYMS
Abstract
The Indian real estate market is characterized by its dynamic nature, influenced by factors such as
geographic location, infrastructure development, population growth, and proximity to prestigious
educational institutions like IITs and NITs. Real estate investments in India often fluctuate due to
variations in property features such as square footage, age, lot size, and market demand. In recent
years, technological advancements and data-driven techniques have played a vital role in
predicting property prices, thereby empowering stakeholders to make informed decisions.
This study focuses on predicting house prices using a Decision Tree model, a widely adopted
machine learning technique that excels in handling large datasets and identifying non-linear
relationships among variables. By analyzing historical housing data, including sale price, square
footage, property age, and proximity to major institutions, this research provides actionable
insights into the Indian real estate market. The integration of localized data, such as properties
near IITs and NITs, brings relevance to the Indian context, allowing for tailored strategies and
market-specific analysis.
1. To understand the key factors influencing property prices in various regions across
India.
2. To develop a robust predictive model that estimates property prices with high accuracy.
3. To provide valuable insights for buyers, sellers, investors, and developers aiming to
optimize decision-making in real estate transactions.
The data used in this study comprises property records across multiple regions, with key features
such as number of bedrooms, number of bathrooms, square footage of homes and lots, property
age, and location attributes. A special focus is placed on properties located near educational hubs
like IIT Bombay, IIT Kanpur, NIT Trichy, and others, as these institutions often serve as drivers
for regional real estate growth. This localized approach ensures that the predictive model reflects
the real-world dynamics of Indian housing markets.
A comprehensive Exploratory Data Analysis (EDA) was conducted to evaluate the distribution
and relationships among key variables. Univariate analysis revealed the concentration of
properties with specific features, such as houses with 3-4 bedrooms and 2-3 bathrooms, which are
most prevalent in the dataset. The median house prices varied significantly across regions, with
properties near prestigious institutions commanding higher valuations due to increased demand
and perceived value. Multivariate analysis demonstrated strong correlations between house prices
and features like home square footage, lot size, and the number of amenities.
To ensure accuracy and efficiency, the Decision Tree model was chosen for its interpretability
and ability to identify hierarchical relationships within the data. The model was developed
iteratively using training and testing datasets, with appropriate preprocessing techniques like
feature normalization and categorical encoding applied. The performance of the model was
evaluated using metrics such as R-squared (R²) scores and feature importance rankings. The final
model achieved a balanced accuracy of 70% on the testing dataset, demonstrating its reliability
in predicting property prices.
1. Home Size Matters: Larger homes (higher square footage) consistently drive higher sale
prices, reflecting buyer preferences for spacious living areas.
2. Location Drives Value: Proximity to IITs, NITs, and other key infrastructure
significantly increases property prices, highlighting the importance of location-based
investments.
3. Property Age: While newer properties are preferred, well-maintained older properties in
prime locations still command competitive prices.
4. Lot Size and Amenities: Buyers value outdoor spaces and additional amenities,
influencing overall property valuations.
The Decision Tree model provides a transparent and actionable framework for understanding
real estate trends in India. For buyers, this research helps identify regions and property features
that offer the best value. For sellers, it highlights the importance of optimizing property features
to maximize sale prices. Developers and investors can leverage these insights to focus on high-
demand regions, particularly near educational and infrastructural hubs, to achieve higher returns
on investment.
In conclusion, this study bridges the gap between data science and the Indian real estate market by
developing a robust predictive model that aligns with the local context. The integration of
machine learning techniques with region-specific data ensures accurate price predictions, enabling
stakeholders to navigate the complex real estate landscape effectively. Future research can
enhance this model by incorporating additional dynamic factors, such as economic conditions,
interest rates, and supply-demand trends, to further improve prediction accuracy and market
insights.
Chapter-1:
Introduction
The Indian real estate market plays a pivotal role in the economy and societal development. With
urbanization and infrastructure growth, the need for accurate real estate valuation has never been
more critical. Predicting house prices with high precision enables stakeholders, including buyers,
sellers, developers, and policymakers, to make well-informed decisions. This study focuses on
predicting house prices using Decision Tree models, emphasizing regional differences and the
influence of proximity to premier institutions like IITs and NITs. Key factors such as property
size, age, and location are meticulously analyzed to decode their impact on pricing trends.
This project explores the application of decision tree models for predicting house prices in Indian
cities. With a dataset of 10,659 property records containing detailed attributes like price, size, and
property characteristics, the goal is to understand and analyze the relationship between these
variables and the resulting property prices. Decision tree models were employed due to their
interpretability and ability to handle non-linear relationships effectively.
The primary objective of this project is to build a predictive model to estimate house prices based
on critical property attributes, such as:
The decision tree model provides an accurate and interpretable prediction tool, enabling buyers,
sellers, and developers to make informed decisions. This project addresses the limitations of
manual and simplistic approaches currently used in real estate analysis.
Furthermore, this project assists various stakeholders, such as:
In the existing systems for real estate pricing and analysis, the following shortcomings are
evident:
Lack of Automation: Many real estate reports are generated manually or using
spreadsheets.
Limited Features: Existing methods do not incorporate multiple variables such as lot
size, property age, and location simultaneously.
Inconsistent Accuracy: Simple linear models fail to capture non-linear relationships
between variables and house prices.
Real estate professionals currently rely on manual data entry and interpretation, making the
process time-consuming and prone to human errors.
The need for a predictive system arises from the limitations of existing methods, which include:
Data Management Challenges: Current systems use spreadsheets that lack centralized
storage and robust data management capabilities.
Manual Intervention: Data manipulation in spreadsheets can lead to errors and
inconsistencies.
No Cross-Verification: Reports generated manually cannot be validated easily, leading to
unreliable insights.
To overcome these challenges, a predictive system based on decision tree models is proposed.
The new system offers the following advantages:
Real-Time Analysis: Provides dynamic and real-time predictions for improved decision-
making.
This project integrates a robust and scalable solution to address the needs of India’s growing real
estate market.
In software development, selecting an appropriate process model is crucial for project success.
For this project, the Spiral Model has been chosen due to its iterative nature and flexibility.
The spiral model is an evolutionary software process model that combines the iterative approach
of prototyping with the systematic methodology of the sequential linear model. It enables
incremental releases of the software, ensuring continuous improvement and evaluation.
The iterative nature of the spiral model ensures that the predictive model is developed, tested, and
validated at each step, leading to improved performance and accuracy.
The Spiral Model’s structured yet flexible approach makes it ideal for developing the predictive
system for house prices, ensuring robustness, accuracy, and user satisfaction.
Chapter-2:Project Management
Efficient project management was crucial in ensuring the success of this research. The study
involved systematic data collection and analysis. Key steps included:
1. “Identifying Reliable Data Sources”: Data was sourced from Indian real estate
platforms and government housing records to ensure reliability.
2. “Data Cleaning and Preprocessing”: Missing and redundant information was handled
carefully to maintain data integrity.
3. “Algorithm Implementation”: The Decision Tree algorithm was chosen for its
interpretability and robustness in capturing non-linear relationships.
4. “Model Optimization”: The model parameters were fine-tuned using grid search to
achieve balanced performance.
Effective project management is crucial to ensure that the development and implementation of the
software achieve its objectives within the given constraints of time, cost, and resources. This
chapter presents the project process, resource planning, and the approach adopted to manage the
project efficiently. Project management primarily focuses on the three P's—People, Problem,
and Process—and also incorporates cost estimation for the overall development.
2.1 People
The success of any project depends on the people involved. Proper management of human
resources ensures task delegation, motivation, and collaboration among team members to achieve
project goals. In this project, a team of three developers collaborated to develop the software with
well-defined roles:
Md Hifzullah: Conducted the analysis phase and handled the coding aspects of the project.
Kundan Kumar: Took charge of the design phase, focusing on the architecture and interface, and
also contributed to coding.
Sandhya Kumari & Manisha Kumari: Managed the testing phase to ensure the software was bug-
free and functionally sound.
Additionally, the people factor extends to other key stakeholders, such as Users (Employees and
Admins) who interact with the system to perform day-to-day tasks, including uploading files,
setting reminders, and sending data requests.
Teamwork, effective communication, and task division enabled smooth progress, ensuring all
project deliverables were met on schedule.
2.2 Problem
The primary focus of this project is to address inefficiencies in the existing manual insurance
management systems and develop a web-based solution. The major objectives and scope of the
project include:
Objective: To automate client information management, generate reminders for birthdays and
anniversaries, manage reports, and reduce the manual workload of agents and admins.
Scope: The software aims to simplify insurance data storage, retrieval, and reporting processes. It
also improves organizational efficiency by centralizing all operations into a single, user-friendly
platform.
1. Confidential Data Handling: The system is designed for the exclusive use of the organization,
ensuring data privacy and security.
2. User Adaptability: While initial user training may be required, the system's intuitive design
ensures ease of use over time.
3. Report Management: The software generates accurate and timely reports needed for decision-
making and analysis.
The technical foundation of the system uses Java for application development and Oracle 10g for
the database, enabling robust, secure, and efficient handling of data.
2.3 Process
The Spiral Model was particularly useful for this project due to its flexibility in accommodating
changing requirements and its emphasis on regular evaluation and feedback. The iterative nature
of the model allowed incremental development, ensuring the project remained on track.
Cost estimation is an integral part of project management as it helps allocate resources efficiently
and plan project expenditures. For this project, the COCOMO (COnstructive COst MOdel) was
used to calculate the development cost.
COCOMO Calculation:
Cost Breakdown:
Final Cost:
Thus, the total estimated cost for developing this web-based application is Rs. 1,12,396.03.
Conclusion
The project management process ensures effective resource allocation, problem identification,
and systematic development of the software. By adopting the Spiral Model and estimating costs
using the COCOMO approach, this project successfully addresses organizational needs and
provides a scalable solution for managing insurance operations efficiently.
Chapter-3:
Project Planning
The scope of the House Price Prediction System revolves around streamlining the process of
property price estimation by leveraging a machine learning model. The software provides a
dynamic platform for analyzing housing attributes and predicting prices based on significant
influencing factors.
Housing Attributes Database: Collects and stores details such as home size, lot size, property
age, number of bedrooms, bathrooms, and amenities.
Regional Analysis: City-wise analysis of property prices with emphasis on regions near IITs and
NITs.
Interactive Reports: Provides consolidated price prediction reports for sellers, buyers, and
investors.
Visualization Tools: Generates graphs and charts for easier understanding of market trends and
property evaluations.
User-Friendly Interface: Displays location-based property summaries with details such as average
price, size, and features.
The system ensures accurate price predictions, reduces manual estimation errors, and allows
stakeholders to make well-informed decisions based on concrete data-driven insights.
Before initiating the development process, a feasibility analysis was performed to evaluate the
project’s practicality. This analysis determines if the proposed system can meet the objectives
within defined constraints. The feasibility study includes:
1. Operational Feasibility: The system eliminates the manual effort of real estate analysis,
providing fast and precise house price predictions. Data collection, preprocessing, and
prediction tasks are automated, enhancing accuracy and efficiency.
2. Technical Feasibility: The system uses the proven combination of Python (for model
development), libraries such as Scikit-learn, and tools like Oracle for database storage.
The Decision Tree model's iterative nature ensures robust analysis with excellent
prediction accuracy.
3. Economic Feasibility: The system reduces reliance on expensive manual appraisal
processes. Automated price predictions save time, reduce human error, and improve the
quality of decision-making, resulting in a cost-effective solution for stakeholders.
4. Financial Feasibility: Development costs include personnel, software tools, and
infrastructure. The benefit-to-cost ratio demonstrates that the investment in the project will
yield significant returns by improving decision-making efficiency.
5. Resource Feasibility: Existing computational tools and hardware resources are sufficient
to implement the model. No additional resources are required for the system’s
deployment.
Risk analysis is crucial for anticipating challenges and formulating mitigation strategies.
Identified risks include:
1. Data Overload: Larger datasets may increase prediction processing times. Optimized algorithms
and hardware upgrades mitigate this issue.
2. Inconsistent Data Quality: Missing or incorrect data can impact model accuracy. Data cleaning
and preprocessing techniques ensure high-quality inputs.
3. Model Overfitting: Decision Tree models risk overfitting. Cross-validation and hyperparameter
tuning reduce this risk.
4. Technological Challenges: Compatibility issues between tools are managed by using widely
compatible platforms like Python and Oracle.
Project scheduling involves breaking down tasks into manageable components and assigning
timelines. The project follows a macroscopic and detailed schedule that includes key phases:
Quality assurance ensures the accuracy and reliability of the developed system:
Validation: Regular testing of the Decision Tree model using training and testing datasets.
Verification: Code reviews and performance validation to ensure model accuracy.
User Satisfaction: The system is user-friendly, accurate, and efficient in predicting house prices.
Dataset Overview
Data Preprocessing
Requirement Analysis Specification involves understanding and defining the project's scope,
functionality, and requirements to meet its objectives effectively. A comprehensive analysis
ensures that all critical components of the House Price Prediction System are identified,
evaluated, and modeled systematically.
Requirement Analysis bridges the gap between real estate requirements and technical
implementation. It enables the system engineer to specify the software’s functionality,
performance, and constraints to meet project goals.
Defining core functionalities (house price predictions, user interactions, and reporting).
Representing system behavior under different scenarios.
Developing models that progressively uncover system details.
Moving from high-level specifications to implementation-ready details.
The primary challenge lies in the lack of automated and accurate house price prediction tools
tailored to the Indian market. Existing manual processes are time-consuming and error-prone.
Problems include:
By implementing this predictive system, stakeholders will gain access to precise pricing data and
trend reports, eliminating manual inefficiencies and guesswork.
The evaluation process involves analyzing data flow, identifying functional components, and
synthesizing the system's solution. Key tasks include:
Identifying housing attributes (e.g., square footage, lot size, and age).
Defining system behavior under changing market conditions.
Establishing user-friendly interfaces for stakeholders.
Creating reports for trend analysis, predictions, and insights.
These efforts collectively ensure a clear understanding of the problem and an effective solution
design.
During requirement modeling, system functionalities and behaviors are represented using models
to understand data flow, functional processes, and operational behavior.
The system processes housing data to generate predictions. Core functions include:
Input: Users input housing attributes (e.g., square footage, location, and amenities).
Processing: The Decision Tree model processes the data and analyzes relevant patterns.
Output: The system outputs price predictions, trend analyses, and graphical reports.
Over iterations, detailed models for each function will refine the system's functionality.
Behavioral models define the system’s responses to external triggers. In this project:
The Requirement Specification outlines the system's hardware, software, and functional needs:
4.3.4 Advantages
4.3.5 Disadvantages
Chapter-5:Design
- Most properties were priced between Rs. 50 lakh and Rs. 2 crore, with higher
prices observed near IIT Kanpur and IIT Bombay.
- Square footage emerged as a critical determinant, with larger homes commanding
significantly higher prices.
- Proximity to renowned educational institutions like IITs and NITs influenced
property prices, reflecting their demand-driven markets.
Model Development
The Decision Tree model was developed and refined through iterative testing:
Feature Importance
The following features emerged as the most critical predictors:
1. “Home Square Footage”: Larger homes were strongly associated with higher
prices.
2. “Proximity to IIT/NIT Campuses”: Locations near prestigious institutions
consistently fetched premium prices.
3. “Lot Size”: Properties with larger outdoor spaces attracted higher valuations.
4. “Age of Property”: While older properties had mixed effects, their condition and
location often compensated for their age.
Data design is the foundational activity in software engineering that focuses on defining the
structure of data used in the system. In the House Price Prediction System, proper organization
of housing-related data is crucial for ensuring high-quality predictions and system performance.
The data design emphasizes efficient data storage, retrieval, and integration with the Decision
Tree model.
Property Details Table: Stores information such as property ID, square footage, lot size,
number of bedrooms, bathrooms, property age, and location.
User Table: Maintains user credentials, including user ID, login credentials, and roles
(buyer, seller, investor).
Prediction Results Table: Stores historical and real-time house price predictions
generated by the Decision Tree model.
The Class Diagram represents the logical structure of the system by detailing its classes,
attributes, and methods. Key classes include:
Property Class: Handles property attributes such as square footage, lot size, and age.
User Class: Manages user interactions with the system.
Prediction Class: Implements the Decision Tree model and returns price predictions.
The Architectural Design defines the system’s overall modular structure, representing control
flow and data interactions.
The ERD showcases relationships among tables in the database, ensuring effective data
management.
The Data Flow Diagram represents the flow of information within the system.
DFD Level 0: Depicts the entire system as a single process that takes user input and
produces price predictions.
DFD Level 1: Breaks down processes into sub-processes, such as data input, prediction
analysis, and report generation.
Figu
re 5.8: DFD Level 1
The Interface Design ensures seamless interaction between the system and its users. Key
interfaces include:
Use Case Diagram: Illustrates user interactions with the system (e.g., input property
details, view predictions).
State Diagram: Represents system behavior based on events such as data input,
prediction processing, and result display.
This research highlights the factors influencing house prices in the Indian real
estate market. Proximity to educational hubs like IITs and NITs, combined with
property size and amenities, significantly affects pricing dynamics. The Decision
Tree model proved effective in identifying these relationships, with its insights
applicable for real-world decision-making.
1. “Expand the Dataset”: Incorporate data from more cities and rural areas to cover
diverse market segments.
2. “Dynamic Market Factors”: Account for variables like demand-supply
dynamics, interest rates, and regional economic policies.
3. “Enhanced Algorithms”: Experiment with ensemble models like Random Forest
and Gradient Boosting for improved prediction accuracy.
4. “Integrate Real-Time Data”: Leverage APIs for real-time housing market trends
to make the model adaptable to market fluctuations.
The findings of this study provide a robust foundation for stakeholders to navigate
the Indian real estate sector strategically.
References
- https://www.kaggle.com/code/subhradeep88/house-price-predict-decision-
tree-random-forest
- https://github.com/srimallipudi/House-Price-Prediction-Using-Decision-
Tree-ML-Algorithm
- https://ieeexplore.ieee.org/document/9731083
- https://www.researchgate.net/publication/
238398459_Determinants_of_House_Price_A_Decision_Tree_Approach
- https://www.researchgate.net/publication/
350430324_House_Price_Prediction_Using_Machine_Learning_Algorithm
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3565512
- https://chatgpt.com/share/67611640-0528-800f-867e-6db0d1c3262b
- https://nevonprojects.com/predicting-house-price-using-decision-tree/
- https://chatgpt.com/share/67619c73-9180-800a-b492-e78b701ec15c
- https://emhaihsan.medium.com/house-price-prediction-with-decision-tree-
regressor-9728064de7da
- https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data.