Ravi Internship Report
Ravi Internship Report
Ravi Internship Report
INTERNSHIP REPORT
“MACHINE LEARNING MODEL FOR PREDICTING
CHRONIC DISEASE”
Conducted at
COMPANY NAME: VARCONS TECHNOLOGIES PVT LTD.
RAJARAJESWARI COLLEGE OF
ENGINEERING
MYSORE ROAD, BANGALORE-560074
(An ISO 9001:2008 Certified Institute)
(2023-24)
CERTIFICATE
This is to certify that the Internship titled “Machine Learning Model for predicting the
risks of chronic diseases” carried out by Mr. Ravi Kumar U, a bonafide student of
Rajarajeswari college of engineering, in partial fulfillment for the award of Bachelor of
Engineering, in Artificial Intelligence and Machine Learning under Visvesvaraya
Technological University, Belagavi, during the year 2023-2024. It is certified that all
corrections/suggestions indicated have been incorporated in the report.
The project report has been approved as it satisfies the academic requirements in respect
of Internship prescribed for the course Internship / Professional Practice (18AII85)
External Viva:
1)
2)
Date :21/09/2023 :
Place :Bengaluru
USN : 1RR20AI022
NAME : Ravi Kumar U
Dear Student,
We would like to congratulate you on being selected for the Machine Learning With Python
(Research Based) Internship position with Varcons Technologies, effective Start Date 11th August,
2023, All of us are excited about this opportunity provided to you!
This internship is viewed as being an educational opportunity for you, rather than a part-time job. As such,
your internship will include training/orientation and focus primarily on learning and developing new skills
and gaining a deeper understanding of concepts of Machine Learning With Python (Research
Based) through hands-on application of the knowledge you learn while you train with the senior
developers. You will be bound to follow the rules and regulations of the company during your internship
duration.
Sincerely,
Spoorthi H C
Director
Varcons Technologies
213, 2st Floor,
18 M G Road, Ulsoor,
Bangalore-560001
We express our sincere thanks to our Principal, Dr. R Balakrishna for providing us adequate
facilities to undertake this Internship.
We would like to thank our Head of Dept – AIML, Dr. Rajesh K S for providing us an
opportunity to carry out Internship and for his valuable guidance and support.
We would like to thank our Professor’s Software Services for guiding us during the period of
internship.
We express our deep and profound gratitude to our guide, Guide name, Assistant/Associate
Prof, for her keen interest and encouragement at every step in completing the Internship.
We would like to thank all the faculty members of our department for the support extended
during the course of Internship.
We would like to thank the non-teaching members of our dept, for helping us during the
Internship.
Last but not the least, we would like to thank our parents and friends without whose constant
help, the completion of Internship would have not been possible.
NAME:Ravi Kumar U
USN:1RR20AI022
The healthcare field is one of the most essential areas of research in the
modern period, thanks to rapid changes in technology and data. It's difficult
to keep track of a large amount of patient records.
The use of Big Data Analytics allows us to manage this content. Various
ailments can be treated in a variety of ways all throughout the world.
Machine Learning is one of the approaches for disease prediction and
diagnosis. This study reveals how symptoms can be utilized as input
parameters in machine learning to produce disease predictions.
Sl no Description Page no
1 Company Profile 9
3 Task Performed 13
4 Introduction 14
5 System Analysis 17
6 Requirement Analysis 19
7 Design Analysis 21
8 Implementation 24
9 Snapshots 30
10 Conclusion 36
11 References 38
They understand that the best desired output can be achieved only by
understanding the clients demand better. At our Company we work with them
clients and help them to defiine their exact solution requirement. Sometimes
even they wonder that they have completely redefined their solution or new
application requirement during the brainstorming session, and here they position
themselves as an IT solutions consulting group comprising of high caliber
consultants.
They believe that Technology when used properly can help any business to
scale and achieve new heights of success. It helps Improve its efficiency,
profitability, reliability; to put it in one sentence ” Technology helps you to
Delight your Customers” and that is what we want to achieve.
We are a Technology Organization providing solutions for all web design and development,
Researching and Publishing Papers to ensure the quality of most used ML Models, MYSQL,
PYTHON Programming, HTML, CSS, ASP.NET and LINQ. Meeting the ever increasing
automation requirements, Compsoft Technologies specialize in ERP, Connectivity, SEO
Services, Conference Management, effective web promotion and tailor-made software
products, designing solutions best suiting clients requirements. The organization where they
have a right mix of professionals as a stakeholders to help us serve our clients with best of
our capability and with at par industry standards. They have young, enthusiastic, passionate
and creative Professionals to develop technological innovations in the field of Mobile
technologies, Web applications as well as Business and Enterprise solution. Motto of our
organization is to “Collaborate with our clients to provide them with best Technological
solution hence creating Good Present and Better Future for our client which will bring a
cascading a positive effect in their business shape as well”. Providing a Complete suite of
technical solutions is not just our tag line, it is Our Vision for Our Clients and for Us, We
strive hard to achieve it.
• Python
• Selenium Testing
• Software Training
TASK
PERFORMED
30/8/23 Wednesday
Seminar on array and files, file operations topic
31/8/23 ThursdayFunction, passing parameters
1/9/23 Friday Installation of anaconda and importing some files for frontend
2/9/23 SaturdayCompletion of installation assignments
6/9/23 Wednesday
Modes of files, assignments for the above
7/9/23 ThursdayVarious operations and methods of file operations
8/9/23 Friday Continuation
9/9/23 SaturdayAssignments for the above
13/9/23 Wednesday
Project allocation
14/9/23 ThursdayProject explanation
15/9/23 Friday Working of project
5th
INTRODUCTION
Introduction to ML
3. Features: Features are the variables or attributes in your dataset that the ML
model uses to make predictions. Effective feature selection and engineering
are crucial for model performance.
9. Hyperparameters: These are settings that are not learned from data but are
set before training. Tuning hyperparameters is essential to optimize model
performance.
Problem Statement
SYSTEM ANALYSIS
1. Existing System
Steps Involved:
Define the Problem: Clearly define the problem you want to solve.
Specify the type of chronic disease you want to predict and the objectives
of the model.
Collect and Prepare Data: Gather relevant data sources. This data may
include medical records, patient histories, lab results, lifestyle factors,
genetic information, and more.
Preprocess the data by cleaning, handling missing values, and performing
feature engineering to extract relevant features.
Split the dataset into training, validation, and test sets to evaluate the
model's performance.
Model Training: Train the selected ML model using the training dataset.
Fine-tune hyper parameters to optimize model performance, possibly
using techniques like cross-validation.
Model Evaluation: Evaluate the model using the validation dataset and
appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score,
ROC-AUC). Adjust the model and hyper parameters based on validation
results.
Model Testing: Assess the final model's performance using the test
dataset to estimate its real-world predictive capabilities.
Interpretability and Explainability: Ensure that the model's predictions
are interpretable and explainable, especially in healthcare applications
where transparency is crucial. Use techniques such as feature importance
analysis and SHAP (SHapley Additive exPlanations) values.
REQUIREMENT ANALYSIS
Designing and analyzing a project for chronic disease prediction involves several
key components, including data collection, preprocessing, model development, and
evaluation. Below is a high-level overview of the design and analysis process:
Model Selection:
• Choose appropriate machine learning algorithms or models for chronic
disease prediction. Common choices include logistic regression, decision
trees, random forests, gradient boosting, and deep learning models.
• Consider the nature of the disease prediction task (e.g., binary classification
for disease presence or regression for risk assessment).
Ethical Considerations:
• Ensure that the project adheres to ethical guidelines, patient privacy
regulations (e.g., HIPAA), and data protection laws.
• Address potential bias and fairness concerns in model predictions.
IMPLEMENTATION
The system can be implemented only after thorough testing is done and if it is found to work
according to the specification. It involves careful planning, investigation of the current
system and it constraints on implementation, design of methods to achieve the change over
and an evaluation of change over methods a part from planning.
Two major tasks of preparing the implementation are education and training of the users and
testing of the system. The more complex the system being implemented, the more involved
will be the system analysis and design effort required just for implementation.
The implementation phase comprises of several activities. The required hardware and
software acquisition is carried out. The system may require some software to be developed.
For this, programs are written and tested. The user then changes over to his new fully tested
system and the old system is discontinued.
1. Preliminaries
According to US National Center for Health Statistics, chronic diseases are diseases that last for a
long period of time, that is, more than three months. These diseases are neither treated by
medicines nor prevented by vaccines. The major cause of chronic diseases is the use of tobacco,
unhealthy food habits, and lack of physical activity. Also, this disease can commonly be caused
due to ageing. Chronic diseases include cardiovascular disease, cancer, arthritis, diabetes, obesity,
epilepsy and seizures, and problems in oral health [35].
Cardiovascular disease includes heart disease and stroke, which highly lead to death. This disease
is caused due to the use of tobacco, intake of nutritionless food, and lack of physical activity.
When these activities are changed by the patient, they might have the chance to reduce the impact
on controlling and preventing cardiovascular disease.
The chronic disease such as arthritis causes inflammation in the joints, causes pain, and stiffness
that increases due to ageing. There is an availability of cost-effective methods for reducing the
effects caused by arthritis but are not used much. The effects of arthritis can be reduced by
following moderate exercises regularly.
Since 1980, obesity is more common in adults for all age groups. The one who is overweight or
obese can develop the risk of getting high blood pressure (BP), heart diseases, diabetes, and
arthritis. Obesity can also cause some types of cancers.
Oral health problems are a crucial issue that attains special attention in the health of older people.
This is a serious issue, since it affects the normal day-to-day actions of a person such as speak,
chew, swallow, and maintain a nutritional food plan.
The ConvNet or CNN is an algorithm of deep learning that fetches the input and assigns the bias
and weights to its several aspects and then distinguishes one from the other [55] as shown
in Algorithm 1. The major reason for using CNN is that it requires only few efforts in
preprocessing the data when compared with other algorithms, since the CNN can learn to
optimize the filters through automate learning [56]. The output layer of CNN can be calculated
using the following expression:
KNN is a supervised machine learning algorithm, which analyzes the similarities between
the new data and the existing data and adds the new data into the category that is highly
similar to the available categories. The KNN can be used in classification as well as
regression tasks, but it is most commonly used in classification. The calculation of Euclidean
distance is expressed mathematically as follows:
x2=c−a2+d−b2.
In this section, a detailed description of the data set creation, model preparation, and
disease prediction has been given. The first action is data collection. Our proposed system
collects structured and unstructured data obtained from various sources. After data
collection, they are subjected to preprocessing and are split into cleaning and test data sets.
Then the training data set is trained with the machine learning algorithms such as CNN and
KNN to a number of epochs for improving the accuracy of the prediction results. After
multiple epochs, once the desired target is achieved, the developed model is ready for
testing.
At this step, the model is tested with the test data set to verify the model performance with
brand-new data that were not used for training. If the model attains the desired accuracy in
test data, then the proposed model is ready for deployment as shown in Figure 1.
The real-life data that includes structured data such as patient basic information including
demographics, living habitat, and lab test results and the unstructured data such as the
symptoms of the disease faced by the patient and their consultation with the doctor. The
data set excludes the patient's personal details such as name, ID, and location so as to
preserve their privacy.
The collected data are preprocessed for the availability of missing values in most of the
structured data. Hence, it is essential to fill out the missed data or remove or modify them to
enhance the quality of the data set. The preprocessing step also eliminates the commas,
punctuations, and white spaces. Once the preprocessing of data has been completed, it is
then subjected to feature extraction followed by disease prediction.
As discussed above, the data set consists of both structured and unstructured data. The
structured data comprises patient demographics and the data related to the cause for the
disease such as age, gender height, weight, and so on, patient's living habitat, laboratory test
results, and the disease that they are affected in tabular format. The unstructured data
comprises patient's disease symptoms and the information about the interrogation with
doctors in text format. The unstructured data is an added advantage of the prediction task
to get a more accurate results. The data set is split into 80% for training and 20% for
testing.
The proposed system uses the CNN algorithm in the prediction of chronic disease. At first,
the data set is converted into vector form, followed by word embedding to adopt zero
values for filling the data. It is then given to the convolution layer.
The pooling layer takes the input from the convolution layer and follows the max pooling
operation. The output of max pooling is given to the fully connected layer, and then finally,
the output layer provides the classification results. Figure 2 shows the block diagram of the
convolutional neural network.
For evaluating the proposed disease prediction model, four performance evaluation metrics
are used. The confusion matrix consists of the true positives (TP), which is the correct
prediction of the target as a patient with chronic disease; the true negatives (TN), which is
the correct prediction of the persons without diseases; false positives (FP), which is the
incorrect prediction of the healthy person as a diseased person, and false negatives (FN),
which is the incorrect prediction of the target as healthy persons. The following is the
description of the four performance evaluation parameters.
3.1. Accuracy
The classification accuracy is described as the ratio of correct predicted values to the total
predicted values and is depicted mathematically as follows:
Accuracy=TP+TNTP+TN+FP+FN∗100.
3.2. Precision
The precision or positive predictive value (PPV) is described as the ratio of correct
prediction to the total correct values including the true and false predictions and is depicted
mathematically as follows:
Precision=TPTP+FP.
CODE SNIPPET:
User Interface
Our system also has an easy to use interface. It also has various visual
representation of data collected and results achieved.
1. WWW.GOOGLE.COM.
2. GeeksforGeeks.
3. Scikit Learn Documentations.
4. Streamlit Documentation.
5. Youtube