0% found this document useful (0 votes)
99 views32 pages

Sample INTERNSHIP Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views32 pages

Sample INTERNSHIP Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

KGiSL Institute of Technology

(An Autonomous Institution)


Affiliated to Anna University, Approved by AICTE, Recognized by UGC,
Accredited by NAAC & NBA (B.E-CSE,B.E-ECE, B.Tech-IT),
365, KGiSL Campus, Thudiyalur Road, Saravanampatti, Coimbatore – 641035.

DIABETES PREDICTION USING


DATASCIENCE AND MACHINE
LEARNING
A SUMMER INTERNSHIP REPORT

Submitted by

JASFER I (711721243038)

in partial fulfilment for the award of the degree

of

BACHELOR OF ENGINEERING
IN
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

KGiSL INSTITUTE OF TECHNOLOGY

ANNA UNIVERSITY: CHENNAI 600 025

NOV 2024
KGiSL Institute of Technology
(An Autonomous Institution)
Affiliated to Anna University, Approved by AICTE, Recognized by UGC,
Accredited by NAAC & NBA (B.E-CSE,B.E-ECE, B.Tech-IT),
365, KGiSL Campus, Thudiyalur Road, Saravanampatti, Coimbatore – 641035.

BONAFIDE CERTIFICATE

Certified that this Internship report on “Diabetes prediction using Data


Science and Machine Learning” at Exposys DataLabs is the bonafide work
of JASFER I who belongs to IV Year Computer Science and Engineering “A”
during VII Semester of Academic Year 2024-2025.

FACULTY INCHARGE HEAD OF THE DEPARTMENT

Certified that the candidates were examined by us for Summer Internship Viva
held on ____ at KGiSL Institute of Technology, Saravanampatti,
Coimbatore 641035.

INTERNAL EXAMINER EXTERNAL EXAMINER


ACKNOWLEDGEMENT

We express our deepest gratitude to our Chairman and Managing


Trustee Dr. Ashok Bakthavathsalam for providing us with an environment to
complete our Internship project successfully.

We are grateful to our CEO of Academic Initiatives Mr. Aravind


Kumar Rajendran and our beloved Secretary Dr. Rajkumar N. Our sincere
thanks to honourable Principal Dr. Suresh Kumar S for his support, guidance,
and blessings.

We would like to thank Dr. Thenmozhi T, Head of the Department,


and Internship Coordinator Mr. Vivekanandan V, Department of Computer
Science and Engineering for firm support during the entire course of this
Internship and who modeled us both technically and morally for achieving
greater success in this project work.

We also thank all the faculty members of our department for their help in
making this Internship project a successful one. Finally, we take this
opportunity to extend our deep appreciation to our Family and Friends, for all
they meant to us during the crucial times of the completion of our project
INTERNSHIP
OFFER LETTER
INTERNSHIP
CERTIFICATE
ABSTRACT

During the internship tenure at Exposys Data Labs, an extensive


exploration into machine learning applications for diabetes prediction unfolded.
The analysis focused on a comprehensive dataset of 768 patient profiles,
subjecting three pivotal algorithms like Logistic Regression, Random Forest,
and Bagging ensemble to rigorous evaluation.

Sourced from a reputable medical research database, the dataset underwent


meticulous preprocessing to ensure data integrity and relevance. The subsequent
feature selection process retained only the most impactful variables, including
age, BMI, blood pressure, and glucose levels.

The comparative analysis revealed the exceptional performance of the


Bagging ensemble, particularly utilizing 10 base estimators, achieving a
noteworthy accuracy of 79%. Surpassing Logistic Regression and Random
Forest, Bagging emerged as a compelling candidate for diabetes prediction.

Exploration into feature importance within the Bagging ensemble elucidated


key predictors like glucose levels, BMI, and age signifying critical determinants
in the landscape of diabetes outcomes. These findings substantively contribute
to the domain of diabetes prediction, emphasizing the broader relevance of
ensemble methods in enhancing predictive accuracy for intricate medical
conditions.

This internship project offers a nuanced and professionally oriented


perspective on the strategic deployment of machine learning for diabetes
prediction. The robust methodology employed, coupled with insightful results,
positions this endeavor as a substantive contribution to the ongoing discourse on
leveraging data science for optimized healthcare outcomes.
INTRODUCTION
Domain:

Healthcare Focus:

 The project centers on healthcare, specifically emphasizing the


prediction of diabetes. This focus aligns with the critical need
for advanced tools in disease prediction and prevention within
the healthcare sector.
 By concentrating efforts on diabetes, a prevalent chronic
condition, the project directly contributes to improving patient
outcomes and reducing the overall burden of the disease on
healthcare systems.

Global Health Impact:

 Given the global prevalence of diabetes as a chronic health condition,


the project's scope extends beyond regional boundaries. Its findings and
methodologies have the potential to impact healthcare practices on a
global scale.
 The emphasis on diabetes prediction addresses a crucial aspect of
public health, with the goal of implementing effective strategies for
early detection and intervention, ultimately mitigating the impact of
diabetes worldwide.
Technology:

Machine Learning Algorithms:

 The project employs state-of-the-art machine learning


algorithms, including Logistic Regression, Random Forest, and
the Bagging ensemble. This strategic choice ensures a thorough
exploration of diverse modeling techniques for diabetes
prediction.
 The incorporation of multiple algorithms reflects a commitment
to leveraging the strengths of each method, allowing for a
nuanced understanding of their performance in healthcare
applications.

Programming Language:

 Python serves as the core programming language, providing a versatile


and widely adopted platform for machine learning implementations.
Its readability and extensive libraries contribute to an efficient and
standardized development process.
 The use of Python underscores the project's alignment with
industry standards and best practices in data science and machine
learning.

Libraries:

 The project leverages popular Python libraries, notably Scikit-Learn, to


streamline the implementation of machine learning algorithms. These
libraries offer robust functionalities for model development,
evaluation, and optimization.
 The choice of well-established libraries ensures the reliability and
efficiency of the project's technological stack, contributing to the
overall success of diabetes prediction models.
Advanced Predictive Modeling:

 Logistic Regression, Random Forest, and Bagging ensemble


represent advanced predictive modeling techniques. The selection of
these algorithms demonstrates a commitment to harnessing
contemporary advancements in machine learning for healthcare
applications.
 By incorporating sophisticated modeling approaches, the project aims
to uncover nuanced patterns within healthcare data, advancing the
capabilities of predictive analytics in disease diagnosis and prognosis.

Technological Integration:

 The project's technological stack integrates machine learning


technologies seamlessly, driven by Python and relevant libraries.
This cohesive integration enhances the scalability and efficiency of
the predictive modeling process.
 The alignment with current technological trends in machine learning
ensures that the project remains at the forefront of advancements,
contributing to the development of robust healthcare predictive
models.

Methods:

Systematic Data Collection:

 A comprehensive dataset of 768 patient instances is systematically


collected from a reputable medical research database. This
meticulous data collection ensures a representative sample for
training and evaluating diabetes prediction models.
 The systematic approach to data collection establishes a solid
foundation for generating insights into the complex
relationships between patient characteristics and diabetes
outcomes.

Feature Selection and Engineering:

 Rigorous preprocessing techniques are applied to enhance data


relevance, with a particular focus on feature selection and engineering.
This step ensures that only the most impactful variables, such as age,
BMI, blood pressure, and glucose levels, are included in the analysis.
 Feature engineering techniques, including scaling and
normalization, contribute to the compatibility of data across
different algorithms, enhancing the overall robustness of the
predictive models.

Algorithmic Evaluation:

 Logistic Regression, Random Forest, and Bagging ensemble


undergo systematic training and evaluation on a split dataset. This
approach enables a comprehensive understanding of each algorithm's
performance in predicting diabetes.
 The systematic training process ensures that each algorithm is
optimized for predictive accuracy, considering default parameters as
a baseline for subsequent fine-tuning.

Performance Metrics:

 Rigorous assessment of algorithmic performance employs a suite of


metrics, including accuracy, precision, recall, F1-score, and AUC-ROC.
These metrics provide a nuanced evaluation of each algorithm's
strengths and weaknesses in diabetes prediction.
 The use of multiple metrics ensures a thorough assessment,
capturing various aspects of predictive performance and allowing for
a comprehensive comparison between different algorithms.

Bagging Ensemble Focus:

 Special attention is given to the Bagging ensemble method,


characterized by the aggregation of predictions from multiple base
estimators. This ensemble approach aims to capitalize on the
collective wisdom of diverse models.
 The focus on Bagging ensemble reflects an acknowledgment of the
potential improvements in predictive accuracy that can be achieved
through ensemble methods, particularly in the context of
healthcare applications.

Feature Importance Analysis:

 Within the Bagging ensemble, a detailed analysis of feature


importance is conducted using Python-based tools. This analysis aims
to identify critical predictors influencing diabetes outcomes, providing
valuable insights into the underlying factors contributing to disease
prediction.
 Feature importance analysis adds a layer of interpretability to the
predictive models, facilitating a deeper understanding of the
variables that significantly contribute to diabetes prediction.
Contemporary Practices:

 The project's technological framework, including Python and


relevant libraries, and methodological rigor align with current best
practices in data science and healthcare analytics. This alignment
ensures that the project adheres to industry standards and leverages
the latest methodologies in predictive modeling.
 By incorporating contemporary practices, the project contributes to
the ongoing discourse on the application of data science in healthcare,
fostering innovation and advancement in the field.

Model Training and Evaluation:

Dataset Splitting:

 The dataset is randomly split into training and testing sets using
Python's Scikit-Learn library. This strategic splitting process
maintains a balanced class distribution in both subsets, ensuring
representative training and evaluation samples.
 The use of Scikit-Learn for dataset splitting reflects a commitment
to standardized and well-established tools in machine learning.

Training Procedure:

 Each algorithm, including Logistic Regression, Random Forest,


and Bagging ensemble, undergoes systematic training
SYSTEM SPECIFICATIONS

Software Requirements:
1. Python Version:
• Python 3.11

2. Integrated Development Environment (IDE):


• Jupyter Notebook or Anaconda Navigator

3. Machine Learning Libraries:


• Scikit-Learn
• Pandas
• NumPy
• Matplotlib
• Seaborn
• TensorFlow and/or PyTorch (optional, based on specific
algorithm requirements)
4. Database Management System(Optional):
• SQLite or any preferred relational database

5. Documentation and Reporting: Jupyter Notebook, Microsoft Word

Hardware Requirements:

1. Processor:
• Quad-core processor or higher

2. RAM:
• 16 GB or higher
3. Storage:
• 512 GB SSD or higher

4. Graphics Processing Unit (GPU) (optional):


• NVIDIA GeForce or AMD Radeon series with CUDA cores

5. Display:
• Resolution: Full HD (1920 x 1080) or higher

6. Operating System:
• Windows 10, macOS, or Linux (Ubuntu recommended for machine
learning tasks)

7. Internet Connectivity:
• Required for accessing external datasets, libraries, and updates.

8. Peripheral Devices:
• Mouse and keyboard for input
• Webcam and microphone for virtual collaborations and presentations
MODULE DESCRIPTION
Data Collection Module:

 Methodically retrieves a diverse dataset of 768 patient instances from


a reputable medical research database, ensuring a comprehensive
representation of health profiles.
 Establishes a robust foundation for model training and evaluation
by incorporating nuanced patient information, fostering a deeper
understanding of diabetes predictors.

Data Preprocessing Module:

 Applies sophisticated preprocessing techniques, including feature


selection and engineering, to elevate the relevance and
informativeness of the dataset.
 Implements cutting-edge scaling and normalization methodologies,
ensuring seamless compatibility across diverse machine learning
algorithms and optimizing input feature quality.

Machine Learning Algorithms Module:

 Implements three forefront machine learning algorithms:


Logistic Regression, Random Forest, and Bagging ensemble.
 Harnesses the advanced capabilities of the Scikit-Learn library, and
optionally integrates TensorFlow and PyTorch, showcasing a
commitment to state-of-the-art predictive modeling techniques.

Model Training and Evaluation Module:

 Systematically trains each algorithm on a strategically split dataset,


utilizing default parameters as a baseline for subsequent
optimization.
 Conducts a comprehensive evaluation of model performance,
employing an extensive suite of metrics, including accuracy,
precision, recall, F1-score, and AUC-ROC on the testing set.

Bagging Ensemble Focus Module:

 Delves deeply into the intricacies of the Bagging ensemble


method, emphasizing the intelligent aggregation of predictions
from diverse base estimators.
 Strives for predictive excellence by leveraging the collective
intelligence of multiple models, demonstrating a commitment
to pioneering solutions in diabetes prediction.

Feature Importance Analysis Module:

 Conducts a meticulous analysis of feature importance within


the Bagging ensemble, utilizing advanced Python-based tools
and algorithms.
 Uncovers critical predictors influencing diabetes outcomes, offering
invaluable insights for enhanced model interpretability and
informed decision-making in healthcare contexts.

Documentation and Reporting Module:

 Utilizes Jupyter Notebook, Microsoft Word, or other cutting-edge


documentation tools to craft comprehensive reports and visually
compelling presentations.
 Adopts a user-centric approach, ensuring stakeholders can
intuitively grasp the intricacies of the methodology, results, and
implications of the diabetes prediction models.
Contemporary Practices Module:

 Aligns seamlessly with current best practices in data science and


healthcare analytics, integrating state-of-the-art technologies and
methodologies.
 Serves as a catalyst for advancements in the field by adopting
and contributing to the evolution of innovative practices, ensuring
relevance and impact.

System Specifications Module:

 Defines precise and advanced software and hardware


prerequisites, emphasizing compatibility, efficiency, and optimal
performance throughout the project lifecycle.
 Ensures a seamless development and execution experience
by providing clear guidelines for the technological
infrastructure, incorporating the latest advancements.

User Interface Module:

 Integrates an intuitive and visually appealing user interface,


designed for streamlined interactions with the machine learning
models.
 Enhances user experience through thoughtful design, allowing
for effortless data input, initiation of predictions, and intuitive
visualization of results, prioritizing accessibility and usability.
EXPERIMENTAL RESULTS
Libraries and Data Visualization:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("/content/Diabetes_Dataset.csv")
df.head()
df.tail()
df.info()
df.isnull().sum()
df.isnull().sum()
x=df[["Glucose","BloodPressure","SkinThickness","Insu
lin", "BMI"]] == 0
x = x.sum()
print(x)
import numpy as np
df[["BloodPressure","Glucose","BMI"]] =
df[["BloodPressure","Glucose","BMI"]].replace(0, np.NaN)
df.fillna(df.mean(), inplace=True)
df[["SkinThickness","Insulin"]] =
df[["SkinThickness","Insulin"]].replace(0, np.NaN)
#box plot for data
preprocessing fig =
df.hist(figsize = (20,15))
df.plot(kind='bar', subplots=True, layout=(3,3),
sharex=False, sharey=False, figsize=(50,50))
plt.show()
df1.plot(kind='box', subplots=True, layout=(3,3),
sharex=False, sharey=False, figsize=(15,15))
plt.show()

fig = sns.FacetGrid(df1, hue="Outcome",


aspect =5) fig.map(sns.kdeplot, 'Age',
shade=True)
oldest =
df1['Age'].max()
fig.set(xlim=(0,
oldest))
fig.add_legend()
fig = sns.FacetGrid(df1, hue="Outcome",
aspect=4) fig.map(sns.kdeplot, 'Insulin',
shade=True) oldest = df1['Insulin'].max()
#finding relations:
fig.set(xlim=(0, oldest))
fig.add_legend()
fig = sns.FacetGrid(df1, hue="Outcome", aspect=4)
fig.map(sns.kdeplot, 'BMI', shade=True)

oldest =
df1['BMI'].max()
fig.set(xlim=(0,
oldest))
fig.add_legend()
fig = sns.FacetGrid(df1, hue="Outcome",
aspect=4) fig.map(sns.kdeplot,
'BloodPressure', shade=True) oldest =

df1['BloodPressure'].max()

fig.set(xlim=(0, oldest))
fig.add_legend()
fig = sns.FacetGrid(df1, hue="Outcome",
aspect=4) fig.map(sns.kdeplot, 'Glucose',
shade=True) oldest = df1['Glucose'].max()
fig.set(xlim=(0,
oldest))
fig.add_legend()
#traintest split
from sklearn.model_selection import train_test_split
train,test=
train_test_split(df1,test_size=0.25,random_state=0,st
ratif y=df1['Outcome'])# stratify the outcome
X=df.drop('Outcome',axis=1)
train_X=train[train.columns[:8]
] test_X=test[test.columns[:8]]
train_Y=train['Outcome']
test_Y=test['Outcome']

Alogorithms:
from sklearn.linear_model import
LogisticRegression from sklearn.naive_bayes
import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import
KNeighborsClassifier from sklearn.tree import
DecisionTreeClassifier from sklearn.ensemble
import RandomForestClassifier #logistic
regression:
lr =
LogisticRegression()
lr.fit(train_X,train_Y
) p =
lr.predict(test_X)
from sklearn import metrics
print('The accuracy Score for logistic regression
is:\n',metrics.accuracy_score(p,test_Y))
print('\n \n The confusion matrix: \n',
metrics.confusion_matrix(p, test_Y))
print('\n\n The metrics classification report:\n ',
metrics.classification_report(p, test_Y))
prob = lr.predict_proba(test_X)prob = prob[:, 1]
def plot_roc_curve(fpr, tpr):
plt.plot(fpr, tpr, color='orange', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue',
linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for logistic
regression') plt.legend()
plt.show()
#Random forest:
from sklearn.ensemble import
RandomForestClassifier rf=
RandomForestClassifier(n_estimators=100,random_state=
0) rf.fit(train_X,train_Y)
rfm = rf.predict(test_X)

# Results

print('The accuracy Score for random forest


algorithm is:\n',metrics.accuracy_score(rfm,test_Y))
print('\n \n The confusion matrix: \n',
metrics.confusion_matrix(rfm, test_Y))
print('\n\n The metrics classification report:\n ',
metrics.classification_report(rfm, test_Y))

from sklearn.metrics import roc_curve,


auc y_scores = rf.predict_proba(test_X)
[:, 1]
fpr, tpr, thresholds = roc_curve(test_Y,
y_scores) roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='yellow', lw=2, label='ROC
curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2,
linestyle='-- ')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive
Rate') plt.ylabel('True
Positive Rate')
plt.title('Receiver Operating Characteristic for
random forest algorithm')
plt.legend(loc="lower right")

#bagging classifier:
from sklearn.ensemble import
BaggingClassifier from sklearn.tree import
DecisionTreeClassifier from
sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import
RandomForestClassifier from sklearn.metrics
import accuracy_score base_classifiers = [
DecisionTreeClassifier(),
LogisticRegression(),
RandomForestClassifier(),SVC
()
]
bagging_classifier =
BaggingClassifier( base_estimator=None, # Set
to None to use multiple
base classifiers
n_estimators=len(base_classifiers), # Number of
base classifiers
random_state=42
)
bagging_classifier.fit(X_train, y_train)
y_pred =
bagging_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy for bagging classifier:", accuracy)
CONCLUSION
Embarking on this internship journey at Exposys Data Labs has been a
profound and enriching experience. The focus on diabetes prediction through
advanced data science and machine learning techniques has not only deepened
my understanding of these technologies but has also provided practical insights
into their real-world applications, particularly in healthcare analytics.

The internship has been structured around a meticulous series of modules,


each contributing to a holistic approach in developing and evaluating diabetes
prediction models. From the systematic data collection to the nuanced feature
engineering and the implementation of cutting-edge machine learning
algorithms, every step has been a learning opportunity. The incorporation of
advanced methodologies, such as Bagging ensemble, showcases the
commitment of Exposys Data Labs to staying at the forefront of the field.

Throughout the internship, the emphasis on contemporary practices has been


evident. The integration of state-of-the-art technologies like TensorFlow,
PyTorch, and Scikit-Learn aligns with industry best practices and demonstrates
the organization's dedication to staying current with technological
advancements. This exposure has not only enhanced my technical skills but has
also broadened my perspective on the dynamic landscape of data science.

In conclusion, this internship has been an invaluable chapter in my


professional development. The hands-on experience, exposure to advanced
technologies, and the collaborative work environment have not only enhanced
my technical skills but have also instilled in me a deeper appreciation for the
transformative potential of data science in addressing real-world challenges. I
am grateful for the opportunities and mentorship provided during this
internship, laying a strong foundation for my future endeavors in the dynamic
field of data science.

You might also like