0% found this document useful (0 votes)
39 views33 pages

Sampath Report

Uploaded by

uzmakin198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views33 pages

Sampath Report

Uploaded by

uzmakin198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

INTERNSHIP REPORT ON

A report submitted in partial fulfilment of the requirements for the Award of Degree of

BACHELOR OF TECHNOLOGY
In
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
BY
A. SAMPATH
Regd. No.: 22671A7362

Under Supervision of
Ms. Maryam Fatima Farooqui
(Duration: 05th August 2023 to 15th September 2023)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND


MACHINE LEARNING
J.B. INSTITUTE OF ENGINEERING AND TECHNOLOGY
(UGC Autonomous)
Approved by AICTE, accredited by NBA & NAAC Permanently
affiliated to JNTUH, Hyderabad, Telangana
2022 – 2026

I
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND
MACHINE LEARNING
J.B. INSTITUTE OF ENGINEERING AND TECHNOLOGY
(UGC Autonomous)

CERTIFICATE

This is to certify that the “Internship report” submitted by A. SAMPATH (Regd.


No.22671A7303) is work done by him and submitted during academic year 2023 –
2024, in partial fulfilment of the requirements for the award of the degree of
DEPARTMENT OF DATA SCIENCE WITH PYTHON, at COINCENT
COMPANY.

Ms. Maryam Fatima Farooqui Dr. G. Arun Sampaul Thomas


Assistant professor & Internship Coordinator Associate Professor & HOD
Department of AI&ML Department of AI&ML

II
ACKNOWLEDGEMENT

First, I would like to thank COINCENT COMPANY for giving me the opportunity to
do an internship within the organization.

I also like to thank all the people that worked along with me COINCENT COMPANY
with their patience and openness they created an enjoyable working environment.

It is indeed with a great sense of pleasure and immense sense of gratitude that I
acknowledge the help of these individuals.

I would like to thank Ms. Maryam Fatima Farooqui, Internship Coordinator,


Department of ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING for
his support and advice to get and complete internship in above said organization. I am
extremely great full to my department, staff members and friends who helped me in
successful completion of this internship.

I would like to thank my Head of the Department Dr. G. Arun Sampaul Thomas for
his constructive criticism throughout my internship. I am highly indebted to our
Principal Dr. P.C. Krishnamachary, for the facilities provided to accomplish this
internship.

A. SAMPATH

22671A7362

III
TABLE OF CONTENTS
1. ABSTRACT

2. ORGANISATION INFORMATION

3. INTERNSHIP OBJECTIVES

4. WEEKLY REPORT OF INTERNSHIP ACTIVITIES

5. INTRODUCTION

6. MODULES

7. SYSTEM SPECIFICATION

8. HARDWARE AND SOFTWARE REQUIRE

9. SOFTWARE ENVIROMENT

10.SYSTEM DESIGN

11.CODING

12.SCREENSHOTS

13.CONCLUSION

IV
V
ABSTRACT
The provided code implements a Random Forest Classifier to predict diabetes based
on a dataset. The dataset, loaded using Pandas, consists of diabetes-related
features, and the target variable indicating the presence or absence of diabetes.
The data is split into training and testing sets using sklearn' s train test split

A Random Forest Classifier is trained on the training set with a specified maximum
depth and random state. The model's predictions on the test set are then evaluated
using confusion matrix and accuracy score metrics. The confusion matrix provides
insights into the classifier's performance, while the accuracy score quantifies its
overall accuracy.

The code concludes with a demonstration of making a prediction for a new data
point, and the result is displayed, indicating whether the person is diabetic or not
based on the model's prediction.

This code serves as a practical example of employing machine learning techniques,


specifically a Random Forest Classifier, for diabetes prediction. The metrics and
prediction results enhance the understanding of the model's effectiveness,
contributing to informed decision-making in diabetic status assessment In the
pursuit of precision healthcare, this project presents a streamlined Python
implementation for diabetes prediction using a carefully curated dataset.
Leveraging the Pandas, NumPy, and Matplotlib libraries, the study commences with
comprehensive data exploration and manipulation, offering a glimpse into the
structure and characteristics of the diabetes prediction dataset.

Key features, such as 'gender,' 'smoking history,' and 'HbA1c_level,' are


strategically identified and subsequently excluded from the analysis, streamlining
the dataset to essential parameters. Utilizing the renowned Random Forest
Classifier from scikit-learn, a supervised machine learning model is trained on the
refined dataset to predict diabetes outcomes accurately.

The dataset is partitioned into training and testing sets using the widely adopted
train-test split methodology. The Random Forest Classifier, characterized by a
defined maximum depth and random state, is trained on the testing set to harness
its predictive capabilities.

To demonstrate the practicality of the model, a novel entry is introduced for


prediction, simulating a real-world scenario. The model predicts the diabetes status
of the given individual, and the results are presented with clear interpretability.
Notably, the project emphasizes the importance of proper implementation practices,
ensuring the reproducibility and reliability of the results.

This Python-based approach not only provides a foundation for diabetes prediction
but also serves as a template for the development of machine learning solutions in
healthcare. By offering a transparent and accessible codebase, this project strives
to empower practitioners and researchers alike in harnessing the potential of
machine learning for predictive analytics in the realm of medical diagnoses.

1
ORGANISATION INFORMATION

Coincent is a manage marketplace that offers Live Industrial Training, Projects &
Internship to students. We are a professional community of Industry Experts and academia,
who have come together to help learners become employable. We use Design Thinking and
a learner-centred approach to problem-solving and focusing on application based learning.

Our Industrial Trainers, mentors, and counsellors are passionate tutors & student career-
driven specialists in their fields with years of Industrial Expertise in eminent companies like
Google, IBM, Microsoft, and more.

Besides we use our leading-edge and comprehensive Internally developed AI tools to make
certain that our learners experience customized and personalized learning to achieve
exponential success.

We are a bunch of individuals always working to make a difference in students’ career and
to meet the needs of the industries by bridging the skill gap between colleges and
industries.

Personalized Learning

Coincent Partnered Companies provide live interactive classes, and amiable mentors to
make the session more engaging and informative

Anywhere Anytime

Our time schedule is very flexible, and our live sessions will be held in evening time to
avoid any clashes with the college schedule

Lifetime Access

Students can access the dashboard at any time to see their progress, and a customized
resume builder will be accessible after completion of Internship.

2
METHODOLOGIES
o Data Collection
o Data Preprocessing
o Data Splitting
o Model Architecture
o Model Compilation
o Model Training
o Hyperparameter Tuning
o Model Evaluation
o Visualization
o Deployment

3
INTERNSHIP OBJECTIVES
One of the main objectives of an internship is to expose you to a
particular job and a profession or industry. While you might have an idea
about what a job is like, you won’t know until you actually perform it if it’s
what you thought it was, if you have the training and skills to do it and if it’s
something you like. For example, you might think that advertising is a
creative process that involves coming up with slogans and fun campaigns.
Taking an internship at an advertising agency would help you find that
advertising includes consumer demographic research, focus groups,
knowledge of a client’s pricing and distribution strategies, and media
research and buying. When you apply for jobs, the more experience and
accomplishments you have, the more attractive you’ll look to a potential
employer. Just because you have an internship with a specific title or well-
known company doesn’t mean your internship will help you land a nice gig.
Make an impact where you work by asking for responsibility and looking
for ways to achieve accomplishments. Be willing to work more hours than
you’re required and ask to work in different departments to expand your
skill set. Don’t just fetch coffee, make copies and sit in on meetings, even if
that’s all it will take to finish your internship.
Another benefit of an internship is developing business contacts. These
people can help you find a job later, act as references or help you with
projects after you’re hired somewhere else. Meet the people who have jobs
you would like some day and ask them if you can take them to lunch. Ask
them how they started their careers, how they got to where they are now and
if they have any suggestions for you to improve your skills.

4
WEEKLY REPORT OF INTERNSHIP ACTIVITIES
WEEK PROGRESS
WEEK-1 o Introduction to python.
o Installation.
o Basic, Number, Strings…
o Basic Python and Datatypes.
o Control flow Conditions.
WEEK-2 o Exceptional Handling.
o Functions.
o Object-Oriented Programming (OOP).

WEEK-3 o Introduction to Deep Learning.


o Logistic Regression.
o LR Vs DL.
o TensorFlow.
o Kera’s.
WEEK-4 o Library Introduction.
o Matplotlib.
o NumPy.
o Pandas.
o Linear Algebra.
o Probs.
WEEK-5 o Data Visualization with Tableau
o LSTM
o Machine Learning Models-Clustering
o Evaluation Metrics
o Logistic Regression
o Simple Linear Regression
o Multiple Linear Regression
WEEK-6 o Hierarchical Clustering
o HC Code
o Chatbot with Dialogflow-Theory
o Chatbot with Dialogflow-Code
o Chatbot-Code
o Project-Introduction
o Venv and Data Import
o Data Cleaning
o EDA
o Data Preparation
o Building ML Model

5
6
PROGRAMS AND OPPORTUNITIES:
1. Machine Learning with Python
2. Full Stack Web development
3. Cloud computing
4. Data Science
5. Cyber Security
6. Artificial Intelligence
7. Microsoft Azure Cloud Computing
8. Augmented & Virtual Reality
9. App Development Combined Course
10.Graphic Design ……etc.,

7
INTRODUCTION
Diabetes is a chronic health condition characterized by elevated blood sugar levels,
primarily resulting from the body's inability to produce or effectively use insulin. It is
a significant global health concern, with millions of people affected worldwide. Early
detection and management of diabetes are crucial for preventing complications and
improving overall health outcomes.

In the realm of healthcare, machine learning techniques have emerged as valuable


tools for predicting and diagnosing diseases, including diabetes. These technologies
leverage patterns and relationships within large datasets to make predictions or
classifications. Python, with its extensive libraries and frameworks, is a popular
language for implementing machine learning models.

Diabetes prediction using machine learning involves training models on historical


data containing relevant features associated with the condition. The provided
features - Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI (Body
Mass Index), Diabetes Pedigree Function, and Age - are common indicators used in
predictive Modelling for diabetes.

By utilizing machine learning algorithms, such as Random Forest, Support Vector


Machines, or Neural Networks, these features can be employed to create predictive
models that discern patterns indicative of diabetes. These models can then be
applied to new data to make predictions about an individual's likelihood of having
diabetes.

The code you've provided seems to be an implementation of a Random Forest


Classifier for diabetes prediction, utilizing the specified features. This approach
showcases the practical application of machine learning in healthcare,
demonstrating how technology can aid in the early identification of diabetes,
thereby facilitating timely interventions and improving patient outcomes.

In the dynamic landscape of healthcare, the fusion of advanced data analytics and
machine learning has emerged as a catalyst for transformative breakthroughs. This
project represents a pioneering Endeavor, orchestrating a sophisticated diabetes
prediction system through an intricate Python implementation. By leveraging the
capabilities of Pandas, NumPy, and the scikit-learn library, the study delves into a
meticulously curated diabetes prediction dataset, navigating the complex terrain of
patient data.

Key to the project's ingenuity is the strategic identification and exclusion of crucial
features, culminating in a refined dataset that serves as the crucible for predictive
Modelling. At the heart of this Endeavor lies the Random Forest Classifier, an
ensemble learning algorithm revered for its versatility and predictive power.
Tailored to discern patterns within the data, this classifier is meticulously trained on
a testing set, ultimately offering a model capable of predicting diabetes outcomes
with a remarkable blend of accuracy and transparency.

8
Beyond its immediate application, this project carries profound implications for the
intersection of technology and healthcare. As the Python-based methodology
unfolds, it not only augurs

9
MODULES

MODULES:
● Users
● Data Collection
● Attribute Selection
● Preprocessing of data.

Users: Users add the data to the database and view the data to the view data
and predict the heart disease using ml.

Data Collection: First step for predication system is data collection and
deciding about the training and testing dataset. In this project we have used 8
0% training dataset and 20% dataset used as testing dataset the system.

Attribute Selection: Attribute of dataset are property of dataset which are


used for system and for diabetes many attributes are like: - Pregnancies,
Glucose , Blood Pressure , Skin Thickness , insulin , BMI , Age etc.…

Preprocessing of data: Preprocessing needed for achieving prestigious


result from the machine learning algorithms. For example, Random Forest
algorithm does not support null values dataset and for this we have to
manage null values from original raw data. For our project we must convert
some categorized value by dummy value means in the form of “0”and “1” by
using following code

Admin: Admin will give authority to Users. To activate the users. the admin
can Predict Diabetes

10
SYSTEM SPECIFICATIONS

HARDWARE REQUIREMENTS:

 System : Intel Core i5

 Hard Disk : 512 SSD.

 Monitor : 14’ Color Monitor.

 Mouse : Optical Mouse.

 Ram : 16GB.

SOFTWARE REQUIREMENTS:

 Operating system : Windows 11.

 Coding Language : DATA SCIENCE WITH PYTHON.

 Front-En : Html , CSS

 Designing : Html, CSS .

 Data Base : SQLite.

11
HARDWARE AND SOFTWARE SPECIFICATIONS

REQUIREMENT ANALYSIS:

The project involved analyzing the design of few applications so as to make the application more
users friendly. To do so, it was really important to keep the navigations from one screen to the other
well-ordered and at the same time reducing the amount of typing the user needs to do. In order to
make the application more accessible, the browser version had to be chosen so that it is compatible
with most of the Browsers.

REQUIREMENT SPECIFICATION

Functional Requirements:
 Graphical User interface with the User.
Software Requirements:
For developing the application, the following are the Software Requirements:

1. Python for Model Preparation.

2. Flask or Django for Backend.

3. HTML, CSS, React for Frontend.

4. SQL-Lite for Database Management.

Operating Systems supported:

1. Windows 10 64-bit OS

Technologies and Languages used to Develop:

1. Python and their Framework.

2. Debugger and Emulator:

 Any Browser (Particularly Chrome)


Hardware Requirements
For developing the application, the following are the Hardware Requirements:
 Processor: Intel i5
 RAM: 16 GB
12
SOFTWARE ENVIRONMENT

Data science with Python:


1. Python: The programming language itself, usually using the latest stable
version.
2. Integrated Development Environments (IDEs): Popular choices
include Jupiter Notebook, Jupiter Lab, Spyder, or Visual Studio Code
with Python extensions. These provide an interface for coding,
visualization, and documentation within a single environment.
3. Data Manipulation Libraries: Such as Pandas for data manipulation and
analysis.
4. Visualization Libraries: Matplotlib, Seaborn, Plotly, or Bokeh for
creating various types of visualizations.
5. Machine Learning Libraries: Scikit-learn, TensorFlow, PyTorch, or
Keras for implementing machine learning models and algorithms.
6. Statistical Analysis Libraries: SciPy, Stats models, or NumPy for
statistical analysis and computations.
7. Data Collection and Web Scraping Libraries: Requests,
BeautifulSoup, or Scrapy for obtaining data from various sources.
8. Database Access Libraries: SQL Alchemy, Psycopg2, or MySQL for
interfacing with databases.
9. Big Data Processing Libraries: PySpark or Desk for handling large-
scale data processing.
10. Notebook Extensions and Plugins: These can enhance functionality,
such as Jupiter Notebook extensions for additional features like spell-
checking, code formatting, or auto-completion.

Interactive Mode Programming:


Certainly! Python offers various libraries and tools for data science in its
interactive mode, especially when combined with Jupyter notebooks or
the interactive Python shell. Here's a basic example using Python's
built-in capabilities and some popular data science libraries like NumPy,
Pandas, and Matplotlib in an interactive mode:

1. NumPy: For numerical computing.


2. Pandas: For data manipulation and analysis.
3. Matplotlib: For data visualization.
13
Using Python Interactive Mode
Open your Python interpreter in the terminal or command
prompt by typing python and press Enter.

# Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib. pyplot as plt

# Create sample data using NumPy


data = np.random.randn(100).cumsum()

# Create a Pandas Series from the NumPy array


series = pd.Series(data)

# Display the first few rows of the Series using Pandas


print(series.head())

# Plot the data using Matplotlib


plt.plot(series)
plt.title('Random Data')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

plt.show()This code demonstrates a basic example:


1. It generates random data using NumPy.
2. Converts it into a Pandas Series.
3. Prints the first few rows of the Series.
4. Plots the data using Matplotlib.

You can run these lines one by one in your Python interpreter to see the
intermediate results and visualize the data as you go along. For a more
comprehensive and interactive experience, consider using Jupyter
notebooks that allow mixing code, visualizations, and text explanations
in a more organized way

Script Mode Programming # Import necessary libraries


14
Script mode programming in Python allows you to write a sequence of
Python commands in a script file and execute it as a whole. Here's an
example of how you might structure a Python script for data science
using libraries like NumPy, Pandas, and Matplotlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def generate_and_plot_data():
# Create sample data using NumPy
data = np.random.randn(100).cumsum()

# Create a Pandas Series from the NumPy array


series = pd.Series(data)

# Display the first few rows of the Series using Pandas


print(series.head())

# Plot the data using Matplotlib


plt.plot(series)
plt.title('Random Data')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

if __name__ == "__main__":
generate_and_plot_data()

This script does the following:

1. Imports the necessary libraries: NumPy, Pandas, and Matplotlib.


2. Defines a function generate_and_plot_data() to generate
random data, convert it into a Pandas Series, print the first few
rows, and plot the data using Matplotlib.
3. Checks if the script is executed as the main program (if __name__
== "__main__":) and then calls the generate_and_plot_data()
function.

15
RANDOM FOREST CLASSIFIER:-
Random Forest Classifier is a powerful and popular machine learning algorithm used
for both classification and regression tasks. It belongs to the ensemble learning
methods, which combine multiple individual models to produce a more robust and
accurate prediction. Here's an explanation of Random Forest Classifier in the
context of data science using Python
How does Random Forest Classifier work?
1. Decision Trees: Random Forest is constructed from multiple decision trees.
Each decision tree is built independently and operates based on a set of rules
to make decisions. It splits the dataset into smaller subsets while
progressively narrowing down to make predictions.
2. Ensemble Learning: Random Forest uses the concept of ensemble learning
by creating a multitude of decision trees. Each tree is trained on a random
subset of the data and uses a random subset of features.
3. Voting Mechanism: When making predictions, Random Forest collects
predictions from each individual decision tree and performs a majority vote
(for classification) or averaging (for regression) to determine the final
prediction.

Implementation in Python using Scikit-Learn:


You can use the Random Forest Classifier class from the sklearn.ensemble
module in Python to implement a Random Forest Classifier. Here's a simple
example.

from sklearn.datasets import load iris

from sklearn.model selection import train test split

from sklearn.ensemble import Random Forest Classifier

from sklearn.metrics import accuracy score

# Load the Iris dataset (or any other dataset)

data = load_iris()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,


test_size=0.2, random_state=42)

# Create a Random Forest Classifier

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier

rf_classifier.fit(X_train, y_train)

16
# Make predictions on the test set

predictions = rf_classifier.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy}")

This code snippet demonstrates how to use RandomForestClassifier from the


sklearn (Scikit-Learn) library in Python to train a Random Forest model on the Iris
dataset, make predictions, and evaluate its accuracy.

Adjust the parameters like n_estimators (the number of trees in the forest) and
others based on your specific dataset and requirements.

Random Forest Classifier is a versatile and powerful algorithm widely used in


various domains due to its high accuracy, robustness, and ease of use for both
classification and regression problems in data science.

17
18
SYSTEM DESIGN

SYSTEM ARCHITECTURE:

DATA FLOW DIAGRAM:


1. The DFD is also called as bubble chart. It is a simple graphical formalism that
can be used to represent a system in terms of input data to the system, various
processing carried out on this data, and the output data is generated by this
system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It
is used to model the system components. These components are the system
process, the data used by the process, an external entity that interacts with the
system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is
modified by a series of transformations. It is a graphical technique that depicts
information flow and the transformations that are applied as data moves from
input to output.
19
Data flow diagram
UML DIAGRAMS

UML stands for Unified Modeling Language. UML is a standardized


general-purpose modeling language in the field of object-oriented
software engineering. The standard is managed, and was created by, the
Object Management Group.
The goal is for UML to become a common language for creating models of
object oriented computer software. In its current form UML is comprised of
two major components: a Meta-model and a notation. In the future, some
form of method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software
system, as well as for business modeling and other non- software systems.
The UML represents a collection of best engineering practices that have
proven successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software
and the software development process. The UML uses mostly graphical
notations to express the design of software projects.

20
CODING: -
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv('/content/diabetes_prediction_dataset.csv')
dataset.head()

gen ag hypert heart_d smoking bm HbA1c blood_gluc diab


der e ension isease _history i _level ose_level etes

Fem neve
0 80.0 0 1 25.19 6.6 140 0
ale r

Fem No
1 54.0 0 0 27.32 6.6 80 0
ale Info

Mal neve
2 28.0 0 0 27.32 5.7 158 0
e r

Fem curr
3 36.0 0 0 23.45 5.0 155 0
ale ent

Mal curr
4 76.0 1 1 20.14 4.8 155 0
e ent

Smo
a
gen hyperte heart_di ing b HbA1c_ blood_gluco diab
g
der nsion sease hist mi level se_level etes
e
ory

999 Femal 80. No 27.


0 0 6.2 90 0
95 e 0 Info 32

999 Femal No 17.


2.0 0 0 6.5 100 0
96 e Info 37

999 66. forme 27.


Male 0 0 5.7 155 0
97 0 r 83

999 Femal 24. 35.


0 0 never 4.0 100 0
98 e 0 42

999 Femal 57. curren 22.


0 0 6.6 90 0
99 e 0 t 43

21
columns = ['gender', 'smoking_history', 'HbA1c_level'] # Store the column names
in a list

# Use the correct variable name (columns) and specify axis=1 to drop columns
dataset = dataset.drop(columns=columns, axis=1)

ag hypertensi heart BM Blood glucose diabet


e on disease I level es

0 80.0 0 1 25.19 140 0

1 54.0 0 0 27.32 80 0

2 28.0 0 0 27.32 158 0

3 36.0 0 0 23.45 155 0

4 76.0 1 1 20.14 155 0

<bound method DataFrame.info of age hypertension heart_disease bmi


blood_glucose_level diabetes
0 80.0 0 1 25.19 140 0
1 54.0 0 0 27.32 80 0
2 28.0 0 0 27.32 158 0
3 36.0 0 0 23.45 155 0
4 76.0 1 1 20.14 155 0
... ... ... ... ... ... ...
99995 80.0 0 0 27.32 90 0
99996 2.0 0 0 17.37 100 0
99997 66.0 0 0 27.83 155 0
99998 24.0 0 0 35.42 100 0
99999 57.0 0 0 22.43 90 0

<bound method DataFrame.info of age hypertension heart_disease bmi


blood_glucose_level diabetes
0 80.0 0 1 25.19 140 0
1 54.0 0 0 27.32 80 0
2 28.0 0 0 27.32 158 0
3 36.0 0 0 23.45 155 0
4 76.0 1 1 20.14 155 0
... ... ... ... ... ... ...
99995 80.0 0 0 27.32 90 0
99996 2.0 0 0 17.37 100 0
99997 66.0 0 0 27.83 155 0
99998 24.0 0 0 35.42 100 0
99999 57.0 0 0 22.43 90 0

x=dataset.iloc[:,:-1]

22
y=dataset.iloc[:,-1]

print(x)
print(y)

age hypertension heart_disease bmi blood_glucose_level


0 80.0 0 1 25.19 140
1 54.0 0 0 27.32 80
2 28.0 0 0 27.32 158
3 36.0 0 0 23.45 155
4 76.0 1 1 20.14 155
... ... ... ... ... ...
99995 80.0 0 0 27.32 90
99996 2.0 0 0 17.37 100
99997 66.0 0 0 27.83 155
99998 24.0 0 0 35.42 100
99999 57.0 0 0 22.43 90

[100000 rows x 5 columns]


0 0
1 0
2 0
3 0
4 0
..
99995 0
99996 0
99997 0
99998 0
99999 0
Name: diabetes, Length: 100000, dtype: int64
print(y)
0 0
1 0
2 0
3 0
4 0
..
99995 0
99996 0
99997 0
99998 0
99999 0
Name: diabetes, Length: 100000, dtype: int64
dataset.isnull().sum()

age 0
hypertension 0
heart_disease 0
bmi 0
blood_glucose_level 0
diabetes 0
23
dtype: int64

from sklearn.model_selection import train_test_split


x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.2,
random_state= 42)

print(x_train)
print(x_test)
print(y_train)
print(y_test)

age hypertension heart_disease bmi blood_glucose_level


75220 73.0 0 0 24.77 80
48955 80.0 0 0 24.60 145
44966 38.0 0 0 24.33 158
13568 26.0 0 0 18.87 100
92727 61.0 1 0 22.11 85
... ... ... ... ... ...
6265 49.0 0 0 32.98 80
54886 15.0 0 0 28.10 159
76820 42.0 0 0 26.14 85
860 37.0 0 0 24.96 158
15795 23.0 0 0 27.99 159

[80000 rows x 5 columns]


age hypertension heart_disease bmi blood_glucose_level
75721 13.0 0 0 20.82 126
80184 3.0 0 0 21.00 145
19864 63.0 0 0 25.32 200
76699 2.0 0 0 17.43 126
92991 33.0 0 0 40.08 200
... ... ... ... ... ...
32595 44.0 0 0 21.95 159
29313 61.0 1 0 41.98 90
37862 49.0 0 0 26.51 100
53421 73.0 0 1 27.32 100
42410 43.0 0 0 23.86 145

[20000 rows x 5 columns]


75220 0
48955 1
44966 0
13568 0
92727 0
..
6265 0
54886 0
76820 0
860 0
15795 0

24
SCREENSHOTS

Conclusion: -
The provided code is a Python script for predicting diabetes using a Random Forest
Classifier. The script uses the pandas library to handle a diabetes dataset and the scikit-
learn library for machine learning tasks. Here's a summary of the key steps in the code:

1. Data Loading: The script begins by loading a diabetes dataset from a CSV file
using the panda’s library.
2. Data Preparation: The features (X) and labels (Y) are extracted from the dataset.
The script then prints the features and labels for inspection.
3. Data Splitting: The dataset is split into training and testing sets using the train test
split function from scikit-learn.
4. Model Training: A Random Forest Classifier is initialized and trained on the
training data.
5. Prediction: The trained model is used to predict labels for the test set, and the
results are printed alongside the actual labels.
6. Model Evaluation: The script calculates a confusion matrix and accuracy score to
evaluate the performance of the model on the test set.
7. Prediction on New Data: The script performs a prediction on a new data point
representing a person's health metrics. It prints whether the person is predicted to be
diabetic or not based on the trained model.

25
The provided code serves as a concise example of using a Random Forest Classifier for
diabetes prediction. However, there are a few points to consider for improvement:

 The script lacks proper comments and documentation, making it less readable and
harder to understand for someone unfamiliar with the code.
 It would be beneficial to include explanations of the features in the dataset for better
understanding.
 The dataset source and context are not provided, making it challenging to interpret
the significance of the features and the reliability of the model.
 Further analysis, such as hyperparameter tuning, cross-validation, or feature
importance exploration, could enhance the model's performance and interpretability.

In conclusion, while the code successfully demonstrates the application of a Random


Forest Classifier for diabetes prediction, adding comments, documentation, and
considering additional analysis would make it more robust and user-friendly.

26
BIBLIOGRAPHY

1. Rosman, N.F., Asli, N.A., Abdullah, S. and Rusop, M. (2019) Review: Some Common Disease
in Mango. AIP Conference Proceedings, 2151, Article No. 020019.
2. Gulavnai, S. and Patil, R. (2019) Deep Learning for Image Based Mango Leaf Disease Detection.
International Journal of Recent Technology and Engineering.
3. Wu, S.-L., Tung, H.-Y. and Hsu, Y.-L. (2020) Deep Learning for Automatic Quality Grading of
Mangoes: Methods and Insights. 2020 19th IEEE International Conference on Machine Learning
and Applications, Miami.
4. FAO (2022) Major Tropical Fruits: Preliminary Results 2021. FAO, Rome.
5. Mohanty, S.P, Hughes, D. and Salathé, M. (2016). Using Deep Learning for Image-Based Plant
Disease Detection. Frontiers in Plant Science, 7, Article 1419.
6. Pham, T.N., Tran, L.V. and Dao, S.V.T. (2020) Early Disease Classification of Mango Leaves
Using Feed-Forward Neural Network and Hybrid Metaheuristic Feature Selection. IEEE Access.
7. Singh, U.P., Chouhan, S.S., Jain, S. and Jain, S. (2019) Multilayer Convolution Neural Network
for the Classification of Mango Leaves Infected by Anthracnose Disease. IEEE Access, 7,
43721-43729.
8. Sutrodhor, N., Hussein, M.R., Mridha, F., Karmokar, P. and Nur, T. (2018) Mango Leaf Ailment
Detection using Neural Network Ensemble and Support Vector Machine. International Journal of
Computer Applications.
9. Saleem, R., Shah, J.H., Sharif, M. and Ansari, G.J. (2021) Mango Leaf Disease Identification
Using Fully Resolution Convolutional Network. Computers, Materials & Continua,
10. Arivazhagan, S. and Ligi, S.V. (2018) Mango Leaf Diseases Identification Using Convolutional
Neural Network. International Journal of Pure and Applied Mathematics.
11. Mia, M.R., Roy, S., Das, S.K. and Rahman, M.A. (2020) Mango Leaf Disease Recognition Using
Neural Network and Support Vector Machine. Iran Journal of Computer Science.
12. Ullagaddi, S.B. and Raju, S.V. (2017) Disease Recognition in Mango Crop Using Modified
Rotational Kernel Transform Features. 2017 4th International Conference on Advanced
Computing and Communication Systems (ICACCS), Coimbatore, 6-7 January 2017.
13. Ullagaddi, S.B. and Viswanadha Raju, S. (2017) An Enhanced Feature Extraction Technique for
Diagnosis of Pathological Problems in Mango Crop. International Journal of Image, Graphics
and Signal Processing.
27
28

You might also like