0% found this document useful (0 votes)
28 views10 pages

Aih Lab1

The document outlines an experiment on regression analysis using healthcare datasets, focusing on linear and logistic regression techniques. It details the objectives, outcomes, system requirements, theoretical background, types of regression, and algorithms used for analysis. The conclusion emphasizes the effectiveness of these models in predicting medical outcomes and enhancing healthcare delivery.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

Aih Lab1

The document outlines an experiment on regression analysis using healthcare datasets, focusing on linear and logistic regression techniques. It details the objectives, outcomes, system requirements, theoretical background, types of regression, and algorithms used for analysis. The conclusion emphasizes the effectiveness of these models in predicting medical outcomes and enhancing healthcare delivery.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Sardar Patel Institute of Technology,Mumbai

Department of Electronics and Telecommunication Engineering


B.E. Sem-VII- PE-IV (2024-2025)
IT 24 - AI in Healthcare

Experiment 1: Regression in Healthcare Dataset

Name: Sanika Tiwarekar Date: 12/08/2024

Objective:

● Write a program for regression analysis for healthcare dataset.

● To demonstrate the working principle of regression techniques on medical data set


for building the model to classify/ predict using a new sample.
Outcomes:
● Explore the Medical Dataset suitable for linear/ logistic regression problem

● Explore the pattern from the dataset and apply suitable algorithm

System Requirements:
Linux OS with Python and libraries or R or windows with MATLAB
Theory:
What is regression with a mathematical approach?

Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between a dependent variable and one or
more independent variables. Linear regression is the most common form of this technique. Also
called simple regression or ordinary least squares (OLS), linear regression establishes the linear
relationship between two variables. Linear regression is graphically depicted using a straight line
of best fit with the slope defining how the change in one variable impacts a change in the other.
The y-intercept of a linear regression relationship represents the value of the dependent variable
when the value of the independent variable is zero. Nonlinear regression models also exist, but
are far more complex.

Let’s consider a model where (y) is linearly dependent on (x) hence we can create a hypothesis
that can be resembled the equation of a straight line (y=mx+c).Here (θ₀)and (θ₁) are also called
regression coefficients.
The two basic types of regression are simple linear regression and multiple linear regression,
although there are nonlinear regression methods for more complicated data and analysis. Simple
linear regression uses one independent variable to explain or predict the outcome of the
dependent variable Y, while multiple linear regression uses two or more independent variables to
predict the outcome. Analysts can use stepwise regression to examine each independent variable
contained in the linear regression model.

What are the types of regression and its significance?


There are several types of regression techniques, each suited for different types of data and
different types of relationships. The main types of regression techniques are:
1. Linear Regression - Linear regression is a linear approach for modeling the relationship
between the criterion or the scalar response and the multiple predictors or explanatory
variables. Linear regression focuses on the conditional probability distribution of the
response given the values of the predictors. For linear regression, there is a danger of
overfitting.
2. Polynomial Regression - This is used to model a non-linear relationship between the
dependent variable and independent variables. Here the input variables include some
polynomial or higher degree terms of some already existing features as well.
3. Stepwise Regression - It is used for fitting regression models with predictive models. It is
carried out automatically. With each step, the variable is added or subtracted from the set
of explanatory variables. The approaches for stepwise regression are forward selection,
backward elimination, and bidirectional elimination.
4. Decision Tree Regression - A Decision tree is a flowchart-like tree structure, where each
internal node denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label. There is a non-parametric method
used to model a decision tree to predict a continuous outcome.
5. Random Forest Regression - The basic idea behind this is to combine multiple decision
trees in determining the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models. We randomly
perform row sampling and feature sampling from the dataset forming sample datasets for
every model. This part is called Bootstrap.
6. Support Vector Regression - SVR can use both linear and non-linear kernels. A linear
kernel is a simple dot product between two input vectors, while a non-linear kernel is a
more complex function that can capture more intricate patterns in the data. The choice of
kernel depends on the data’s characteristics and the task’s complexity.

Dataset:
1. Linear Regression - Medical Insurance Dataset
(https://www.kaggle.com/datasets/mirichoi0218/insurance)
2. Logistic Regression - Heart Disease Dataset
(https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)

Algorithm:
Step 1: Create a sample dataset with multiple independent variables and one dependent
variable (Y).
Step 2: The data is split into training and testing sets using the train_test_split function.
Step3: Different regression models are created and fitted to the training data.
Step4: Predictions are made on the test set.
Step5: The model is evaluated using metrics like Mean Absolute Error, Mean Squared Error,
and Root Mean Squared Error.
Step6: Finally, the coefficients and intercept of the regression equation are printed.

Code:
1. Linear Regression
import opendatasets as od

od.download('https://www.kaggle.com/datasets/mirichoi0218/insurance')

import pandas as pd

df = pd.read_csv('/content/insurance/insurance.csv')

df.head()

df.describe()

df[['sex', 'smoker', 'region']] = df[['sex', 'smoker',


'region']].astype('category')

df.dtypes

from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()

label.fit(df.sex.drop_duplicates())

df.sex = label.transform(df.sex)
label.fit(df.smoker.drop_duplicates())

df.smoker = label.transform(df.smoker)

label.fit(df.region.drop_duplicates())

df.region = label.transform(df.region)

df.dtypes

from sklearn.model_selection import train_test_split as holdout

from sklearn.linear_model import LinearRegression

from sklearn import metrics

x = df.drop(['charges'], axis = 1)

y = df['charges']

x_train, x_test, y_train, y_test = holdout(x, y, test_size=0.2,


random_state=0)

Lin_reg = LinearRegression()

Lin_reg.fit(x_train, y_train)

print(Lin_reg.intercept_)

print(Lin_reg.coef_)

print(Lin_reg.score(x_test, y_test))

print("Coefficients:", model.coef_)

print("Intercept:", model.intercept_)

2. Logistic Regression
from google.colab import drive

drive.mount('/content/drive')

import zipfile

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, OneHotEncoder


from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report, roc_auc_score,


confusion_matrix

import seaborn as sns

import matplotlib.pyplot as plt

import numpy as np

df = pd.read_csv('/content/drive/MyDrive/AIH C4/heart.csv')

# Separate features and target variable

X = df.drop(columns='HeartDisease')

y = df['HeartDisease']

# Identify categorical and numerical columns

categorical_cols = ['Sex', 'ChestPainType', 'RestingECG',


'ExerciseAngina', 'ST_Slope']

numerical_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS',


'MaxHR', 'Oldpeak']

# Preprocessing pipeline for numerical and categorical features

preprocessor = ColumnTransformer(

transformers=[

('num', StandardScaler(), numerical_cols),

('cat', OneHotEncoder(), categorical_cols)

])
# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.25, random_state=42)

# Create a pipeline with preprocessing and logistic regression model

model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),

('classifier',
LogisticRegression(max_iter=10000))])

# Train the logistic regression model

model_pipeline.fit(X_train, y_train)

# Make predictions on the test set

y_pred = model_pipeline.predict(X_test)

y_pred_proba = model_pipeline.predict_proba(X_test)[:, 1]

# Evaluate the model

classification_rep = classification_report(y_test, y_pred)

roc_auc = roc_auc_score(y_test, y_pred_proba)

conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results

print("Classification Report:\n", classification_rep)

print("ROC-AUC Score:", roc_auc)

Output:
1. Linear Regression
2. Logistic Regression
Conclusion:
● The exploration of the healthcare dataset, including visualizing the relationships between
features, allowed us to identify key patterns that influence medical conditions.
● Linear regression was applied to predict continuous outcomes (medical insurance), and
logistic regression was used for classification tasks (heart disease). Logistic regression
performed well for binary classification of disease presence, while linear regression
provided reasonably accurate predictions for continuous variables.
● Such models can be integrated into healthcare systems for early prediction, personalized
treatment plans, and enhanced patient monitoring, significantly improving the overall
quality of healthcare delivery.

You might also like