Aih Lab1
Aih Lab1
Objective:
● Explore the pattern from the dataset and apply suitable algorithm
System Requirements:
Linux OS with Python and libraries or R or windows with MATLAB
Theory:
What is regression with a mathematical approach?
Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between a dependent variable and one or
more independent variables. Linear regression is the most common form of this technique. Also
called simple regression or ordinary least squares (OLS), linear regression establishes the linear
relationship between two variables. Linear regression is graphically depicted using a straight line
of best fit with the slope defining how the change in one variable impacts a change in the other.
The y-intercept of a linear regression relationship represents the value of the dependent variable
when the value of the independent variable is zero. Nonlinear regression models also exist, but
are far more complex.
Let’s consider a model where (y) is linearly dependent on (x) hence we can create a hypothesis
that can be resembled the equation of a straight line (y=mx+c).Here (θ₀)and (θ₁) are also called
regression coefficients.
The two basic types of regression are simple linear regression and multiple linear regression,
although there are nonlinear regression methods for more complicated data and analysis. Simple
linear regression uses one independent variable to explain or predict the outcome of the
dependent variable Y, while multiple linear regression uses two or more independent variables to
predict the outcome. Analysts can use stepwise regression to examine each independent variable
contained in the linear regression model.
Dataset:
1. Linear Regression - Medical Insurance Dataset
(https://www.kaggle.com/datasets/mirichoi0218/insurance)
2. Logistic Regression - Heart Disease Dataset
(https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)
Algorithm:
Step 1: Create a sample dataset with multiple independent variables and one dependent
variable (Y).
Step 2: The data is split into training and testing sets using the train_test_split function.
Step3: Different regression models are created and fitted to the training data.
Step4: Predictions are made on the test set.
Step5: The model is evaluated using metrics like Mean Absolute Error, Mean Squared Error,
and Root Mean Squared Error.
Step6: Finally, the coefficients and intercept of the regression equation are printed.
Code:
1. Linear Regression
import opendatasets as od
od.download('https://www.kaggle.com/datasets/mirichoi0218/insurance')
import pandas as pd
df = pd.read_csv('/content/insurance/insurance.csv')
df.head()
df.describe()
df.dtypes
label = LabelEncoder()
label.fit(df.sex.drop_duplicates())
df.sex = label.transform(df.sex)
label.fit(df.smoker.drop_duplicates())
df.smoker = label.transform(df.smoker)
label.fit(df.region.drop_duplicates())
df.region = label.transform(df.region)
df.dtypes
x = df.drop(['charges'], axis = 1)
y = df['charges']
Lin_reg = LinearRegression()
Lin_reg.fit(x_train, y_train)
print(Lin_reg.intercept_)
print(Lin_reg.coef_)
print(Lin_reg.score(x_test, y_test))
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
2. Logistic Regression
from google.colab import drive
drive.mount('/content/drive')
import zipfile
import pandas as pd
import numpy as np
df = pd.read_csv('/content/drive/MyDrive/AIH C4/heart.csv')
X = df.drop(columns='HeartDisease')
y = df['HeartDisease']
preprocessor = ColumnTransformer(
transformers=[
])
# Split the dataset into training and testing sets
('classifier',
LogisticRegression(max_iter=10000))])
model_pipeline.fit(X_train, y_train)
y_pred = model_pipeline.predict(X_test)
y_pred_proba = model_pipeline.predict_proba(X_test)[:, 1]
Output:
1. Linear Regression
2. Logistic Regression
Conclusion:
● The exploration of the healthcare dataset, including visualizing the relationships between
features, allowed us to identify key patterns that influence medical conditions.
● Linear regression was applied to predict continuous outcomes (medical insurance), and
logistic regression was used for classification tasks (heart disease). Logistic regression
performed well for binary classification of disease presence, while linear regression
provided reasonably accurate predictions for continuous variables.
● Such models can be integrated into healthcare systems for early prediction, personalized
treatment plans, and enhanced patient monitoring, significantly improving the overall
quality of healthcare delivery.