0% found this document useful (0 votes)
42 views11 pages

AI Project Medicine Recommending System

Uploaded by

kashaf.zahra04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views11 pages

AI Project Medicine Recommending System

Uploaded by

kashaf.zahra04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Semester Project

“Medicine Prediction Model”

Name: Syeda Kashaf Naqvi

Roll#: BSSEM-F22-125

Section: 5C

Department: Software Engineering

Submission Date: 27-11-2024

Course: Artificial Intelligence

Course Ms. Asma Abubakar


Instructor:
1. Title and Overview

Title:
Medical Dataset for Predicting Medicines Based on Symptoms

Overview:
This dataset is designed to support the prediction of appropriate medicines based
on a patient’s symptoms. It contains medical records including patient
demographics, symptoms, diagnosed causes, and prescribed treatments. The
dataset was likely collected for research or to develop a recommendation system
for healthcare. It represents a small sample of real-world medical cases.

2. Source

Origin:
The origin of the dataset is unspecified but may have come from simulated or
anonymized medical records.

Collection Method:
Data appears to be collected via case summaries, combining observational and
synthetic inputs. The dataset includes common symptoms, probable causes, and
associated treatments.

Why use Random Forest Classifier?


Logistic Regression Accuracy: 0.5344827586206896

Decision Tree Accuracy: 0.8448275862068966

Random Forest Accuracy: 0.896551724137931

3. Content Description

Variables:

Name Type Description Units


Name Categorical Patient name (potentially anonymized). N/A

Date of Birth Date Date of birth of the patient. DD-MM-


YYYY

Gender Categorical Gender of the patient (e.g., Male, N/A


Female).

Symptoms Categorical List of reported symptoms. N/A

Causes Categorical Possible causes of symptoms. N/A

Disease Categorical Diagnosed disease based on symptoms N/A


and causes.

Medicine Categorical Prescribed medication or treatment. N/A

Sample Size:
The dataset includes 287 records, though some entries have missing values.

4. Data Structure
Format:
CSV file.

Dimensions:
287 rows × 7 columns.
5. Summary Statistics

Descriptive Statistics:

● Name: 241 non-null, with 87 unique entries; most frequent value is "Sophia
Koh".
● Gender: 242 non-null; predominantly "Male" (116 occurrences).
● Symptoms: 247 non-null, with 53 unique combinations; the most common
combination is "Fatigue, Weakness".
● Disease: 249 non-null, 68 unique; "Gastroenteritis" appears most frequently
(20 times).
● Medicine: 242 non-null, 65 unique; "Rest, Lifestyle" is the most common
treatment (16 occurrences).

Distributions:
Variables such as Symptoms, Causes, and Disease are categorical and multi-modal.
Numeric descriptive statistics are not applicable. Missing data and frequent
repetitions suggest synthetic or anonymized nature.

6. Data Quality

Missing Values:

● Missing entries in columns such as Name, Gender, Symptoms, Causes, and


Medicine.
● Possible imputation strategies include filling with mode/mean or excluding
incomplete rows.

Outliers:

● Some rare values may indicate edge cases or errors but should be validated.

Validation:

● Cross-verification against known medical knowledge could help improve


data reliability.
7. References

● Dataset Source: Kaggle: Medicine Recommendation System Dataset


● Additional Resources:
o DrugBank Dataset (for drug information)
o Symptom-Disease Mapping from Mayo Clinic

“CODE”

The following code has been implemented for the model:


Library Imports
# =========== IMPORT REQUIRED LIBRARIES =====================

import numpy as np

import pandas as pd

from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

import joblib

Pre Processing

unique_symptoms = [symptom for symptoms in df['Symptoms'] for symptom


in symptoms]

print(unique_symptoms)

# Encode Symptoms column using MultiLabelBinarizer

if not test:

# Fit and transform for training data

symptoms_encoded = mlb.fit_transform(df['Symptoms'])

else:

# Only transform for test data


symptoms_encoded = mlb.transform(df['Symptoms'])

# Create a new DataFrame for the encoded symptoms

symptoms_df = pd.DataFrame(symptoms_encoded, columns=mlb.classes_,


index=df.index)

# Drop the original Symptoms column and concatenate the new binary
features

df = pd.concat([df.drop('Symptoms', axis=1), symptoms_df], axis=1)

# Fill missing values for other categorical columns

for column in categorical_columns:

if column in df.columns:

df[column].fillna(df[column].mode()[0], inplace=True)

else:

print(f"Warning: Column '{column}' is missing in the


DataFrame.")

return df, mlb

Encoding features
# ============= Encoding of Features ==================

def label_encode(df: pd.DataFrame):


label_encoders = {}

for column in df.columns:

if df[column].dtype == 'object':

le = LabelEncoder()

df[column] = le.fit_transform(df[column])

label_encoders[column] = le

return label_encoders

def checking_missing(df: pd.DataFrame):

missing_values = df.isnull().sum()

return missing_values[missing_values > 0]

Testing on unseen test data


# ================== UNSEEN TESTING DATA ========================

def test(C_COLS, LEs, MODEL, BINARIZER):

data = pd.DataFrame({

'Name': ['Zaid', 'Jawad'],

'DateOfBirth': ['1990-01-01', '1992-02-02'],

'Gender': ['Male', 'Female'],

'Symptoms': ['Anxiety, Numbness', 'Abdominal Pain, Bloating'],

'Causes': ['Stress', 'Obesity'],


'Disease': ['Anxiety Disorder', 'Sleep Apnea'],

})

# Preprocess test data using the existing BINARIZER

data, _ = preprocess(data, C_COLS, BINARIZER, test=True)

# Apply label encoding using the trained encoders

for column in data.columns:

if column in LEs:

encoder = LEs[column]

data[column] = encoder.transform(data[column])

# Make predictions

predictions = MODEL.predict(data)

print(LEs['Medicine'].inverse_transform(predictions))

Main Function
This function asks for the training dataset csv and then train the data accordingly

# ================== MAIN FUNCTION ===================

def main():

# For Colab: Upload file and load dataset

from google.colab import files

print("Upload your 'medical data.csv' file.")


uploaded = files.upload() # Prompt for file upload in Colab

df = pd.read_csv(list(uploaded.keys())[0])

CATEGORICAL_COLUMNS_TRAIN = ['Gender', 'Causes', 'Disease',


'Medicine']

CATEGORICAL_COLUMNS_TEST = CATEGORICAL_COLUMNS_TRAIN[:-1]

print([x for x in df['Disease'].unique()])

# Preprocess training data

mlb = MultiLabelBinarizer()

df, mlb = preprocess(df, CATEGORICAL_COLUMNS_TRAIN, mlb)

# Label encode categorical columns

label_encoders = label_encode(df)

# Prepare features and target variable

X = df.drop('Medicine', axis=1)

y = df['Medicine']

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)
# Train a RandomForest model

rf_classifier = RandomForestClassifier(random_state=42)

rf_classifier.fit(X_train, y_train)

# Evaluate the model

predictions = rf_classifier.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

print(fAccuracy: {accuracy}')

# Test on new data

# test(CATEGORICAL_COLUMNS_TEST, label_encoders, rf_classifier, mlb)

print("ALL GOOD")

# Save model (optional)

# joblib.dump(rf_classifier, 'rf_classifier.joblib')

# Call the main function

main()

You might also like