AI Project Medicine Recommending System
AI Project Medicine Recommending System
Roll#: BSSEM-F22-125
Section: 5C
Title:
Medical Dataset for Predicting Medicines Based on Symptoms
Overview:
This dataset is designed to support the prediction of appropriate medicines based
on a patient’s symptoms. It contains medical records including patient
demographics, symptoms, diagnosed causes, and prescribed treatments. The
dataset was likely collected for research or to develop a recommendation system
for healthcare. It represents a small sample of real-world medical cases.
2. Source
Origin:
The origin of the dataset is unspecified but may have come from simulated or
anonymized medical records.
Collection Method:
Data appears to be collected via case summaries, combining observational and
synthetic inputs. The dataset includes common symptoms, probable causes, and
associated treatments.
3. Content Description
Variables:
Sample Size:
The dataset includes 287 records, though some entries have missing values.
4. Data Structure
Format:
CSV file.
Dimensions:
287 rows × 7 columns.
5. Summary Statistics
Descriptive Statistics:
● Name: 241 non-null, with 87 unique entries; most frequent value is "Sophia
Koh".
● Gender: 242 non-null; predominantly "Male" (116 occurrences).
● Symptoms: 247 non-null, with 53 unique combinations; the most common
combination is "Fatigue, Weakness".
● Disease: 249 non-null, 68 unique; "Gastroenteritis" appears most frequently
(20 times).
● Medicine: 242 non-null, 65 unique; "Rest, Lifestyle" is the most common
treatment (16 occurrences).
Distributions:
Variables such as Symptoms, Causes, and Disease are categorical and multi-modal.
Numeric descriptive statistics are not applicable. Missing data and frequent
repetitions suggest synthetic or anonymized nature.
6. Data Quality
Missing Values:
Outliers:
● Some rare values may indicate edge cases or errors but should be validated.
Validation:
“CODE”
import numpy as np
import pandas as pd
import joblib
Pre Processing
print(unique_symptoms)
if not test:
symptoms_encoded = mlb.fit_transform(df['Symptoms'])
else:
# Drop the original Symptoms column and concatenate the new binary
features
if column in df.columns:
df[column].fillna(df[column].mode()[0], inplace=True)
else:
Encoding features
# ============= Encoding of Features ==================
if df[column].dtype == 'object':
le = LabelEncoder()
df[column] = le.fit_transform(df[column])
label_encoders[column] = le
return label_encoders
missing_values = df.isnull().sum()
data = pd.DataFrame({
})
if column in LEs:
encoder = LEs[column]
data[column] = encoder.transform(data[column])
# Make predictions
predictions = MODEL.predict(data)
print(LEs['Medicine'].inverse_transform(predictions))
Main Function
This function asks for the training dataset csv and then train the data accordingly
def main():
df = pd.read_csv(list(uploaded.keys())[0])
CATEGORICAL_COLUMNS_TEST = CATEGORICAL_COLUMNS_TRAIN[:-1]
mlb = MultiLabelBinarizer()
label_encoders = label_encode(df)
X = df.drop('Medicine', axis=1)
y = df['Medicine']
# Train-test split
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)
predictions = rf_classifier.predict(X_test)
print(fAccuracy: {accuracy}')
print("ALL GOOD")
# joblib.dump(rf_classifier, 'rf_classifier.joblib')
main()