0% found this document useful (0 votes)
29 views88 pages

Aids

- The document analyzes data from the AIDS Clinical Trials Group Study 175, which tested the effectiveness of antiretroviral treatment regimens for HIV/AIDS. - The data is read into a Pandas dataframe with 2,139 rows and 25 columns containing patient information like treatment assignment, demographics, CD4 cell counts over time. - The dataframe is explored through methods like .info(), .describe(), and classifying its features as categorical, discrete, or continuous. Feature values are also printed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views88 pages

Aids

- The document analyzes data from the AIDS Clinical Trials Group Study 175, which tested the effectiveness of antiretroviral treatment regimens for HIV/AIDS. - The data is read into a Pandas dataframe with 2,139 rows and 25 columns containing patient information like treatment assignment, demographics, CD4 cell counts over time. - The dataframe is explored through methods like .info(), .describe(), and classifying its features as categorical, discrete, or continuous. Feature values are also printed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

AIDS_Clinical_Trials_Group_Study_175 Analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("aids.csv")

df.head()

Unnamed: 0 time trt age wtkg hemo homo drugs karnof


oprior \
0 0 948 2 48 89.8128 0 0 0 100
0
1 1 1002 3 61 49.4424 0 0 0 90
0
2 2 961 3 45 88.4520 0 1 1 90
0
3 3 1166 3 47 85.2768 0 1 0 100
0
4 4 1090 0 43 66.6792 0 1 0 100
0

... str2 strat symptom treat offtrt cd40 cd420 cd80 cd820
cid
0 ... 0 1 0 1 0 422 477 566 324
0
1 ... 1 3 0 1 0 162 218 392 564
1
2 ... 1 3 0 1 1 326 274 2063 1893
0
3 ... 1 3 0 1 0 287 394 1590 966
0
4 ... 1 3 0 0 0 504 353 870 782
0

[5 rows x 25 columns]

df.tail()

Unnamed: 0 time trt age wtkg hemo homo drugs karnof


oprior \
2134 2134 1091 3 21 53.2980 1 0 0 100
0
2135 2135 395 0 17 102.9672 1 0 0 100
0
2136 2136 1104 2 53 69.8544 1 1 0 90
0
2137 2137 465 0 14 60.0000 1 0 0 100
0
2138 2138 1045 3 45 77.3000 1 0 0 100
0

... str2 strat symptom treat offtrt cd40 cd420 cd80


cd820 cid
2134 ... 1 3 0 1 1 152 109 561
720 0
2135 ... 1 3 0 0 1 373 218 1759
1030 0
2136 ... 1 3 0 1 0 419 364 1391
1041 0
2137 ... 0 1 0 0 0 166 169 999
1838 1
2138 ... 0 1 0 1 0 911 930 885
526 0

[5 rows x 25 columns]

df.shape

(2139, 25)

df.columns

Index(['Unnamed: 0', 'time', 'trt', 'age', 'wtkg', 'hemo', 'homo',


'drugs',
'karnof', 'oprior', 'z30', 'zprior', 'preanti', 'race',
'gender',
'str2', 'strat', 'symptom', 'treat', 'offtrt', 'cd40', 'cd420',
'cd80',
'cd820', 'cid'],
dtype='object')

df.duplicated().sum()

df.isnull().sum()

Unnamed: 0 0
time 0
trt 0
age 0
wtkg 0
hemo 0
homo 0
drugs 0
karnof 0
oprior 0
z30 0
zprior 0
preanti 0
race 0
gender 0
str2 0
strat 0
symptom 0
treat 0
offtrt 0
cd40 0
cd420 0
cd80 0
cd820 0
cid 0
dtype: int64

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2139 entries, 0 to 2138
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 2139 non-null int64
1 time 2139 non-null int64
2 trt 2139 non-null int64
3 age 2139 non-null int64
4 wtkg 2139 non-null float64
5 hemo 2139 non-null int64
6 homo 2139 non-null int64
7 drugs 2139 non-null int64
8 karnof 2139 non-null int64
9 oprior 2139 non-null int64
10 z30 2139 non-null int64
11 zprior 2139 non-null int64
12 preanti 2139 non-null int64
13 race 2139 non-null int64
14 gender 2139 non-null int64
15 str2 2139 non-null int64
16 strat 2139 non-null int64
17 symptom 2139 non-null int64
18 treat 2139 non-null int64
19 offtrt 2139 non-null int64
20 cd40 2139 non-null int64
21 cd420 2139 non-null int64
22 cd80 2139 non-null int64
23 cd820 2139 non-null int64
24 cid 2139 non-null int64
dtypes: float64(1), int64(24)
memory usage: 417.9 KB

df.describe()

Unnamed: 0 time trt age wtkg


\
count 2139.000000 2139.000000 2139.000000 2139.000000 2139.000000

mean 1069.000000 879.098177 1.520804 35.248247 75.125311

std 617.620434 292.274324 1.127890 8.709026 13.263164

min 0.000000 14.000000 0.000000 12.000000 31.000000

25% 534.500000 727.000000 1.000000 29.000000 66.679200

50% 1069.000000 997.000000 2.000000 34.000000 74.390400

75% 1603.500000 1091.000000 3.000000 40.000000 82.555200

max 2138.000000 1231.000000 3.000000 70.000000 159.939360

hemo homo drugs karnof oprior


... \
count 2139.000000 2139.000000 2139.000000 2139.000000 2139.000000
...
mean 0.084151 0.661057 0.131370 95.446470 0.021973
...
std 0.277680 0.473461 0.337883 5.900985 0.146629
...
min 0.000000 0.000000 0.000000 70.000000 0.000000
...
25% 0.000000 0.000000 0.000000 90.000000 0.000000
...
50% 0.000000 1.000000 0.000000 100.000000 0.000000
...
75% 0.000000 1.000000 0.000000 100.000000 0.000000
...
max 1.000000 1.000000 1.000000 100.000000 1.000000
...

str2 strat symptom treat offtrt


\
count 2139.000000 2139.000000 2139.000000 2139.000000 2139.000000

mean 0.585788 1.979897 0.172978 0.751286 0.362786

std 0.492701 0.899053 0.378317 0.432369 0.480916

min 0.000000 1.000000 0.000000 0.000000 0.000000

25% 0.000000 1.000000 0.000000 1.000000 0.000000

50% 1.000000 2.000000 0.000000 1.000000 0.000000

75% 1.000000 3.000000 0.000000 1.000000 1.000000

max 1.000000 3.000000 1.000000 1.000000 1.000000

cd40 cd420 cd80 cd820 cid

count 2139.000000 2139.000000 2139.000000 2139.000000 2139.000000

mean 350.501169 371.307153 986.627396 935.369799 0.243572

std 118.573863 144.634909 480.197750 444.976051 0.429338

min 0.000000 49.000000 40.000000 124.000000 0.000000

25% 263.500000 269.000000 654.000000 631.500000 0.000000

50% 340.000000 353.000000 893.000000 865.000000 0.000000

75% 423.000000 460.000000 1207.000000 1146.500000 0.000000

max 1199.000000 1119.000000 5011.000000 6035.000000 1.000000


[8 rows x 25 columns]

df = df.drop(['Unnamed: 0', 'cid'], axis = 1)

object_columns = df.select_dtypes(include=['object', 'bool']).columns


print("Object type columns:")
print(object_columns)

numerical_columns = df.select_dtypes(include=['int64',
'float64']).columns
print("\nNumerical type columns:")
print(numerical_columns)

Object type columns:


Index([], dtype='object')

Numerical type columns:


Index(['time', 'trt', 'age', 'wtkg', 'hemo', 'homo', 'drugs',
'karnof',
'oprior', 'z30', 'zprior', 'preanti', 'race', 'gender', 'str2',
'strat',
'symptom', 'treat', 'offtrt', 'cd40', 'cd420', 'cd80',
'cd820'],
dtype='object')

def classify_features(df):
categorical_features = []
non_categorical_features = []
discrete_features = []
continuous_features = []

for column in df.columns:


if df[column].dtype in ['object', 'bool']:
if df[column].nunique() < 15:
categorical_features.append(column)
else:
non_categorical_features.append(column)
elif df[column].dtype in ['int64', 'float64']:
if df[column].nunique() < 10:
discrete_features.append(column)
else:
continuous_features.append(column)

return categorical_features, non_categorical_features,


discrete_features, continuous_features

categorical, non_categorical, discrete, continuous =


classify_features(df)
print("Categorical Features:", categorical)
print("Non-Categorical Features:", non_categorical)
print("Discrete Features:", discrete)
print("Continuous Features:", continuous)

Categorical Features: []
Non-Categorical Features: []
Discrete Features: ['trt', 'hemo', 'homo', 'drugs', 'karnof',
'oprior', 'z30', 'zprior', 'race', 'gender', 'str2', 'strat',
'symptom', 'treat', 'offtrt']
Continuous Features: ['time', 'age', 'wtkg', 'preanti', 'cd40',
'cd420', 'cd80', 'cd820']

for i in discrete:
print(i, ':')
print(df[i].unique())
print()

trt :
[2 3 0 1]

hemo :
[0 1]

homo :
[0 1]

drugs :
[0 1]

karnof :
[100 90 80 70]

oprior :
[0 1]

z30 :
[0 1]

zprior :
[1]

race :
[0 1]

gender :
[0 1]

str2 :
[0 1]
strat :
[1 3 2]

symptom :
[0 1]

treat :
[1 0]

offtrt :
[0 1]

for i in discrete:
print(i, ':')
print(df[i].value_counts())
print()

trt :
3 561
0 532
2 524
1 522
Name: trt, dtype: int64

hemo :
0 1959
1 180
Name: hemo, dtype: int64

homo :
1 1414
0 725
Name: homo, dtype: int64

drugs :
0 1858
1 281
Name: drugs, dtype: int64

karnof :
100 1263
90 787
80 80
70 9
Name: karnof, dtype: int64

oprior :
0 2092
1 47
Name: oprior, dtype: int64
z30 :
1 1177
0 962
Name: z30, dtype: int64

zprior :
1 2139
Name: zprior, dtype: int64

race :
0 1522
1 617
Name: race, dtype: int64

gender :
1 1771
0 368
Name: gender, dtype: int64

str2 :
1 1253
0 886
Name: str2, dtype: int64

strat :
1 886
3 843
2 410
Name: strat, dtype: int64

symptom :
0 1769
1 370
Name: symptom, dtype: int64

treat :
1 1607
0 532
Name: treat, dtype: int64

offtrt :
0 1363
1 776
Name: offtrt, dtype: int64

for i in discrete:
plt.figure(figsize=(15,6))
sns.countplot(df[i], data = df, palette='hls')
plt.show()
for i in discrete:
plt.figure(figsize=(20,10))
plt.pie(df[i].value_counts(), labels=df[i].value_counts().index,
autopct='%1.1f%%', textprops={'fontsize': 15,
'color': 'black',
'weight': 'bold',
'family': 'serif' })
hfont = {'fontname':'serif', 'weight': 'bold'}
plt.title(i, size=20, **hfont)
plt.show()
for i in continuous:
plt.figure(figsize=(15,6))
sns.histplot(df[i], bins = 20, kde = True, palette='hls')
plt.xticks(rotation = 90)
plt.show()
for i in continuous:
plt.figure(figsize=(15,6))
sns.distplot(df[i], bins = 20, kde = True)
plt.xticks(rotation = 90)
plt.show()
for i in continuous:
plt.figure(figsize=(15,6))
sns.boxplot(i, data = df, palette='hls')
plt.xticks(rotation = 90)
plt.show()
for i in continuous:
plt.figure(figsize=(15,6))
sns.violinplot(i, data = df, palette='hls')
plt.xticks(rotation = 90)
plt.show()
for i in continuous:
for j in continuous:
if i != j:
plt.figure(figsize=(15,6))
sns.scatterplot(x = i, y = j, data = df, ci = None,
palette='hls')
plt.xticks(rotation = 90)
plt.show()
for i in continuous:
for j in continuous:
if i != j:
plt.figure(figsize=(15,6))
sns.lineplot(x = i, y = j, data = df, ci = None,
palette='hls')
plt.xticks(rotation = 90)
plt.show()
correlation_matrix = df.corr()

correlation_matrix

time trt age wtkg hemo homo


drugs \
time 1.000000 0.101482 0.026544 0.009225 -0.017501 0.043430 -
0.021856
trt 0.101482 1.000000 -0.001931 -0.031685 0.012329 0.025035
0.005712
age 0.026544 -0.001931 1.000000 0.132858 -0.231257 0.158917
0.077446
wtkg 0.009225 -0.031685 0.132858 1.000000 -0.075791 0.155909
0.002343
hemo -0.017501 0.012329 -0.231257 -0.075791 1.000000 -0.391307 -
0.092957
homo 0.043430 0.025035 0.158917 0.155909 -0.391307 1.000000 -
0.206876
drugs -0.021856 0.005712 0.077446 0.002343 -0.092957 -0.206876
1.000000
karnof 0.094417 -0.014573 -0.100041 0.034271 0.068403 -0.042072 -
0.084558
oprior -0.016116 -0.026805 0.056161 0.009607 0.034978 0.019743 -
0.029968
z30 0.012898 -0.001656 0.061178 -0.073841 0.111554 -0.049760
0.014961
zprior NaN NaN NaN NaN NaN NaN
NaN
preanti 0.007249 0.006710 0.113220 -0.079292 0.113892 0.014132 -
0.029981
race -0.051276 0.017080 -0.097678 -0.081452 -0.070333 -0.307108
0.082311
gender 0.020810 0.022691 0.048705 0.240013 0.115867 0.607820 -
0.141748
str2 0.010098 -0.003003 0.068230 -0.078885 0.124983 -0.036700
0.001106
strat 0.022033 -0.003508 0.089884 -0.080458 0.141674 -0.022608 -
0.011319
symptom -0.104611 -0.000765 0.032814 0.003942 -0.076296 0.118575
0.027052
treat 0.153314 0.775990 0.001499 -0.040638 0.010786 0.024407
0.022055
offtrt -0.475795 -0.043239 -0.057695 -0.003159 0.005949 -0.045151
0.098031
cd40 0.191436 -0.012770 -0.040302 0.036401 -0.022533 0.000511 -
0.003360
cd420 0.350611 0.064448 -0.044294 0.020980 -0.065838 0.019915
0.013109
cd80 -0.017425 -0.015665 0.046874 0.090075 -0.037273 0.086028
0.014900
cd820 0.032480 -0.004595 0.037458 0.085447 -0.058392 0.082284
0.025728

karnof oprior z30 ... gender str2


strat \
time 0.094417 -0.016116 0.012898 ... 0.020810 0.010098
0.022033
trt -0.014573 -0.026805 -0.001656 ... 0.022691 -0.003003 -
0.003508
age -0.100041 0.056161 0.061178 ... 0.048705 0.068230
0.089884
wtkg 0.034271 0.009607 -0.073841 ... 0.240013 -0.078885 -
0.080458
hemo 0.068403 0.034978 0.111554 ... 0.115867 0.124983
0.141674
homo -0.042072 0.019743 -0.049760 ... 0.607820 -0.036700 -
0.022608
drugs -0.084558 -0.029968 0.014961 ... -0.141748 0.001106 -
0.011319
karnof 1.000000 -0.057291 -0.074947 ... -0.011695 -0.085975 -
0.055172
oprior -0.057291 1.000000 -0.037580 ... 0.042976 0.126040
0.134629
z30 -0.074947 -0.037580 1.000000 ... -0.036119 0.903417
0.848624
zprior NaN NaN NaN ... NaN NaN
NaN
preanti -0.023189 0.067082 0.655054 ... 0.032099 0.680354
0.833213
race 0.026155 -0.003923 -0.073658 ... -0.292146 -0.080510 -
0.106307
gender -0.011695 0.042976 -0.036119 ... 1.000000 -0.031258
0.003586
str2 -0.085975 0.126040 0.903417 ... -0.031258 1.000000
0.916723
strat -0.055172 0.134629 0.848624 ... 0.003586 0.916723
1.000000
symptom -0.107940 0.024199 0.020883 ... 0.064373 0.030760
0.041857
treat 0.001379 -0.031801 0.003776 ... 0.024280 0.005794 -
0.000836
offtrt -0.103251 0.019561 -0.029318 ... -0.019309 -0.026789 -
0.051276
cd40 0.077730 -0.059199 -0.121282 ... -0.030423 -0.124566 -
0.121317
cd420 0.098463 -0.109643 -0.200149 ... -0.023369 -0.216457 -
0.206306
cd80 -0.008567 -0.019247 0.029346 ... 0.087233 0.009576
0.032360
cd820 -0.003981 -0.036577 0.018454 ... 0.087572 0.012055
0.021257

symptom treat offtrt cd40 cd420 cd80


cd820
time -0.104611 0.153314 -0.475795 0.191436 0.350611 -0.017425
0.032480
trt -0.000765 0.775990 -0.043239 -0.012770 0.064448 -0.015665 -
0.004595
age 0.032814 0.001499 -0.057695 -0.040302 -0.044294 0.046874
0.037458
wtkg 0.003942 -0.040638 -0.003159 0.036401 0.020980 0.090075
0.085447
hemo -0.076296 0.010786 0.005949 -0.022533 -0.065838 -0.037273 -
0.058392
homo 0.118575 0.024407 -0.045151 0.000511 0.019915 0.086028
0.082284
drugs 0.027052 0.022055 0.098031 -0.003360 0.013109 0.014900
0.025728
karnof -0.107940 0.001379 -0.103251 0.077730 0.098463 -0.008567 -
0.003981
oprior 0.024199 -0.031801 0.019561 -0.059199 -0.109643 -0.019247 -
0.036577
z30 0.020883 0.003776 -0.029318 -0.121282 -0.200149 0.029346
0.018454
zprior NaN NaN NaN NaN NaN NaN
NaN
preanti 0.012304 0.005682 -0.042379 -0.067495 -0.132213 0.037500
0.023221
race -0.078378 -0.006071 0.004638 -0.001290 -0.035935 0.006930
0.009981
gender 0.064373 0.024280 -0.019309 -0.030423 -0.023369 0.087233
0.087572
str2 0.030760 0.005794 -0.026789 -0.124566 -0.216457 0.009576
0.012055
strat 0.041857 -0.000836 -0.051276 -0.121317 -0.206306 0.032360
0.021257
symptom 1.000000 0.008648 0.071388 -0.131006 -0.124883 0.035311
0.049254
treat 0.008648 1.000000 -0.051731 -0.013123 0.139934 -0.000746
0.009255
offtrt 0.071388 -0.051731 1.000000 -0.145311 -0.196474 -0.033651 -
0.024180
cd40 -0.131006 -0.013123 -0.145311 1.000000 0.583578 0.214274
0.073039
cd420 -0.124883 0.139934 -0.196474 0.583578 1.000000 0.054165
0.216472
cd80 0.035311 -0.000746 -0.033651 0.214274 0.054165 1.000000
0.756218
cd820 0.049254 0.009255 -0.024180 0.073039 0.216472 0.756218
1.000000

[23 rows x 23 columns]

plt.figure(figsize=(20,10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
threshold = 0.75
correlation_pairs = set()

for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > threshold:
colname_i = correlation_matrix.columns[i]
colname_j = correlation_matrix.columns[j]
correlation_pairs.add((colname_i, colname_j))

for pair in correlation_pairs:


feature1, feature2 = pair
if feature1 in df.columns and feature2 in df.columns:
if df[feature1].var() > df[feature2].var():
df.drop(feature2, axis=1, inplace=True)
else:
df.drop(feature1, axis=1, inplace=True)

correlation_matrix_after_drop = df.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix_after_drop, annot=True,
cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Matrix After Dropping Highly Correlated
Features')
plt.show()
print("Remaining Features:")
print(df.columns)

Remaining Features:
Index(['time', 'trt', 'age', 'wtkg', 'hemo', 'homo', 'drugs',
'karnof',
'oprior', 'zprior', 'preanti', 'race', 'gender', 'symptom',
'offtrt',
'cd40', 'cd420', 'cd80'],
dtype='object')

features = ['age', 'hemo', 'cd40']


target = 'time'

fig = plt.figure(figsize=(10, 8))


ax = fig.add_subplot(111, projection='3d')
ax.scatter(df[features[0]], df[features[1]], df[features[2]],
c=df[target], cmap='viridis', marker='o')
ax.set_xlabel(features[0])
ax.set_ylabel(features[1])
ax.set_zlabel(features[2])
ax.set_title(f'3D Scatter Plot: {features[0]}, {features[1]},
{features[2]} vs. {target}')

plt.show()
scatter_features = ['age', 'cd40', 'wtkg']

sns.pairplot(df, x_vars=scatter_features, y_vars='time', height=4)


plt.suptitle('Scatter Plots of Features vs. Time', y=1.02)
plt.show()

surface_features = ['age', 'hemo', 'cd40']

fig = plt.figure(figsize=(10, 8))


ax = fig.add_subplot(111, projection='3d')
ax.plot_trisurf(df[surface_features[0]], df[surface_features[1]],
df[surface_features[2]], cmap='viridis', linewidth=0.2)
ax.set_xlabel(surface_features[0])
ax.set_ylabel(surface_features[1])
ax.set_zlabel(surface_features[2])
ax.set_title(f'3D Surface Plot: {surface_features[0]},
{surface_features[1]}, {surface_features[2]} vs. Time')

plt.show()
Thanks !!!

You might also like