0% found this document useful (0 votes)

23 views20 pages

Data Mining

The document describes performing multi-class classification on emotion recognition data using logistic regression models. The main steps are: 1. The CSV data is loaded and preprocessed, including removing rows with 'unclassified' emotion and one-hot encoding the emotion variable. 2. Logistic regression models are created for each emotion class. 3. The data is split into train and test sets. Models are trained on the train set. 4. Predictions are made on the test set using a prediction function that calculates probabilities from each model and returns the class with highest probability.

Uploaded by

21800768

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views20 pages

Data Mining

Uploaded by

21800768

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

13 주차

Task0
Json 파일 전처리
# Data 하위 폴더에 있는 json 파일 하나씩 search
base_path = './Data'

json_files = glob.glob(os.path.join(base_path, '**/*.json'),

recursive=True)
all_combined_df = pd.DataFrame()

for file_path in json_files:

print(file_path)
with open(file_path, 'r', encoding='UTF-8') as file:
json_data = json.load(file)

rows=[]
# Scene data
for scene in json_data['scene']['data']:
img_name = scene['img_name']
for occupant in scene['occupant']:

row = {
'img_name': img_name,
'occupant_id':
occupant.get('occupant_id', None),
'action': occupant.get('action', None),
'emotion': occupant.get('emotion',
None),
}
rows.append(row)
# Creating a DataFrame
combined_df = pd.DataFrame(rows)

# occupant df
occupant_df =
pd.DataFrame(json_data['occupant_info'])

# sensor df
sensor_df = pd.DataFrame()

for sensor_data in json_data['scene']['sensor']:

# Calculating the average and standard deviation
for each sensor
avg_ecg = sum(sensor_data['ECG']) /
len(sensor_data['ECG'])
std_ecg = statistics.stdev(sensor_data['ECG'])
if len(sensor_data['ECG']) > 1 else 0

avg_ppg = sum(sensor_data['PPG']) /
len(sensor_data['PPG'])
std_ppg = statistics.stdev(sensor_data['PPG'])
if len(sensor_data['PPG']) > 1 else 0

avg_spo2 = sum(sensor_data['SPO2']) /
len(sensor_data['SPO2'])
std_spo2 = statistics.stdev(sensor_data['SPO2'])
if len(sensor_data['SPO2']) > 1 else 0

avg_eeg = sum([sum(eeg_list) / len(eeg_list) for

eeg_list in sensor_data['EEG']]) / len(sensor_data['EEG'])
std_eeg = statistics.stdev([sum(eeg_list) /
len(eeg_list) for eeg_list in sensor_data['EEG']]) if
len(sensor_data['EEG']) > 1 else 0
# sensor 데이터가 시계열로 여러 개 찍혀있어서 평균과
표준편차로 값을 넣음
avg_std_sensor_df = pd.DataFrame({
'occupant_id': sensor_data['occupant_id'],
'ECG_avg': [avg_ecg],
'ECG_std': [std_ecg],
'PPG_avg': [avg_ppg],
'PPG_std': [std_ppg],
'SPO2_avg': [avg_spo2],
'SPO2_std': [std_spo2],
'EEG_avg': [avg_eeg],
'EEG_std': [std_eeg]
})

sensor_df = pd.concat([sensor_df,
avg_std_sensor_df], ignore_index=True)
# Combine scene_df, sensor_df, occupant_df
combined_df = pd.merge(combined_df, sensor_df,
on='occupant_id')
combined_df = pd.merge(combined_df, occupant_df,
on='occupant_id')

# 'img_name'과 'action' 열을 제외한 모든 열을 선택하여

중복되는 행 제거
# 근거: 영상을 여러 이미지로 나눈 데이터라 같은
탑승자인데, 중복되는 행이 한 scene 내에 여러 개가 존재함.
columns_to_check = [col for col in
combined_df.columns if col not in ['img_name', 'action']]

combined_df =
combined_df.drop_duplicates(subset=columns_to_check).reset_i
ndex(drop=True)
all_combined_df = pd.concat([all_combined_df,
combined_df], ignore_index=True)
특정 폴더 안에 있는 JSON 파일들을 읽어, 그 데이터를 분석하고 처리하는
과정을 담고 있다. 구체적으로 살펴보면 아래의 단계와 같다.

1. 파일 탐색 및 읽기

 glob 을 사용해 'Data' 폴더 안의 모든 .json 파일들을 찾는다.

glob.glob 함수는 주어진 패턴에 맞는 파일 경로 목록을 반환한다.

 각 JSON 파일에 대해, 파일을 열고 (open(file_path, 'r', encoding='UTF-

8')), json.load 를 사용해 JSON 데이터를 파이썬 객체로 변환한다.

2. JSON 데이터 처리

 파일마다 반복문을 돌면서, 주어진 JSON 구조에 따라 데이터를

분석하고 처리한다.

 데이터에서 각 'scene'의 'img_name'과 'occupant' 정보를

scene

추출하고, 이를 통해 행 데이터를 만들어

데이터프레임 combined_df 에 추가한다.

 occupant_info 에서 탑승자 정보를 담은 데이터프레임 occupant_df 를

생성한다.

 scene 의 sensor
부분에서 각 센서 데이터의 평균과 표준편차를
계산하여 sensor_df 에 추가한다.

3. 데이터 병합 및 중복 제거

 사용해 combined_df, sensor_df, occupant_df 를 'occupant_id'를

pd.merge 를

기준으로 병합한다. 이렇게 함으로써 각 탑승자에 대한 행동,

감정, 센서 데이터, 개인 정보가 하나의 행에 포함되도록 한다.

 중복된 행을 제거한다. 'img_name'과 'action'을 제외한 모든

열에서 중복된 데이터를 제거해, 같은 탑승자에 대한 중복
정보를 줄인다.

4. 결합된 데이터의 최종 병합
 각 JSON 파일로부터 생성된 combined_df 를 all_combined_df 에
추가한다. 이렇게 함으로써 모든 파일의 데이터가 하나의 큰
데이터프레임에 포함되도록 한다.

여러 JSON 파일로 분산된 데이터를 효율적으로 하나의 데이터프레임으로

결합하고, 중복을 제거하여 이후 Multi class logistic regression 으로
예측하는 데 사용된다. 데이터 병합과 중복 제거 과정은 데이터의 일관성과
정확성을 보장하는 데 중요하다.
# NA 값을 포함하는 행을 제거
all_df.dropna(inplace=True)

# occupant_age 를 numeric data 로 convert

def age_group_to_numeric(age_group):
if age_group == '20 대':
return 20
elif age_group == '30 대':
return 30
elif age_group == '40 대':
return 40
elif age_group == '60 대_이상':
return 60
else:
# '기타'인 경우 NA 값으로 처리 -> 매우 적어서 추후에 제거
return np.nan

# 'occupant_age' 열의 각 값에 대해 매핑 함수 적용
all_df['occupant_age'] =
all_df['occupant_age'].apply(age_group_to_numeric)

# typo
all_df = all_df.rename(columns={'occupant_posoition ':
'occupant_position'})
# 'occupant_sex' 및 'occupant_position' 열을 원-핫 인코딩
sex_dummies = pd.get_dummies(all_df['occupant_sex'],
prefix='sex')
position_dummies =
pd.get_dummies(all_df['occupant_position'],
prefix='position')

# 원-핫 인코딩된 데이터프레임을 원래의 데이터프레임에 결합

all_df = pd.concat([all_df, sex_dummies, position_dummies],
axis=1)

# SPO2_std, EEG_std 의 unique 값이 하나 -> 표준편차에 변화가

없음 -> 삭제
all_df = all_df.drop(columns=['SPO2_std', 'EEG_std'])

# normalize
columns_to_normalize = ['ECG_avg', 'ECG_std', 'PPG_avg',
'PPG_std', 'SPO2_avg', 'EEG_avg', 'occupant_age']
all_df[columns_to_normalize] = (all_df[columns_to_normalize]
- all_df[columns_to_normalize].mean()) /
all_df[columns_to_normalize].std()

all_df.to_csv('all_data_norm.csv', index=False,
encoding='CP949')
추가적으로, 데이터 전처리 과정을 수행하여, JSON 파일에서 추출된
데이터를 분석에 적합한 형태로 변환한다. 주요 단계는 다음과 같다:

1. 결측치 제거

 함수를 사용해 all_df 데이터프레임에서 결측치(NA)를

dropna()

포함하는 모든 행을 제거한다. 이는 데이터의 정확성과

신뢰성을 보장하기 위한 표준 절차이다.

2. 연령 데이터 변환
 함수는 'occupant_age' 열의 문자열 값을 숫자
age_group_to_numeric

값으로 변환한다. 예를 들어 '20 대'는 20 으로, '30 대'는 30 으로

변환한다.

 '기타'와 같은 분류되지 않은 연령 그룹은 np.nan 으로 설정하여

추후에 제거할 수 있도록 한다.

3. 데이터 정제

 함수를 사용해 'occupant_position' 열의 이름 오류를

rename

수정한다.

4. 원-핫 인코딩

 성별('occupant_sex')과 위치('occupant_position') 열을 원-핫

인코딩한다. 이는 범주형 데이터를 모델이 처리하기 쉬운
형태로 변환하는 방법이다.

 함수를 사용해 각 범주에 대한 새 열을 생성하고,

pd.get_dummies

이를 원래 데이터프레임에 결합한다.

5. 불필요한 열 제거

 'SPO2_std'와 'EEG_std' 열을 제거한다. 이 열들은 모든 값이

동일하여 변화의 표준편차가 없으므로 분석에 유용하지 않다고
판단된다.

6. 데이터 정규화

 센서 데이터와 연령 데이터 열을 정규화한다. 이는 데이터의

평균을 0 으로, 표준편차를 1 로 조정하여 다른 변수들과의 비교
및 분석을 용이하게 한다.

7. 데이터 저장:

 최종적으로 처리된 데이터프레임을 'all_data_norm.csv' 파일로

저장한다. 'CP949' 인코딩을 사용해 한국어 호환성을 유지한다.

최종 data set 은 아래와 같다.

Task1
13 주차 실습문제는 aihub.or.kr 에서 제공하는 "운전자 및 탑승자 상태 및
이상행동 모니터링" 데이터를 사용하여 운전자의 감정 상태를 인식하는
multi-class classification 을 수행하는 것이다.
# Load necessary libraries
library(readr)
library(dplyr)
library(tidyr)

############################################################
####################
### Task 1.
############################################################
########
############################################################
####################

# Read the CSV file

data <- read.csv("csv_Data/all_data_norm.csv")

# '분류 없음' 행 제거
data <- data %>% filter(emotion != '분류 없음')

head(data)

# factor 변환
data$emotion <- as.factor(data$emotion)

# Handle other preprocessing tasks (e.g., normalization,

missing value imputation)

# label
labelDF <- data.frame(y = data$emotion, val=1, i =
1:nrow(data))

# One-hot encoding for the emotion variable

labelDF.new <- (labelDF %>%
spread(key = y, value = val, fill = 0))[,-
1]

labelDF.new %>% colSums()

# Name the columns

emotion_levels <- levels(data$emotion)
names(labelDF.new) <- emotion_levels

# 미사용하는 열 제거
dataX <- data[, -which(names(data) %in% c("img_name",
"emotion", "action", "occupant_id",
"occupant_sex",
"occupant_position"))]

# glm 모델 리스트 생성: class 개수(5 개) 만큼

modelList <- lapply(emotion_levels, function(x){
glm(data = cbind(labelDF.new %>% select(x), dataX),
formula = paste0(x,' ~.'), family =
binomial(link = 'logit'))
})

# Prediction function
predict_emotion <- function(inputMat){

nData = data.frame(inputMat)
result <- sapply(modelList, function(x) {predict(x,
newdata = nData, type = 'response')})
return(apply(result, 1, function(x)
{emotion_levels[which.max(x)]})) # 가장 확률이 높은 class

# Split data into training and testing sets

set.seed(1234) # for reproducibility
train_indices <- sample(1:nrow(data), size = 0.8 *
nrow(data))
train_data <- dataX[train_indices, ]
test_data <- dataX[-train_indices, ]

train_emotions <- data$emotion[train_indices]

test_emotions <- data$emotion[-train_indices]

# Make predictions on the test set

train_pred_emotions <- predict_emotion(train_data)
test_pred_emotions <- predict_emotion(test_data)

# Calculate accuracy
train_accuracy <- mean(train_pred_emotions ==
train_emotions)
test_accuracy <- mean(test_pred_emotions == test_emotions)

# Compare
print(train_accuracy)
print(test_accuracy)
# under fitting
위 단계는 Multi class logistic regression 을 binary logistic regression 모델을
class 의 개수만큼 생성하여 직접 구현한다.

1. 라이브러리 불러오기 및 데이터 읽기

 readr, dplyr, tidyr
라이브러리를 불러온다. 이들은 데이터 처리와
조작을 위한 표준 R 패키지들이다.

 함수를 사용해 'all_data_norm.csv' 파일을 읽고

read.csv data

변수에 저장한다.

2. 데이터 전처리

 함수를 사용해 'emotion' 열에서 '분류 없음' 값을 갖는 행을

filter

제거한다.

 as.factor 를 사용해 'emotion' 열을 팩터(범주형 변수)로 변환한다.

3. 원-핫 인코딩 및 라벨 준비

 감정(emotion) 변수를 원-핫 인코딩한다. 이는 각 감정을

별도의 열로 변환하여 모델에 적합한 형태로 만든다.

 함수를 사용해 각 감정에 대한 열을 생성하고, 불필요한

spread

열을 제거한다.

4. 데이터 정제

 불필요한 열(img_name, emotion, action, occupant_id,

occupant_sex, occupant_position)을 제거한다. 모든 행에서
unique 값이나, 더미 변수로 만들어진 열을 제거해야 하기
때문이다.

5. 로지스틱 회귀 모델 리스트 생성

 각 감정 카테고리에 대해 별도의 로지스틱 회귀 모델을

생성한다. lapply 함수를 사용해 각각의 감정 수준에 대한 모델을
리스트로 저장한다.

6. 예측 함수 정의

 함수는 입력 데이터에 대해 각 모델의 예측을

predict_emotion

수행하고, 가장 높은 확률을 가진 감정을 반환한다.

7. 데이터 분할 및 모델 평가
 데이터를 훈련 세트와 테스트 세트로 분할한다.

 훈련 데이터와 테스트 데이터에 대해 감정 예측을 수행하고,

정확도를 계산한다.

모델 성능 비교

 훈련 세트와 테스트 세트의 정확도를 출력한다. train 정확도는 0.445,

test 정확도는 0.457 로 train 데이터에 다소 under fitting 된다고 볼 수
있다. prediction 값을 직접 확인한 결과, 대부분이 ‘중립’으로 예측한
값들이 많았다.

Task2
feature engineering 을 진행하여 새로운 변수를 추가하거나 필요 없는
변수를 삭제하는 방법으로 모델을 최적화해보자. 변수 추가를 통한 최적화
과정 중 overfitting 이 일어난다면 regularization 을 통해서 과적합을
해소하여보고 성능의 변화를 확인해보자.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,
PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,
classification_report

# 데이터 불러오기
data = pd.read_csv('all_data.csv', encoding='cp949')

# '분류없음' 행 제거
data = data[data['emotion'] != '분류없음']

# 원-핫 인코딩 적용
emotion_one_hot = pd.get_dummies(data['emotion'])
data = pd.concat([data, emotion_one_hot], axis=1)

# 상호작용 및 다항 특성 생성
polynomial_features = PolynomialFeatures(degree=2,
include_bias=False, interaction_only=False)
sensor_data = data[['ECG_avg', 'ECG_std', 'PPG_avg',
'PPG_std', 'SPO2_avg', 'EEG_avg']]
sensor_poly = polynomial_features.fit_transform(sensor_data)
sensor_poly_df = pd.DataFrame(sensor_poly,
columns=polynomial_features.get_feature_names_out(sensor_dat
a.columns))
data = pd.concat([data, sensor_poly_df], axis=1)

# 범주형 변수 변환 (이미 원-핫 인코딩이 적용된 경우 생략 가능)

# data = pd.get_dummies(data, columns=['occupant_sex',
'occupant_position'])

# 필요한 특성 선택
features = list(sensor_poly_df.columns) + ['occupant_age',
'position_back', 'position_front']
X = data[features]

# 타겟 변수 선택
y = data[emotion_one_hot.columns]

# 데이터 분할 및 랜덤 시드 설정
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=123)

# 특성 표준화
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 모델 구축 및 예측
models = {}
emotions = y.columns
for emotion in emotions:
model = LogisticRegression(solver='lbfgs',
max_iter=1000, random_state=123)
model.fit(X_train_scaled, y_train[emotion])
models[emotion] = model

# 멀티클래스 예측 함수 정의
def predict_multiclass(X_scaled):
probabilities = np.array([model.predict_proba(X_scaled)
[:,1] for emotion, model in models.items()]).T
predicted_class_indices = np.argmax(probabilities,
axis=1)
predicted_classes = [emotions[idx] for idx in
predicted_class_indices]
return predicted_classes

# 훈련 및 테스트 데이터에 대한 예측
y_train_pred = predict_multiclass(X_train_scaled)
y_test_pred = predict_multiclass(X_test_scaled)

# 원본 'emotion' 타겟과 비교를 위한 인덱스 매핑

y_train_original = y_train.idxmax(axis=1)
y_test_original = y_test.idxmax(axis=1)

# 성능 평가
train_accuracy = accuracy_score(y_train_original,
y_train_pred)
test_accuracy = accuracy_score(y_test_original, y_test_pred)

print("Train Accuracy:", train_accuracy)

print("Test Accuracy:", test_accuracy)
위 코드는 성능개선을 위해 기존의 로지스틱 회귀 기반 감정 분류 모델에
대한 Feature Engineering 을 수행하는 과정이다. 추가적으로 더해진 기능은
아래와 같다.

1. 다항 특성 생성:

 PolynomialFeatures 는주어진 센서 데이터(ECG_avg, ECG_std, PPG_avg, PPG_std,

SPO2_avg, EEG_avg)의 상호작용 및 다항 특성을 생성한다.

 설정은 각 특성의 제곱 및 특성들 간의 모든 이차 조합을

degree=2
포함한다는 의미다. 이는 원본 데이터에서는 포착되지 않는
복잡한 관계를 모델이 학습할 수 있게 한다.

2. 데이터프레임 변환:

 생성된 sensor_poly 배열을 DataFrame 으로 변환하고, get_feature_names_out

메서드를 사용하여 열 이름을 지정한다.

 이 새로운 데이터프레임 sensor_poly_df 는 원본 데이터셋 data 에

추가된다.

특성 선택

1. 특성 리스트 구성:

 생성된 다항 특성(sensor_poly_df.columns)과 추가적인 특성(occupant_age,

position_back, position_front)을 포함하는 리스트를 만든다.

 이 특성 리스트는 모델 학습에 사용될 최종 특성 세트를

결정한다.

2. 최종 특성 데이터셋 구축:

 X = data[features]를 통해 모델 학습에 사용될 최종 특성 데이터셋 X

를 구축한다.
성능 평가결과 1

성능이 더욱 저조하게 나와 추가적인 다른 feature engineering 을 수행이

필요해 보인다.
추가적으로 처리된 부분은 아래와 같다.
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,
classification_report
from sklearn.preprocessing import StandardScaler,
PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif

# 데이터 불러오기
data = pd.read_csv('all_data.csv', encoding='cp949')

# '분류없음' 행 제거
data = data[data['emotion'] != '분류없음']

# 이상치 처리 (예시: Z-score 사용)

from scipy import stats
data = data[(np.abs(stats.zscore(data[['ECG_avg', 'PPG_avg',
'SPO2_avg', 'EEG_avg']])) < 3).all(axis=1)]

# 원-핫 인코딩 적용
emotion_one_hot = pd.get_dummies(data['emotion'])
data = pd.concat([data, emotion_one_hot], axis=1)

# 상호작용 및 다항 특성 생성
polynomial_features = PolynomialFeatures(degree=2,
include_bias=False, interaction_only=False)
sensor_data = data[['ECG_avg', 'PPG_avg', 'EEG_avg']]
sensor_poly = polynomial_features.fit_transform(sensor_data)
sensor_poly_df = pd.DataFrame(sensor_poly,
columns=polynomial_features.get_feature_names_out(sensor_dat
a.columns))
data = pd.concat([data, sensor_poly_df], axis=1)

# 필요한 특성 선택
X = data.drop(['emotion', 'img_name', 'occupant_id',
'occupant_sex', 'occupant_position'] +
list(emotion_one_hot.columns), axis=1)

# 타겟 변수 선택
y = data[emotion_one_hot.columns]

# 데이터 분할 및 랜덤 시드 설정
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=123)

# 특성 표준화
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 특성 선택
selector = SelectKBest(f_classif, k=20)
X_train_scaled = selector.fit_transform(X_train_scaled,
y_train.idxmax(axis=1))
X_test_scaled = selector.transform(X_test_scaled)
# 감정 클래스별로 모델 구축
models = {}
emotions = y.columns
for emotion in emotions:
model = LogisticRegression(solver='lbfgs',
max_iter=1000, random_state=123)
model.fit(X_train_scaled, y_train[emotion])
models[emotion] = model

# 멀티클래스 예측 함수
def predict_multiclass(X_scaled):
probabilities = np.array([model.predict_proba(X_scaled)
[:,1] for emotion, model in models.items()]).T
predicted_class_indices = np.argmax(probabilities,
axis=1)
predicted_classes = [emotions[idx] for idx in
predicted_class_indices]
return predicted_classes

# 훈련 및 테스트 데이터에 대한 예측
y_train_pred = predict_multiclass(X_train_scaled)
y_test_pred = predict_multiclass(X_test_scaled)

# 원본 'emotion' 타겟과 비교
y_train_original = y_train.idxmax(axis=1)
y_test_original = y_test.idxmax(axis=1)

# 성능 평가
train_accuracy = accuracy_score(y_train_original,
y_train_pred)
test_accuracy = accuracy_score(y_test_original, y_test_pred)
print("Train Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)
추가된 부분 :

1. 이상치 처리:

 Z-score 를 사용하여 ECG_avg, PPG_avg, SPO2_avg, EEG_avg 데이터의

이상치를 제거한다. Z-score 는 데이터가 평균으로부터 얼마나
떨어져 있는지를 나타내며, 여기서는 절대값이 3 보다 작은
데이터만을 유지한다. 이는 데이터의 정확도를 높이고, 모델의
성능을 개선하기 위한 단계이다.

2. 상호작용 및 다항 특성 생성:

 선정된 센서 데이터에 대해 PolynomialFeatures 를 사용하여 상호작용

및 다항 특성을 생성한다. 이는 데이터의 비선형 관계를 모델이
학습할 수 있게 하여, 예측의 정확도를 높이는데 기여한다.

3. 특성 선택:

 SelectKBest 와 f_classif 를
사용하여 모델에 가장 유의미한 영향을
미치는 상위 20 개의 특성을 선택한다. 이는 모델의 복잡도를
관리하고, 중요한 정보에 집중하게 함으로써 성능을 향상시킬
수 있다.

이러한 추가적인 단계는 기존 모델에 비해 데이터의 품질을 향상시키고,

모델이 데이터에서 유의미한 패턴을 더 잘 학습하도록 돕는다. 그러나
이상치 처리나 특성 선택 과정에서 중요한 정보가 손실될 수도 있으므로,
이러한 방법들은 신중하게 적용되어야 한다.

성능 평가결과 2

1. 데이터 처리의 미흡

 이상치 및 노이즈: 이상치나 노이즈가 충분히 처리되지 않았다면,

이러한 데이터가 모델 학습에 부정적인 영향을 미쳤을 수 있다.
 데이터 분포: 새로운 특성이 데이터의 본질적인 분포를 왜곡했을
가능성이 있다. 예를 들어, 일부 특성이 다른 특성에 비해 과도하게
강조될 수 있다.

2. 하이퍼파라미터 설정

 모델 튜닝: 사용된 하이퍼파라미터(max_iter, solver 등)가 새로운 특성

세트에 적합하지 않았을 수 있다. 적절한 하이퍼파라미터 튜닝
없이는 모델 성능이 최적화되지 않을 수 있다.

3. 클래스 불균형

 클래스 비율: 데이터셋 내에서 각각의 감정 클래스가 불균형하게

분포되어 있을 경우, 모델이 다수 클래스에 편향되어 학습될 수 있다.

결론
Feature Engineering 을 통해 다항 및 상호작용 특성을 추가했음에도
불구하고 성능이 저조해진 결과는 여러 요인에 기인할 수 있다. 데이터
처리의 미흡, 하이퍼파라미터 설정의 부적절함, 그리고 클래스 불균형을
의심해 볼 수 있는데, 향후 성능 개선을 위해서는 아래와 같은 접근을 통해
모델의 성능을 개선을 시도해 볼 수 있다.

 데이터 전처리 재검토: 광범위하고 엄격한 EDAf 를 통해 이상치 제거,

데이터 정규화 등을 재검토하여 데이터의 품질을 개선한다.

 하이퍼파라미터 최적화: 그리드 서치, 랜덤 서치 등을 통해 최적의

하이퍼파라미터를 찾는다.

 클래스 불균형 해소: SMOTE 나 다운샘플링 등의 기법을 사용하여

클래스 불균형 문제를 해결한다.

결국, 모델의 복잡도와 데이터의 특성을 적절히 조율하고, 더욱 정교한 모델

튜닝을 통한 해결이 필요하다.

BigBoon-Gi Final
No ratings yet
BigBoon-Gi Final
6 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Import As Import As Import As Import As Import As From Import From Import From Import
No ratings yet
Import As Import As Import As Import As Import As From Import From Import From Import
12 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Heart Diesese
No ratings yet
Heart Diesese
9 pages
Postgresql Jsonb: Learn This Powerful Tool By Example
From Everand
Postgresql Jsonb: Learn This Powerful Tool By Example
Mohammed N. S. Al Saadi
No ratings yet
.Ipynb - Colab
No ratings yet
.Ipynb - Colab
60 pages
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Data Mining 2
No ratings yet
Data Mining 2
24 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Pseudo Code
No ratings yet
Pseudo Code
21 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
NgRx SignalStore: An effortless solution for state management
From Everand
NgRx SignalStore: An effortless solution for state management
Abdelfattah Ragab
No ratings yet
新建文本文档
No ratings yet
新建文本文档
10 pages
ML 7
No ratings yet
ML 7
6 pages
Abdimas Hki3f52b4c6
No ratings yet
Abdimas Hki3f52b4c6
6 pages
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
No ratings yet
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
71 pages
Tarea2 Ejercicio 1
No ratings yet
Tarea2 Ejercicio 1
3 pages
Dsbda 5
No ratings yet
Dsbda 5
12 pages
Heart Disease Classification Using Ann Hands-On
No ratings yet
Heart Disease Classification Using Ann Hands-On
7 pages
Ap Python
No ratings yet
Ap Python
12 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Student Abandonment Classification in Brazil
No ratings yet
Student Abandonment Classification in Brazil
59 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
HIV Regression Source Code
No ratings yet
HIV Regression Source Code
26 pages
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
No ratings yet
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
25 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
COMP5318
No ratings yet
COMP5318
42 pages
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Aiml Programs
No ratings yet
Aiml Programs
12 pages
50 Java Concepts Every Developer Should Know
From Everand
50 Java Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Heart Disease Classification ML Assignment - Jupyter Notebook
No ratings yet
Heart Disease Classification ML Assignment - Jupyter Notebook
7 pages
Heart Disease Classification Full-1
No ratings yet
Heart Disease Classification Full-1
3 pages
C Language Programming Codes
From Everand
C Language Programming Codes
Durgesh
No ratings yet
DWDM Lab 3
No ratings yet
DWDM Lab 3
10 pages
C121 Exp1
No ratings yet
C121 Exp1
32 pages
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
From Everand
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
Abdelfattah Ragab
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
No ratings yet
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
15 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Derinlemesine React Data
From Everand
Derinlemesine React Data
Onder Teker
No ratings yet
Ex7 HTML
No ratings yet
Ex7 HTML
3 pages
19 20DecTestPICMIC
No ratings yet
19 20DecTestPICMIC
28 pages
Core Java Programming Book
From Everand
Core Java Programming Book
Manish Soni
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Advance Python
No ratings yet
Advance Python
5 pages
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
Import Redfish Import Time From Datetime Import Datetime
No ratings yet
Import Redfish Import Time From Datetime Import Datetime
6 pages
Cardiovascular Disease Prediction
No ratings yet
Cardiovascular Disease Prediction
2 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

13 주차

json_files = glob.glob(os.path.join(base_path, '**/*.json'),

for file_path in json_files:

for sensor_data in json_data['scene']['sensor']:

avg_eeg = sum([sum(eeg_list) / len(eeg_list) for

# 'img_name'과 'action' 열을 제외한 모든 열을 선택하여

 glob 을 사용해 'Data' 폴더 안의 모든 .json 파일들을 찾는다.

 각 JSON 파일에 대해, 파일을 열고 (open(file_path, 'r', encoding='UTF-

 파일마다 반복문을 돌면서, 주어진 JSON 구조에 따라 데이터를

 데이터에서 각 'scene'의 'img_name'과 'occupant' 정보를

추출하고, 이를 통해 행 데이터를 만들어

 occupant_info 에서 탑승자 정보를 담은 데이터프레임 occupant_df 를

 사용해 combined_df, sensor_df, occupant_df 를 'occupant_id'를

기준으로 병합한다. 이렇게 함으로써 각 탑승자에 대한 행동,

 중복된 행을 제거한다. 'img_name'과 'action'을 제외한 모든

여러 JSON 파일로 분산된 데이터를 효율적으로 하나의 데이터프레임으로

# occupant_age 를 numeric data 로 convert

# 원-핫 인코딩된 데이터프레임을 원래의 데이터프레임에 결합

# SPO2_std, EEG_std 의 unique 값이 하나 -> 표준편차에 변화가

 함수를 사용해 all_df 데이터프레임에서 결측치(NA)를

포함하는 모든 행을 제거한다. 이는 데이터의 정확성과

값으로 변환한다. 예를 들어 '20 대'는 20 으로, '30 대'는 30 으로

 '기타'와 같은 분류되지 않은 연령 그룹은 np.nan 으로 설정하여

 함수를 사용해 'occupant_position' 열의 이름 오류를

 성별('occupant_sex')과 위치('occupant_position') 열을 원-핫

 함수를 사용해 각 범주에 대한 새 열을 생성하고,

 'SPO2_std'와 'EEG_std' 열을 제거한다. 이 열들은 모든 값이

 센서 데이터와 연령 데이터 열을 정규화한다. 이는 데이터의

 최종적으로 처리된 데이터프레임을 'all_data_norm.csv' 파일로

최종 data set 은 아래와 같다.

# Read the CSV file

# Handle other preprocessing tasks (e.g., normalization,

# One-hot encoding for the emotion variable

labelDF.new %>% colSums()

# Name the columns

# glm 모델 리스트 생성: class 개수(5 개) 만큼

# Split data into training and testing sets

train_emotions <- data$emotion[train_indices]

# Make predictions on the test set

1. 라이브러리 불러오기 및 데이터 읽기

 함수를 사용해 'all_data_norm.csv' 파일을 읽고

 함수를 사용해 'emotion' 열에서 '분류 없음' 값을 갖는 행을

 as.factor 를 사용해 'emotion' 열을 팩터(범주형 변수)로 변환한다.

 감정(emotion) 변수를 원-핫 인코딩한다. 이는 각 감정을

 함수를 사용해 각 감정에 대한 열을 생성하고, 불필요한

 불필요한 열(img_name, emotion, action, occupant_id,

 각 감정 카테고리에 대해 별도의 로지스틱 회귀 모델을

 함수는 입력 데이터에 대해 각 모델의 예측을

수행하고, 가장 높은 확률을 가진 감정을 반환한다.

 훈련 데이터와 테스트 데이터에 대해 감정 예측을 수행하고,

 훈련 세트와 테스트 세트의 정확도를 출력한다. train 정확도는 0.445,

# 범주형 변수 변환 (이미 원-핫 인코딩이 적용된 경우 생략 가능)

# 원본 'emotion' 타겟과 비교를 위한 인덱스 매핑

print("Train Accuracy:", train_accuracy)

 PolynomialFeatures 는주어진 센서 데이터(ECG_avg, ECG_std, PPG_avg, PPG_std,

 설정은 각 특성의 제곱 및 특성들 간의 모든 이차 조합을

 생성된 sensor_poly 배열을 DataFrame 으로 변환하고, get_feature_names_out

 이 새로운 데이터프레임 sensor_poly_df 는 원본 데이터셋 data 에

 생성된 다항 특성(sensor_poly_df.columns)과 추가적인 특성(occupant_age,

 이 특성 리스트는 모델 학습에 사용될 최종 특성 세트를

 X = data[features]를 통해 모델 학습에 사용될 최종 특성 데이터셋 X

성능이 더욱 저조하게 나와 추가적인 다른 feature engineering 을 수행이

# 이상치 처리 (예시: Z-score 사용)

 Z-score 를 사용하여 ECG_avg, PPG_avg, SPO2_avg, EEG_avg 데이터의

 선정된 센서 데이터에 대해 PolynomialFeatures 를 사용하여 상호작용

이러한 추가적인 단계는 기존 모델에 비해 데이터의 품질을 향상시키고,

 이상치 및 노이즈: 이상치나 노이즈가 충분히 처리되지 않았다면,

 모델 튜닝: 사용된 하이퍼파라미터(max_iter, solver 등)가 새로운 특성

 클래스 비율: 데이터셋 내에서 각각의 감정 클래스가 불균형하게

 데이터 전처리 재검토: 광범위하고 엄격한 EDAf 를 통해 이상치 제거,

 하이퍼파라미터 최적화: 그리드 서치, 랜덤 서치 등을 통해 최적의

 클래스 불균형 해소: SMOTE 나 다운샘플링 등의 기법을 사용하여

결국, 모델의 복잡도와 데이터의 특성을 적절히 조율하고, 더욱 정교한 모델

You might also like