0% found this document useful (0 votes)

51 views13 pages

Midterm Text

Uploaded by

hamburgerhenry13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views13 pages

Midterm Text

Uploaded by

hamburgerhenry13

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Python資料分析與機器學習應用

Data Analysis and Machine Learning with Python

Midterm Test

The total score of the test is 150 points, with a maximum score
of 120 points.

Please answer your questions in a new document (e.g., doc,

pdf, ppt) in the order of the questions. You can answer in either
English or Chinese. Once completed, please upload it to NTU
Cool before the end of the test (before the system closes).

Note 1: Avoid uploading your answers at the last minute to

prevent network congestion or any unforeseen issues that may
cause failure to upload them. For example, discrepancies
between the NTU Cool system time and the school clock time.
DO NOT accept late submissions or make-up submissions.

Note 2: You can open books and use AI tools for assistance,
but you CANNOT directly copy the answers they provide. In
other words, please try to describe the answers in your own
words after understanding them.
I. (10%) True or False Questions (Please answer T for True
or F for False):

1. In data analysis, using the Python Pandas package

makes it convenient to handle structured data.

2. In data analysis, data cleaning only refers to removing
missing values from raw data.

3. In machine learning, we typically divide data into
training and testing data sets to evaluate the performance
of the model.

4. Supervised learning is a machine learning approach
where the model is trained based on labeled training data.

5. Decision tree is a supervised learning method used for
both classification and regression tasks.

II. (10%) Multiple Choice Questions (Please answer a, b, c,
d):

1. Which method can be used for statistical analysis in

Python?

a) describe()

b) read_csv()

c) plot()

d) fit()

2. Which is the common algorithm used for classification

tasks in machine learning?

a) Linear Regression

b) K Nearest Neighbors

c) K-means

d) PCA

3. Which function is used for handling missing values in
data analysis?

a) dropna()

b) fillna()
c) isnull()

d) all of the above

4. Which method is used for grouping data in data

analysis?

a) Groupby()

b) Split()

c) Filter()

d) Map()

5. Which chart can be used to visualize the distribution of
data?

a) histogram

b) boxplot

c) scatter plot

d) all of the above

III. (80%) Programming Questions:

1. (20%) Please examine the code to answer the

questions below:
a. (10%) What is the purpose of the following code?
(Please provide a detailed description beyond simply
stating "data analysis")
b. (10%) Is there a simpler way to achieve the same
result? (Please provide the code in either an ipynb or
py file)

import pandas as pd

df = pd.read_csv("question.csv")

data_dict = {}

for i in range(df.shape[0]):
category = df.loc[i, "Category"]
if (df.loc[i,'Group'] not in data_dict):
data_dict[df.loc[i,'Group']] = {}
data_dict[df.loc[i,'Group']][category] = df.loc[i,
'Value']

group_list = []
categoryA_list = []
categoryB_list = []
categoryC_list = []
categoryD_list = []
categoryE_list = []

for k, v in data_dict.items():
group_list.append(k)
if "A" in v:
categoryA_list.append(1)
else:
categoryA_list.append(0)

if "B" in v:
categoryB_list.append(1)
else:
categoryB_list.append(0)

if "C" in v:
categoryC_list.append(1)
else:
categoryC_list.append(0)

if "D" in v:
categoryD_list.append(1)
else:
categoryD_list.append(0)

if "E" in v:
categoryE_list.append(1)
else:
categoryE_list.append(0)

data_new = {"Group": group_list, "A": categoryA_list,

"B": categoryB_list, "C": categoryC_list, "D":
categoryD_list, "E": categoryE_list}
df_new = pd.DataFrame.from_dict(data_new)

df_new["A_Count"] = df_new["A"]
df_new["B_Count"] = df_new["B"]
df_new["C_Count"] = df_new["C"]
df_new["D_Count"] = df_new["D"]
df_new["E_Count"] = df_new["E"]

result = df_new[df_new["D_Count"] == 0].shape[0]

print(result)

2. (35%) In class, we utilized three features from the
Titanic dataset: "Pclass," "Sex," and "Age," as features /
predictors (X), with the "Survived" column serving as the
target label / variable (y). With data from nearly 900
passengers, we trained eight classification supervised
learning models (e.g., Logistic Regression, K-NN, SVC,
Gaussian Naive Bayes, Multinomial Naive Bayes,
Decision Tree, Random Forest, and XGBoost) to predict
the survival of Jack and Rose from the movie Titanic.

Now, we have obtained an updated passenger list
(comprising over 1,300 passengers) and additional column
meanings from titanic.csv and the definitions of the
following data dictionary and variable notes.

Data Dictionary
Variable Definition Key

Survived Survival 0 = No, 1 = Yes

pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
# of siblings / spouses
sibsp
aboard the Titanic
# of parents / children
parch
aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
C = Cherbourg,
embarked Port of Embarkation Q = Queenstown,
S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated,

is it in the form of xx.5

sibsp: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were
ignored)

parch: The dataset defines family relations in this way...

Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children traveled only with a nanny, therefore
parch=0 for them.

If we add two new features, SibSp and Parch, as new
features / predictors (X), resulting in a total of five features,
and redefine the data for Jack and Rose as follows:
- Jack, a 20-year-old male, a poor child who won a
third-class ticket at a casino and boarded the ship with a
friend;
- Rose, a 17-year-old female, a daughter of nobility,
boarded the ship with her mother and fiancé.

Please answer the following questions:
a. (24%) What are the survival results for Jack and Rose
using the eight classification models? (Please provide the
code in either an ipynb or py file)
b. (3%) Are there any models with significantly different
results?
c. (8%) Given the above, what do you think could be
potential reasons for the differences in results?
3. (25%) The motor products manufactured by a certain
company have a certain life cycle. Please analyze the
provided data on motor life cycles to answer the following
questions: (Please provide the code in either an ipynb or
py file)

a. (5%) What is the average lifespan of the batch of
motors? Please calculate the mean and explain its
significance.
b. (5%) Please plot a histogram of motor lifespans to
show the distribution across different lifespan
intervals.
c. (5%) How many products in the batch have
reached or exceeded the expected lifespan? The
expected lifespan is 1000 hours.
d. (5%) Please calculate the standard deviation of the
motor lifespans and explain its significance.
e. (5%) Does the lifespan of the batch of motors
follow a normal distribution? Please briefly explain
and provide the results of relevant statistical tests.
For example, you can use the Shapiro–Wilk test in
Scipy
(https://docs.scipy.org/doc/scipy/reference/generated/
scipy.stats.shapiro.html).

Data for 10 motors:
Motor ID Lifespan (hours)

1 980

2 1050

3 990
4 1100

5 1000

6 980

7 1020

8 990

9 950

10 1020
IV. (50%) Essay Questions:

1. (10%) Please explain the importance of data cleaning in

the process of data analysis, and list three common data
cleaning techniques, providing examples of their
application scenarios.

2. (10%) Please explain the significance of overfitting in
machine learning and provide methods to avoid overfitting.

3. (10%) Please explain the working principles of the
decision tree and the random forest models, compare their
advantages and disadvantages, and provide examples of
suitable application scenarios for each.

4. (10%) Please discuss common data visualization tools
in data analysis, such as Matplotlib and Seaborn,
explaining their pros and cons and suitable usage
methods.

5. (10%) Is it appropriate to use random forest regression
and decision tree regression to detect anomalies? Please
elaborate carefully.

Cognitive Class - Answers Data Analysis With Python
No ratings yet
Cognitive Class - Answers Data Analysis With Python
6 pages
Machine Learning Models and Bankruptcy Prediction Paper File
No ratings yet
Machine Learning Models and Bankruptcy Prediction Paper File
13 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Datascience
No ratings yet
Datascience
8 pages
Soal CISDM
No ratings yet
Soal CISDM
3 pages
Data Science
No ratings yet
Data Science
10 pages
Viva
No ratings yet
Viva
7 pages
VIP Question Bank For DPV For Theory Exam
No ratings yet
VIP Question Bank For DPV For Theory Exam
6 pages
Solution
No ratings yet
Solution
18 pages
Data Analysis Theory and Practice Case P
No ratings yet
Data Analysis Theory and Practice Case P
97 pages
Data - Science - Manaul (Te)
No ratings yet
Data - Science - Manaul (Te)
78 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
DADV - Lab - Subject - 303105315
No ratings yet
DADV - Lab - Subject - 303105315
35 pages
CSE1703 - Fundamental of Data Science
No ratings yet
CSE1703 - Fundamental of Data Science
6 pages
Set-D CT2 Answerkey
No ratings yet
Set-D CT2 Answerkey
11 pages
DVW 203105491 - 5926 - Question - Paper
No ratings yet
DVW 203105491 - 5926 - Question - Paper
2 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Dsbda Lab - 1 - 1736243987425
No ratings yet
Dsbda Lab - 1 - 1736243987425
10 pages
Ds Viva
No ratings yet
Ds Viva
9 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
List of Experiment - Data Analysis Lab
No ratings yet
List of Experiment - Data Analysis Lab
2 pages
ML Record
No ratings yet
ML Record
19 pages
Midterm Review CS 4372
No ratings yet
Midterm Review CS 4372
42 pages
PCED-30-01 Dumps Questions
No ratings yet
PCED-30-01 Dumps Questions
7 pages
Data Science QnA
No ratings yet
Data Science QnA
15 pages
Data Science
No ratings yet
Data Science
18 pages
Shatrughan (25084)
No ratings yet
Shatrughan (25084)
13 pages
IDA - Sample Questions FA1
No ratings yet
IDA - Sample Questions FA1
2 pages
Ai Class 12 Practical 2
No ratings yet
Ai Class 12 Practical 2
21 pages
Python II A - Model QP - Set 2
No ratings yet
Python II A - Model QP - Set 2
3 pages
Lab Manual
No ratings yet
Lab Manual
19 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
FDS - 1 Solved
No ratings yet
FDS - 1 Solved
17 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
DSBDA Lab Manual24-25
No ratings yet
DSBDA Lab Manual24-25
58 pages
DSBDA Lab Plan
No ratings yet
DSBDA Lab Plan
5 pages
DA Question Bank
No ratings yet
DA Question Bank
4 pages
Idsup Mid Sem Exam-2023
No ratings yet
Idsup Mid Sem Exam-2023
2 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
1
No ratings yet
1
3 pages
Question Bank Python For Data Science
0% (1)
Question Bank Python For Data Science
3 pages
21hcs4108 Davpracticals
No ratings yet
21hcs4108 Davpracticals
29 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Python PDF Merged
No ratings yet
Python PDF Merged
350 pages
Python Programming
No ratings yet
Python Programming
9 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
No ratings yet
Compare and Contrast CSV, JSON, and XML Dataset Formats. Which Format Would You Choose For Image Data and Why?
9 pages
Dav End Sem
No ratings yet
Dav End Sem
2 pages
ML Manual
No ratings yet
ML Manual
21 pages
Python Practice Questions
No ratings yet
Python Practice Questions
5 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Tutorial2 Q&A
No ratings yet
Tutorial2 Q&A
5 pages
Set-B - CT2 - AnswerKey
No ratings yet
Set-B - CT2 - AnswerKey
10 pages
PR List Dsbda
No ratings yet
PR List Dsbda
2 pages
3-DSEs UGCF CS (H) Approved Facultymay25
No ratings yet
3-DSEs UGCF CS (H) Approved Facultymay25
44 pages
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
Apache Cassandra Developer Associate - Exam Practice Tests
From Everand
Apache Cassandra Developer Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Bag of Freebies For Training Object Detection Neural Networks
No ratings yet
Bag of Freebies For Training Object Detection Neural Networks
9 pages
Machine Leaning Cours
No ratings yet
Machine Leaning Cours
24 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
MINI PROJECT Kshetrika
No ratings yet
MINI PROJECT Kshetrika
41 pages
Question Bank Questions-5
No ratings yet
Question Bank Questions-5
70 pages
All Machine Learning Algorithms You Should Know For 2023 - by Terence Shin - Jan, 2023 - Medium
No ratings yet
All Machine Learning Algorithms You Should Know For 2023 - by Terence Shin - Jan, 2023 - Medium
12 pages
Data Science Life Cycle - All Details
No ratings yet
Data Science Life Cycle - All Details
12 pages
Machine Learning Approaches and Sentinel-2 Data in Crop Type Mapping
No ratings yet
Machine Learning Approaches and Sentinel-2 Data in Crop Type Mapping
21 pages
Unit1 DL JNTUK
No ratings yet
Unit1 DL JNTUK
43 pages
Lab3 NguyenQuocKhanh ITITIU18186
No ratings yet
Lab3 NguyenQuocKhanh ITITIU18186
7 pages
ML BIT Ans
No ratings yet
ML BIT Ans
5 pages
High-Performance Extreme Learning Machines - A Complete Toolbox For Big Data Applications PDF
No ratings yet
High-Performance Extreme Learning Machines - A Complete Toolbox For Big Data Applications PDF
15 pages
A Brief Tour of Deep Learning From A Statistical Perspective
No ratings yet
A Brief Tour of Deep Learning From A Statistical Perspective
31 pages
MLP Multilayer
No ratings yet
MLP Multilayer
29 pages
The Problem of Overfitting: Perspective
No ratings yet
The Problem of Overfitting: Perspective
12 pages
Machine Learning For Beginners
No ratings yet
Machine Learning For Beginners
25 pages
A Survey On The Vulnerability of Neural Network Pruning - A Question On Their Susceptibility To Membership Inference Attacks
No ratings yet
A Survey On The Vulnerability of Neural Network Pruning - A Question On Their Susceptibility To Membership Inference Attacks
10 pages
Stock Market Prediction Using Machine Learning Report 1
No ratings yet
Stock Market Prediction Using Machine Learning Report 1
36 pages
Comparative Analysis of Traditional and AI-based D
No ratings yet
Comparative Analysis of Traditional and AI-based D
24 pages
Class1 AdvancedDataMiningWithWeka 2016
100% (1)
Class1 AdvancedDataMiningWithWeka 2016
62 pages
A I in Finance Ut Course Syllabus & Bios
No ratings yet
A I in Finance Ut Course Syllabus & Bios
10 pages
007-Discrete Dynamics in Nature and Society - 2022 - Alkhammash - Optimized Multivariate Adaptive Regression Splines For
No ratings yet
007-Discrete Dynamics in Nature and Society - 2022 - Alkhammash - Optimized Multivariate Adaptive Regression Splines For
9 pages
Linear - Regression & Evaluation Metrics
No ratings yet
Linear - Regression & Evaluation Metrics
31 pages
IandF CS2 Paper A 202204 Examiner Report
No ratings yet
IandF CS2 Paper A 202204 Examiner Report
13 pages
Credit Risk - Predictive Modelling
No ratings yet
Credit Risk - Predictive Modelling
47 pages
Hyperparameters Optimization XGBoost For Network Intrusion Detection Using CSE-CIC-IDS 2018 Dataset
No ratings yet
Hyperparameters Optimization XGBoost For Network Intrusion Detection Using CSE-CIC-IDS 2018 Dataset
10 pages
Pa 5 Unit
No ratings yet
Pa 5 Unit
35 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Dataminingshort Question Part2
No ratings yet
Dataminingshort Question Part2
17 pages