0% found this document useful (0 votes)
3 views13 pages

Linearreg - Exercise Finish

Uploaded by

liyouyang297
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

Linearreg - Exercise Finish

Uploaded by

liyouyang297
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2024/10/13 15:00 linearreg_exercise finish

线性回归练习题
在这个练习中,我们使用一个Kaggle竞赛中提供的共享单车的数据集:Bike Sharing Demand。 该数据集包含2011到2012年Capital
Bikeshare系统中记录的每日每小时单车的租赁数,以及相应的季节和气候等信息。

数据列:

datetime - hourly date + timestamp


season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy;2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist;
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds;4: Heavy Rain + Ice Pallets +
Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

第一步:读入数据
In [3]: # read the data and set the datetime as the index
import pandas as pd

bikes = pd.read_csv('bikeshare.csv', index_col='datetime', parse_dates=True)

In [4]: bikes.head()

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 1/13


2024/10/13 15:00 linearreg_exercise finish

Out[4]: season holiday workingday weather temp atemp humidity windspeed casual registered count

datetime

2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16

2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40

2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32

2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13

2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1

第二步:可视化数据
用matplotlib画出温度“temp”和自行车租赁数“count”之间的散点图;
用seborn画出温度“temp”和自行车租赁数“count”之间带线性关系的散点图(提示:使用seaborn中的lmplot绘制)

In [6]: # matplotlib
import matplotlib.pyplot as plt
# 绘制散点图
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['count'], alpha=0.5)
plt.grid(True)
plt.show()

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 2/13


2024/10/13 15:00 linearreg_exercise finish

In [8]: # seaborn
import seaborn as sns

# 绘制带线性关系的散点图
plt.figure(figsize=(10, 6))
sns.lmplot(x='temp', y='count', data=bikes, aspect=1.5)
plt.grid(True)
plt.show()

<Figure size 1000x600 with 0 Axes>

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 3/13


2024/10/13 15:00 linearreg_exercise finish

第三步:一元线性回归
用温度预测自行车租赁数

In [9]: # create X and y


X = bikes[['temp']]
y = bikes['count']

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 4/13


2024/10/13 15:00 linearreg_exercise finish

In [10]: # import, instantiate, fit


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np
# 实例化线性回归模型
linear_model = LinearRegression()

# 拟合模型
linear_model.fit(X, y)

Out[10]: ▾ LinearRegression i ?

LinearRegression()

In [11]: # print the coefficients

print(f'线性回归模型的系数(slope): {linear_model.coef_[0]:.4f}')
print(f'线性回归模型的截距(intercept): {linear_model.intercept_:.4f}')

线性回归模型的系数(slope): 9.1705
线性回归模型的截距(intercept): 6.0462

第四步:探索多个特征
In [12]: # explore more features
feature_cols = ['temp', 'season', 'weather', 'humidity']

In [13]: # using seaborn, draw multiple scatter plots between each feature in feature_cols and 'count'
columns_to_plot = feature_cols + ['count']
# 绘制多变量散点图
sns.pairplot(bikes[columns_to_plot], diag_kind='kde')
plt.show()

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 5/13


2024/10/13 15:00 linearreg_exercise finish

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 6/13


2024/10/13 15:00 linearreg_exercise finish

In [14]: # correlation matrix (ranges from 1 to -1)


bikes.corr()

Out[14]: season holiday workingday weather temp atemp humidity windspeed casual registered c

season 1.000000 0.029368 -0.008126 0.008879 0.258689 0.264744 0.190610 -0.147121 0.096758 0.164011 0.16

holiday 0.029368 1.000000 -0.250491 -0.007074 0.000295 -0.005215 0.001929 0.008409 0.043799 -0.020956 -0.00

workingday -0.008126 -0.250491 1.000000 0.033772 0.029966 0.024660 -0.010880 0.013373 -0.319111 0.119460 0.01

weather 0.008879 -0.007074 0.033772 1.000000 -0.055035 -0.055376 0.406244 0.007261 -0.135918 -0.109340 -0.12

temp 0.258689 0.000295 0.029966 -0.055035 1.000000 0.984948 -0.064949 -0.017852 0.467097 0.318571 0.39

atemp 0.264744 -0.005215 0.024660 -0.055376 0.984948 1.000000 -0.043536 -0.057473 0.462067 0.314635 0.38

humidity 0.190610 0.001929 -0.010880 0.406244 -0.064949 -0.043536 1.000000 -0.318607 -0.348187 -0.265458 -0.31

windspeed -0.147121 0.008409 0.013373 0.007261 -0.017852 -0.057473 -0.318607 1.000000 0.092276 0.091052 0.10

casual 0.096758 0.043799 -0.319111 -0.135918 0.467097 0.462067 -0.348187 0.092276 1.000000 0.497250 0.69

registered 0.164011 -0.020956 0.119460 -0.109340 0.318571 0.314635 -0.265458 0.091052 0.497250 1.000000 0.97

count 0.163439 -0.005393 0.011594 -0.128655 0.394454 0.389784 -0.317371 0.101369 0.690414 0.970948 1.00

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 7/13


2024/10/13 15:00 linearreg_exercise finish

In [15]: sns.heatmap(bikes.corr())

Out[15]: <Axes: >

用'temp', 'season', 'weather', 'humidity'四个特征预测单车租赁数'count'


In [23]: # create X and y
# 创建特征矩阵 X 和目标变量 y
feature_cols = ['temp', 'season', 'weather', 'humidity'] # 特征列

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 8/13


2024/10/13 15:00 linearreg_exercise finish

X = bikes[feature_cols] # 特征
y = bikes['count'] # 目标变量

In [17]: # import, instantiate, fit


# 导入必要的库
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# 实例化线性回归模型
linear_model = LinearRegression()

# 拟合模型
linear_model.fit(X, y)

Out[17]: ▾ LinearRegression i ?

LinearRegression()

In [18]: # print the coefficients

print('回归模型的系数:', linear_model.coef_)
print('回归模型的截距:', linear_model.intercept_)

回归模型的系数: [9.17054048]
回归模型的截距: 6.046212959616469

使用train/test split和RMSE来比较多个不同的模型

In [19]: # compare different sets of features


feature_cols1 = ['temp', 'season', 'weather', 'humidity']
feature_cols2 = ['temp', 'season', 'weather']
feature_cols3 = ['temp', 'season', 'humidity']

补充:处理类别特征
有两种类别特征:

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 9/13


2024/10/13 15:00 linearreg_exercise finish

有序类别值: 转换成相应的数字值(例如: small=1, medium=2, large=3)


无序类别值: 使用dummy encoding (0/1编码)

此数据集中的类别特征有:

有序类别值: weather (已经被编码成相应的数字值1,2,3,4)


无序类别值: season (需要进行dummy encoding), holiday (已经被dummy encoded), workingday (已经被dummy encoded)

In [21]: # create dummy variables


season_dummies = pd.get_dummies(bikes.season, prefix='season')

# print 5 random rows


season_dummies.sample(n=5, random_state=1)

Out[21]: season_1 season_2 season_3 season_4

datetime

2011-09-05 11:00:00 False False True False

2012-03-18 04:00:00 True False False False

2012-10-14 17:00:00 False False False True

2011-04-04 15:00:00 False True False False

2012-12-11 02:00:00 False False False True

我们只需要 三个 dummy 变量 (不是四个) (为什么?), 所以可以删除第一个dummy变量。

In [22]: # drop the first column


season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)

# print 5 random rows


season_dummies.sample(n=5, random_state=1)

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 10/13


2024/10/13 15:00 linearreg_exercise finish

Out[22]: season_2 season_3 season_4

datetime

2011-09-05 11:00:00 False True False

2012-03-18 04:00:00 False False False

2012-10-14 17:00:00 False False True

2011-04-04 15:00:00 True False False

2012-12-11 02:00:00 False False True

In [23]: # concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
bikes = pd.concat([bikes, season_dummies], axis=1)

# print 5 random rows


bikes.sample(n=5, random_state=1)

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 11/13


2024/10/13 15:00 linearreg_exercise finish

Out[23]: season holiday workingday weather temp atemp humidity windspeed casual registered count season_2 season

datetime

2011-
09-05 3 1 0 2 28.70 33.335 74 11.0014 101 207 308 False T
11:00:00

2012-
03-18 1 0 0 2 17.22 21.210 94 11.0014 6 8 14 False Fa
04:00:00

2012-
10-14 4 0 0 1 26.24 31.060 44 12.9980 193 346 539 False Fa
17:00:00

2011-
04-04 2 0 1 1 31.16 33.335 23 36.9974 47 96 143 True Fa
15:00:00

2012-
12-11 4 0 1 2 16.40 20.455 66 22.0028 0 1 1 False Fa
02:00:00

将编码成的dummy变量加入回归模型的特征,预测单车租赁数
In [30]: # include dummy variables for season in the model
feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity']
X = bikes[feature_cols] # 特征
y = bikes['count'] # 目标变量

# 划分训练集和测试集 (70%训练,30%测试)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 实例化线性回归模型
linear_model = LinearRegression()

# 拟合模型

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 12/13


2024/10/13 15:00 linearreg_exercise finish

linear_model.fit(X_train, y_train)

# 打印模型的系数和截距
print('回归模型的系数:', linear_model.coef_)
print('回归模型的截距:', linear_model.intercept_)

# 使用测试集进行预测
y_pred = linear_model.predict(X_test)

# 计算模型的R²评分
r2 = r2_score(y_test, y_pred)
print(f'新模型的 R² 评分: {r2:.4f}')

回归模型的系数: [ 11.25840607 -5.55599206 -46.71747057 67.71409193 -2.80624827]


回归模型的截距: 133.0147787618583
新模型的 R² 评分: 0.2706

和前面的模型进行比较

In [ ]:

localhost:8888/lab/tree/linearreg_exercise finish.ipynb 13/13

You might also like