Linearreg - Exercise Finish
Linearreg - Exercise Finish
线性回归练习题
在这个练习中,我们使用一个Kaggle竞赛中提供的共享单车的数据集:Bike Sharing Demand。 该数据集包含2011到2012年Capital
Bikeshare系统中记录的每日每小时单车的租赁数,以及相应的季节和气候等信息。
数据列:
第一步:读入数据
In [3]: # read the data and set the datetime as the index
import pandas as pd
In [4]: bikes.head()
Out[4]: season holiday workingday weather temp atemp humidity windspeed casual registered count
datetime
第二步:可视化数据
用matplotlib画出温度“temp”和自行车租赁数“count”之间的散点图;
用seborn画出温度“temp”和自行车租赁数“count”之间带线性关系的散点图(提示:使用seaborn中的lmplot绘制)
In [6]: # matplotlib
import matplotlib.pyplot as plt
# 绘制散点图
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['count'], alpha=0.5)
plt.grid(True)
plt.show()
In [8]: # seaborn
import seaborn as sns
# 绘制带线性关系的散点图
plt.figure(figsize=(10, 6))
sns.lmplot(x='temp', y='count', data=bikes, aspect=1.5)
plt.grid(True)
plt.show()
第三步:一元线性回归
用温度预测自行车租赁数
# 拟合模型
linear_model.fit(X, y)
Out[10]: ▾ LinearRegression i ?
LinearRegression()
print(f'线性回归模型的系数(slope): {linear_model.coef_[0]:.4f}')
print(f'线性回归模型的截距(intercept): {linear_model.intercept_:.4f}')
线性回归模型的系数(slope): 9.1705
线性回归模型的截距(intercept): 6.0462
第四步:探索多个特征
In [12]: # explore more features
feature_cols = ['temp', 'season', 'weather', 'humidity']
In [13]: # using seaborn, draw multiple scatter plots between each feature in feature_cols and 'count'
columns_to_plot = feature_cols + ['count']
# 绘制多变量散点图
sns.pairplot(bikes[columns_to_plot], diag_kind='kde')
plt.show()
Out[14]: season holiday workingday weather temp atemp humidity windspeed casual registered c
season 1.000000 0.029368 -0.008126 0.008879 0.258689 0.264744 0.190610 -0.147121 0.096758 0.164011 0.16
holiday 0.029368 1.000000 -0.250491 -0.007074 0.000295 -0.005215 0.001929 0.008409 0.043799 -0.020956 -0.00
workingday -0.008126 -0.250491 1.000000 0.033772 0.029966 0.024660 -0.010880 0.013373 -0.319111 0.119460 0.01
weather 0.008879 -0.007074 0.033772 1.000000 -0.055035 -0.055376 0.406244 0.007261 -0.135918 -0.109340 -0.12
temp 0.258689 0.000295 0.029966 -0.055035 1.000000 0.984948 -0.064949 -0.017852 0.467097 0.318571 0.39
atemp 0.264744 -0.005215 0.024660 -0.055376 0.984948 1.000000 -0.043536 -0.057473 0.462067 0.314635 0.38
humidity 0.190610 0.001929 -0.010880 0.406244 -0.064949 -0.043536 1.000000 -0.318607 -0.348187 -0.265458 -0.31
windspeed -0.147121 0.008409 0.013373 0.007261 -0.017852 -0.057473 -0.318607 1.000000 0.092276 0.091052 0.10
casual 0.096758 0.043799 -0.319111 -0.135918 0.467097 0.462067 -0.348187 0.092276 1.000000 0.497250 0.69
registered 0.164011 -0.020956 0.119460 -0.109340 0.318571 0.314635 -0.265458 0.091052 0.497250 1.000000 0.97
count 0.163439 -0.005393 0.011594 -0.128655 0.394454 0.389784 -0.317371 0.101369 0.690414 0.970948 1.00
In [15]: sns.heatmap(bikes.corr())
X = bikes[feature_cols] # 特征
y = bikes['count'] # 目标变量
# 拟合模型
linear_model.fit(X, y)
Out[17]: ▾ LinearRegression i ?
LinearRegression()
print('回归模型的系数:', linear_model.coef_)
print('回归模型的截距:', linear_model.intercept_)
回归模型的系数: [9.17054048]
回归模型的截距: 6.046212959616469
使用train/test split和RMSE来比较多个不同的模型
补充:处理类别特征
有两种类别特征:
此数据集中的类别特征有:
datetime
datetime
In [23]: # concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
bikes = pd.concat([bikes, season_dummies], axis=1)
Out[23]: season holiday workingday weather temp atemp humidity windspeed casual registered count season_2 season
datetime
2011-
09-05 3 1 0 2 28.70 33.335 74 11.0014 101 207 308 False T
11:00:00
2012-
03-18 1 0 0 2 17.22 21.210 94 11.0014 6 8 14 False Fa
04:00:00
2012-
10-14 4 0 0 1 26.24 31.060 44 12.9980 193 346 539 False Fa
17:00:00
2011-
04-04 2 0 1 1 31.16 33.335 23 36.9974 47 96 143 True Fa
15:00:00
2012-
12-11 4 0 1 2 16.40 20.455 66 22.0028 0 1 1 False Fa
02:00:00
将编码成的dummy变量加入回归模型的特征,预测单车租赁数
In [30]: # include dummy variables for season in the model
feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity']
X = bikes[feature_cols] # 特征
y = bikes['count'] # 目标变量
# 划分训练集和测试集 (70%训练,30%测试)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 实例化线性回归模型
linear_model = LinearRegression()
# 拟合模型
linear_model.fit(X_train, y_train)
# 打印模型的系数和截距
print('回归模型的系数:', linear_model.coef_)
print('回归模型的截距:', linear_model.intercept_)
# 使用测试集进行预测
y_pred = linear_model.predict(X_test)
# 计算模型的R²评分
r2 = r2_score(y_test, y_pred)
print(f'新模型的 R² 评分: {r2:.4f}')
和前面的模型进行比较
In [ ]: