0% found this document useful (0 votes)
208 views

Linear Regression - Cheatsheet

This document provides a cheat sheet on linear regression, covering topics like simple and multiple linear regression, regularization techniques, model selection and evaluation, and more. It lists functions and approaches for working with linear regression in Python libraries like statsmodels and scikit-learn.

Uploaded by

Projects Fun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views

Linear Regression - Cheatsheet

This document provides a cheat sheet on linear regression, covering topics like simple and multiple linear regression, regularization techniques, model selection and evaluation, and more. It lists functions and approaches for working with linear regression in Python libraries like statsmodels and scikit-learn.

Uploaded by

Projects Fun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

# [ Linear Regression ] ( CheatSheet )

1. Basic Linear Regression with statsmodels

● Simple Linear Regression: statsmodels.api.OLS(y, X).fit()


● Summary of Regression Results: results.summary()
● Predictions: results.predict(X_new)
● Residuals: results.resid
● Regression Plot: seaborn.regplot(x, y)

2. Linear Regression with scikit-learn

● Fit Linear Model: sklearn.linear_model.LinearRegression().fit(X, y)


● Coefficients and Intercept: model.coef_, model.intercept_
● Predictions: model.predict(X_new)
● R-squared Score: model.score(X, y)
● Mean Squared Error: sklearn.metrics.mean_squared_error(y_true,
y_pred)

3. Data Preprocessing for Linear Regression

● Standard Scaling:
sklearn.preprocessing.StandardScaler().fit_transform(X)
● Polynomial Features:
sklearn.preprocessing.PolynomialFeatures(degree).fit_transform(X)
● Train-Test Split: sklearn.model_selection.train_test_split(X, y)
● Handling Missing Values: pandas.DataFrame.fillna()
● One-Hot Encoding for Categorical Variables: pandas.get_dummies()

4. Diagnostics and Model Checking

● Plotting Residuals: seaborn.residplot(x, y)


● Checking for Homoscedasticity:
statsmodels.stats.diagnostic.het_breuschpagan(residuals,
model.model.exog)
● Normality Test of Residuals: scipy.stats.shapiro(residuals)

By: Waleed Mousa


● Outliers Detection (e.g., Cook's distance):
statsmodels.stats.outliers_influence.OLSInfluence(model).cooks_dist
ance
● Cross-Validation Scores:
sklearn.model_selection.cross_val_score(model, X, y)

5. Regularization Techniques

● Ridge Regression: sklearn.linear_model.Ridge(alpha).fit(X, y)


● Lasso Regression: sklearn.linear_model.Lasso(alpha).fit(X, y)
● Elastic Net: sklearn.linear_model.ElasticNet(alpha,
l1_ratio).fit(X, y)
● Grid Search for Hyperparameter Tuning:
sklearn.model_selection.GridSearchCV()

6. Multivariate Linear Regression

● Multiple Linear Regression: statsmodels.api.OLS(y,


sm.add_constant(X)).fit()
● Partial Regression Plots:
statsmodels.graphics.regressionplots.plot_partregress(y, X,
exog_idx)

7. Advanced Linear Models

● Generalized Linear Models (GLM): statsmodels.api.GLM(y, X,


family).fit()
● Quantile Regression:
statsmodels.regression.quantile_regression.QuantReg(y, X).fit(q)
● Robust Regression: statsmodels.robust.robust_linear_model.RLM(y,
X).fit()

8. Interaction Effects and Nonlinearity

● Interaction Terms: X['interaction'] = X['feature1'] * X['feature2']


● Non-linear Transformations of Predictors: numpy.log(X),
numpy.sqrt(X)

9. Model Interpretation

By: Waleed Mousa


● Feature Importance: abs(model.coef_)
● Coefficients Interpretation: beta coefficients in results.summary()
● Effects of Categorical Variables: one-hot encoded coefficients

10. Model Selection and Evaluation

● AIC and BIC: results.aic, results.bic


● Adjusted R-squared: 1 - (1 - model.score(X, y)) * ((len(y) - 1) /
(len(y) - X.shape[1] - 1))
● F-Test for Model Significance: results.f_pvalue
● Stepwise Regression (Forward, Backward): stepwise_selection(X, y)
# Custom function

11. Prediction and Confidence Intervals

● Confidence Interval of Predictions:


results.get_prediction(X_new).conf_int()
● Prediction Interval: prediction_interval(model, X_new, alpha) #
Custom function

12. Visualization of Linear Models

● Coefficient Plot: plot_coefficients(model, feature_names) # Custom


function
● Scatter Plot with Regression Line: seaborn.lmplot(x, y, data)
● Partial Dependence Plot:
sklearn.inspection.plot_partial_dependence(model, X, features)

13. Handling Large Datasets

● Stochastic Gradient Descent for Linear Regression:


sklearn.linear_model.SGDRegressor().fit(X, y)
● Mini-Batch Gradient Descent:
sklearn.linear_model.SGDRegressor(mini_batch_size)

14. Working with Time Series

● Linear Regression with Time Series Data: Handle time-based


features and trends in data

By: Waleed Mousa


● Lag Features and Autoregression: df['lag_feature'] =
df['feature'].shift(periods)

15. Practical Challenges and Solutions

● Handling Multicollinearity: Variance Inflation Factor (VIF)


calculation
● Dealing with Non-Stationarity in Time Series: Differencing or
transformation

16. Integrating with Machine Learning Pipelines

● Using Linear Regression in Pipelines:


sklearn.pipeline.Pipeline(steps=[('scaler', StandardScaler()),
('regressor', LinearRegression())])

17. Cross-Validation and Model Selection

● K-Fold Cross-Validation:
sklearn.model_selection.cross_val_score(model, X, y, cv=5)
● Leave-One-Out Cross-Validation:
sklearn.model_selection.LeaveOneOut()
● Hyperparameter Tuning with GridSearchCV:
sklearn.model_selection.GridSearchCV(estimator, param_grid)

18. Diagnostic Plots

● Residual Plot: seaborn.residplot(x, y, lowess=True)


● Q-Q Plot for Residuals: scipy.stats.probplot(residuals, plot=plt)
● Leverage Plot:
statsmodels.graphics.regressionplots.influence_plot(model,
criterion="cooks")

19. Advanced Feature Engineering

● Feature Interaction and Polynomial Terms:


sklearn.preprocessing.PolynomialFeatures(include_bias=False).fit_tr
ansform(X)

By: Waleed Mousa


● Automatic Feature Selection:
sklearn.feature_selection.RFE(estimator, n_features_to_select)

20. Preprocessing and Feature Scaling

● Normalization (MinMax Scaling):


sklearn.preprocessing.MinMaxScaler().fit_transform(X)
● Robust Scaling (handling outliers):
sklearn.preprocessing.RobustScaler().fit_transform(X)

21. Regularization and Penalization Techniques

● LassoCV for Optimal Alpha:


sklearn.linear_model.LassoCV(alphas).fit(X, y)
● RidgeCV for Optimal Alpha:
sklearn.linear_model.RidgeCV(alphas).fit(X, y)
● ElasticNetCV for Optimal Alpha and L1 Ratio:
sklearn.linear_model.ElasticNetCV(alphas, l1_ratio).fit(X, y)

22. Assumptions of Linear Regression

● Linearity Test: Plotting observed vs. predicted values


● Independence Test: Durbin-Watson test
● Homoscedasticity Test: Breusch-Pagan test
● Normality Test for Residuals: Kolmogorov-Smirnov test

23. Working with Non-linear Data

● Transformation of Target Variable: numpy.log(y) or numpy.sqrt(y)


● Generalized Additive Models (GAMs): pygam.LinearGAM().fit(X, y)

24. Model Interpretability

● Feature Importance in Linear Models: np.abs(model.coef_)


● SHAP Values for Linear Regression: shap.LinearExplainer(model,
X).shap_values(X_new)

25. Ensemble Methods

By: Waleed Mousa


● Averaging Multiple Linear Models: Averaging predictions from
different models
● Stacking Linear Models:
sklearn.ensemble.StackingRegressor(estimators)

26. Error Metrics and Model Evaluation

● Mean Absolute Error (MAE):


sklearn.metrics.mean_absolute_error(y_true, y_pred)
● Root Mean Squared Error (RMSE):
numpy.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
● Mean Squared Logarithmic Error (MSLE):
sklearn.metrics.mean_squared_log_error(y_true, y_pred)

27. Time Series Regression

● Lag Features for Time Series: df['lag_feature'] =


df['feature'].shift(1)
● Rolling Window Features: df['rolling_mean'] =
df['feature'].rolling(window=5).mean()

28. Handling Sparse Data

● Sparse Matrix Handling: scipy.sparse.csr_matrix(X)


● Linear Regression with Sparse Data:
sklearn.linear_model.LinearRegression().fit(X_sparse, y)

29. Deployment and Persistence of Model

● Model Serialization with joblib: joblib.dump(model, 'model.pkl')


● Model Deserialization: model = joblib.load('model.pkl')

30. Performance Improvement

● Parallel Computing for Large Datasets: LinearRegression(n_jobs=-1)


● Batch Gradient Descent for Large Datasets: Implementing batch or
mini-batch gradient descent

By: Waleed Mousa


31. Reporting and Visualization

● Coefficient Path Plot: Plotting coefficient magnitude vs.


regularization strength
● Prediction Error Plot: Yellowbrick's PredictionError(model)

32. Extensions and Related Models

● Partial Least Squares Regression:


sklearn.cross_decomposition.PLSRegression()
● Ridge Regression with Polynomial Features: Pipeline with
PolynomialFeatures and Ridge

33. Advanced Statistical Techniques

● Quantile Regression:
statsmodels.regression.quantile_regression.QuantReg(y,
X).fit(q=0.5)
● Instrumental Variable Regression:
linearmodels.iv.IV2SLS(dependent, exog, endog, instruments)

34. Working with Categorical Variables

● Encoding and Including Categorical Variables: pandas.get_dummies()


● ANOVA for Categorical Features Impact: statsmodels.api.ols('y ~
C(categorical_feature)', data).fit()

35. Model Diagnostics and Validation

● Cross-Validation for Linear Regression:


sklearn.model_selection.cross_val_score(model, X, y, cv=5)
● Learning Curve to Diagnose Model Performance:
sklearn.model_selection.learning_curve(model, X, y)

36. Multicollinearity Handling

● Variance Inflation Factor (VIF) Calculation:


statsmodels.stats.outliers_influence.variance_inflation_factor(X,
i)

By: Waleed Mousa


37. Interaction with Domain Knowledge

● Incorporating Domain Insights into Model: Modifying features or


model based on domain expertise

By: Waleed Mousa

You might also like