Generative AI For Models Development

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

23/10/2024, 13:59 about:blank

Lab: Generative AI for Models Development


Estimated time: 30 minutes

In this lab, we will use generative AI to create Python scripts to develop and evaluate different predictive models for a given data set.

Learning objectives
In this lab, you will learn how to use generative AI to create Python codes that can:

Use linear regression in one variable to fit the parameters to a model


Use linear regression in multiple variables to fit the parameters to a model
Use polynomial regression in a single variable to fit the parameters to a model
Create a pipeline for performing linear regression using multiple features in polynomial scaling
Use the grid search with cross-validation and ridge regression to create a model with optimum hyperparameters

Code execution environment


To test the prompt-generated code, keep the Jupyter Notebook (in the link below) open in a separate tab in your web browser. The notebook has some setup
instructions that you should complete now.

Jupyter-Lite Test Environment

The data set for this lab is available in the following URL.
1. 1

1. URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing

Copied!

The dataset is a filtered and modified version of the Laptop Price Prediction using specifications dataset, available under the Database Contents License (DbCL) v1.0
on the Kaggle website.

Complete the setup in the Jupyter Notebook and then proceed further.

Building the prompt: Importing data set


Now you will start building your prompt. The first step is to ask the Gen AI model to generate a code to import the provided dataset to a Pandas' data frame. You
must specify if you imported the data. Then, you should see the dataset headers in the first row of the CSV file.

You can structure the prompt to create the code as follows:


1. 1
2. 2

1. Write a Python code that can perform the following tasks.


2. Read the CSV file, located on a given file path, into a pandas data frame, assuming that the first row of the file can be used as th

Copied!

Running this prompt using the GPT 3.5 engine in the GenAI classroom should result in a response that contains the code, as shown below.
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
13. 13
14. 14
15. 15
16. 16
17. 17

1. import pandas as pd
2.
3. # Define the file path
4. file_path = "path/to/your/csv/file.csv"
5.
6. # Read the CSV file into a pandas data frame
7. data_frame = pd.read_csv(file_path)
8.
9. # Assume the first row of the file can be used as the headers for the data
10. # If the file doesn't have headers, you can remove the 'header' parameter
11. data_frame = pd.read_csv(file_path, header=0)
12.
13. # Additional details:
14. # - The 'pd.read_csv()' function is used to read a CSV file into a pandas data frame.

about:blank 1/8
23/10/2024, 13:59 about:blank
15. # - The 'header' parameter in the 'pd.read_csv()' function specifies which row to use as the headers.
16. # By default, it is set to 'infer', which means pandas will try to infer the headers from the file.
17. # If the headers are in the first row, you can set the 'header' parameter to 0.

Copied!

You may copy the code from the response and try it on the Jupyter Notebook with appropriate modifications in the file path.

Linear regression in one variable


You can now ask the generative AI model to generate a script to create a linear regression model using a single attribute as the training feature to predict a target
attribute.

You can write the prompt as follows:


1. 1
2. 2
3. 3

1. Write a Python code that performs the following tasks.


2. 1. Develops and trains a linear regression model that uses one attribute of a data frame as the source variable and another as a tar
3. 2. Calculate and display the MSE and R^2 values for the trained model

Copied!

It will generate a response as shown below:

1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
13. 13
14. 14
15. 15
16. 16
17. 17
18. 18
19. 19
20. 20
21. 21
22. 22
23. 23
24. 24
25. 25
26. 26
27. 27
28. 28
29. 29
30. 30
31. 31
32. 32
33. 33
34. 34
35. 35

1. import pandas as pd
2. from sklearn.linear_model import LinearRegression
3. from sklearn.metrics import mean_squared_error, r2_score
4.
5. # Assume you have a pandas data frame called 'data_frame' with two columns: 'source_variable' and 'target_variable'
6.
7. # Extract the source variable and target variable from the data frame
8. X = data_frame[['source_variable']]
9. y = data_frame['target_variable']
10.
11. # Initialize a linear regression model
12. model = LinearRegression()
13.
14. # Train the model using the source and target variables
15. model.fit(X, y)
16.
17. # Make predictions using the trained model
18. y_pred = model.predict(X)
19.
20. # Calculate the mean squared error (MSE)
21. mse = mean_squared_error(y, y_pred)
22.
23. # Calculate the coefficient of determination (R^2)
24. r2 = r2_score(y, y_pred)
25.
26. # Display the MSE and R^2 values
27. print("Mean Squared Error (MSE):", mse)
28. print("Coefficient of Determination (R^2):", r2)
29.
30. # Additional details:
31. # - The 'LinearRegression' class from the 'sklearn.linear_model' module is used to create a linear regression model.
32. # - The 'fit()' method is used to train the model using the source and target variables.
33. # - The 'predict()' method is used to make predictions using the trained model.
34. # - The 'mean_squared_error()' function from the 'sklearn.metrics' module is used to calculate the MSE.
35. # - The 'r2_score()' function from the 'sklearn.metrics' module is used to calculate the R^2 value.

Copied!

about:blank 2/8
23/10/2024, 13:59 about:blank
You can use this code to develop a linear regression model with the target variable as Price and the source variable as CPU_frequency. Try this out in the Test
environment.

Linear regression in multiple variables


You can modify the prompt to ask for a code that uses multiple attributes as input for training a model to predict the value of a target variable.

For that, you can use the prompt as shown below:

1. 1
2. 2
3. 3

1. Write a Python code that performs the following tasks.


2. 1. Develops and trains a linear regression model that uses some attributes of a data frame as the source variables and one of the at
3. 2. Calculate and display the MSE and R^2 values for the trained model.

Copied!

You should see a similar response to the following:


1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
13. 13
14. 14
15. 15
16. 16
17. 17
18. 18
19. 19
20. 20
21. 21
22. 22
23. 23
24. 24
25. 25
26. 26
27. 27
28. 28
29. 29
30. 30
31. 31
32. 32
33. 33
34. 34
35. 35

1. import pandas as pd
2. from sklearn.linear_model import LinearRegression
3. from sklearn.metrics import mean_squared_error, r2_score
4.
5. # Assume you have a pandas data frame called 'data_frame' with multiple columns: 'source_variable_1', 'source_variable_2', ..., 'tar
6.
7. # Extract the source variables and target variable from the data frame
8. X = data_frame[['source_variable_1', 'source_variable_2', ...]]
9. y = data_frame['target_variable']
10.
11. # Initialize a linear regression model
12. model = LinearRegression()
13.
14. # Train the model using the source and target variables
15. model.fit(X, y)
16.
17. # Make predictions using the trained model
18. y_pred = model.predict(X)
19.
20. # Calculate the mean squared error (MSE)
21. mse = mean_squared_error(y, y_pred)
22.
23. # Calculate the coefficient of determination (R^2)
24. r2 = r2_score(y, y_pred)
25.
26. # Display the MSE and R^2 values
27. print("Mean Squared Error (MSE):", mse)
28. print("Coefficient of Determination (R^2):", r2)
29.
30. # Additional details:
31. # - The 'LinearRegression' class from the 'sklearn.linear_model' module is used to create a linear regression model.
32. # - The 'fit()' method is used to train the model using the source and target variables.
33. # - The 'predict()' method is used to make predictions using the trained model.
34. # - The 'mean_squared_error()' function from the 'sklearn.metrics' module is used to calculate the MSE.
35. # - The 'r2_score()' function from the 'sklearn.metrics' module is used to calculate the R^2 value.

Copied!

You can use the generated code and build a linear regression model for the data set that uses CPU_frequency, RAM_GB, Storage_GB_SSD, CPU_core, OS, GPU and
Category and source variables and Price as the target variable. You can check the code generated in the webpage browser.

You may compare the performance of the two models by comparing their MSE and R^2 values.

about:blank 3/8
23/10/2024, 13:59 about:blank

Polynomial regression
You can explore creating a model that uses higher-order features from original attributes. Higher orders of the same feature allow better generalization on the target
variable. It is called polynomial regression, and you can use Generative AI to create a code for this.

Assume you are given a single attribute as the source variable and one as a target variable. You must create a model using polynomial regression for a given order.
You can also make the model for different order values and compare their performance based on MSE and R^2 scores.

You can write a similar prompt to the following code:


1. 1
2. 2
3. 3
4. 4

1. Write a Python code that performs the following tasks.


2. 1. Develops and trains multiple polynomial regression models, with orders 2, 3, and 5, that use one attribute of a data frame as the
3. 2. Calculate and display the MSE and R^2 values for the trained models.
4. 3. Compare the performance of the models.

Copied!

The expected output should look as shown below:


1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
13. 13
14. 14
15. 15
16. 16
17. 17
18. 18
19. 19
20. 20
21. 21
22. 22
23. 23
24. 24
25. 25
26. 26
27. 27
28. 28
29. 29
30. 30
31. 31
32. 32
33. 33
34. 34
35. 35
36. 36
37. 37
38. 38
39. 39
40. 40
41. 41
42. 42
43. 43
44. 44
45. 45
46. 46
47. 47
48. 48
49. 49
50. 50
51. 51
52. 52
53. 53
54. 54
55. 55
56. 56
57. 57
58. 58
59. 59
60. 60

1. import pandas as pd
2. import numpy as np
3. from sklearn.linear_model import LinearRegression
4. from sklearn.preprocessing import PolynomialFeatures
5. from sklearn.metrics import mean_squared_error, r2_score
6.
7. # Assume you have a pandas data frame called 'data_frame' with two columns: 'source_variable' and 'target_variable'
8.
9. # Extract the source variable and target variable from the data frame
10. X = data_frame[['source_variable']]
11. y = data_frame['target_variable']
12.
13. # Initialize lists to store the MSE and R^2 values for each model
14. mse_values = []
15. r2_values = []
16.

about:blank 4/8
23/10/2024, 13:59 about:blank
17. # Loop through the polynomial orders
18. for order in [2, 3, 5]:
19. # Create polynomial features
20. polynomial_features = PolynomialFeatures(degree=order)
21. X_poly = polynomial_features.fit_transform(X)
22.
23. # Initialize a linear regression model
24. model = LinearRegression()
25.
26. # Train the model using the polynomial features and target variable
27. model.fit(X_poly, y)
28.
29. # Make predictions using the trained model
30. y_pred = model.predict(X_poly)
31.
32. # Calculate the mean squared error (MSE)
33. mse = mean_squared_error(y, y_pred)
34.
35. # Calculate the coefficient of determination (R^2)
36. r2 = r2_score(y, y_pred)
37.
38. # Append the MSE and R^2 values to the lists
39. mse_values.append(mse)
40. r2_values.append(r2)
41.
42. # Display the MSE and R^2 values for the current model
43. print(f"Polynomial Order {order}:")
44. print("Mean Squared Error (MSE):", mse)
45. print("Coefficient of Determination (R^2):", r2)
46. print()
47.
48. # Compare the performance of the models
49. best_order = np.argmin(mse_values)
50. worst_order = np.argmax(r2_values)
51.
52. print("Model Comparison:")
53. print(f"Best Polynomial Order: {best_order + 2}")
54. print(f"Worst Polynomial Order: {worst_order + 2}")
55.
56. # Additional details:
57. # - The 'PolynomialFeatures' class from the 'sklearn.preprocessing' module is used to create polynomial features.
58. # - The 'fit_transform()' method is used to transform the source variable into polynomial features.
59. # - The 'degree' parameter in the 'PolynomialFeatures' class specifies the maximum degree of the polynomial features.
60. # - The 'argmin()' and 'argmax()' functions from the 'numpy' module are used to find the index of the minimum and maximum values in

Copied!

You can use the relevant part of the code in your script.

You can see that the model can generate sophisticated code using functions to create and train models with different orders and evaluate their performance for each of
them.

Try to run the generated code on the testing interface with the source variable as CPU frequency and the target variable as Price.

Creating a Pipeline
Pipelines are processes containing a sequence of steps that lead to creating a trained model.

You will now use the Generative AI model to create a pipeline for performing feature scaling, creating polynomial features for multiple attributes, and performing
linear regression using these variables.

You can build a similar prompt to the following code:


1. 1
2. 2
3. 3

1. Write a Python code that performs the following tasks.


2. 1. Create a pipeline that performs parameter scaling, Polynomial Feature generation, and Linear regression. Use the set of multiple
3. 2. Calculate and display the MSE and R^2 values for the trained model.

Copied!

The expected response is as shown below:

1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
13. 13
14. 14
15. 15
16. 16
17. 17
18. 18
19. 19
20. 20
21. 21
22. 22
23. 23
24. 24

about:blank 5/8
23/10/2024, 13:59 about:blank
25. 25
26. 26
27. 27
28. 28
29. 29
30. 30
31. 31
32. 32
33. 33
34. 34
35. 35
36. 36
37. 37
38. 38
39. 39
40. 40
41. 41
42. 42

1. import pandas as pd
2. import numpy as np
3. from sklearn.pipeline import make_pipeline
4. from sklearn.preprocessing import StandardScaler, PolynomialFeatures
5. from sklearn.linear_model import LinearRegression
6. from sklearn.metrics import mean_squared_error, r2_score
7.
8. # Assume you have a pandas data frame called 'data_frame' with multiple columns: 'source_variable_1', 'source_variable_2', ..., 'tar
9.
10. # Extract the source variables and target variable from the data frame
11. X = data_frame[['source_variable_1', 'source_variable_2', ...]]
12. y = data_frame['target_variable']
13.
14. # Create a pipeline that performs parameter scaling, polynomial feature generation, and linear regression
15. pipeline = make_pipeline(
16. StandardScaler(),
17. PolynomialFeatures(degree=2),
18. LinearRegression()
19. )
20.
21. # Train the model using the source and target variables
22. pipeline.fit(X, y)
23.
24. # Make predictions using the trained model
25. y_pred = pipeline.predict(X)
26.
27. # Calculate the mean squared error (MSE)
28. mse = mean_squared_error(y, y_pred)
29.
30. # Calculate the coefficient of determination (R^2)
31. r2 = r2_score(y, y_pred)
32.
33. # Display the MSE and R^2 values
34. print("Mean Squared Error (MSE):", mse)
35. print("Coefficient of Determination (R^2):", r2)
36.
37. # Additional details:
38. # - The 'make_pipeline()' function from the 'sklearn.pipeline' module is used to create a pipeline.
39. # - The 'StandardScaler' class from the 'sklearn.preprocessing' module is used to perform parameter scaling.
40. # - The 'PolynomialFeatures' class from the 'sklearn.preprocessing' module is used to create polynomial features.
41. # - The 'LinearRegression' class from the 'sklearn.linear_model' module is used for linear regression.
42. # - The pipeline automatically applies the transformations in the specified order.

Copied!

You can use the relevant part in your code.

Make appropriate changes to the attributes you will use in the code. Consider the same set of attributes as used for multiple feature linear regression, which is
CPU_frequency, RAM_GB, Storage_GB_SSD, CPU_core, OS, GPU and Category as source variables, and Price as the target variable.

Grid search and Ridge regression


An improved way to train your model is to use ridge regression instead of linear regression. You can use the polynomial features of multiple attributes. One of the
key factors of ridge regression is using the parameter alpha as a hyperparameter for training. Using grid search, one can determine the optimum value of the
hyperparameter for the given set of features. Grid search also uses cross-validation training to train and prepare the optimum model.

You can use generative AI to create the Python code to perform a grid search for the optimum ridge regression model, which uses polynomial features generated from
multiple parameters.

You may use the following prompt to create this:


1. 1
2. 2
3. 3
4. 4
5. 5

1. Write a Python code that performs the following tasks.


2. 1. Use polynomial features for some of the attributes of a data frame.
3. 2. Perform Grid search on a ridge regression model for a set of values of hyperparameter alpha and polynomial features as input.
4. 3. Use cross-validation in the Grid search.
5. 4. Evaluate the resulting model's MSE and R^2 values.

Copied!

The resulting code will look as shown below:


1. 1
2. 2
3. 3

about:blank 6/8
23/10/2024, 13:59 about:blank
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
13. 13
14. 14
15. 15
16. 16
17. 17
18. 18
19. 19
20. 20
21. 21
22. 22
23. 23
24. 24
25. 25
26. 26
27. 27
28. 28
29. 29
30. 30
31. 31
32. 32
33. 33
34. 34
35. 35
36. 36
37. 37
38. 38
39. 39
40. 40
41. 41
42. 42
43. 43
44. 44
45. 45
46. 46
47. 47
48. 48
49. 49
50. 50
51. 51
52. 52

1. import pandas as pd
2. import numpy as np
3. from sklearn.model_selection import GridSearchCV, cross_val_score
4. from sklearn.preprocessing import PolynomialFeatures
5. from sklearn.linear_model import Ridge
6. from sklearn.metrics import mean_squared_error, r2_score
7.
8. # Assume you have a pandas data frame called 'data_frame' with multiple columns: 'source_variable_1', 'source_variable_2', ..., 'tar
9.
10. # Extract the source variables and target variable from the data frame
11. X = data_frame[['source_variable_1', 'source_variable_2', ...]]
12. y = data_frame['target_variable']
13.
14. # Create polynomial features
15. polynomial_features = PolynomialFeatures()
16.
17. # Transform the source variables into polynomial features
18. X_poly = polynomial_features.fit_transform(X)
19.
20. # Define the hyperparameter values for the grid search
21. param_grid = {'alpha': [0.1, 1.0, 10.0], 'degree': [2, 3, 4]}
22.
23. # Initialize a ridge regression model
24. model = Ridge()
25.
26. # Perform grid search with cross-validation
27. grid_search = GridSearchCV(model, param_grid, cv=5)
28.
29. # Train the model using the polynomial features and target variable
30. grid_search.fit(X_poly, y)
31.
32. # Make predictions using the trained model
33. y_pred = grid_search.predict(X_poly)
34.
35. # Calculate the mean squared error (MSE)
36. mse = mean_squared_error(y, y_pred)
37.
38. # Calculate the coefficient of determination (R^2)
39. r2 = r2_score(y, y_pred)
40.
41. # Display the MSE and R^2 values
42. print("Mean Squared Error (MSE):", mse)
43. print("Coefficient of Determination (R^2):", r2)
44.
45. # Additional details:
46. # - The 'PolynomialFeatures' class from the 'sklearn.preprocessing' module is used to create polynomial features.
47. # - The 'GridSearchCV' class from the 'sklearn.model_selection' module is used to perform grid search with cross-validation.
48. # - The 'Ridge' class from the 'sklearn.linear_model' module is used for ridge regression.
49. # - The 'fit_transform()' method is used to transform the source variables into polynomial features.
50. # - The 'param_grid' parameter in the 'GridSearchCV' class specifies the hyperparameter values to search over.
51. # - The 'cv' parameter in the 'GridSearchCV' class specifies the number of folds for cross-validation.
52. # - The best model found by grid search can be accessed using the 'best_estimator_' attribute of the grid search object.

Copied!

about:blank 7/8
23/10/2024, 13:59 about:blank
You can test this code for the data set on the testing environment.

You make use of the following parametric values for this purpose.

Source Variables: CPU_frequency, RAM_GB, Storage_GB_SSD, CPU_core, OS, GPU and Category
Target Variable: Price
Set of values for alpha: 0.0001,0.001,0.01, 0.1, 1, 10
Cross Validation: 4-fold
Polynomial Feature order: 2

Conclusion
Congratulations! You have completed the lab on Data preparation.

With this, you have learned how to use generative AI to create Python codes that can:

Implement linear regression in single variable


Implement linear regression in multiple variables
Implement polynomial regression for different orders of a single variable
Create a pipeline that implements polynomial scaling for multiple variables and performs linear regression on them
Apply a grid search to create an optimum ridge regression model for multiple features

Author(s)
Abhishek Gagneja

© IBM Corporation. All rights reserved.

about:blank 8/8

You might also like