0% found this document useful (0 votes)
10 views2 pages

Machine Learning Life Cycle Report

Uploaded by

Lamia Altayeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views2 pages

Machine Learning Life Cycle Report

Uploaded by

Lamia Altayeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Machine Learning Life Cycle Report

1. Data Acquisition:
The housing price in California dataset was obtained for analysis and model
development. The dataset contains various features such as the number of rooms,
median income, housing prices, and other relevant variables.
2. Data Exploration and Visualization:
a) Top Five Rows: The head() method was used to examine the first five rows of the
dataset, providing an initial understanding of the data structure and variables.
b) Data Description: The info() method was employed to obtain a quick description of
the data, including the number of instances, attribute types, and any missing values.
c) Analysis of "ocean_proximity": The value_counts() method was used to determine the
number of districts belonging to each category in the "ocean_proximity" variable.
d) Summary of Numerical Attributes: The describe() method was utilized to generate a
statistical summary of the numerical attributes, including count, mean, standard
deviation, minimum, quartiles, and maximum values.
e) Data Visualization: Various visualizations were created to gain insights into the
dataset, including:

 Histogram Plot: A histogram plot was generated to visualize the distribution of


the housing dataframe.
 Scatter Plot: A scatter plot was created between the "longitude" and "latitude"
variables, with the alpha parameter set to 0.1. The size of each circle represented
the district's population, and the color represented the price.
 Correlation Analysis: The correlation matrix was computed using the .corr()
method to explore the relationships between all continuous numeric variables. A
heatmap plot was generated using the seaborn library to visualize the
correlations.
 Scatter Matrix: The pandas scatter_matrix() function was used to examine the
correlations between attributes. The plot was color-coded based on the
"ocean_proximity" category using the seaborn pairplot.
 Scatter Plot: A scatter plot was created between the "median_income" and
"median_house_value" variables to explore their relationship.
 Box Plot: A box plot was generated to show the relationship between the
"median_house_value" and the categorical feature "ocean_proximity".
3. Data Preprocessing:
a) Data Cleaning: The dataset was examined for missing values, and it was found that
the "total_bedrooms" attribute had some missing values. The missing values were filled
with the median value using the fillna() method.
b) Handling Zeros: It was verified that there were no zeros in the dataset, which could
indicate missing values.
c) Attribute Combinations: New attributes were created by combining existing ones,
namely "rooms_per_household" (derived from "total_rooms" and "households"),
"bedrooms_per_room" (derived from "total_bedrooms" and "total_rooms"), and
"population_per_household" (derived from "population" and "households").
d) Handling Text and Categorical Attributes: The categorical feature "ocean_proximity"
was handled by creating a separate variable called "housing_cat" and using the
OneHotEncoder from sklearn to encode the categorical values.
e) Feature Scaling: Numerical values were scaled using the StandardScaler from sklearn.
The numerical attributes were stored in a variable called "housing_num".
f) Custom Transformers: A custom transformer class called "CombinedAttributesAdder"
was created to add the combined attributes discussed earlier. The transformer was
instantiated as "attr_reader", and the housing values were transformed and saved in a
variable called "housing_extra_attribs".
g) Pipeline Creation: Two pipelines were created - "num_pipeline" for numerical
attributes and "full_pipeline" for both numerical and categorical attributes. The
"num_pipeline" included the SimpleImputer, CombinedAttributesAdder, and
StandardScaler transformers.
4. Train-Test Split:
The data was split into training and testing sets using the train_test_split function from
sklearn.model_selection. The random_state parameter was set to 42 to ensure
reproducibility. The predictors and labels were separated into "housing" and
"housing_labels" variables, respectively.

The machine learning life cycle involves several additional steps beyond the scope of
this report, such as model selection, training, evaluation, optimization, deployment, and
maintenance. These steps would typically be followed to develop and deploy a machine
learning model based on the given dataset.

You might also like