Finalised FBA CIA 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Fundamentals of Business Analytics CIA 3

Topic: Multiple Linear Regression, Classification and Clustering on Cars

About the dataset:

VARIABLES:

1. Manufacturer: Brand of the car.

2. Model: Specific car model.

3. Vehicle Category: Classification (e.g., Small, Midsize, Large, Van, Sporty).

4. Price:

➢ Minimum Price ($1000): Lowest price for the model.

➢ Midrange Price ($1000): Average price.

➢ Maximum Price ($1000): Highest price.

5. City Mileage (MPG): Fuel efficiency in miles per gallon in the city.

6. Highway Mileage (MPG): Fuel efficiency on the highway.

7. Air Bags Standard: Whether airbags are included (coded numerically).

8. Drive Train Type: Configuration (e.g., Front-Wheel Drive, Rear-Wheel Drive, or All-
Wheel Drive).

9. Number of Cylinders: Number of cylinders in the engine.

10. Engine Size (litres): Engine displacement.

POTENTIAL KEY INSIGHTS:

1. Price Analysis:

➢ Compare prices across manufacturers to identify premium and budget brands.

➢ Analyse the relationship between price range and features like engine size, mileage, and
vehicle category.
2. Fuel Efficiency:

➢ Determine which vehicle categories (e.g., Small, Midsize, Sporty) offer the best city and
highway mileage.

➢ Check if there is a trade-off between engine size and fuel efficiency.

3. Safety Features:

➢ Explore the distribution of airbags across vehicle categories and manufacturers.

4. Performance:

➢ Investigate the relationship between engine size, number of cylinders, and car performance
(e.g., sporty vs. economical).

5. Vehicle Category Trends:

➢ Which categories dominate the market in terms of variety and price points?

➢ How do the prices and features differ between compact, midsize, and large vehicles?

PROBLEM STATEMENT:

Problem Statement for the Cars Dataset

"Optimizing Car Features and Pricing for Competitive Market Positioning"

The automotive industry is highly competitive, with a wide range of vehicles catering to diverse
customer preferences. The dataset provides details on car specifications, pricing, fuel
efficiency, and safety features. The challenge lies in identifying trends and relationships among
these variables to support manufacturers in developing and positioning cars effectively.

Situation 1: Multiple Linear Regression


If we consider horsepower as the dependent variable and the rest of the variables as independent
variable, the main interpretation is that the P-value is not as significant as enough and only
engine size can be considered as a significant value here.

VIF: VIF is referred to as Variance Inflation Factor, which is a statistical tool used in regression
analysis to measure the severity of multi collinearity.

Analysis: As VIF is very high in this case with values exceeding 13 and 14 and any VIF being
greater than 5 is associated with an 80% or more correlation between the predictors and if it is
more than 90%, we find a 90% or more correlation between predictors.

Creation of a correlation matrix:

A correlation matrix is a statistical technique used to evaluate the relationship between two
variables in a data set. The matrix is a table in which every cell contains a correlation
coefficient, where 1 is considered a strong positive relationship between variables, 0 is no
relationship and -1 is a strong negative relationship. It’s most commonly used in building
regression models.

Analysis and Interpretation:

It shows the relationship between the pairs of variables which also includes the dependent
variable price

Stepwise Regression: First, is to select fit model and then place Price in the dependent variable
option and the rest in the other and by considering stepwise option, we will get the following
output.
By adding or removing the variables, it will give us a fair idea as to which is showing more
correlation than the other.

As we see here, the P values are very significant because they indicate a lesser value.

If we look at the predictor estimates as well, it is showing a significant value.


Situation 2: Classification
1. Bars:

o Each bar represents the proportion of vehicles with a specific number of


cylinders for a particular model.

o The height of a section in the stacked bar corresponds to the relative frequency
of that vehicle model within the specific cylinder category.

2. Color Coding:

o Each color represents a unique vehicle model. These are listed on the right as a
legend, where the names like "Metro," "Accord," "Capri," etc., correspond to
the respective color segments in the bars.
Analysis:

• Cylinder Distribution:

o Most vehicles seem to fall within the 4-cylinder or 6-cylinder categories,


suggesting these are popular engine configurations.

o A smaller proportion exists in the 3-cylinder and 8-cylinder categories, likely


due to fewer models in those segments (e.g., compact cars or high-performance
vehicles).

Key Elements:

1. Partition Tree Structure:

➢ At the top of the tree is "All Rows," which represents the entire dataset (93 rows in total).

➢ The tree splits into two branches based on a set of models grouped by names (e.g., Model
(190E, 300E, Aerostar, etc.) and Model (Town Car, Swift, Sunbird, etc.)).

➢ This first split separates models into two clusters based on their association with certain
manufacturers.

2. Metrics:

➢ RSquare: The value is 0.079, which measures the proportion of variance explained by the
model. This value suggests that the model captures some, but not all, of the variance in the
data. Higher values indicate better model performance.

➢ G^2 (Chi-Square Statistic): A value of 606.01588, indicating the degree to which the
observed partition deviates from expectations if there were no relationship between the
predictors and the manufacturer.

➢ Logworth: This is 1.32, derived from the p-value of the Chi-Square test, and it quantifies
the strength of the split.
3. Split Results:

➢ Each split groups rows by model names, indicating how specific vehicle models correlate
with certain manufacturers. These clusters could reflect either brand-specific model naming
conventions or differences in vehicle classifications.

4. Visualization:

➢ The scatter plot at the top shows data points distributed along the manufacturer axis, with
clusters visualized as distinct bands. These bands highlight how well the split separates the
manufacturers.

Analysis:

➢ Classification Purpose: The decision tree attempts to classify "Manufacturer" based on


the "Model" attribute. Each branch represents a subset of models corresponding to specific
manufacturers.

➢ Model Performance: The RSquare of 0.209 suggests moderate explanatory power,


meaning additional predictors or splits could improve classification accuracy.

➢ Insights:

o Group 1: Manufacturers like Volkswagen, Suzuki, and Mercedes-Benz seem


associated with models in the left cluster.

o Group 2: Manufacturers like Acura, Cadillac, and BMW are tied to models in the
right cluster.

Recommendations for Improvement:

➢ Add more predictive variables (e.g., price, engine type, or vehicle size) to improve the
model's RSquare.

➢ Consider using deeper splits or alternative classification algorithms (e.g., Random Forest)
for better performance.

Step 3: Clustering

The task of grouping data points based on their similarity with each other is called Clustering
or Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which
aims at gaining insights from unlabelled data points, that is, unlike supervised learning we
don’t have a target variable.

Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset.
It evaluates the similarity based on a metric like Euclidean distance, Cosine similarity,
Manhattan distance, etc. and then group the points with highest similarity score together.

Hierarchical Clustering:

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster
is distinct from each other cluster, and the objects within each cluster are broadly similar to
each other.

Analysis and Interpretation:

Key Elements of the Dendrogram:

1. Structure:
➢ The vertical lines represent the joining of clusters, while their height indicates the distance
(or dissimilarity) at which clusters merge.

➢ At the bottom, each leaf corresponds to an individual data point (vehicle model).

➢ The horizontal axis represents the similarity or dissimilarity among clusters.

2. Factors (Manufacture, Model, and Category):

➢ The clustering likely reflects similarities in vehicle features such as the manufacturer,
model specifications, or vehicle category (e.g., sedan, SUV, etc.).

➢ Similar vehicles are grouped closely together, indicating shared features (e.g., Lexus
models being closer to each other than to other brands).

3. Clusters:

➢ Vehicles are grouped based on their shared attributes. For example:

o Cluster Example: Buick models (e.g., Buick Century, Roadmaster) are grouped
closely, which suggests they share significant similarities (possibly in vehicle type,
design, or manufacturing origin).

o Distinct Clusters: Luxury brands (e.g., Lexus and Cadillac) form separate clusters
from non-luxury or economy brands, such as Ford or Chevrolet.

4. Interpretation:

➢ Larger Clusters: At higher levels (nodes closer to the top of the dendrogram), clusters
combine groups of vehicles based on broader similarities, such as overall vehicle category
(e.g., SUVs vs. sedans).

➢ Smaller Clusters: At lower levels, individual manufacturers or closely related vehicle


models merge first due to their specific similarities.

➢ Example: Acura Integra and Acura Legend are in the same branch, reflecting their likely
shared manufacturer origin and vehicle class.

Observations:

➢ Standardization: Standardizing the columns ensures that attributes measured on different


scales (e.g., vehicle weight vs. engine size) contribute equally to the clustering.
Insights:

1. Manufacturer-Based Clustering:

➢ Many clusters group vehicles by manufacturers (e.g., Acura, Chevrolet, Buick), likely due
to brand-specific design and performance traits.

➢ Brands with multiple models are represented as subclusters within larger manufacturer-
based groups.

2. Vehicle Category:

➢ Vehicles with similar categories, such as luxury, compact cars, or SUVs, tend to cluster
together. For example, Cadillac and Lexus, often categorized as luxury vehicles, might
merge at higher levels.

3. Use Cases:

➢ This dendrogram can be used to identify market segments or product similarities.

➢ Automotive manufacturers can analyze how their models compare with competitors or
evaluate potential opportunities for diversification.

Step 4: K Means clustering

K means clustering, assigns data points to one of the K clusters depending on their distance
from the center of the clusters. It starts by randomly assigning the clusters centroid in the space.
Then each data point assign to one of the clusters based on its distance from centroid of the
cluster. After assigning each point to one of the cluster, new cluster centroids are assigned. This
process runs iteratively until it finds good cluster. In the analysis we assume that number of
clusters is given in advanced, and we have to put points in one of the groups.
Key Elements in the Analysis

1. Cluster Summary:

➢ Number of Clusters: 3

➢ Cluster Sizes:

o Cluster 1: 44 vehicles

o Cluster 2: 22 vehicles

o Cluster 3: 27 vehicles

2. Cluster Means:

➢ Cluster 1: Lower fuel tank capacity (13.95) and passenger capacity (4.59).

➢ Cluster 2: Medium fuel tank capacity (18.96) and the lowest passenger capacity (4.5).

➢ Cluster 3: Highest fuel tank capacity (19.22) and highest passenger capacity (6.37).

3. Cluster Standard Deviations:

➢ This metric (not fully visible) measures the variability of features within each cluster.

4. Biplot Visualization:

➢ Principal Components:

o The biplot reduces the data's dimensionality into two principal components (PC1 and
PC2) for visualization.
➢ Clusters Representation:

o Clusters are shown as ellipses with their centroids marked.

o Cluster 1 (red) appears distinct, with smaller values for both variables.

o Cluster 2 (green) overlaps slightly with Cluster 1 but tends toward higher fuel tank
capacities.

o Cluster 3 (blue) is the most distinct with significantly higher values for both
attributes.

Interpretation of Clusters

1. Cluster 1:

➢ Likely represents smaller, compact vehicles with smaller fuel tanks and fewer passengers
(e.g., sedans or small cars).

➢ Suitable for urban use where efficiency and compactness are priorities.

2. Cluster 2:

➢ May represent mid-sized vehicles, such as crossovers or smaller SUVs, with moderate fuel
capacity and passenger capacity.

➢ Vehicles in this cluster may balance efficiency with functionality.

3. Cluster 3:

➢ Likely includes larger vehicles like full-sized SUVs or vans, designed for larger passenger
loads and higher fuel consumption.

➢ These vehicles may cater to family use or commercial purposes.

Insights and Use Cases

1. Customer Segmentation:

➢ Automakers can use these clusters to identify target markets for different vehicle categories.
➢ E.g., Cluster 1 appeals to urban commuters, while Cluster 3 caters to larger families or
commercial fleets.

2. Optimization:

➢ Manufacturers can analyze fuel efficiency within each cluster to optimize vehicle designs.

➢ Opportunities to develop hybrid or electric alternatives in Cluster 3 could address


environmental concerns.

3. Market Trends:

➢ Comparing the cluster sizes provides insight into market preferences:

o A larger Cluster 1 suggests higher demand for smaller, efficient vehicles.

Recommendations:

➢ Further Analysis: Add additional features (e.g., vehicle weight, price) to improve
clustering accuracy and relevance.

➢ Cluster Refinement: Evaluate whether additional clusters or other clustering algorithms


(e.g., DBSCAN or hierarchical) improve separation.

CONCLUSION:

Conclusion for the Cars Dataset Analysis

The analysis of the cars dataset offers valuable insights into the automotive industry, helping
manufacturers optimize their product offerings and pricing strategies. Key findings include:

1. Pricing Optimization:

➢ Vehicle price is influenced by several factors, including engine size, number of cylinders,
vehicle category, and safety features. Understanding these relationships helps
manufacturers target different market segments effectively.

2. Fuel Efficiency Insights:

➢ Smaller cars or vehicles with fewer cylinders tend to have better fuel efficiency. This
information can guide manufacturers in designing fuel-efficient models for
environmentally conscious consumers.
3. Safety Features as a Differentiator:

➢ The inclusion of safety features like airbags varies across vehicle categories. Highlighting
these features in marketing can appeal to safety-conscious buyers, especially in premium
segments.

4. Vehicle Category Trends:

➢ Categories like "Compact" and "Midsize" may dominate in terms of affordability and
demand. Larger vehicles, such as vans and SUVs, cater to families or utility-focused
buyers, commanding higher prices.

5. Market Trends and Niche Opportunities:

➢ Unique trends or outliers in the dataset, such as high-performance cars with strong pricing
and mileage balance, may indicate niche market opportunities for luxury or sporty vehicles.

You might also like