Finalised FBA CIA 3
Finalised FBA CIA 3
Finalised FBA CIA 3
VARIABLES:
4. Price:
5. City Mileage (MPG): Fuel efficiency in miles per gallon in the city.
8. Drive Train Type: Configuration (e.g., Front-Wheel Drive, Rear-Wheel Drive, or All-
Wheel Drive).
1. Price Analysis:
➢ Analyse the relationship between price range and features like engine size, mileage, and
vehicle category.
2. Fuel Efficiency:
➢ Determine which vehicle categories (e.g., Small, Midsize, Sporty) offer the best city and
highway mileage.
3. Safety Features:
4. Performance:
➢ Investigate the relationship between engine size, number of cylinders, and car performance
(e.g., sporty vs. economical).
➢ Which categories dominate the market in terms of variety and price points?
➢ How do the prices and features differ between compact, midsize, and large vehicles?
PROBLEM STATEMENT:
The automotive industry is highly competitive, with a wide range of vehicles catering to diverse
customer preferences. The dataset provides details on car specifications, pricing, fuel
efficiency, and safety features. The challenge lies in identifying trends and relationships among
these variables to support manufacturers in developing and positioning cars effectively.
VIF: VIF is referred to as Variance Inflation Factor, which is a statistical tool used in regression
analysis to measure the severity of multi collinearity.
Analysis: As VIF is very high in this case with values exceeding 13 and 14 and any VIF being
greater than 5 is associated with an 80% or more correlation between the predictors and if it is
more than 90%, we find a 90% or more correlation between predictors.
A correlation matrix is a statistical technique used to evaluate the relationship between two
variables in a data set. The matrix is a table in which every cell contains a correlation
coefficient, where 1 is considered a strong positive relationship between variables, 0 is no
relationship and -1 is a strong negative relationship. It’s most commonly used in building
regression models.
It shows the relationship between the pairs of variables which also includes the dependent
variable price
Stepwise Regression: First, is to select fit model and then place Price in the dependent variable
option and the rest in the other and by considering stepwise option, we will get the following
output.
By adding or removing the variables, it will give us a fair idea as to which is showing more
correlation than the other.
As we see here, the P values are very significant because they indicate a lesser value.
o The height of a section in the stacked bar corresponds to the relative frequency
of that vehicle model within the specific cylinder category.
2. Color Coding:
o Each color represents a unique vehicle model. These are listed on the right as a
legend, where the names like "Metro," "Accord," "Capri," etc., correspond to
the respective color segments in the bars.
Analysis:
• Cylinder Distribution:
Key Elements:
➢ At the top of the tree is "All Rows," which represents the entire dataset (93 rows in total).
➢ The tree splits into two branches based on a set of models grouped by names (e.g., Model
(190E, 300E, Aerostar, etc.) and Model (Town Car, Swift, Sunbird, etc.)).
➢ This first split separates models into two clusters based on their association with certain
manufacturers.
2. Metrics:
➢ RSquare: The value is 0.079, which measures the proportion of variance explained by the
model. This value suggests that the model captures some, but not all, of the variance in the
data. Higher values indicate better model performance.
➢ G^2 (Chi-Square Statistic): A value of 606.01588, indicating the degree to which the
observed partition deviates from expectations if there were no relationship between the
predictors and the manufacturer.
➢ Logworth: This is 1.32, derived from the p-value of the Chi-Square test, and it quantifies
the strength of the split.
3. Split Results:
➢ Each split groups rows by model names, indicating how specific vehicle models correlate
with certain manufacturers. These clusters could reflect either brand-specific model naming
conventions or differences in vehicle classifications.
4. Visualization:
➢ The scatter plot at the top shows data points distributed along the manufacturer axis, with
clusters visualized as distinct bands. These bands highlight how well the split separates the
manufacturers.
Analysis:
➢ Insights:
o Group 2: Manufacturers like Acura, Cadillac, and BMW are tied to models in the
right cluster.
➢ Add more predictive variables (e.g., price, engine type, or vehicle size) to improve the
model's RSquare.
➢ Consider using deeper splits or alternative classification algorithms (e.g., Random Forest)
for better performance.
Step 3: Clustering
The task of grouping data points based on their similarity with each other is called Clustering
or Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which
aims at gaining insights from unlabelled data points, that is, unlike supervised learning we
don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset.
It evaluates the similarity based on a metric like Euclidean distance, Cosine similarity,
Manhattan distance, etc. and then group the points with highest similarity score together.
Hierarchical Clustering:
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster
is distinct from each other cluster, and the objects within each cluster are broadly similar to
each other.
1. Structure:
➢ The vertical lines represent the joining of clusters, while their height indicates the distance
(or dissimilarity) at which clusters merge.
➢ At the bottom, each leaf corresponds to an individual data point (vehicle model).
➢ The clustering likely reflects similarities in vehicle features such as the manufacturer,
model specifications, or vehicle category (e.g., sedan, SUV, etc.).
➢ Similar vehicles are grouped closely together, indicating shared features (e.g., Lexus
models being closer to each other than to other brands).
3. Clusters:
o Cluster Example: Buick models (e.g., Buick Century, Roadmaster) are grouped
closely, which suggests they share significant similarities (possibly in vehicle type,
design, or manufacturing origin).
o Distinct Clusters: Luxury brands (e.g., Lexus and Cadillac) form separate clusters
from non-luxury or economy brands, such as Ford or Chevrolet.
4. Interpretation:
➢ Larger Clusters: At higher levels (nodes closer to the top of the dendrogram), clusters
combine groups of vehicles based on broader similarities, such as overall vehicle category
(e.g., SUVs vs. sedans).
➢ Example: Acura Integra and Acura Legend are in the same branch, reflecting their likely
shared manufacturer origin and vehicle class.
Observations:
1. Manufacturer-Based Clustering:
➢ Many clusters group vehicles by manufacturers (e.g., Acura, Chevrolet, Buick), likely due
to brand-specific design and performance traits.
➢ Brands with multiple models are represented as subclusters within larger manufacturer-
based groups.
2. Vehicle Category:
➢ Vehicles with similar categories, such as luxury, compact cars, or SUVs, tend to cluster
together. For example, Cadillac and Lexus, often categorized as luxury vehicles, might
merge at higher levels.
3. Use Cases:
➢ Automotive manufacturers can analyze how their models compare with competitors or
evaluate potential opportunities for diversification.
K means clustering, assigns data points to one of the K clusters depending on their distance
from the center of the clusters. It starts by randomly assigning the clusters centroid in the space.
Then each data point assign to one of the clusters based on its distance from centroid of the
cluster. After assigning each point to one of the cluster, new cluster centroids are assigned. This
process runs iteratively until it finds good cluster. In the analysis we assume that number of
clusters is given in advanced, and we have to put points in one of the groups.
Key Elements in the Analysis
1. Cluster Summary:
➢ Number of Clusters: 3
➢ Cluster Sizes:
o Cluster 1: 44 vehicles
o Cluster 2: 22 vehicles
o Cluster 3: 27 vehicles
2. Cluster Means:
➢ Cluster 1: Lower fuel tank capacity (13.95) and passenger capacity (4.59).
➢ Cluster 2: Medium fuel tank capacity (18.96) and the lowest passenger capacity (4.5).
➢ Cluster 3: Highest fuel tank capacity (19.22) and highest passenger capacity (6.37).
➢ This metric (not fully visible) measures the variability of features within each cluster.
4. Biplot Visualization:
➢ Principal Components:
o The biplot reduces the data's dimensionality into two principal components (PC1 and
PC2) for visualization.
➢ Clusters Representation:
o Cluster 1 (red) appears distinct, with smaller values for both variables.
o Cluster 2 (green) overlaps slightly with Cluster 1 but tends toward higher fuel tank
capacities.
o Cluster 3 (blue) is the most distinct with significantly higher values for both
attributes.
Interpretation of Clusters
1. Cluster 1:
➢ Likely represents smaller, compact vehicles with smaller fuel tanks and fewer passengers
(e.g., sedans or small cars).
➢ Suitable for urban use where efficiency and compactness are priorities.
2. Cluster 2:
➢ May represent mid-sized vehicles, such as crossovers or smaller SUVs, with moderate fuel
capacity and passenger capacity.
3. Cluster 3:
➢ Likely includes larger vehicles like full-sized SUVs or vans, designed for larger passenger
loads and higher fuel consumption.
1. Customer Segmentation:
➢ Automakers can use these clusters to identify target markets for different vehicle categories.
➢ E.g., Cluster 1 appeals to urban commuters, while Cluster 3 caters to larger families or
commercial fleets.
2. Optimization:
➢ Manufacturers can analyze fuel efficiency within each cluster to optimize vehicle designs.
3. Market Trends:
Recommendations:
➢ Further Analysis: Add additional features (e.g., vehicle weight, price) to improve
clustering accuracy and relevance.
CONCLUSION:
The analysis of the cars dataset offers valuable insights into the automotive industry, helping
manufacturers optimize their product offerings and pricing strategies. Key findings include:
1. Pricing Optimization:
➢ Vehicle price is influenced by several factors, including engine size, number of cylinders,
vehicle category, and safety features. Understanding these relationships helps
manufacturers target different market segments effectively.
➢ Smaller cars or vehicles with fewer cylinders tend to have better fuel efficiency. This
information can guide manufacturers in designing fuel-efficient models for
environmentally conscious consumers.
3. Safety Features as a Differentiator:
➢ The inclusion of safety features like airbags varies across vehicle categories. Highlighting
these features in marketing can appeal to safety-conscious buyers, especially in premium
segments.
➢ Categories like "Compact" and "Midsize" may dominate in terms of affordability and
demand. Larger vehicles, such as vans and SUVs, cater to families or utility-focused
buyers, commanding higher prices.
➢ Unique trends or outliers in the dataset, such as high-performance cars with strong pricing
and mileage balance, may indicate niche market opportunities for luxury or sporty vehicles.