Note
Note
Note
import pandas as pd
import numpy as np
df = df.drop_duplicates()
def convert_new_price(price):
else:
return float(price_str)
current_year = pd.to_datetime("now").year
plt.title()
plt.xlabel()
plt.ylabel()
fuel_counts = df['Fuel_Type'].value_counts()
plt.xlabel('Fuel Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
summary_table = df.groupby('Fuel_Type').agg({
}).reset_index()
km_driven_by_location = df.groupby('Location')['Kilometers_Driven'].sum().reset_index()
df['Mileage'] = pd.to_numeric(df['Mileage'].str.replace('
km/kg', '').str.replace(' kmpl', '').astype(str),
errors='coerce')
df['Mileage'] = pd.to_numeric(df['Mileage'], errors='coerce')
#heat map
correlation = df[['Engine', 'Power', 'Mileage',
'Price']].corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap (Numeric Variables)')
plt.show()
X = df[features]
y = df['Price']
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
# Normalize features
scaler = StandardScaler()
X_kmeans_scaled = scaler.fit_transform(X_kmeans)
inertia = []
kmeans.fit(X_kmeans_scaled)
inertia.append(kmeans.inertia_)
plt.figure(figsize=(8, 5))
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()
optimal_clusters = 3
df['Cluster'] = kmeans.fit_predict(X_kmeans_scaled)
# Visualizing clusters (using first two features for simplicity)
plt.figure(figsize=(8, 5))
plt.xlabel(features_kmeans[0]) # Kilometers_Driven
plt.ylabel(features_kmeans[1]) # Car_Age
plt.show()
# K-Means Clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
df['Cluster'] = kmeans.fit_predict(X_scaled)
plt.figure(figsize=(10, 5))
plt.show()
# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)
# Logistic Regression
df['Price_Category'] = (df['Price'] > df['Price'].median()).astype(int)
y = df['Price_Category']
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
1. Pair Plot
Visualization: A pair plot allows you to visualize relationships between multiple numerical
features at once, showing scatter plots for each pair of variables.
Explanation: This plot helps you identify correlations and relationships between features,
such as whether higher engine power is associated with higher prices or if older cars tend to
have lower prices.
2. Violin Plot
Visualization: A violin plot can show the distribution of prices across different categories
(like Fuel_Type).
plt.figure(figsize=(10, 6))
sns.violinplot(x='Fuel_Type', y='Price', data=df, inner='quartile')
plt.title('Price Distribution by Fuel Type')
plt.show()
Explanation: This plot provides insight into how prices are distributed for different fuel
types, revealing the range, median, and potential outliers in the data.
python
Copy code
plt.figure(figsize=(10, 6))
sns.boxplot(x='Owner_Type', y='Price', data=df)
plt.title('Price Distribution by Owner Type')
plt.show()
Explanation: This visualization helps understand how the price varies with the number of
previous owners. It can show which owner types tend to have higher or lower prices,
revealing patterns in the used car market.
Visualization: A heatmap showing the average price for combinations of Fuel_Type and
Transmission can illustrate how these features interact.
python
Copy code
price_heatmap = df.groupby(['Fuel_Type', 'Transmission'])
['Price'].mean().unstack()
plt.figure(figsize=(10, 6))
sns.heatmap(price_heatmap, annot=True, cmap='YlGnBu')
plt.title('Average Price by Fuel Type and Transmission')
plt.ylabel('Fuel Type')
plt.xlabel('Transmission')
plt.show()
Explanation: This heatmap highlights how average prices vary with different combinations
of fuel types and transmission types, allowing for quick comparisons between categories.
Visualization: A scatter plot of Price vs. Engine with a regression line can visualize
relationships.
python
Copy code
plt.figure(figsize=(10, 6))
sns.regplot(x='Engine', y='Price', data=df)
plt.title('Price vs. Engine Size with Regression Line')
plt.show()
Explanation: This plot helps to visualize the relationship between engine size and price,
showing trends and potential outliers. The regression line can indicate whether larger engines
tend to lead to higher prices.
Visualization: A bar chart showing average mileage by fuel type can highlight efficiency
differences.
python
Copy code
average_mileage = df.groupby('Fuel_Type')['Mileage'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(x='Fuel_Type', y='Mileage', data=average_mileage)
plt.title('Average Mileage by Fuel Type')
plt.show()
Explanation: This visualization can provide insights into which fuel types tend to be more
fuel-efficient, potentially influencing purchasing decisions.
Visualization: A CDF can show the probability that a car's price is less than or equal to a
certain value.
python
Copy code
plt.figure(figsize=(10, 6))
sns.ecdfplot(df['Price'])
plt.title('Cumulative Distribution Function of Car Prices')
plt.xlabel('Price')
plt.ylabel('Cumulative Probability')
plt.show()
Explanation: This plot helps to understand the distribution of car prices across the dataset,
indicating what percentage of cars are priced below a certain threshold.