Rithika Content
Rithika Content
INTRODUCTION
At its core, the dataset likely includes both numerical and categorical variables, such as
product prices, discounts, transaction dates, and store locations. These variables are essential
for understanding how different factors, such as pricing strategies or seasonal promotions,
impact customer purchasing decisions. For instance, analyzing temporal data can help
retailers identify peak shopping times, while store-level data can reveal regional sales
patterns and product preferences. By drilling down into the data, retailers can better
understand which products perform best under certain conditions and adjust their inventory or
marketing strategies accordingly.
The dataset can also serve as a foundation for advanced analytics and predictive modeling.
Retailers can apply machine learning techniques to forecast future sales, predict customer
churn, and optimize pricing strategies. Additionally, customer segmentation models can be
built to target specific customer groups more effectively, allowing for more personalized
marketing and improved customer retention. Overall, this dataset offers retailers a wealth of
actionable insights to enhance their operations, boost sales, and improve customer
satisfaction.
1.1 SYNOPSIS
With a variety of numerical and categorical variables, such as product prices, transaction
dates, store locations, and payment methods, this dataset enables in-depth analysis of sales
trends across different dimensions. For example, by analysing the transaction timestamps,
retailers can identify high-traffic periods and optimize staffing and inventory levels.
Similarly, examining product categories and price data can help assess the effectiveness of
pricing strategies and promotions, guiding future discount campaigns and pricing models.
Geographic data, if included, can reveal regional preferences, allowing for tailored marketing
strategies based on local demand.
In addition to descriptive analysis, this dataset can be used for predictive modeling and
advanced analytics. Machine learning algorithms can forecast future sales trends, identify
potential customer churn, and optimize inventory levels based on past sales data. Customer
segmentation models can help businesses better understand their audience, leading to more
personalized marketing efforts that enhance customer engagement and loyalty. By leveraging
these insights, retailers can make data-driven decisions that improve operational efficiency
and drive profitability.
1.2 DATA OVERVIEW
This Dataset captures key details of customer purchases, including transaction IDs,
product lists, total costs, payment methods, and store locations. It provides insights into
customer behavior, sales trends, and the impact of discounts and promotions across various
stores and regions.
df.drop_duplicates(inplace=True)
print(df.isnull().sum())
print(df.isnull().sum())
Q1 = df['total_sales'].quantile(0.25)
Q3 = df['total_sales'].quantile(0.75)
IQR = Q3 - Q1
print(df_cleaned.info())
In data conversion, the goal is to ensure that all columns in the dataset have the correct
data types and are formatted properly. The most common conversions include Converting
date columns to datetime format. Converting categorical columns to categorical types or
encoding them if needed. Ensuring numerical columns are properly formatted as int or float
df['Date'] =pd.to_datetime(df['Date'])
df.dtypes
Text data preprocessing is a crucial step when preparing data for analysis. In this dataset,
there are several categorical columns with textual data that must be encoded into numerical
form for analysis to understand them
df[text_columns] = df[text_columns].fillna('unknown')
label_encoders = {}
le = LabelEncoder()
df[column] = le.fit_transform(df[column])
label_encoders[column] = le
Data aggregation involves grouping the data based on specific columns and
performing aggregations such as sum, mean, count, etc., on other columns. In the context of a
retail transactions dataset, to aggregate data to calculate total sales per day, per product, or per
customer.
average_sales = df.groupby('Date')['Total_Cost'].mean().reset_index()
print(average_sales.head())
print(daily_sales.head())
Data splitting is a crucial step in machine learning and data analysis. It involves
dividing your dataset into subsets for training and testing purposes, ensuring that you can
evaluate how well your model generalizes to unseen data.
# Adjust the feature columns and target column based on the actual dataset.
# Example: Assuming 'Total_Cost' is the target, and 'Quantity' and
'Item_Price' are features.
# Step 2: Handle missing values (Optional: This will drop rows with missing
values)
X = X.fillna(0)
y = y.fillna(0)
Normalization and scaling are common steps in data preprocessing, especially when
you are working with numerical data that has varying ranges. Scaling ensures that numerical
features are on a similar scale, which is particularly important for algorithms that rely on
distance measurements (like k-NN, SVMs, or gradient-based algorithms).
CODING:
# Step 2: Handle missing values before scaling (scalers cannot handle NaN
values)
df[numerical_cols] = df[numerical_cols].fillna(0)
# Initialize MinMaxScaler
min_max_scaler = MinMaxScaler()
df[numerical_cols] = min_max_scaler.fit_transform(df[numerical_cols])
3. DATA ANALYSIS
TRANSACTION ANALYSIS:
Analyse transaction patterns, like the distribution of total cost and number of items per
transaction.
CODING:
plt.figure(figsize=(10, 6))
plt.xlabel('Total Cost')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(10, 6))
plt.xlabel('Total Items')
plt.ylabel('Frequency')
plt.show()
OUTPUT:
2. EXPLORATORY DATA ANALYSIS (EDA)
It is used to understand distributions, relationships, and trends in the data. This includes
visualizing the data and analyzing summary statistics.
CODING:
plt.figure(figsize=(10, 6))
plt.xlabel('Payment Method')
plt.ylabel('Total Cost')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
OUTPUT:
SALES TREND OVER TIME :
CODING :
df['Date'] = pd.to_datetime(df['Date'])
monthly_sales = df.groupby(df['Date'].dt.to_period('M')).agg({'Total_Cost':
'sum'})
plt.figure(figsize=(12, 6))
monthly_sales.plot(kind='line', legend=False)
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.show()
OUTPUT:
CODING
payment_method_counts = df['Payment_Method'].value_counts()
plt.figure(figsize=(8, 6))
plt.ylabel('')
plt.show()
OUTPUT:
4. DATA VISUALIZATION
python
CODING:
plt.figure(figsize=(10, 6))
plt.xlabel('Total Items')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
OUTPUT:
SALES BY CITY:
city_sales=df.groupby('City')
['Total_Cost'].sum().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
city_sales.plot(kind='bar', color='purple')
plt.xlabel('City')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()
OUTPUT:
CODING:
top_products = df['Product'].value_counts().head(10)
plt.figure(figsize=(12, 6))
top_products.plot(kind='bar', color='orange')
plt.xlabel('Product')
plt.ylabel('Purchase Frequency')
plt.xticks(rotation=45)
plt.show()
OUTPUT:
PAYMENT METHOD DISTRIBUTION :
CODING:
# Count payment method occurrences
payment_method_counts = df['Payment_Method'].value_counts()
plt.figure(figsize=(8, 6))
plt.ylabel('')
plt.show()
OUTPUT :
CORRELATION HEATMAP :
CODING:
corr_matrix = numerical_cols.corr()
plt.figure(figsize=(10, 8))
plt.title('Correlation Heatmap')
plt.show()
OUTPUT :
CONCLUSION
The analysis revealed several important trends. Sales data over time showcased
seasonal fluctuations, indicating times of increased customer activity and
helping businesses plan inventory and promotions more effectively.
Furthermore, product purchase patterns highlighted which items are most
popular among customers, guiding inventory management and marketing
efforts. Customer demographics and categories provided additional insights into
which segments contribute most to overall sales, enabling businesses to tailor
their marketing strategies to specific customer groups.
FUTURE OUTCOME