0% found this document useful (0 votes)
16 views16 pages

Data Collection and Data Cleaning: Next Connect To The Drive

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views16 pages

Data Collection and Data Cleaning: Next Connect To The Drive

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

1.

Data collection and data cleaning

Next connect to the drive

Next, will upload the dataset into pandas Data Frame for easy use and analysis.
Results obtained from the data:

Clean up data:
Results received:
2. Visualisation
Here's a breakdown of what the code does:

 import pandas as pd: This line imports the Pandas library, which is commonly used for
data manipulation and analysis.
 import matplotlib.pyplot as plt: This line imports the Pyplot module from the Matplotlib
library, which is used for creating data visualizations.
 plt.figure(figsize=(12,6)): This line sets the size of the figure to be 12 inches wide and 6
inches tall.
 city_sales = df.groupby('Product line')['Total'].sum().sort_values(): This line groups the
data by 'Product line', sums the 'Total' column, and then sorts the resulting values.
 city_sales.plot(kind='bar', color='grey'): This line creates a bar plot of the 'city_sales'
data, with the bars colored in grey.
 plt.title('Total Sales by Product line'): This line sets the title of the plot to "Total Sales by
Product line".
 plt.xlabel('Product line'): This line sets the label for the x-axis to "Product line".
 plt.ylabel('Total Sales'): This line sets the label for the y-axis to "Total Sales".
 plt.xticks(rotation=40): This line rotates the x-axis labels by 40 degrees to make them
more readable.
 plt.show(): This line displays the plot.

Overall, this code is designed to analyze and visualize sales data by product line, with the goal of
identifying the best-selling product lines.
The key insights from the chart are:

 The product line with the highest total sales is "Food and beverages", followed by "Sports
and travel".
 The product line with the lowest total sales is "Health and beauty".
 The sales figures for the other product lines, in descending order, are: "Home and
 lifestyle", "Fashion accessories", and "Electronic accessories".
 The sales figures for the different product lines vary significantly, with the top two
product lines having much higher total sales compared to the others.
 The chart provides a clear visualization of the relative performance of the different
product lines in terms of total sales, which can help the company identify its best-
performing and underperforming product lines.

Overall, the chart provides a useful summary of the sales performance by product line, which can
inform business decisions and strategies for the company.
Here's a breakdown of what the code does:

 import pandas as pd: This line imports the Pandas library, which is commonly used for
data manipulation and analysis.
 import matplotlib.pyplot as plt: This line imports the Pyplot module from the Matplotlib
library, which is used for creating data visualizations.
 plt.figure(figsize=(12,6)): This line sets the size of the figure to be 12 inches wide and 6
inches tall.
 rating = df.groupby('Product line')['Rating'].mean().sort_values(): This line groups the
data by 'Product line', calculates the mean of the 'Rating' column, and then sorts the
resulting values.
 rating.plot(kind='bar', color='red'): This line creates a bar plot of the 'rating' data, with the
bars colored in red.
 plt.title('Rating by Product line'): This line sets the title of the plot to "Rating by Product
line".
 plt.xlabel('Product line'): This line sets the label for the x-axis to "Product line".
 plt.ylabel('Rating'): This line sets the label for the y-axis to "Rating".
 plt.xticks(rotation=40): This line rotates the x-axis labels by 40 degrees to make them
more readable.
 plt.show(): This line displays the plot.
Overall, this code is designed to analyze and visualize the average rating for each product line,
which can be useful for identifying the most well-received products and potentially informing
product development and marketing strategies.

The key insights from the chart are:

 The product line with the highest average rating is "Food and beverages", followed by
"Fashion accessories".
 The product line with the lowest average rating is "Home and lifestyle".
 The ratings for the other product lines, in descending order, are: "Sports and travel",
"Electronic accessories", and "Health and beauty".
 The ratings for all product lines are quite high, with most falling in the range of 6-7 out of
7.
 The chart provides a clear visualization of the relative performance of the different
product lines in terms of customer ratings, which can help the company identify their
strongest and weakest product categories.
Overall, the chart suggests that customers are generally satisfied with the company's products,
with the food and beverage and fashion accessory lines being the most well-received. This
information could be useful for the company in making strategic decisions about product
development, marketing, and resource allocation.

Here's a breakdown of what the code does:

 import pandas as pd: This line imports the Pandas library, which is commonly used for
data manipulation and analysis.
 import matplotlib.pyplot as plt: This line imports the Pyplot module from the Matplotlib
library, which is used for creating data visualizations.
 plt.figure(figsize=(12,6)): This line sets the size of the figure to be 12 inches wide and 6
inches tall.
 city_sales = df.groupby('City')['Total'].sum().sort_values(): This line groups the data by
'City', sums the 'Total' column, and then sorts the resulting values.
 city_sales.plot(kind='bar', color='black'): This line creates a bar plot of the 'city_sales'
data, with the bars colored in black.
 plt.title('Total Sales by City'): This line sets the title of the plot to "Total Sales by City".
 plt.xlabel('City'): This line sets the label for the x-axis to "City".
 plt.ylabel('Total Sales'): This line sets the label for the y-axis to "Total Sales".
 plt.xticks(rotation=40): This line rotates the x-axis labels by 40 degrees to make them
more readable.
 plt.show(): This line displays the plot.
Overall, this code is designed to analyze and visualize the total sales by city, which can be useful
for identifying the best-performing and underperforming sales regions. The bar plot provides a
clear comparison of the total sales for each city, allowing the company to focus its efforts on the
most lucrative markets.

The chart shows the total sales by city, with three cities represented: Mandalay, Yangon, and
Naypyitaw.

Key insights from the chart:

 Yangon has the highest total sales of the three cities, significantly higher than the other
two.
 Naypyitaw has the second highest total sales, but much lower than Yangon.
 Mandalay has the lowest total sales of the three cities.
 The difference in total sales between the cities is quite dramatic, with Yangon's total sales
being much larger than the other two.

This chart provides a clear visual comparison of the sales performance across the three cities.
The large disparity in total sales suggests that the company may want to further investigate the
factors contributing to Yangon's stronger sales performance compared to the other locations. This
information could help inform strategic decisions around resource allocation, marketing, and
operations to optimize the company's overall sales.

Here's a breakdown of what the code does:

 import pandas as pd: This line imports the Pandas library, which is commonly used for
data manipulation and analysis.
 import matplotlib.pyplot as plt: This line imports the Pyplot module from the Matplotlib
library, which is used for creating data visualizations.
 plt.figure(figsize=(10,5)): This line sets the size of the figure to be 10 inches wide and 5
inches tall.
 date_sale = df.groupby('Date')['Total'].sum(): This line groups the data by 'Date', sums the
'Total' column, and assigns the result to the date_sale variable.
 date_sale.plot(kind='line', marker='o', linestyle='-', color='blue'): This line creates a line
plot of the date_sale data, with the line style set to a solid line, the marker set to a circle,
and the color set to blue.
 plt.title('Total sale in recent times'): This line sets the title of the plot to "Total sale in
recent times".
 plt.xlabel('Date'): This line sets the label for the x-axis to "Date".
 plt.ylabel('Total Sales'): This line sets the label for the y-axis to "Total Sales".
 plt.xticks(rotation=40): This line rotates the x-axis labels by 40 degrees to make them
more readable.
 plt.grid(True): This line adds a grid to the plot.
 plt.show(): This line displays the plot.
Overall, this code is designed to analyze and visualize the total sales over time, which can be
useful for identifying trends and patterns in the company's sales performance. The line plot
provides a clear representation of how the total sales have changed over the recent time period,
allowing the company to identify any significant changes or fluctuations that may require further
investigation or action.

The chart shows the total sales over time, with the x-axis representing the date and the y-axis
representing the total sales. Here are the key insights from the chart:

 The sales data is plotted as a line chart, which effectively shows the fluctuations in total
sales over the recent time period.
 The total sales exhibit significant volatility, with large spikes and dips throughout the
time frame.
 There appear to be some recurring patterns, with certain dates consistently showing
higher or lower sales compared to the surrounding dates.
 The overall trend seems to be one of increasing total sales over the time period, as the
chart shows higher highs and higher lows as time progresses.
 The largest spike in total sales occurs around the middle of the time frame, suggesting a
potential seasonal or cyclical pattern in the sales.
 The chart provides a clear visual representation of the company's sales performance over
time, which can help identify opportunities for optimization and potential areas of
concern.

This information can be valuable for the company to analyze and understand the factors driving
the sales trends, such as seasonality, promotions, or other external market influences. By
identifying these patterns, the company can make more informed decisions about inventory
management, marketing strategies, and overall business planning.

Here's a breakdown of what the code does:

 import pandas as pd: This line imports the Pandas library, which is commonly used for
data manipulation and analysis.
 import matplotlib.pyplot as plt: This line imports the Pyplot module from the Matplotlib
library, which is used for creating data visualizations.
 import seaborn as sns: This line imports the Seaborn library, which is used for creating
more advanced data visualizations, including the correlation heatmap.
 plt.figure(figsize=(12, 8)): This line sets the size of the figure to be 12 inches wide and 8
inches tall.
 df_corr = df.drop(columns=['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender',
'Product line', 'Date', 'Time', 'Payment']): This line creates a new DataFrame df_corr by
dropping the specified columns from the original DataFrame df.
 corr = df_corr.corr(): This line calculates the correlation matrix for the columns in
df_corr.
 sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1): This line creates
the correlation heatmap using Seaborn's heatmap() function. The annot=True parameter
adds the correlation values to the heatmap cells, and the cmap='coolwarm' parameter sets
the colormap to a diverging palette that ranges from blue (negative correlation) to red
(positive correlation).
 plt.title('Correlation Heatmap'): This line sets the title of the plot to "Correlation
Heatmap".
 plt.show(): This line displays the correlation heatmap.

The purpose of this code is to analyze the relationships between different variables in the dataset,
such as "Invoice ID", "Branch", "City", "Customer type", "Gender", "Product line", "Date",
"Time", and "Payment". The correlation heatmap provides a visual representation of the strength
and direction of the relationships between these variables, allowing the user to quickly identify
which variables are strongly correlated with each other.

This information can be valuable for the company in understanding the underlying patterns and
relationships in their data, which can inform business decisions, marketing strategies, and
product development.
The chart provided is a correlation heatmap that visualizes the relationships between various
variables in the dataset.

Key insights from the correlation heatmap:

The variables with the strongest positive correlation (depicted in dark red) are:

 Unit price and Quantity (correlation of 1)


 Tax 5%, Total, cogs, gross income percentage, and gross income (all with a correlation of
1)
 The variables with the strongest negative correlation (depicted in dark blue) are:
 Rating and Unit price (-0.0088)
 Rating and Quantity (-0.016)
 Rating and the remaining variables (-0.036)
 The variables with no correlation (depicted in light gray) are:
 Unit price and Tax 5%, Total, cogs, gross margin percentage, gross income (all with a
correlation of 0.63)
 Quantity and Tax 5%, Total, cogs, gross margin percentage, gross income (all with a
correlation of 0.71)
 The diagonal elements in the heatmap represent the correlation of a variable with itself,
which is always 1.

This correlation heatmap provides valuable insights into the relationships between the different
variables in the dataset. It can help the company identify key drivers of their business
performance and understand which factors are strongly correlated with each other. This
information can be used to inform decision-making, optimize operations, and identify areas for
potential improvement.

You might also like