0% found this document useful (0 votes)
1K views29 pages

Smart Data Discovery

The document is a coursework submission for a Smart Data Discovery module. It analyzes sales data from ABC company over 12 months using Python. The analysis includes data understanding, preparation, exploration and conclusion. Key steps taken are merging data files, handling missing values, adding columns, summary statistics, correlations, and data visualizations including bar graphs and histograms. The aim is to gain insights from the data such as highest selling months, cities, and products.

Uploaded by

np01cp4s220149
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views29 pages

Smart Data Discovery

The document is a coursework submission for a Smart Data Discovery module. It analyzes sales data from ABC company over 12 months using Python. The analysis includes data understanding, preparation, exploration and conclusion. Key steps taken are merging data files, handling missing values, adding columns, summary statistics, correlations, and data visualizations including bar graphs and histograms. The aim is to gain insights from the data such as highest selling months, cities, and products.

Uploaded by

np01cp4s220149
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

CC5067NI-Smart Data Discovery

60 % Individual Coursework

2022-23 Spring

Student Name: AAROHAN SUBEDI


London Met ID: 22015633
College ID: np01cp4s220149
Assignment Due Date: Monday, May 1, 2023
Assignment Submission Date: Monday, May 1, 2023
Word Count: 3098

I confirm that I understand my coursework needs to be submitted online via MySecondTeacher under the
relevant module page before the deadline in order for my assignment to be accepted and marked. I am
fully aware that late submissions will be treated as non-submission and a marks of zero will be awarded.
CC5067NI Smart Data Discovery

Table of Contents
Acknowledgement ........................................................................................................... 5
ABSTRACT ..................................................................................................................... 6
INTRODUCTION ............................................................................................................. 7
1.Data Understanding ..................................................................................................... 8
2.Data Preparation ........................................................................................................ 12
2.1 Write a python program to merge data from each month into one CSV and read in
updated dataframe..................................................................................................... 12
2.2 Write a python program to remove the NaN missing values from updated dataframe.
................................................................................................................................... 14
2.3 Write a python program to convert Quantity Ordered and Price Each to numeric.
................................................................................................................................... 15
2.4 Create a new column named Month from Ordered Date of updated dataframe and
convert it to integer as data type. ............................................................................... 16
2.5 Create a new column named City from Purchase Address based on the value in
updated dataframe..................................................................................................... 17
3. Data analysis ............................................................................................................. 18
3.1 Write a Python program to show summary statistics of sum, mean, standard
deviation, skewness, and kurtosis of any chosen variable. ........................................ 19
3.2 Write a Python program to calculate and show correlation of all variables. ......... 19
4. Data Exploration ........................................................................................................ 21
4.1 Which Month has the best sales? and how much was the earning in that month?
Make a bar graph of sales as well. ............................................................................ 21
4.2 Which city has sold the highest product? ............................................................. 23
4.3 Which product was sold the most in overall? Illustrate it through bar graph. ....... 24
4.4 Write a Python program to show histogram plot of any chosen variables. Use proper
labels in the graph. .................................................................................................... 26
5.Conclusion. ................................................................................................................ 27
Bibliography .................................................................................................................. 28

2
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

Table of Figures.

Figure 1 Summary Of Data Frame ................................................................................ 10


Figure 2 Showing the Data information. ........................................................................ 11
Figure 3 Importing the Libraries and reading CSV files. ................................................ 12
Figure 4 Merging and saving new merged data frame. ................................................. 13
Figure 5 Removing Null values. .................................................................................... 14
Figure 6 Conversion of datatypes ................................................................................. 15
Figure 7 Adding new column Month and changing its datatype to integer. ................... 16
Figure 8 Adding a new column City in the data frame. .................................................. 17
Figure 9 Showing the outputs of the Statistics. ............................................................. 19
Figure 10 Showing correlation. ...................................................................................... 20
Figure 11 Heatmap to show corelation. ......................................................................... 21
Figure 12 Finding month with highest sale and highest earning of a month. ................. 21
Figure 13 Bar graph of month with highest sales .......................................................... 22
Figure 14 Finding the city with highest sale................................................................... 23
Figure 15 Finding the most sold product. ...................................................................... 24
Figure 16 Plotting product sale of the year in bar graph. ............................................... 25
Figure 17 Histogram of Price Each Variable. ................................................................ 26

3
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

List of Tables

Table 1 Data Understanding ........................................................................................... 9

4
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

Acknowledgement

I want to thank my college Islington for giving me this opportunity to showcase my abilities
and concepts related to this module. And I express my true appreciation to the everybody
who had made me able to wrap up this task and who has helped me while tackling the
issues. I would like to thank Mr. Sarun Dahal, mentor for this module for continuous
direction to complete the project and sharing your abilities and experience. Again, I thank
you all for your advising, help and inspiration which had made a difference to complete
the coursework in time.

5
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

ABSTRACT

This is the primary individual coursework of module Smart Data Discovery which carries
60% of the marks from the module. The major objective of this coursework is to showcase
my knowledge and concepts abilities which I have learnt in this module like data
investigation, abilities related to issue solving and basic thinking/evaluation. In this task I
have been given the sales examination from ABC company of 2019 and I must compose
a python program to unravel the address given and plan a specialized report on the output
result counting information understanding, planning, investigation, and introductory
investigation. To illuminate those questions, I have utilized Jupyter Notebook and the
diverse python libraries like NumPy, Pandas, Matplotlib, SciPy etc. Also Matplotlib to
make visualization like bar charts, histogram and the heat maps.

6
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

INTRODUCTION

This course is based on data analysis of ABC company of 12 months. The main goal of
this project is to inquire about the information of ABC company. Python programming
language is utilized to manipulate the information with the assistance of Pandas. We
dissect the information of ABC company and type in python codes and plan a report on
information. This Csv files assist to list and oversee all the information by posting or
making the table. analysis on all the factors is done to pick up more understanding of the
data. Here I have calculated the whole of all values, the average value, the degree of how
much the values within the dataset shift from the mean, the asymmetry of the dispersion
of values and evaluate the peakiness of the dispersion of values within the data set. I also
calculated and gave the correlation of the variables. Also, I outlined the bar chart and
histogram of the best sales, earning of the month, city with highest product sell, most sold
item etc.

7
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

1.Data Understanding

The given data set given is of ABC company of year 2019 which comprises of sales
examination of the company. The columns within the csv file portrays all the detail data
around the special code of the product (Order ID), the sold product (Product), quantity
acquired (Quantity), price of each thing (Price Each), the order date with detail time (Order
Date), and the complete delivery address (Purchase Address). As per my investigation,
it can be expected that the ABC company is online based hardware store. The records
are given in month-to-month frame (January to December) after that the information are
combined and the overall number of rows were 186850 among that 900 lines were invalid
information. After the consolidation of the information, we are ready to perform the deals
investigation and data mining.

S. N Column Name Description Data Type


1 Order ID A unique code given to every Integer
order. There was a total of
186,850 orders placed in
2019.

8
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

2 Product The unique name of the 19 String


products available in the
store, such as iPhone,
Google phone, and Mac book
pro.
3 Quantity The quantity of the product Integer
Ordered ordered by the customer. The
lowest ordered quantity is 1,
and the highest is 9.
4 Price Each The price of each unit Float
product. The cheapest price
is 2.99, and the most
expensive one is 1700
5 Order Date The detailed date and time of String
the order from January to
December 2019.
6 Purchase The delivery address of the String
Address product.
Table 1 Data Understanding

This table contains the column names, descriptions, and data types for a dataset with
data on 186,850 orders set in 2019. The dataset incorporates columns for the distinctive
order ID, the product name, the quantity ordered, the cost of each unit, the order date and
time, and the delivery address. The item column contains the special names of the 19
items accessible within the store, and the quantity ordered column ranges from 1 to 9.
The price each column contains costs extending from 2.99 to 1700. The order date
column gives complete date and time data for each order, and the purchase address
column contains the shipment address of each order.

9
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

Figure 1 Summary of Data Frame

This method gives a brief outline of the Data Frame, including the number of not-null
values and data types for each column.

The data frame that was created after merging all the csv files provided, contains a total
of 186850 rows and 6 columns.

10
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

Figure 2 Showing the Data information.

Each diverse columns represents Order ID, Product, Price Each, Quantity Order and
Purchase Address. In spite of the fact that 900 invalid columns were expelled which was
the result of not filling up all essential areas and remaining lines were further examined.
The information will moreover aid the company to target the dynamic clients with the
purchase history and other recurring factor. The remaining information are utilized in total
examination process to think about the internal working of the company. In general, the
investigation can offer assistance to enhance the marketing and deals procedures for
another year.

11
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

2.Data Preparation
Data preparation is the process of preparing raw data so that it is suitable for further
processing and analysis. Key steps include collecting, cleaning, and labelling raw data
into a form suitable for machine learning (ML) algorithms and then exploring and
visualizing the data. Data preparation can take up to 80% of the time spent on an ML
project. Using specialized data preparation tools is important to optimize this process
(Aws, 2023).

Here are some common steps for data preparation:

o Collection of Data
o Cleaning of Data
o Integration of Data
o Transformation of Data
o Data Reduction

2.1 Write a python program to merge data from each month into one CSV
and read in updated dataframe.

Figure 3 Importing the Libraries and reading CSV files.

12
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

This code is a Python script that imports a few libraries:

NumPy, glob, and matplotlib.pyplot. It at that point characterizes a path variable indicating
to a folder on the user's desktop that contains numerous CSV files. The script uses the
glob function to induce a list of all CSV records within the folder that coordinate the defined
path.

Figure 4 Merging and saving new merged data frame.

After that I have imported all the csv files of 12 months all were merged using the codes
given in figure 3 and the result of merging is shown in the above figure i.e., new data
frame is created (my_merged_df).

13
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

2.2 Write a python program to remove the NaN missing values from updated
dataframe.

Figure 5 Removing Null values.

Here, all the NaN values within the data frame are evacuated utilizing the dropna function.
Before expelling the invalid values there were 186850 rows as seen in figure 4 and 900
NaN values were expelled permanently and the new total rows sum up to 185950 as seen
in figure 5.

14
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

2.3 Write a python program to convert Quantity Ordered and Price Each to
numeric.

Figure 6 Conversion of datatypes

Here you can see the initial data type of the Quantity Ordered and Price Each is Float64
and it is converted into Int32 as shown in the above figure 6.

15
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

2.4 Create a new column named Month from Ordered Date of updated
dataframe and convert it to integer as data type.

Figure 7 Adding new column Month and changing its datatype to integer.

Here, I have created a new column name Month from Ordered Date to make easier with
time and date values. Here I have converted the data type of month from object to integer
which can be seen in the figure 7 above.

16
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

2.5 Create a new column named City from Purchase Address based on the
value in updated dataframe.

Figure 8 Adding a new column City in the data frame.

Here, I have inserted a new column name ‘City’ taken from the column Purchase Address.
I have used the split function to split each value and then used indexing 1 to select the
second element in the Purchase Address which gives the information about the city as
shown in the above figure.

17
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

3. Data analysis

Data analysis is the process of collecting, modelling, and analysing data using various
statistical and logical methods and techniques. Businesses rely on analytics processes
and tools to extract insights that support strategic and operational decision-making
(Martin Blumenau, 2022).

There are six common steps for data analysis:

Step1: Specifying the Data Requirement.

Step2: Collecting the Data.

Step3: Cleaning and Processing Data.

Step4: Analysing the data.

Step5: Sharing The data.

Step6: Act or Report.

18
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

3.1 Write a Python program to show summary statistics of sum, mean,


standard deviation, skewness, and kurtosis of any chosen variable.

Figure 9 Showing the outputs of the Statistics.

Here I have chosen the Quantity Ordered column to find out the given statistics which
includes the total sum, Average or mean, standard deviation, kurtosis and skewness. The
chosen variable is stored in column_name. All the output of different calculations are
stored in unique variables.

3.2 Write a Python program to calculate and show correlation of all variables.

Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It’s a common tool

19
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

describing simple relationships without making a statement about cause and effect (jmp,
2023).

Figure 10 Showing correlation.

From the above figure, we can determine that there is a strong correlation between Order
ID and Month. There is also lose corelation between the price each and the month. Here
if the order id rises the month will be also rises. Here we are using correlation () function
to run correlation operation.

I have also created a heatmap for more visually vibrant study of corelation between
variables.

20
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

Figure 11 Heatmap to show corelation.

4. Data Exploration

Data exploration is the first step of data analysis used to explore and visualize data to
uncover insights from the start or identify areas or patterns to dig into more. Using
interactive dashboards and point-and-click data exploration, users can better understand
the bigger picture and get to insights faster (group, 2022).

4.1 Which Month has the best sales? and how much was the earning in that
month? Make a bar graph of sales as well.

Figure 12 Finding month with highest sale and highest earning of a month.

At first, I have used the groupby function to group the data of sales by month adding each
month sale by using sum function. I have stored all the output in Max_sales_month
variable.

21
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

Lastly, I have found out the month having the highest earning using the max function.

A bar graph can be defined as a graphical representation of data, quantities, or numbers


using bars or strips. They are used to compare and contrast different types of data,

frequencies, or other measures of distinct categories of data (splashlearn, 2023).

Here is a bar graph representation of the above output.

Figure 13 Bar graph of month with highest sales

In the bar graph the x-axis represents the 12 months and the y-axis represents the total
sales. we can identify from the graph that the last month December has the highest total
sales and January month has the lowest sale compared to another month.

22
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

4.2 Which city has sold the highest product?

Figure 14 Finding the city with highest sale.

All the data received from above calculation is stored in city variable and the city with
highest sold is stored in highest variable. Here we can see that San Francisco has sold
highest number of products.

23
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

4.3 Which product was sold the most in overall? Illustrate it through bar
graph.

Figure 15 Finding the most sold product.

Here I have grouped all the product available in the store using groupby function and
added the total sale of each product using sum function. After that the output value is
stored in best_selling variable. After that the output values are compared and find out the
most sold or best selling product using idmax function.

24
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

Figure 16 Plotting product sale of the year in bar graph.

In the above bar graph, the x-axis shows the Product name available in the store and y-
axis represents the Ordered quantity. we can find out that the highest selling product is
AAA batteries, and the lowest selling product are LG dryer and LG Washing Machine
from the above graph.

25
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

4.4 Write a Python program to show histogram plot of any chosen variables.
Use proper labels in the graph.

A frequency distribution shows how often each different value in a set of data occurs. A
histogram is the most used graph to show frequency distributions. It looks very much like
a bar chart, but there are important differences between them. This helpful data collection
and analysis tool is considered one of the seven basic quality tools (ASQ, 2023).

Here is a histogram representation of Price Each variable.

Figure 17 Histogram of Price Each Variable.

Now we can analyse the highest price followed by other quantities and its count.

26
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

5.Conclusion.

In this data analysis venture, I learnt different data investigation concepts from the sales
record of ABC company of 2019. At first, I combined all the given csv record of sales of
12 months by giving a brief explanation of its properties within the data understanding
section. And after, that Within the data preparation stage I arranged a single csv record
by blending all the information, expelled all the null values, changing over quantity ordered
and price each to numeric (int) and adding new column named month and City from
ordered date and Purchase address separately.

Within the examination stage, I analyse on all the factors to pick up more understanding
of the data. Here I have calculated the whole of all values, the average value, the degree
of how much the values within the dataset shift from the mean, the asymmetry of the
dispersion of values and evaluate the peakiness of the dispersion of values within the
data set. I also calculated and gave the correlation of the variables. Also, I outlined the
bar chart and histogram of the best sales, earning of the month, city with highest product
sell, most sold item etc.

Completing this coursework has given me a profitable encounter on creating my ability


and concepts in working with python data frames and data examination. Going Back on
our encounters and projects like this are always useful and I am energized to apply these
abilities in my future ventures.

27
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

Bibliography
ASQ, 2023. American Society for Quality.. [Online]
Available at: https://asq.org/quality-resources/histogram
Aws, A., 2023. [Online]
Available at: https://aws.amazon.com/what-is/data-preparation/
D.Miller, J., 2017. Statistics for Data Science. Sebastopol,CA: O'Reilly Media.
Dale, K., 2016. Data Visualization with Python and JavaScript. Sebastopol,CA: O'Reilly
Media.
Davis, D., 2014. Data Analysis for Business Decisions. Boston,MA: Cengage Learning.
group, C. s., 2022. Tibo. [Online]
Available at: https://www.tibco.com/reference-center/what-is-data-
exploration#:~:text=Data%20exploration%20is%20the%20first,and%20get%20to%20ins
ights%20faster.
jmp, 2023. jmpstaticaldiscoveries. [Online]
Available at: https://www.jmp.com/en_ca/statistics-knowledge-portal/what-is-
correlation.html
Martin Blumenau, R. P. W., 2022. datapine. [Online]
Available at: https://www.datapine.com/blog/data-analysis-methods-and-techniques/
Sisense, 2023. What is Data Exploration? | Sisense. [Online]
Available at: https://www.sisense.com/glossary/data-exploration/
[Accessed 1 April 2023].
splashlearn, 2023. Splashlearn. [Online]
Available at: https://www.splashlearn.com/math-vocabulary/geometry/bar-
graph#:~:text=A%20bar%20graph%20is%20the,relates%20directly%20to%20its%20val
ue.
Vallat, R., 2019. Correlation(s) in Python. [Online]
Available at: https://raphaelvallat.com/correlation.html
[Accessed 1 May 2023].

28
22015633 Aarohan Subedi
CC5067NI Smart Data Discovery

29
22015633 Aarohan Subedi

You might also like