Data Visualisation
Data Visualisation
ON
“DATA VISUALISATION”
SUBMITTED TO:
ASSISTANT PROFESSOR
UNISON UNIVERSITY
DEHRADUN
SUBMITTED BY:
I have the pleasure in certifying that Ms. is a bonafide student of 5 th Semester of the B.Com
(H)’s Degree (Batch 2021-24), of IMS Unison University, Dehradun, Roll No.
IUU21BCO052.
He/She has completed his/her project work entitled “DATA VISUALISATION ” under my
guidance.
I certify that this is his/her original effort & has not been copied from any other source. This
project has also not been submitted in any other Institute / University for the purpose of
award of any Degree.
This project fulfils the requirement of the curriculum prescribed by this Institute for the said
course. I recommend this project work for evaluation & consideration for the award of
Degree to the student.
Signature : ……………………………………
Name of the Guide : DR SAURABH SINGH
Designation : Assistant Professor, School of Management
Date : ……………………………………
DECLARATION
I humbly declare that this report entitled “DATA VISUALISATION”
submitted in partial fulfilment of the requirement for the degree of
B.Com(H) at IMS Unison University, Dehradun is based on the original
work, carried by me and no part of it has been presented or published
previously for any higher degree/diploma.
It is also declared that this report has been prepared for academic purposes
alone and has not been/will not be submitted elsewhere for any other
purposes.
Date:
Signature
Name:
ACKNOWLEDGEMENT
At last, I would like to extend my deep gratitude towards my family and all of my
friends for their Cooperation, Inspiration, Guidance and support during all stages of
the preparation of this report and help me to ride over the difficulties I had during my
project work.
1. TITLE PAGE
4. DECLARATION
5. ACKNOWLEDGEMENT
6. EXECUTIVE SUMMARY
8. INTRODUCTION
9. RESEARCH METHODOLOGY
11. FINDINGS
12. CONCLUSION
13. SUGGESTIONS
1. Estimating parameters:
This means taking a statistic from the sample data (for example the
sample mean) and using it to infer about a population parameter
(i.e. the population mean).There may be sampling variations
because of chance fluctuations, variations in sampling techniques,
and other sampling errors. Estimation about population
characteristics may be influenced by such factors. Therefore, in
estimation the important point is that to what extent our estimate is
close to the true value.
Characteristics of Good Estimator: A good statistical estimator
should have the following characteristics, (i) Unbiased (ii)
Consistent (iii) Accuracy
i) Unbiased
An unbiased estimator is one in which, if we were to obtain an
infinite number of random samples of a certain size, the mean of the
statistic would be equal to the parameter. The sample mean, ( x ) is
an unbiased estimate of population mean (μ)because if we look at
possible random samples of size N from a population, then mean of
the sample would be equal to μ.
ii) Consistent
A consistent estimator is one that as the sample size increased, the
probability that estimate has a value close to the parameter also
increased. Because it is a consistent estimator, a sample mean based
on 20 scores has a greater probability of being closer to (μ) than
does a sample mean based upon only 5 scores
iii) Accuracy
The sample mean is an unbiased and consistent estimator of
population mean (μ).But we should not over look the fact that an
estimate is just a rough or approximate calculation. It is unlikely in
any estimate that ( x ) will be exactly equal to population mean (μ).
Whether or not x is a good estimate of (μ) depends upon the
representativeness of sample, the sample size, and the variability of
scores in the population.
Confidence level:
Confidence level refers to the possibility of a parameter that lies
within a specified range of values, which is denoted as c. Moreover,
the confidence level is connected with the level of significance. The
relationship between level of significance and the confidence level
is c=1−α. The common level of significance and the corresponding
confidence level are given below:
Rejection region:
The rejection region is the values of test statistic for which the null hypothesis is
rejected.
There are many tests in this field, of which some of the most important are
mentioned below.
1. Linear Regression Analysis
In this test, a linear algorithm is used to understand the relationship
between two variables from the data set. One of those variables is
the dependent variable, while there can be one or more independent
variables used. In simpler terms, we try to predict the value of the
dependent variable based on the available values of the independent
variables. This is usually represented by using a scatter plot,
although we can also use other types of graphs too.
2. Analysis of Variance
This is another statistical method which is extremely popular in
data science. It is used to test and analyse the differences between
two or more means from the data set. The significant differences
between the means are obtained, using this test.
3. Analysis of Co-variance
This is only a development on the Analysis of Variance method
and involves the inclusion of a continuous co-variance in the
calculations. A co-variate is an independent variable which is
continuous, and is used as regression variables. This method is used
extensively in statistical modelling, in order to study the differences
present between the average values of dependent variables.
5. Correlation Analysis
Another extremely useful test, this is used to understand the extent
to which two variables are dependent on each other. The strength of
any relationship, if they exist, between the two variables can be
obtained from this. You will be able to understand whether the
variables have a strong correlation or a weak one. The correlation
can also be negative or positive, depending upon the variables. A
negative correlation means that the value of one variable decreases
while the value of the other increases and positive correlation
means that the value both variables decrease or increase
simultaneously.
Random Variables
Sampling Distribution
A sampling distribution is a probability distribution of a statistic. It
is obtained through a large number of samples drawn from a
specific population. It is the distribution of all possible values taken
by the statistic when all possible samples of a fixed size n are taken
from the population.
Click Create. You can start adding resources if your project is empty or begin working with
the resources you imported.
From your project’s Assets page, click Add to project > Data or click the Find and add data
icon ().You can also click the Find and add data icon from within a notebook or canvas.
In the Load pane that opens, browse for the files or drag them onto the pane. You must stay
16
on the page until the load is complete. You can cancel an ongoing load process if you want to
stop loading a file.
Case Study:
Let us take the Iris Data set to see how we can visualize the data in Watson studio.
17
Adding Data to Data Refinery
Visualizing information in graphical ways can give you insights into your data. By enabling
you to look at and explore data from different perspectives, visualizations can help you
identify patterns, connections, and relationships within that data as well as understand large
amounts of information very quickly. You can also visualize your data with these same charts
in an SPSS Modeler flow. Right-click a node and select Profile.
18
1. Click any of the available charts. Then add columns in the DETAILS panel that opens on the
left side of the page.
2. Select the columns that you want to work with. Suggested charts will be indicated with a
dot next to the chart name. Click a chart to visualize your data.
Click on refine
19
Adyay Technology pvt. Ltd
ADHYAY EQUI PREF PVT. LTD.
As on: 2024-06-12
Adhyay Equi Pref Pvt. Ltd. is a Private company incorporated on 06 December 1994. It is
classified as Non-government company and is registered at Registrar of Companies, Kolkata.
Its authorized share capital is Rs. 71,700,000 and its paid up capital is Rs. 71,637,620. It's
NIC code is 741 (which is part of its CIN). As per the NIC code, it is inolved in Legal,
accounting, book-keeping and auditing activities; tax consultancy; market research and public
opinion polling; business and management consultancy.
Adhyay Equi Pref .'s Annual General Meeting (AGM) was last held on N/A and as per
records from Ministry of Corporate Affairs (MCA), its balance sheet was last filed on 31
March 2015.
Directors of Adhyay Equi Pref . are MADHAB CHANDRA DAW and JAYATI
MAJUMDER.
Basic Information
CIN U74140WB1994PTC066348
Name
20
Company Status Strike Off
Activity NIC Code: 741NIC Description: Legal, accounting, book-keeping and auditing activiti
Number of Members 0
Key Numbers
UNIT – III
Introduction to Anaconda -
Anaconda Installation -
21
installer (A) or a Python 2.x graphical installer.
22
After that click on next.
Click Finish.
23
Installing Anaconda Distribution will also include Jupyter Notebook.
To access the Jupyter Notebook go to anaconda prompt and run below command
24
A statement or expression is an instruction the computer will run or
execute. Perhaps the simplest program you can write is a print
statement. When you runthe print statement, Python will simply
display the value in the parentheses. The value in the parentheses is
called the argument.
If you are using a Jupyter notebook, you will see a small rectangle
with the statement. This is called a cell. If you select this cell with
your mouse, then click the run cell button. The statement will
execute. The result will be displayed beneath the cell.
It’s customary to comment your code. This tells other people what
your code does. You simply put a hash symbol proceeding your
comment. When you run the code, Python will ignore the comment.
Data Types
25
The following chart summarizes three data types for the last
examples. The first coslumn indicates the expression. The second
Column indicates the data type. We can see the actual data type in
Python by using the type command. We can have int, which stands
for an integer, and float that stands for float, essentially a real
number. The type string is a sequence of characters.
2
6
be careful. For example, if you cast the float 1.1 to 1, you will lose
some information. If a string contains an integer value, you can
convert it to int. If we convert a string that contains a non-integer
value, we get an error. You can convert an int to a string or a float
to a string.
2
7
Basics of Data Visualization
Before jumping into the term “Data Visualization”, let’s have a brief discussion
on the term “Data Science” because these two terms are interrelated.
In simple terms, “Data Science is the science of analyzing raw data using
statistics and machine learning techniques with the purpose of drawing
conclusions about that information“.
In simple words, a pipeline in data science is “a set of actions which changes
the raw (and confusing) data from various sources (surveys, feedback, list of
purchases, votes, etc.), to an understandable format so that we can store it and
use it for analysis.”
The raw data undergoes different stages within a pipeline, which are:
2
8
1. Fetching/Obtaining the Data
2. Scrubbing/Cleaning the Data
3. Data Visualization
4. Modeling the Data
5. Interpreting the Data
6. Revision
Data visualization is very critical to market research where both numerical and
categorical data can be visualized, which helps in an increase in the impact of
insights and also helps in reducing the risk of analysis paralysis. So, data
visualization is categorized into the following categories:
10
Data visualization and big data
The increased popularity of big data and data analysis projects has made
visualization more important than ever. Companies are increasingly using
machine learning to gather massive amounts of data that can be difficult and slow
to sort through, comprehend, and explain. Visualization offers a means to speed
this up and present information to business owners and stakeholders in ways they
can understand.
Big data visualization often goes beyond the typical techniques used in
normal visualization, such as pie charts, histograms and corporate graphs. It
instead uses more complex representations, such as heat maps and fever
charts. Big data visualization requires powerful computer systems to collect
raw data, process it, and turn it into graphical representations that humans can
use to quickly draw insights.
While big data visualization can be beneficial, it can pose several
disadvantages to organizations. They are as follows:
· To get the most out of big data visualization tools, a
visualization specialist must be hired. This specialist must be able to
identify the best data sets and visualization styles to guarantee
organizations are optimizing the use of their data.
· Big data visualization projects often require involvement
from IT, as well as management, since the visualization of big data
requires powerful computer hardware, efficient storage systems and
even a move to the cloud.
· The insights provided by big data visualization will only be as
accurate as the information being visualized. Therefore, it is essential to
have people and processes in place to govern and control the quality of
corporate data, metadata, and data sources.
In the early days of visualization, the most common visualization technique was
using a Microsoft Excel spreadsheet to transform the information into a table, bar
graph or pie chart. While these visualization methods are still commonly used,
more intricate techniques are now available, including the following:
· infographics
· bubble clouds
· bullet graphs
· heat maps
· fever charts
· time series charts
Some other popular techniques are as follows:
Line charts. This is one of the most basic and common techniques used. Line
charts display how variables can change over time.
Area charts. This visualization method is a variation of a line chart; it displays
multiple values in a time series -- or a sequence of data collected at consecutive,
equally spaced points in time.
Scatter plots. This technique displays the relationship between two variables. A
scatter plot takes the form of an x- and y-axis with dots to represent data points.
Treemaps. This method shows hierarchical data in a nested format. The size of the
rectangles used for each category is proportional to its percentage of the
whole. Treemaps are best used when multiple categories are present, and the
goal is to compare different parts of a whole.
Population pyramids. This technique uses a stacked bar graph to display the
complex social narrative of a population. It is best used when trying to display
the distribution of a population.
Sales and marketing. Research from market and consumer data provider Statista
estimated $566 billion was spent on digital advertising in 2022 and that number
will cross the $700 billion mark by 2025. Marketing teams must pay close
attention to their sources of web traffic and how their web properties generate
revenue. Data visualization makes it easy to see how marketing efforts effect
traffic trends over time.
Politics. A common use of data visualization in politics is a geographic map
that displays the party each state or district voted for.
Logistics. Shipping companies can use visualization tools to determine the best
global shipping routes.
Data scientists and researchers. Visualizations built by data scientists are typically
for the scientist's own use, or for presenting the information to a select
audience. The visual representations are built using visualization libraries of
the chosen programming languages and tools. Data scientists and researchers
frequently use open source programming languages -- such as Python -- or
proprietary tools designed for complex data analysis. The data visualization
performed by these data scientists and researchers helps them understand data
sets and identify patterns and trends that would have otherwise gone
unnoticed.
Data visualization tools can be used in a variety of ways. The most common
use today is as a business intelligence (BI) reporting tool. Users can set up
visualization
tools to generate automatic dashboards that track company performance
across key performance indicators (KPIs) and visually interpret the results.
The generated images may also include interactive capabilities, enabling users
to manipulate them or look more closely into the data for questioning and
analysis. Indicators designed to alert users when data has been updated or
when predefined conditions occur can also be integrated.
As data visualization vendors extend the functionality of these tools, they are
increasingly being used as front ends for more sophisticated big data
environments. In this setting, data visualization software helps data engineers and
scientists keep track of data sources and do basic exploratory analysis of data
sets prior to or after more detailed advanced analyses.
The biggest names in the big data tools marketplace include Microsoft, IBM, SAP
and SAS. Some other vendors offer specialized big data visualization software;
popular names in this market include Tableau, Qlik and Tibco.
20
D3.js
Jupyter
MicroStrategy
Google Charts
While there are many advantages, some of the disadvantages may seem less
obvious. For example, when viewing a visualization with many different data
points, it’s easy to make an inaccurate assumption. Or sometimes the
visualization is just designed wrong so that it’s biased or confusing.
Data Dimension:
Modality:
Once you have sentiment analysis results, you can create various visualizations to
convey insights:
● Pie Chart: Create a pie chart to show the distribution of
sentiments (positive, negative, neutral) in your textual data.
● Bar Chart: Use a bar chart to display the frequency of each
sentiment category. This can provide a quick overview of
sentiment distribution.
● Line Chart or Time Series Plot: If you have sentiment data over
time (e.g., daily sentiment trends), use a line chart to visualize
sentiment fluctuations.
● Word Clouds: Generate word clouds for positive and negative
sentiments to highlight frequently occurring terms in each
category.
● Heatmap: Create a heatmap that shows sentiment scores for
different topics or entities. Rows represent topics, and columns
represent sentiment scores.
● Scatter Plot: If you want to explore the relationship between
sentiment and other variables (e.g., sentiment vs. product
ratings), use a scatter plot.
● Stacked Area Chart: Visualize sentiment changes over time
using a stacked area chart, where each sentiment category is
represented by a different color.
● Geospatial Visualization: If your data includes geographic
information, use maps to visualize sentiment variations across
different regions.
● Sentiment Flow Diagram: Display sentiment transitions within a
text (e.g., positive to negative) using a flow diagram.
● Interactive Dashboards: Create interactive dashboards that allow
users to explore sentiment analysis results dynamically, filtering
by various attributes like time, source, or sentiment category.
● Comparison Plots: Compare sentiment across different sources,
products, or categories using side-by-side visualizations.
● Emotion Analysis: If you perform emotion analysis as part of
sentiment analysis, visualize emotional tones (e.g., joy, anger,
sadness) using color- coded charts or radial diagrams.
Visualization of sentiment analysis results not only aids in understanding the
overall sentiment but also helps identify trends, anomalies, and actionable insights
in large volumes of textual data. Interactive and dynamic visualizations allow
users to drill down into specific aspects of the data, making it easier to make data-
driven decisions based on sentiment.
1. Data Collection:
Clean and preprocess the textual data to prepare it for sentiment analysis:
● Remove special characters, punctuation, and irrelevant symbols.
● Tokenize the text into words or phrases.
● Convert the text to lowercase for consistency.
● Remove stopwords (common words like "the," "and," etc.).
● Address issues like misspellings and abbreviations.
3. Sentiment Analysis:
4. Visualization:
5. Extracting Insights:
6. Continuous Monitoring:
The study revealed significant differences between the two treatment groups, with
interferential therapy (IFT) showing a marked improvement in reducing pain and increasing
the range of motion (ROM) compared to ultrasound therapy (UST). Data analysis indicated
that patients in the IFT group experienced a 20% greater reduction in pain levels, as measured
by the Numerical Pain Rating Scale, and a 15% higher improvement in ROM, assessed with a
digital inclinometer. Statistical analysis revealed a p-value of 0.03, confirming the results
were statistically significant. Additionally, the IFT group reported a quicker recovery time
and fewer instances of muscle stiffness, suggesting the therapy may be more effective in
managing myofascial pain syndrome of the upper trapezius. These findings highlight the
potential benefits of incorporating interferential therapy over ultrasound therapy in treating
myofascial pain.
Conclusion:
Barr, S. (2009, November 23). What Does “KPI” Really Mean? Dashboard Insight.
Bateman, S., Mandryk, R. L., Gutwin, C., Genest, A., McDine, D., and Brooks, C. (2010).
Useful junk? The effects of visual embellishment on comprehension and memorability of
charts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
(pp. 2573–2582). New York, NY: ACM.
Bertin, J. (2010). Semiology of Graphics: Diagrams, Networks, Maps (1st ed.). Redlands,
CA: Esri Press.
Borkin, M. A., Vo, A. A., Bylinskii, Z., Isola, P., Sunkavalli, S., Oliva, A., and Pfister, H.
(2013). What makes a visualization memorable? IEEE Transactions on Visualization and
Computer Graphics, 19(12), 2306– 2315.
Dresner, H. (2015, May 25). IoT and the Growing Use of Location Features in Business
Intelligence Software. Sand Hill.
Segel, E. and Heer, J. (2010). Narrative visualization: Telling stories with data. IEEE
Transactions on Visualization and Computer Graphics, 16(6), 1139–1148.
Zheng, G., Zhang, C., and Li, L. (2014). Bringing business intelligence to health information
technology curriculum. Journal of Information Systems Education, 25(4).
QUESTIONNAIRE
1. Bar Chart (Multiple bars for each question, each representing an option):
o Description: Each question will have a cluster of bars (one for each option A,
B, C, D). This is useful for comparing the number of respondents for each
option for every question.
o X-Axis: Questions (Q1 to Q10)
o Y-Axis: Number of respondents
o Legend: Each option (A, B, C, D) in a different color
2. Stacked Bar Chart (Each bar represents a question, and the sections represent
options A, B, C, D):
o Description: For each question, the total bar represents the number of
respondents, stacked according to the percentage who selected each option.
o X-Axis: Questions (Q1 to Q10)
o Y-Axis: Number of respondents
o Legend: Options A, B, C, D, represented by different colors
3. Pie Chart (Individual) for each question:
o Description: A separate pie chart for each question showing the proportion of
responses for each option (A, B, C, D).
o Legend: Options A, B, C, D
4. Heat Map (Color-coded representation of the data):
o Description: Each cell in the heat map will represent the number of
respondents for a particular option (A, B, C, D) for a question. The darker the
color, the higher the number of responses.
o Rows: Questions (Q1 to Q10)
o Columns: Options A, B, C, D
o Color Scale: From light (low count) to dark (high count)
How the Visualization Looks:
Bar Chart: Each question (Q1-Q10) will have four bars next to each other for options
A, B, C, and D, showing how many people chose each option.
Stacked Bar Chart: The bar for Q1 might be divided into sections based on how
many people chose Option A, B, C, or D, stacked one on top of the other.
Pie Chart: Each question will have a circular pie showing the breakdown of A, B, C,
D.
Heat Map: The values are color-coded in a table where darker shades show higher
responses.