0% found this document useful (0 votes)
21 views4 pages

Overview of Big Data

Uploaded by

lehanhj4plus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

Overview of Big Data

Uploaded by

lehanhj4plus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

OVERVIEW OF BIG DATA

WHAT IS BIG DATA?


No doubt you’ve heard the terms ‘big data’ and ‘analytics’ being thrown around in the
media. Let’s look at what these concepts really mean.
Watch this short video for a quick introduction to what big data is and the possibilities that it
holds.
https://www.youtube.com/watch?v=TzxmjbL-i4Y
This is an additional video, hosted on YouTube.
Defining big data
‘Big data’ refers to datasets whose size is beyond the ability of typical database software
tools to capture, store, manage, and analyze.1
If you look closely at this definition, you can see that it is framed in terms of time. It uses
the word ‘typical’, and thus refers to current state-of-the-art technology. So, what we called big
data 10 years ago, may not be big data now because the ‘typical’ tools and technologies have
changed. And what we call big data now, may not be big data in 5 years. 2 In the future, we may
still use traditional data collection, storage, and processing systems, however, most likely in
conjunction with newer systems.
The V’s that characterise big data
To determine whether data is big data, we can also consider the V’s that characterise big
data. The four most commonly defined V dimensions are volume, variety, velocity, and
veracity.3
Volume
Volume refers to the quantity of data to be stored. For example, Walmart deals with big
data. They handle more than 1 million customer transactions every hour, importing more than
2.5 petabytes of data into their database. This is about 167 times the amount of information
contained in all the books in the US Library of Congress.
The following table lists the different storage capacity units. To put these in context, there
are 8,000,000,000,000,000,000,000,000 bits (that’s an eight followed by 24 zeros) in one
yottabyte.
Term Capacity Abbreviation

Bit 0 or 1 value b

Byte 8 bits B

1024*
Kilobyte KB
bytes

Megabyte 1024 KB MB

Gigabyte 1024 MB GB
Term Capacity Abbreviation

Terabyte 1024 GB TB

Petabyte 1024 TB PB

Exabyte 1024 PB EB

Zettabyte 1024 EB ZB

Yottabyte 1024 ZB YB
* Note that because bits are binary in nature and are the basis on which all other storage
values are based, all values for data storage units are defined in terms of powers of 2. For
example, the prefix kilo typically means 1000; however, in data storage, a kilobyte = 2 10 = 1024
bytes.2 (Table 14.1, Storage Capacity Units; p. 651)
To manage big volumes of data, we have two options for handling additional load.2
 Scale up, meaning we keep the same number of systems to store and process data,
but migrate each system to a larger system.
 Scale out, meaning we increase the number of systems, but do not migrate to larger
systems.
Velocity
Velocity refers to the speed at which data is entered into a system and must be processed.
For example, Amazon captures every click of the mouse while shoppers are browsing on its
website.2 This happens rapidly.
Velocity is important in stream processing. Think of all the data from radio-frequency
identification (RFID), global positioning system (GPS), near-field communication (NFC), and
Bluetooth sensors flooding in to a system. Stream processing aims to aggregate single data points
from high-velocity data, in order to trigger a high-level event when a certain pattern is detected.
It also focuses on deciding which data to keep from a stream, since it is unfeasible to retain all
the data that is rushing in.
Variety
Variety refers to the complexity of data formats. Big data consists of different forms of data.
For example, when a telecommunications company like Telstra records data on calls to its call
centre, this data includes both:
 structured data, which conforms to a predefined data model (e.g., your customer ID, the
timestamp of your call, your service type), and
 unstructured data (e.g., the recording of the call, notes that the call centre operator makes
during the call, the problem history related to your call).
Veracity
Veracity refers to the trustworthiness of data. The more data is collected and analysed
automatically but not captured in its entirety (due to the high volume and velocity), the higher
the uncertainty about the accuracy of data. For example, it is particularly challenging to verify
the truthfulness of posts on social media platforms, as we do not always know the posters’
backgrounds and their intentions. In fact, detecting fake reviews, fake news, and fake friends is
currently an active research area.
The four V’s as an infographic
The IBM Big Data & Analytics Hub provides an infographic which explains and gives
examples of each of the four V’s.
To expand the infographic, click on the image. You will also find a downloadable PDF text
version of this infographic in the downloads section at the end of the step.

Other V’s
Further V’s that are often mentioned as key characteristics of big data are:
 value: how meaningful the data is
 visualisation: graphical representations to assist humans in understanding big data.
Hopefully, you now have an idea of what big data is. In the next step we will discuss where
all the data is coming from.
Your task
How would you define big data?
Share your thoughts in the comments.
References
1. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. Big data:
The next frontier for innovation, competition, and productivity [Internet]. McKinsey Global

functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation ↩
Institute; 2011[cited 2018 Oct 24]. 143 p. Available from: https://www.mckinsey.com/business-

ed. Boston (MA): Cengage Learning; 2016. ↩ ↩2 ↩3 ↩4


2. Coronel C, Morris S. Database systems: Design, implementation, and management. 12th

3. Elmasri R, Navathe SB. Fundamentals of database systems. 7th ed. Pearson; 2017. ↩

WHERE DOES THE DATA COME FROM?


273 comments
Data is being generated everywhere. Businesses, health care providers, governments,
and education institutions (just to name a few) are all collecting huge amounts of data.
Nonetheless, the majority of big data is coming from the world wide web.
This timeline gives you an overview of how the web has developed and what kinds of data
have become available on the web.
WHY IS IT CHALLENGING TO ANALYSE?
With your knowledge of the different characteristics of big data and of the many
possible sources where data can come from, you can probably imagine that it is not
straightforward to analyse big data.
Here are some of the key challenges to analysing big data 1,2:
Data access
The majority of big data is used for commercial purposes to increase profits, provide better
services, or gain competitive advantage. Thus, organisations are hesitant to share their data with
outsiders. Even when organisations allow access to their data, they usually restrict access to
certain portions of the data or impose rate limits on the amount of data that can be accessed per
day or user. This makes it difficult for researchers and non-profit organisations to obtain data,
but also for organisations to integrate their own data with other organisations’ data. However,
many countries nowadays promote ‘open data’ portals, where datasets are made available to the
public.
Inconsistent and incomplete data
Even though we are collecting more data than ever before, the overall quality of our data has
not increased. The percentage of incorrect or incomplete data points remains the same. For
example, take electronic sensors that record an incorrect reading once in 1000 readings. As the
frequency of readings increases by a factor of ten to 10,000, the number of incorrect readings
also increases to ten. Written text on the web will also always include spelling mistakes. As the
amount of text posted increases, it may even contain a higher percentage of mistakes. Therefore,
data cleaning becomes an important task for big data analytics.
Heterogeneity of data
Heterogeneity of data refers to how much the data differs across the dataset we are looking
at. This can include differences in data format, number of missing values, level of detail, or
length of time period for which data is available.
Heterogeneity is a particular issue when we bring together data from unconnected sources.
For example, it may be useful to connect population data from government sources with data
from environmental sensors to determine action towards a drinking water management plan for a
city. The data from these different sources will need to be carefully matched to ensure valid
analysis results.
Data privacy and protection
More and more data is stored about personal interests, behaviours, and attitudes. While
consumers often trade their personal data for a product customised to their liking, their privacy
needs to be protected by clear policies. In addition, the results of analysing personal data,
perhaps from multiple sources, may be more sensitive than all the individual parts. As Aristotle
already said: ‘The whole is greater than the sum of its parts’.3
Data privacy and protection is not just important for individuals. Organisations also need to
have their data and intellectual property protected by policies and laws
In the next step we will look at an overview of the data analytics cycle used to solve a
problem or get new insights based on data.
Your task
Can you actually find data from any of the big data sources you (or other learners) identified
in the previous step?
Share your thoughts in the comments.
References

platforms. AI & Society 2015; 30(1): 89-116. ↩


1. Batrinca B, Treleaven PC. Social media analytics: a survey of techniques, tools and

2. EMC Education Services. Data Science and Big Data Analytics. Wiley; 2015. ↩
Goodreads. Aristotle > quotes > quotable quote [Internet]. Available
from: https://www.goodreads.com/quotes/20103-the-whole-is-greater-than-the-sum-of-its-parts

You might also like