0% found this document useful (0 votes)
10 views40 pages

Bsd1313 Chapter 2

Uploaded by

r9v54xcfhz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views40 pages

Bsd1313 Chapter 2

Uploaded by

r9v54xcfhz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

BIG DATA

DR. MOHD KHAIRUL BAZLI BIN MOHD AZIZ


CENTRE FOR MATHEMATICAL SCIENCES, UNIVERSITI MALAYSIA PAHANG
CHAPTER 2: BIG
DATA
2.1 Introduction to Big Data
2.2 Sources of Big Data
2.3 The V`s of Big Data
2.4 Big Data Process - Data
Management and Analytics
2.1 INTRODUCTION TO BIG DATA

 Big Data is the Next Big Thing in the IT world


which emerges in the first decade of the 21st
century.
 Google, eBay, LinkedIn and Facebook are
among the first organization based on online
and were built around Big Data.
 New information technology such as Big
Data will give benefit such as dramatic cost
reduction, innovation of new products and
services and substantial improvement in
performance of computing task (e.g.
supercomputers).
Big Data is
DEFINITION OF BIG  huge volume of data which no longer can be handled by the
DATA existing machine or method.
 huge volume of data which no longer can be handled by the
existing machine or method.
 a massive volume of both structured and unstructured data
that is difficult/complex to process using traditional
database and techniques.
 extremely large data sets that need to be analysed
computationally to reveal patterns, trends and relationship,
especially relating to human behaviour and interactions.
 used to describe a large volume of structured, semi-
structured and unstructured data that need to be mined for
information and used in advanced analytics applications.
BIG DATA IS EVERYWHERE

Massive amount of data is being collected, generated and


warehoused from various domains such as
 Web data, e-commerce
 Purchases at department/ grocery stores
 Bank/ Credit card transactions
 Telecommunication
 Social network and etc.
INTERNET LIVE WEBSITE

 https : //www.internetlivestats.com/
 https : //www.webfx.com/internet-real-time/
 https : //www.betfy.co.uk/internet-real-time/
 https : //visual.ly/community/infographic/how/internet-real-time
DATA GROWTH

Due to digital technology


advancement, abundance of data is
generated endlessly every second
and the growth rate is exponential
- explosion of digital footprint.
DATA GROWTH

Data grows from 0.7 zetabyte in 2009 to 35 zetabyte in 2020,


500% increase than 2015. Note that 1 zettabyte = 1021 byte and
1 yottabyte = 1024 byte.
 Sensors used in shopping malls to gather shoppers’
information
 Posts on social media platforms
 Digital pictures and videos captured in our phones
 Purchase transactions made through e-commerce
2.2 SOURCES OF BIG DATA

The Big Data may come from

Human Generated
 emails/ blogs/ reviews pictures/ scientic
research/medical record etc.
 social media: facebook, Linked-in, Contacts, Twitter
etc.

Machine Generated
 Application server logs (e.g. websites, games,
internet).
 Sensor data (e.g. weather, atmospheric science,
astronomy, smart grids).
 Images/ videos (e.g. trac, security cameras, military
surveillance).
MACHINE GENERATED - IOT

 Wikipedia:
The Internet of Things (IoT) is a system of interrelated computing devices, mechanical and digital machines, objects,
animals or people that are provided with unique identifiers (UIDs) and the ability to transfer data over a network
without requiring human-to-human or human-to-computer interaction.

 https : //www:techopedia:com:
The internet of things (IoT) is a computing concept that describes the idea of everyday physical objects being
connected to the internet and being able to identify themselves to other devices. The term is closely identied with
RFID as the method of communication, although it also may include other sensor technologies, wireless technologies
or QR codes.
MACHINE
GENERATED - IOT
MACHINE
GENERATED
Benefits of IoT
A SINGLE VIEW TO
THE CUSTOMER
A SINGLE VIEW TO
THE CUSTOMER
2.3 THE V`S OF BIG
DATA
The characteristic of Big Data can be
defined in terms of
 Volume  Virality
 Velocity  Viscosity
 Variety  Venue
 Variability  Vocabulary
 Veracity  Vagueness
 Validity  Verbosity
 Volatility  Voluntariness
 Vulnerability  Versatility
 Visualization
 Value
VOLUME OF BIG DATA

 Volume of Big Data is how much data we have. It was


measured in Gigabytes and now in Zettabytes (ZB) or even
Yottabytes (YB).
 Data volume in 2009 is 0.7 zettabytes and in 2020 expected to
be 35 zettabytes (50 times) and will keep on increasing
exponentially.
 90% of all data today is generated in the last few years
especially with the IoT technology.
 Forbes (2019) reports that every minute, users watch 4.15
million YouTube videos, send 456,000 tweets on Twitter, post
46,740 photos on Instagram and 510,000 comments posted
and 293,000 statuses updated on Facebook.

Note: Forbes is a global media company, focusing on business,


investing, technology, entrepreneurship, leadership, and lifestyle.
VELOCITY OF BIG DATA

 Velocity is the speed in which data is


accumulated/accessible/generated.
 Clickstreams and capture user behaviour at millions
of events per second.
 High-frequency stock trading algorithms reflect
marker changes within microseconds.
 machine to machine processes exchange data
between billions of devices
 infrastructure and sensors generate massive log data
in real-time
 on-line gaming systems support millions of concurrent
users which produce multiple inputs per second.
VARIETY OF BIG DATA
Variety of Big Data refers to data of different formats which
comes from different sources. The Big Data has a variety of
data formats as follows
 Structured: organized data format with fixed schema
E.g A relational database (RDB) is a collective set of
multiple data sets organized by tables, records and columns.
RDBs establish a well-defined relationship.
 Semi-structured: Partially organized data without fixed
format E.g. Extensible Markup Language (XML) is used to
describe data. The XML standard is a flexible way to create
information formats and electronically share structured
data via the public Internet, as well as via corporate
networks XML
 Unstructured: unorganized data with unknown format.
E.g. audio, video
VARIABILITY OF BIG DATA

 Variability in big data refers to inconsistencies in


the data and will impact data homogenization.
 Variability in data also occurs resulting from
multiple sources and inconsistent speed at which
data is loaded into the database.
 E.g. A coffee shop may offer 6 different blends of
coffee, but if you get the same blend every day
and it tastes different every day, that is variability.
VERACITY OF BIG DATA

 Veracity is the confidence or trust in the data.


 It concerns on the reliability of the data source,
its context and how meaningful the data is based
on the analysis.
 E.g. Researcher collect data last few years. So you
might ask what methodology was used before so
is the result still applicable for current use.
VALIDITY OF BIG DATA

 Validity is the data accuracy, which requires processes


to discard the bad data from accumulating in the
systems.
 According to Forbes, approximately 60% of data
scientist’s time is spent on data cleaning before doing
any analysis.
 Hence, it is very vital to adopt good data governance
practices to ensure data quality, common definition
and metadata (Data that serves to provide context or
additional information about other data).
 validity of data will enhance the accuracy of the
analysis and result which provide correct decision.
VOLATILITY OF BIG DATA

 Volatility is that the data has the tendency to


change over time.
 Due to volume and velocity of data, the volatility
of data need to be considered to ensure
availability and rapid retrieval of information
when required.
VULNERABILITY OF BIG DATA

 Vulnerability refers to data that has the tendency


to breach or on attacks.
 Big Data brings new security concerns. Policies
related to data security, data ethics and data
governance need to be carefully developed and
enforced to safeguard the service provider and
customers.
 Example: Ashley Madison hack in 2015
 Example: In 2016, a hacker called Peace sell 167
million LinkedIn accounts in the dark web.
VISUALIZATION OF BIG DATA

 Visualization is using charts and graphs to visualize the


data.
 For Big Data, to visualize it face technical challenge
due to limitation of in-memory technology, poor
scalability, functionality and response time.
 Different types of graphs are needed to represent big
data. For example for data clustering, tree maps,
sunburst, circular network are used.
 using effective and suitable visualization tools will
simplify the complex relationship between the
variables in more meaningful way and interesting data
insights can be derived.
VALUE OF BIG DATA

 Value from Big Data is the most important of all.


 The other characteristics of big data are meaningless
if the business value from the data is not derived.
 Good business value will enhance business
performance significantly such as profit, marketability
and branding. It is able to innovate new product.
 Substantial value from big Data also include
understanding customer needs and targeting them
accordingly, optimizing processes and improving
business performance.
VIRALITY OF BIG DATA

 Spreading Speed
 It is defined as the rate at which the data
is broadcast/spread by a user and received
by different users for their use
VENUE OF BIG DATA

 Different platform
 Various types of data arrived from different sources via
different platforms like personnel system and private &
public cloud.
VOCABULARY OF BIG DATA

 Data terminology.
 Data terminology likes data model and data structures.
VAGUENESS OF BIG DATA

 Indistinctness of existence in a data.


 Vagueness concern the reality in information that suggested
little or no thought about what each might convey.
VERBOSITY OF BIG DATA
 Bad data refers to the information which is wrong, out of
date or incomplete.
 The consequences of storing these types of information
may be dangerous sometimes.
 So, it is recommended to check that the stored data is
secured, relevant, complete, and trustworthy.
 If a suitable technique at the initial stage is applied to decide
whether the information is useful or not, then storage
space, as well as processing time can be saved.
 Keeping in mind the verbose nature of the big data, we have
identified ‘verbosity’ as one of the characteristic of big data
which is defined as “The redundancy of the information
available at different sources.”
VOLUNTARINESS OF BIG DATA
 ‘Voluntary’ has been defined as one of the characteristic of big data
which is defined as “The will full availability of big data to be used
according to the context.”
 Big data is a set of huge amount of data which can be used as a
volunteer by different organisations without any interference. Big data
voluntarily help numerous enterprises.
 It assist retailers by giving them knowledge of customer preferences,
urban planning by visualization of environment modelling and traffic
patterns, manufacturers by predicting product issues to optimize their
productivity and to improve the equipment and customers
performance, energy companies to meet out energy demands during
peak time and consequently increase production and improving
efficiency by reducing the losses, healthcares professionals to prevent
diseases and improving patient health.
VERSATILITY OF BIG DATA

 ‘Versatility’ as one of the characteristic of big data


which is defined as “The ability of big data to be
flexible enough to be used differently for different
context.”
 Big data is evolving to satisfy the needs of many
organisations, researchers and Government. It
facilitate the urban planning, environment modelling,
visualization, analysis, quality classification, securing
environment, computational analysis, biological
understanding, designing and manufacturing process
required by organisations and cost-effective models as
well as elegant exploration of the result.
2.4 BIG DATA PROCESS
- DATA MANAGEMENT
& ANALYTICS

 Big Data is complex in nature,


hence it requires different
techniques, tools and
architecture to manage it.
 The objective is to solve
problem in a better way and
mine data insights.
 The growth of data is indeed
requires an increase of storage
capacities, processing power and
availability of complex data.
BIG DATA
TOOLS
In today’s reality, data gathered by a
company is a fundamental source of
information for any business.
Unfortunately, it is not that easy to
drive valuable insights from it.

Problems with which all data scientists


are dealing are the amount of data and
its structure. Data has no value unless
we process it. To do so, we need big
data software that will help us in
transforming and analyzing data.
BEST BIG DATA
TOOLS IN 2020

 Apache Hadoop
 Apache Storm
 RapidMiner (open-source)
 Qubole
 Tableau
 Cassandra (open-source)
 Apache Spark (open-
source)
 Flink (open-source)
DATA MANAGEMENT

 Data Management is more focused on the acquisition


and preparation of data for other purposes in the
organization.
 The data management includes
 building and uploading data to databases.
 creating backup and historical copies of les.
 providing "permissions" to others in the organization
to access certain data les.
 "data cleaning", to prepare data for inclusion in
databases.
 ETL (extract/transform/load) tasks that "surface" data
to other users in the organization.
DATA ANALYTICS

 Data Analytics is "what value/information can be derived


from the data"?
 Analytics involves the application of
"mathematical/statistical operations" to data which include
 basic counting tasks such as generating summary counts
of "events,"
 creating reports with summary descriptive statistics, and
application of "data visualization" tools to create graphics
that convey the "information" in your data.
 predicting future events based on historical data
 "clustering" or "grouping" similar units of analysis into
distinct groups.
 assessing the effectiveness of marketing campaigns and
similar "interventions" on customer behavior.
VALUE OF BIG DATA ANALYTICS

 Cost reduction: Big data technologies such as Hadoop and


cloud-based analytics is useful since it can store large
amount of data at very low cost and able to identify more
efficient ways of doing business.

 Faster, better decision making: With the speed of Hadoop


and in-memory analytics, combined with the ability to
analyze new sources of data, businesses are able to analyze
information immediately { and make decisions based on
what they've learned.

 New products and services: With the ability to gauge


customer needs and satisfaction through analytics comes
the power to give customers what they want. Davenport
points out that with big data analytics, more companies are
creating new products to meet customers' needs.
BIG DATA MARKET
SIZE FORECAST

Forecast of Big Data market


size, based on revenue, from
2011 to 2027 (in billion U.S.
dollars).
THANK
YOU

You might also like