Unit 1
Unit 1
1. Unstructured data: This is the data which does not conform to a data model or is
not in a form which can be used easily by a computer program. About 80% data of
an organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos,letters. researches, white papers, body of an email,
etc.
CHARACTERISTICS OF DATA
1. Composition: The composition of data deals with the structure of data, that is, the
sources of data, the granularity, the types, and the nature of data as to whether it is
static or real-time streaming.
2. Condition: The condition of data deals with the state of data, that is,"Can one use
this data as is for analysis?" or "Does it require cleansing for further enhancement
and enrichment?"
3. Context: The context of data deals with "Where has this data been generated?"
"Why was this data generated?" How sensitive is this data?" "What are the events
associated with this data?" and so on.
Small data (data as it existed prior to the big data revolution) is about certainty. It is
about known data sources; it is about no major changes to the composition or
context of data.
What is Big Data?
Big Data refers to extremely large datasets that are complex and difficult to process
using traditional data processing applications.
Data is in Peta bytes i.e. 10^15 byte size is called Big Data.
It is stated that almost 90% of today's data has been generated in the past 3 years.
Data is doubling every 2 years, 90% of world’s data volume is created in past 2
years.
More than 5 billion people are calling, texting, tweeting and browsing on mobile
phones worldwide.
Decoding the human genome originally took 10 years to process; now it can be
achieved in one week.
The largest AT&T database boasts titles including the largest volume of data
in one unique database (312 terabytes) and the second largest number of
rows in a unique database (1.9 trillion), which comprises AT&T’s extensive
calling records.
• YouTube users upload 48 hours of new video every minute of the day.
• In late 2011, IDC Digital Universe published a report indicating that some 1.8
zettabytes of data will be created that year.
In other words, the amount of data in the world today is equal to:
Every person in the US tweeting three tweets per minute for 26,976 years.
Every person in the world having more than 215m high-resolution MRI scans
a day.
• Big data is a top business priority and drives enormous opportunity for
business improvement.
• Wikibon’s own study projects that big data will be a $50 billion business by
2017.
• As recently as 2009 there were only a handful of big data projects and total
industry revenues were under $100 million. By the end of 2012 more than 90
percent of the Fortune 500 will likely have at least some big data initiatives
under way.
• Market research firm IDC has released a new forecast that shows the big
data market is expected to grow from $3.2 billion in 2010 to $16.9 billion in
2015.
• Poor data across businesses and the government costs the U.S. economy
$3.1 trillion dollars a year.
• 140,000 to 190,000. Too few people with deep analytical skills to fill the
demand of Big Data jobs in the U.S. by 2018.
• 14.9 percent of marketers polled in Crain’s BtoB Magazine are still wondering
“What is Big Data?”.
• 39 percent of marketers say that their data is collected “too infrequently or not
real-time enough.”
• Big data is high-velocity and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and
decision making.
• Big data refers to datasets whose size is typically beyond the storage capacity of
and also complex for traditional database software tools.
• Big data is anything beyond the human & technical infrastructure needed to support
storage, processing and analysis.
• It is data that is big in volume, velocity and variety
Irrespective of the size of the enterprise whether it is big or small, data continues to
be a precious and irreplaceable asset. Data is present in homogeneous sources as
well as in heterogeneous sources.
Around 2005, people began to realize just how much data users generated through
Facebook,YouTube, and other online services.
Hadoop (an open-source framework created specifically to store and analyze big
data sets) was developed that same year. NoSQL also began to gain popularity
during this time.
Data generates information and from information we can draw valuable insight.
(i) Volume – The name Big Data itself is related to a size which is enormous.
Size of data plays a very crucial role in determining value out of data. Also,
whether a particular data can actually be considered as a Big Data or not, is
dependent upon the volume of data. Hence, ‘Volume’ is one characteristic
which needs to be considered while dealing with Big Data solutions.
Byte:
Kilobyte: that is 103
megabyte which is 106
gigabyte 109
terabyte that is 1012 ,and it represents the amount of information which flows
on the internet ,
petabytes which is 1015
Exabyte which is 1018
zettabyte which is 1021 and this basically will fill the Pacific Ocean and that
amount of data, is a future volume of the big data, similarly keep on extending
zettabyte becomes,
Yottabyte that is 1024 which becomes ,an earth-sized a rice bowl.
brontobyte that is 1027 that becomes an astronomical size of that particular
data.
It can be said that the Big Data environment has to have these four basic
characteristics:
Volume
You may have heard on more than one occasion that Big Data is nothing more
than business intelligence, but in a very large format. More data, however, is not a
synonym of it.
Obviously, the Big Data, needs a certain amount of data, but having a huge
amount of data, does not necessarily mean that you are working on this field
of data.
It would also be a mistake to think that all areas of Big Data are business
intelligence since is not limited or defined by the objectives sought with that
initiative. But it will be by the characteristics of the data itself.
Variety
Today, we can base our decisions on the prescriptive data obtained through this
method. Thanks to this technology, every action of customers, competitors,
suppliers, etc, will generate prescriptive information that will range from structured
and easily managed data to unstructured information that is difficult to use for
decision making.
Each piece of data, or core information, will require specific treatment. In addition,
each type of data will require specific storage needs (the storage of an e-mail will be
much less than that of a video).
Velocity
It is very possible that Variety and Veracity would not be so relevant and would not
be under so much pressure when facing a Big Data initiative if it were not for
the high volume of information that has to be handled and, above all, for the
velocity at which the information has to be generated and managed.
The data will be an input for the technology area (it will be essential to be able to
store and digest large amounts of information). And the output part will be the
decisions and reactions that will later involve the corresponding departments.
Veracity
This V will refer to both data quality and availability. When it comes to traditional
business analytics, the source of the data is going to be much smaller in both
quantity and variety.
However, the organization will have more control over them, and their veracity
will be greater.
When we talk about the Big D, variety is going to mean greater uncertainty about
the quality of that data and its availability. It will also have implications in terms of the
data sources we may have.
Characteristics of Big Data (3 Vs of Big Data)
3Vs of Big Data = Volume, Velocity and Variety.
1. Volume:
Volume refers to the sheer size of the ever-exploding data of the computing world. It
raises the question about the quantity of data collected from different sources over
the Internet
2. Velocity:
Velocity refers to the processing speed. It raises the question of at what speed the
data is processed. The speed is measured by the use of the data in a specific time
period.In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.There is a massive and continuous flow of data. This
determines the potential of data that how fast the data is generated and processed to
meet the demands.
3. Variety:
Variety: Variety refers to the types of data. In Big Data the raw data always collected
in variety. The raw data can be structured, unstructured, and semi structured. This is
because the data is collected from various sources.It also refers to heterogeneous
sources.
Fig. 1.3 Four Vs of Big Data
4Vs of Big Data = Volume, Velocity, Variety, Veracity
4. Veracity:
Veracity is all about the trust score of the data. If the data is collected from trusted or
reliable sources then the data neglect this rule of big data.It refers to inconsistencies
and uncertainty in data, that is data which is available can sometimes get messy and
quality and accuracy are difficult to control.Big Data is also variable because of the
multitude of data dimensions resulting from multiple disparate data types and
sources.Example: Data in bulk could create confusion whereas less amount of data
could convey half or Incomplete Information.
5Vs of Big Data = Volume, Velocity, Variety, Veracity,
5. Value:
Value refers to purpose, scenario or business outcome that the analytical solution
has to address. Does the data have value, if not is it worth being stored or
collected?The analysis needs to be performed to meet the ethical considerations.
6Vs of Big Data = Volume, Velocity, Variety, Veracity, Variability
6. Variability
This refers to establishing if the contextualizing structure of the data stream is
regular and dependable even in conditions of extreme unpredictability.It defines the
need to get meaningful data considering all possible circumstances.
7Vs of Big Data =Volume, Velocity, Variety, Veracity, Variability, Visualization
7. Visualization:
Visualization is critical in today’s world. Using charts and graphs to visualize large
amounts of complex data is much more effective in conveying meaning than
spreadsheets and reports chock-full of numbers and formulas.
Size hierarchy
GB Gigabyte
There are four main types of big data analytics—descriptive, diagnostic, predictive,
and prescriptive. Each serves a different purpose and offers varying levels of insight.
Collectively, they enable businesses to comprehensively understand their big data
and make decisions to drive improved performance.
Descriptive analytics
This type focuses on summarizing historical data to tell youwhat’s happened in the
past. It uses aggregation, data mining, and visualization techniques to understand
trends, patterns, and key performance indicators (KPIs).
Descriptive analytics helps you understand your current situation and make informed
decisions based on historical information.
Diagnostic analytics
Diagnostic analytics goes beyond describing past events and aims to understand
why they occurred. It separates data to identify the root causes of specific outcomes
or issues.
By analyzing relationships and correlations within the data, diagnostic analytics helps
you gain insights into factors influencing your results.
Predictive analytics
This type of analytics uses historical data and statistical algorithms to predict future
events. It spots patterns and trends and forecasts what might happen next.
You can use predictive analytics to anticipate customer behavior, product demand,
market trends, and more to plan and make strategic decisions proactively.
Prescriptive analytics
Prescriptive analytics builds on predictive analytics by recommending actions to
optimize future outcomes. It considers various possible actions and their potential
impact on the predicted event or outcome.
Prescriptive analytics help you make data-driven decisions by suggesting the best
course of action based on your desired goals and any constraints.
Big data analytics has the potential to transform the way you operate, make
decisions, and innovate. It’s an ideal solution if you’re dealing with massive datasets
and are having difficulty choosing a suitable analytical approach.
By tapping into the finer details of your information, using techniques and specific
tools, you can use your data as a strategic asset.
Figure 1.5 Challenges with big data (Big Data and Analytics)
Data volume: Data today is growing at an exponential rate. This high tide of data will
continue to rise continuously. The key questions are – “will all this data be useful for
analysis?”,
“Do we work with all this data or subset of it?”“How will we separate the knowledge
from the noise?” etc.
Storage: Cloud computing is the answer to managing infrastructure for big data as
far as cost-efficiency, elasticity and easy upgrading / downgrading is concerned. This
further complicates the decision to host big data solutions outside the enterprise.
Data retention: How long should one retain this data? Some data may require for
log-term decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that
generate insights, organizations need professionals who possess a high-level
proficiency in data sciences.
Other challenges: Other challenges of big data are with respect to capture, storage,
search, analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the storage
capacity of traditional database software tools. There is no explicit definition of how
big the data set should be for it to be considered bigdata. Data
visualization(computer graphics) is becoming popular as a separate discipline. There
are very few data visualization experts.
Drivers of Big Data:
Big data is driven by several key factors that enable its growth, adoption, and utility
across various industries. Here are some of the primary drivers for big data:
1. Increased Data Volume: The sheer amount of data being generated every
day from various sources like social media, IoT devices, transactions, and
digital interactions is growing exponentially. This creates a need for systems
and technologies that can handle and process large volumes of data.
2. Advanced Analytics and Machine Learning: The development of
sophisticated analytics tools and machine learning algorithms has enabled
organizations to extract valuable insights from vast datasets. These
technologies help in making informed decisions, predicting trends, and
optimizing operations.
3. Cost-Effective Storage Solutions: The cost of data storage has significantly
decreased over the years, making it economically feasible to store large
amounts of data. Cloud storage solutions and distributed file systems like
Hadoop have played a crucial role in this.
4. Improved Processing Power: Advances in computing power, including the
development of parallel processing and distributed computing, have made it
possible to process and analyze big data more efficiently and quickly.
5. Data-Driven Decision Making: Organizations are increasingly relying on
data-driven decision-making processes to gain a competitive edge. Big data
analytics provides insights that help in strategic planning, marketing, customer
service, and operational efficiency.
6. Internet of Things (IoT): The proliferation of IoT devices has led to the
generation of massive amounts of data from sensors, machines, and other
connected devices. This data is crucial for real-time analytics, monitoring, and
automation.
7. Consumer Demand for Personalization: There is a growing demand for
personalized products and services. Big data allows companies to analyze
consumer behavior, preferences, and feedback to tailor their offerings to
individual needs.
8. Regulatory and Compliance Requirements: Regulations and compliance
requirements in various industries necessitate the collection, storage, and
analysis of large amounts of data. Big data helps organizations meet these
requirements more efficiently.
9. Social Media and Digital Marketing: The rise of social media platforms and
digital marketing strategies generates vast amounts of data on consumer
interactions, preferences, and trends. Analyzing this data helps in targeted
marketing and improving customer engagement.
10. Technological Innovations: Continuous innovations in big data
technologies, such as real-time data processing, advanced data visualization
tools, and enhanced data integration techniques, drive the adoption and
evolution of big data solutions.
These drivers collectively contribute to the growing importance and utilization of big
data across various sectors, leading to more informed decision-making and strategic
advantages.