Big Data
Big Data
>>We live in a digital world where data is increasing rapidly because of the
ever increasing use of the internet, sensors, and heavy machines at a very high
rate.
>>The sheer volume, variety, velocity, and veracity of such data is
signified by the term ‘Big Data’.
>>Big data is structured , unstructured, semi-structured or
heterogeneous(varied, mixed) in nature.
>>It becomes difficult for computing systems to manage ‘ Big Data’ because
of the immense(huge) speed and volume at which it is generated.
>>Traditional data management, warehousing , and analysis systems fizzle to
analyze this type of data.
>>Due to its complexity, big data is stored in distributed architecture file
system.
What is Big Data ?cntd…
>>Hadoop by Apache is widely used for storing and ,managing Big
data.
>>Analyzing Big data is a challenging task as it involves large
distributed file systems, which should be fault tolerant (Is a property of
a system that maintains continuous running of service even during
faults or a process that enables an operating system to respond to a
failure in hardware or software), flexible, and scalable.
>>According to IBM, "Every day, we create 2.5 quintillion bytes of data
– so much that 90% of the data in the world today has been created in
the last two years alone.
>>This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos,
purchase transaction records and cell phone GPS signals to name a few .
This data is big data.
What is Big Data ?cntd…
>>The process of capturing or collecting Big data is known as
‘datafication’.
>>‘Big data is datafied’ so that it can be used productively.
>>Big data cannot be made useful by simply organizing it, rather the
data’s usefulness lies in determining what we can do with it.
Note: By large or huge datasets or big data, we mean anything from a
petabyte(1PB= 1000TB) to an exabyte (1EB=1000PB)of data.
Some real world examples of Big data include:
Some real world examples of Big data include cntd :
Types and sources of Data :
Overview of Big Data, Techniques :
Structuring Big Data :
>>Structuring of data, is arranging the available data in a manner
such that it becomes easy to study , analyze , and derive conclusion
from it. But , why is structuring required?
>>In daily life, you may have come across questions like:
1. How do I use to my advantage the vast amount of data and information I
come across?
2. Which news articles should I read of the thousands I come across?
3. How do I choose a book of the millions available on my favorite sites or
stores?
4. How do I keep myself updated about new events , sports, inventions, and
discoveries taking place across the globe?
>>Solutions to such questions can be found by information
processing systems(IPS).
>>These systems can analyze and structure a large amount of data
specifically for you on the basis of what you searched , what you looked
at, and for how long you remained at a particular page or website, thus
scanning and presenting you with the customized information as per
your behavior and habits.
>>In other words , structuring data helps in understanding user
behaviors, requirements, and preferences to make personalized
recommendations for every individual.
>>When a user regularly visits or purchases from online shopping
site, say eBay, each time he/she logs in , the system can present a
recommended list of products that may interest the user on the basis of
his/her earlier purchases or searches , thus presenting a specially
customized recommendation set for every user.
>>This is the power of Big data analytics.
>>Today, various sources generate a variety of data, such as images,
text, audios etc.
>>All such different types of data can be structured only if it is sorted
and organized in some logical pattern.
>>Thus, the process of structuring data requires one to first
understand the various types of data available today.
Types of data
1. Internal [ provides structured or organized data originates within
enterprise and helps run business]
2. External [ Provides unstructured or unorganized data that originates
from the external environment of an organization]
On the basis of the data received from the sources , Big Data
comprises:
• Structured data
• Unstructured data
• Semi-structured data
Types of Big data
• Structured data
• Unstructured data
• Semi-structured data
Structured data
>>Structured data can be defined as the data that has a defined repeating
pattern .
>>This pattern makes it easier for any program to sort, read and process the
data.
>>Processing structured data is much easier and faster than processing data
without any specific repeating patterns.
Structured data :
.. Is organized data in a predefined format.
..Is stored in tabular form
..Is the data that resides in fixed fields within a record or file.
..Is formatted data that has entities and their attributes mapped.
..Is used to query and report against predetermined data types
Some sources of structured data include:
>>Relational databases(in the form of tables)
>>Flat files in the form of records(Like comma separated values(csv) and
tab-separated files)
>>Multidimensional databases (majorly used in data warehouse technology)
>>Legacy databases .
Unstructured data
>>Unstructured data is a set of data that might or might not
have any logical or repeating patterns.
Unstructured data:
..consists typically of metadata, i.e., the additional information
related to data.
..comprises inconsistent data, such as data obtained from files,
social media websites, satellites etc.
..Consists of data in different formats such as e-mails, text,
audio, video, or images.
Some sources of unstructured data include:
>>Text both internal and external to an organization-
Documents, logs, survey results, feedbacks, and e-mails from
both within and across the organization.
>>Social media : data obtained from social networking
platforms, including YouTube, Facebook, Twitter, LinkedIn
and Flickr.
>>Mobile data- data such as text messages and location
information.
About 80 percent of enterprise data consists of unstructured
content.
>>Unstructured data examples. There is a wide array of forms
that make up unstructured data such as email, text files, social
media posts, video, images, audio, sensor data, and so on.
>>The travel agency Facebook post: an example of
unstructured data.
Semi-Structured data :
>>Semi-structured data, also known as having a schema-less
or self-describing structure, refers to a form of structured data
that contains tags or markup elements in order to separate
elements and generate hierarchies of records and fields in the
given data.
>>Such type of data does not follow the proper structure of
data models as in relational databases.
>>In other words , data is stored inconsistently in rows and
columns of a database.
>>Some sources for semi-structured data include:
..File systems such as Web data in the form of cookies.
..Data exchange formats such as JavaScript Object
Notation(JSON)data.
…Another example, an XML document might contain tags
that indicate the structure of the document , but may also
contain additional tags that provide metadata about the content,
such as author, date, or keywords.
Elements of Big Data
>>According to Gartner , data is growing at the rate of
59% every year.
>>This growth can be depicted in terms of the following
four Vs:
* Volume
* Velocity
* Variety
* Veracity
1. Volume
>>Volume is the amount of data generated by organizations or
individuals.
>>Today, the volume of data in most organizations is approaching
exabytes[1000 petabytes].
>>Some experts predict the volume of data to reach zettabytes in the
coming years.
>>Organizations are doing their bets to handle this ever-increasing
volume of data.
>>For example, according to IBM, over 2.7 zettabytes of data is
present in the digital universe today.
>>Every minute over 571 new websites are being created.
>>IDC [Infrastructure development charges (IDCs)]estimates that by
2020 , online business transactions will reach up to 450 billion per day.
2. Velocity
>>Velocity describes the rate at which data is generated,
captured and shared.
>>Enterprises can capitalize on data only if it is captured
and shared in real time.
>>Information processing systems such as CRM and ERP
face problems associated with data, which keeps adding up
but cannot be processed quickly.
The sources of high velocity data include the following:
>>IT devices, including routers , switches, firewalls etc., constantly
generate valuable data.
>>Social media, including Facebook posts, tweets, and other social
media activities.
>>Portable device, including mobile, PDA, etc., also generate data at a
high speed.
3. Variety
>>We all know that data is being generated at a very fast
pace.
>>Now, this data is generated from different types of sources,
such as internal , external, social and behavioral, and comes in
different formats such as images, text, videos, etc.
>>Even a single source can generate data in varied formats,
for example, GPS and social networking sites, such as
Facebook, produce data of all types, including text, images,
videos, etc.
>>Various types of data included in the following figure;
4. Veracity
>>Veracity generally refers to the uncertainty of data i.e.,
whether the obtained data is correct or consistent.
>>Out of the huge amount of data that is generated in almost
every process, only the data that is correct and consistent can be
used for further analysis.
>>data when processed becomes information, however, a lot of effort
goes in processing the data.
>>Big data , especially in the unstructured and semi-structured forms,
is messy in nature, and it takes a good amount of time and expertise to
clean that data and make it suitable for analysis.
Big data Analytics
>>Big data analytics changed the ways to conduct business in
many ways, such a it improves, decisions making, business
process management etc.
>>Business analytics uses the data and different other
techniques like information technology, features of statistics
, quantitive methods, and different predictive analytics, and
prescriptive analytics.
>>There are three main types of business analytics :
descriptive analytics, predictive analytics, and prescriptive
analytics.
Big data Analytics cntd..
>>The conventional database systems are not in a position to process
Big data defined by the four Vs: volume, variety, velocity, and veracity.
>>Big data also affects the analytical process and technologies used for
analytics.
>>There are mainly three types of analytics:
1. Descriptive Analytics: DA is the most prevalent form of
analytics , and it serves as a base for advanced analytics.
>>It answers the question ‘What happened in the business?’
>>DA analyses database to provide information on the trends of past or
current business events that can help managers, planners, leaders etc.
to develop a road map for future actions.
>>DA performs an in-depth analysis of data to reveal details such as
frequency of events, operation costs, and the underlying reason for
failures.
>>It helps in identifying the root cause of the problem.
2. Predictive Analytics –
>>PA is about understanding an predicting the future and
answers the question ‘What could happen?’ by using statistical
models and different forecast techniques.
>>It predicts the near future probabilities and trends and helps
in what –if analysis .
>>In PA , we use statistics, data mining techniques, and
machine learning to analyze the future.
>>The below figure shows the steps involved in predictive analytics:
3. Prescriptive Analytics –
>> Prescriptive analysis answers ’What should we do’ , on the
basis of complex data obtained from descriptive and predictive
analyses.
>>By using the optimization technique, prescriptive analytics
determines the finest substitute to minimize or maximize some
equitable finance, marketing, and many other areas.
>>For e.g. if we have to find the best way of shipping goods
from a factory to a destination, to minimize costs, we will use
the prescriptive analytics.
3. Prescriptive Analytics –cntd….
>>The below figure shows a diagrammatic representation of the
stages involved in the prescriptive analytics:
3. Prescriptive Analytics –
>>Data, which is available in abundance, can be streamlined for
growth and expansion in technology as well as business.
Capacity Scheduler
>>Is the default scheduler used in Hadoop 2 .
>>Its purpose is to allow multi-tenancy and share resources
between multiple organizations and applications on the same
cluster.
Capacity Scheduler cntd..
>>It supports the following features:
1. Hierarchical queues
2. Capacity guarantees
3. Security
4. Elasticity
5. Multi-tenancy
6. Resource-based scheduling
Fair Scheduler
>>Is a method of assigning resources to applications via
Application Manger such that all applications get an
equal share of resources during their course of running.
YARN Commands
Administration commands: are used by the cluster
administrator.
User commands : These types of commands are used by
the cluster user.
Fair Scheduler
>>Is a method of assigning resources to applications via
Application Manger such that all applications get an
equal share of resources during their course of running.
YARN Commands
Administration commands: are used by the cluster
administrator.
User commands : These types of commands are used by
the cluster user.