Emerging CH2
Emerging CH2
• Input
• It is the task where verified data is coded or converted into machine readable
form so that it can be processed through a computer.
Data entry is done through the use of a keyboard, digitizer, scanner, or data
entry from an existing source.
• Processing
• Once the input is provided the raw data is processed by a suitable or selected
processing method.
• This is the most important step as it provides the processed data in the form
of output which will be used further.
Cont…
▪ It is concerned with making the raw data acquired amenable to use in decision-making
as well as domain-specific usage.
Cont…
3. Data curation:
• It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities such as
content creation, selection, classification, transformation, validation, and
preservation.
4. Data storage:
• It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.
Eg: RDMS
• The ACID (Atomicity, Consistency, Isolation, and Durability)
properties that guarantee database transactions lack flexibility with
regard to schema changes and the performance when data volumes and
complexity grow, making them unsuitable for big data scenarios.
• NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
Cont…
5. Data usage:
• It covers the data-driven business activities that need access to data,
its analysis, and the tools needed to integrate the data analysis
within the business activity.
• Data usage in business decision making can enhance
competitiveness through the reduction of costs, increased added
value, or any other parameter that can be measured against existing
performance criteria.
• It is the amount of data (things like images, movies, photos, videos,
and other files) that you send, receive, download and/or upload.
Cont..
• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
• Today, it may consist of petabytes (1,024 terabytes) or exabytes
(1,024 petabytes) of information, including billions or even trillions of
records from millions of people.
• But it doesn't mean the amount of data, the thing matters is what
organization do with data.
• Big Data is analyzed for insights that lead to better decisions.
Cont…
• Big Data is associated with the concept of 3 V that is volume,
velocity, and variety. Big data is characterized by 3V and
more:
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse
sources
• Veracity: can we trust the data? How accurate is it? etc.
• Value: refers to economically useful benefits that an
organization obtained from Big Data
Cont…
Clustered Computing
• Because of the qualities and quantities of big data,
individual computers are often inadequate for handling the
data at most stages.
• To better address the high storage and computational needs of
big data, computer clusters are a better fit.
• “Computer cluster” basically refers to a set of connected
computer working together.
• The cluster represents one system and the objective is to
improve performance.
Cont…
• High Availability
• Clusters can provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• This becomes increasingly important as we continue to
emphasize the importance of real-time analytics.
Cont..
• Easy Scalability
• Clusters make it easy to scale horizontally by adding additional
machines to the group.
• This means the system can react to changes in resource
requirements without expanding the physical resources on a
machine.
Cont…
○ MapReduce: Programming based Data Processing ○ Mahout, Spark MLLib: Machine Learning
algorithm libraries
○ Spark: In-Memory data processing
○ Solar, Lucene: Searching and Indexing