Understanding Big Data and NoSQL
Understanding Big Data and NoSQL
M.A.MANALANG
Although interest in data analysis has increased, the existing system architecture and DBMS face limitations in
terms of processing speed and performance when an existing analysis system processes tens of PBs of
unstructured data generated by multimedia, SNS, sensors, loT, etc, due to the advancement of IT technologies.
As a result, solutions are being developed that are suitable for analyzing the large volume of unstructured data
(variety) generated at high speed (velocity). Technologies related to big data, which was still in its infancy only
a few years ago, have developed rapidly and are now being used directly in our real life.
There are many examples of this. For instance, the US movie Minority Report, which was released in 2002,
portrayed futuristic crime prediction by “precogs’ in the year 2054, yet something akin to this technology was
actually realized as a crime prevention system in San Francisco as early as 2009. (However, the concept of
“big data” had only been introduced at that time for the study of genomes, and prediction based on big data
analytics had not yet been thought of.) Other commonly known examples include Google's flu map, the US
presidential election case, ZARA, the DHL case, and distribution demand forecasting. Furthermore big data
technology was unfamiliar to IT workers until relatively recently to the extent that they only needed to know the
definition of big data. However, with the rapid development of big data technology, terms such as crawler,
Hadoop, MapReduce, R, and NoSQL, as well as the 3V characteristics of big data, have become familiar
technical terms, and IT workers now experience big data systems more frequently in the workplace. Therefore,
IT workers need to understand at least the concepts of the technical terms for each phase of big data
analytics, if not the detailed technical principles, in order to quickly adapt to the changed workplace.
BIG DATA OVERVIEW
A. Definition and characteristics of big data
1. Definition of big data
• Big data generally refers to either data that exceed the ability of database management tools used to capture, store,
and analyze data (McKinsey, 2011), or to next-generation technologies and architectures designed to extract value
from large-scale data at low cost and support the rapid collection, discovery, and analysis of data (IDC, 2011).
2. Characteristics of big data (3v)
• The characteristics of big data can be explained by the three elements (3V) of big data, namely, volume, velocity, and
variety. Each element has the following characteristics.
3. Structured data vs.
unstructured data
BIG DATA OVERVIEW
Description
Structured data vs. unstructured data
Data type Structured
• Data to be stored in a fixed field.
Semi-structured • Data not stored in a fixed field, but which
contain metadata or schema, such as
XML or HTML.
• XML, CVS, XLS, RDF, etc.
Unstructured • Data not to be stored in a fixed field. •
Document, picture, video, and audio data,
etc.
3.
BIG DATA OVERVIEW
B. Detailed technology by big data life cycle
ITEM DESCRIPTION DETAILED TECHNOLOGY
Collection - A technology that can collect data from - Crawling (web robot), ETL,
all devices and systems CEP (Complex Event
Processing), etc.
Storage/ - A technology that can store and process - Distributed file system,
pro collected large-scale data using a NoSQL, MapReduce
cessing distributed processing system. processing
Analysis - A method of analysis that can assist - Natural language
companies and the public with using big data in processing, machine
business and daily life. learning, data mining
algorithms, etc.
Visualization - A technology that can visualize analyzed - Visualization such as R,
results effectively. graphs, drawing, etc.
Various technologies can be used for data collection, such as ETL, web crawling, RSS Feeding, Open API, and CEP
(Complex Event Processing). Among them, the web crawling technology automatically collects various
documents and data generated on the web, and is used to collect data such as SNS, blogs, and news. Web
crawling copies the entire web page after collecting the URLs to be collected, or collects data with a specific tag
only after analyzing the HTML code.
Distributed File System (DFS) A file system that allows access to files GFS (Google File System), HDFS
on multiple host computers which are (Hadoop Distributed File
shared over a computer network. System), etc.
NoSQL (Not Only SQL) A new type of data storage/retrieval Hbase, Cassandra,
system that uses a less restrictive Mongodb, CouchBase,
consistency model (BASE Redis, Neo4J, etc.
characteristics) than the traditional
relational database.
Distributed parallel processing A technology that processes a large MapReduce
amount of data in a distributed
parallel computing environment.
Scale-out: Its entire available capacity and performance increase almost linearly each time equipment
is added.
High availability: Even if some servers fail, the usability of the entire system is not affected very
much. Optimized for throughput: It is suitable for the batch processing of large-scale data.
Persistence - Data should be kept on a disk, not just in a volatile memory only.
Deployment - When a node is added or deleted, data should be automatically
loaded without the need for data distribution or manual
mediation, and there should be no constraints, such as
distributed file system or shared storage, or any need for
special hardware. Hardware should be operable in
heterogeneous hardware.
Modeling - Data of various types such as key-value pairs, hierarchical data,
flexibility and graphs should be modeled conveniently.
Query flexibility - Multiple GET that obtains a set of values for the provided key
from a query, and queries that obtain data based on a specific
range of keys, are needed.
Big Data In 5 Minutes | What Is Big Data?| Introduction To Big Data |Big Data Explained |
Simplilearn https://www.youtube.com/watch?v=bAyrObl7TYE
Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |
Simplilearn https://www.youtube.com/watch?v=aReuLtY0YMI
SQL vs NoSQL | Difference Between SQL And NoSQL | SQL And NoSQL Tutorial | SQL Training |
Simplilearn https://www.youtube.com/watch?v=jh14LlMHyds&t=3s