Bda 1
Bda 1
(22CA3121)
TEXT BOOKS:
1. Min Chen, Shiwen Mao, Yin Zhang,Victor
C.M. Leung, Big Data: Related
Technologies,Challenges and Future Prospects, Springer;
2014.
2. Tom White, Hadoop- The Definitive Guide, 4th Edition,
O’reilly, 2015.
UNIT- 1
INTRODUCTION: Dawn of the Big Data Era, Definition and Features of Big
Data, Big Data Value, The Development of Big Data, Challenges of Big Data.
RELATED TECHNOLOGIES: Cloud Computing - Cloud Computing
Preliminaries, Relationship Between Cloud Computing and Big Data, IoT - IoT
Preliminaries, Relationship Between IoT and Big Data, Data Center, Hadoop -
Hadoop Preliminaries, Relationship between Hadoop and Big Data
Dawn of the Big Data Era
• Over the past 20 years, data has increased in a large scale in various fields.
• According to a report from International Data Corporation (IDC), in 2011, the
overall created and copied data volume in the world was 1.8ZB, which has
increased by nearly nine times within 5 years .
• Such figure will double at least every other 2 years in the near future.
• The term of big data was coined under the explosive increase of global data and
was mainly used to describe these enormous datasets.
• Compared with traditional datasets, big data generally includes masses of
unstructured data that need more real-time analysis
• In addition, big data also brings new opportunities for discovering new values,
helps us to gain an in-depth understanding of the hidden values, and incurs new
challenges, e.g., on how to effectively organize and manage such data.
• At present, big data has attracted considerable interest from industry, academia,
and government agencies.
• Recently, the rapid growth of big data mainly comes from people’s daily life,
especially related to the service of Internet companies. For example, Google
processes data of hundreds of PB and Facebook generates log data of over 10
Petabyte (PB) per month
• While the amount of large datasets is drastically rising, it also brings about many
challenging problems demanding prompt solutions.
• First, the latest advances of information technology (IT) make it more easily to
generate data. For example, on average, 72 h of videos are uploaded to YouTube
in every minute.
• Second, the collected data is increasingly growing, which causes a problem of
how to store and manage such huge, heterogeneous datasets with moderate
requirements on hardware and software infrastructure.
• Third, in consideration of the heterogeneity, scalability, realtime, complexity,
and privacy of big data, we shall effectively “mine” the datasets at different
levels with analysis, modeling, visualization, forecast, and optimization
techniques, so as to reveal its intrinsic property and improve decision making
• The rapid growth of cloud computing and the Internet of Things
(IoT) further promote the sharp growth of data.
• Cloud computing provides safeguarding, access sites, and channels
for data asset.
• In the paradigm of IoT, sensors all over the world are collecting and
transmitting data which will be stored and processed in the cloud.
• Such data in both quantity and mutual relations will far surpass the
capacities of the IT architectures and infrastructure of existing
enterprises, and its realtime requirement will greatly stress the
available computing capacity.
Definition and Features of Big Data
• Big data is an abstract concept. Apart from masses of data, it also has some other features,
which determine the difference between itself and “massive data” or “very big data”.
• In general, big data refers to the datasets that could not be perceived, acquired, managed,
and processed by traditional IT and software/hardware tools within a tolerable time.
• Because of different concerns, scientific and technological enterprises, research scholars,
data analysts, and technical practitioners have different definitions of big data.
• In 2010, Apache Hadoop defined big data as “datasets which could not be captured,
managed, and processed by general computers within an acceptable scope.”
• On the basis of this definition, in May 2011, McKinsey & Company, a global consulting
agency announced Big Data as “the Next Frontier for Innovation, Competition, and
Productivity.”
• The US National Institute of Standards and Technology (NIST) defines big data as “Big data
shall mean the data of which the data volume, acquisition speed, or data representation
limits the capacity of using traditional relational methods to conduct effective analysis or
the data which may be effectively processed with important horizontal zoom technologies,”
which focuses on the technological aspect of big data. It indicates that efficient methods or
technologies need to be developed and used to analyze and process big data.
The Development of Big Data
• In late 1970s, the concept of “database machine” emerged, which is a technology specially
used for storing and analyzing data.
• With the increase of data volume, the storage and processing capacity of a single
mainframe computer system has become inadequate.
• In the 1980s, people proposed “share nothing,” a parallel database system, to meet the
demand of the increasing data volume.
• The share nothing system architecture is based on the use of cluster and every machine
has its own processor, storage, and disk.
• Teradata system was the first successful commercial parallel database system.
• Such database became very popular lately. On June 2, 1986, a milestone event occurred,
when Teradata delivered the first parallel database system with a storage capacity of 1TB to
Kmart to help the large-scale retail company in North America to expand its data
warehouse
• In late 1990s, the advantages of the parallel database was widely recognized in the
database field.
• With the development of Internet services, indexes and queried contents were rapidly
growing. Therefore, search engine companies had to face the challenges of handling such
• Google created GFS and MapReduce programming models to cope with the challenges brought about
by data management and analysis at the Internet scale.
• In addition, contents generated by users, sensors, and other ubiquitous data sources also drive the
overwhelming data flows, which required a fundamental change on the computing architecture and
large-scale data processing mechanism
• Over the past few years, nearly all major companies, including Oracle, IBM, Microsoft, Google, Amazon,
and Facebook, etc., have started their big data projects. Taking IBM as an example, since 2005, IBM has
invested USD 16 billion on 30 acquisitions related to big data.
• In academia, big data was also under the spotlight.
• In 2008, Nature published the big data special issue.
• In 2011, Science also launched a special issue on the key technologies of “data processing” in big data.
• In 2012, European Research Consortium for Informatics and Mathematics (ERCIM) News published a
special issue on big data.
• In the beginning of 2012, a report titled Big Data, Big Impact presented at the Davos Forum in
Switzerland, announced that big data has become a new kind of economic assets, just like currency or
gold.
• Gartner, an international research agency, issued Hype Cycles from 2012 to 2013, which classified big
data computing, social analysis, and stored data analysis into 48 emerging technologies that deserve
most attention.
Big Data is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight
and decision making.
Characteristics(V’s):
1. Volume: It refers to the amount of the data.
The size of the data is being increased from Bits to Yottabytes.
Bit->
Byte-> 8 bits
KBs-> 210
MBs-> 220
GBs-> 230 , TBs-> 240 , PBs-> 250 , Exabytes-> 260,
Zettabytes-> 270, Yottabytes -> 280
• There are different sources of data like doc, pdf, YouTube, a chat
conversation on internet messenger, a customer feedback form on an
online retail website, CCTV coverage and weather forecast.
• The sources of Big data:
1. Typical internal data sources: data present within an organization’s
firewall. Data storage: File systems, SQL (RDBMSs- oracle, MS SQL
server, DB2, MySQL, PostgreSQL etc.) NoSQL, (MangoDB, Cassandra
etc) and so on. Archives: Archives of scanned documents, paper
archives, customer correspondence records, patient’s health records,
student’s admission records, student’s assessment records, and so on.
2. External data sources: data residing outside an organization’s Firewall.
Public web: Wikipedia, regulatory, compliance, weather, census etc.,
3. Both (internal + external sources) Sensor data, machine log data,
social media, business apps, media and docs.
• 2. Variety: Variety deals with the wide range of data types and sources of
data. Structured, semi-structured and Unstructured. Structured data: From
traditional transaction processing systems and RDBMS, etc. Semi-structured
data: For example Hypertext Markup Language (HTML), eXtensible Markup
Language (XML). Unstructured data: For example unstructured text
documents, audios, videos, emails, photos, PDFs , social media, etc.
• 3. Velocity: It refers to the speed of data processing. we have moved from
the days of batch processing to Real-time processing.
• 4. Veracity: Veracity refers to biases, noise and abnormality in data. The key
question is “Is all the data that is being stored, mined and analysed
meaningful and pertinent to the problem under consideration”.
• 5. Value: This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data. It is often
quantified as the potential social or economic value that the data might
create. 6. Volatility: It deals with “How long the data is valid?
The challenges with big data:
1. Data today is growing at an exponential rate. How to store and
process this exponentially increasing data. 90% of world’s data was
created in last 2 years. Also the key question is : will all this data be
useful for analysis how will separate knowledge from noise.
2. How to host big data solutions outside the world.
3. The period of retention of big data.
4. Lack of skilled professionals who possess a high level of proficiency
in data science that is vital in implementing Big data solutions.
5. Challenges with respect to capture, curation, storage, search,
sharing, transfer, analysis, privacy violations and visualization.
6. Shortage of data visualization experts
7. Scale : The storage of data is becoming a challenge for everyone.
8. Security: The production of more and more data increases security
and privacy concerns.
9. Schema: there is no place for rigid schema, need of dynamic schema.
10. Continuous availability: How to provide 24X7 support
11. Consistency: Should one opt for consistency or eventual
consistency.
12. Partition tolerant: how to build partition tolerant systems that can
take of both hardware and software failures.
13. Data quality: Inconsistent data, duplicates, logic conflicts, and
missing data all result in data quality challenges.
RELATED TECHNOLOGIES
• Several fundamental technologies that are closely related to big data,
including
• cloud computing,
• Internet of Things (IoT),
• data center, and
• Hadoop
Cloud Computing Preliminaries
• In the big data paradigm, reliable hardware infrastructures is critical to provide
reliable storage.
• The hardware infrastructure includes masses of elastic shared Information and
Communications Technology (ICT) resources.
• Such ICT resources shall be capable of horizontal and vertical expansion and
contraction, and dynamic reconfiguration for different applications.
• Over the years, the advances of cloud computing have been changing the way people
acquire and use hardware infrastructure and software service.
• Cloud computing means the delivery and use mode of IT infrastructure, i.e., acquiring
necessary resources through the Internet on-demand or in an expandable way.
• Also the delivery and use mode of services, i.e., acquiring necessary services through
the Internet on-demand or in an expandable way. Such service may related to
software and the Internet, or others. It refers to the case that users access a server
through the network in a remote location and then use some services provided by
the server
Relationship Between Cloud Computing and Big Data
• The basic idea of IoT is to connect different objects in the real world, such as
RFID, bar code readers, sensors, and mobile phones, etc., to realize
information exchange and to make them cooperate with each other to
complete a common task.
• IoT is deemed as the extension of the Internet and is an important part of
the future Internet.
• IoT is mainly characterized with that it accesses every object in the physical
world such that the objects can be addressed, controlled, and
communicated with.
• Compared with the Internet, IoT has the following main features .
• Various terminal equipment
• Automatic data acquisition
• Intelligent terminals
Illustration of the IoT architecture
Relationship Between IoT and Big Data
• In the IoT paradigm, an enormous amount of network sensors are embedded into devices in
the real world.
• Such sensors deployed in different fields may collect various kinds of data, such as
environmental data, geographical data, astronomical data, and logistic data.
• Mobile equipments, transportation facilities, public facilities, and home appliances could all
be data acquisition equipment in IoT.
• The big data generated by IoT has different characteristics compared with general big data
because of the different types of data collected, of which the most classical characteristics
include heterogeneity, variety, unstructured feature, noise, and rapid growth.
• Although the current IoT data is not the dominant part of big data, by 2030, the quantity of
sensors will reach one trillion and then the IoT data could be the most important part of big
data, according to the forecast of HP.
• A report from Intel pointed out that big data in IoT has three features that conform to the big
data paradigm:
(a) abundant terminals generating masses of data;
(b) data generated by IoT is usually semi-structured or unstructured;
(c) data of IoT is useful only when it is analyzed
Data Center
• In the big data paradigm, a data center is not only an organization for
concentrated storage of data, but also undertakes more responsibilities,
such as acquiring data, managing data, organizing data, and leveraging the
data values and functions.
• A data center has masses of data and organizes and manages data
according to its core objective and development path, which is more
valuable than owning a good site and resource.
• Big data requires data center provide powerful backstage support.
• The data center shall provide the infrastructure with a large number of
nodes, build a high-speed internal network, effectively dissipate heat, and
effective backup data.
• Only when a highly energy-efficient, stable, safe, expandable, and
redundant data center is built, the normal operation of big data
applications may be ensured.
• The growth of big data applications accelerates the revolution and
innovation of data centers.
• Many big data applications have developed their unique architectures
and directly promote the development of storage, network, and
computing technologies related to data center.
• The scale of data center is increasingly expanding, it is also an important
issue on how to reduce the operational cost for the development of
data centers.
• Data center shall not only be concerned with hardware facilities but also
strengthen soft capacities, i.e., the capacities of acquisition, processing,
organization, analysis, and application of big data.
• The data center may help business personnel analyze the existing data,
discover problems in business operation, and develop solutions from big
data
• Big data is an emerging paradigm, which will promote the explosive
growth of the infrastructure and related software of data center.
Hadoop Preliminaries
• Hadoop is a technology closely related to big data, which forms a powerful big data
systematic solution through data storage, data processing, system management, and
integration of other modules.
• Hadoop is a set of large-scale software infrastructures for Internet applications similar to
Google’s File System and MapReduce.
• Hadoop was developed by Nutch, an open-source project of Apache, with the initial design
completed by Doug Cutting and Mike Cafarella.
• In 2006, Hadoop became an independent open-source project of Apache, which is widely
deployed by Yahoo, Facebook, and other Internet enterprises.
• At present, the biggest Hadoop cluster operated by Yahoo has 4,000 sets of nodes used for
data processing and analysis, including Yahoo’s advertisements, financial data, and user
logs.
• Hadoop consists of two parts: HDFS (Hadoop Distributed File System) and MR framework
(MapReduce Framework).
• HDFS is the data storage source of MR, which is a distributed file system running on
commercial hardware and designed in reference to Google’s DFS.
• HDFS is the basis for main data storage of Hadoop applications, which distributes files in
data blocks of 64MB and stores such data blocks in different nodes of a cluster, so as to
• Hadoop has many advantages, but the following aspects are especially
relevant to the management and analysis of big data:
• Expandability: Hadoop allows the expansion or shrinkage of hardware
infrastructure without changing data format. The system will automatically
re-distribute data and computing tasks will be adapted to hardware
changes.
• High Cost Efficiency: Hadoop applies large-scale parallel computing to
commercial servers, which greatly reduces the cost per TB required for
storage capacity. The large-scale computing also enables it to accommodate
the continually growing data volume.
• Strong Flexibility: Hadoop may handle many kinds of data from various
sources. In addition, data from many sources can be synthesized in Hadoop
for further analysis. Therefore, it can cope with many kinds of challenges
brought by big data.
• High Fault-Tolerance: it is common that data loss and miscalculation occur
during the analysis of big data, but Hadoop can recover data and correct
computing errors caused by node failures or network congestion
Relationship between Hadoop and Big Data
• Hadoop is widely used in big data applications in the industry, e.g., spam filtering, network
searching, clickstream analysis, and social recommendation.
• As declared in June 2012, Yahoo runs Hadoop in 42,000 servers at four data centers to support
its products and services, e.g., searching and spam filtering, etc.
• At present, the biggest Hadoop cluster has 4,000 nodes, but the number of nodes will be
increased to 10,000 with the release of Hadoop 2.0. In the same month, Facebook announced
that their Hadoop cluster can process 100 PB data, which grew by 0.5 PB per day as in November
2012.
• Hadoop plays a crucial role in enabling the processing and analysis of Big
Data.
• It provides a scalable and distributed environment that can handle the
enormous volumes, high velocities, and diverse varieties of data that
constitute Big Data.
• By leveraging Hadoop's distributed computing capabilities, organizations can
store, process, and gain insights from large datasets that were previously too
challenging to manage using traditional methods.
• Hadoop is a technology framework that helps manage and process Big Data,
making it a fundamental tool in the field of data analytics and allowing