0% found this document useful (0 votes)

44 views83 pages

Unit 1 - BD - Introduction To Big Data

big data notes

Uploaded by

21052473

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views83 pages

Unit 1 - BD - Introduction To Big Data

big data notes

Uploaded by

21052473

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Big Data (CS-3032)

Kalinga Institute of Industrial Technology

Deemed to be University
Bhubaneswar-751024

School of Computer Engineering

Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission

3 Credit Lecture Note

Motivating Quotes
2

q “The world is one big data platform.” - Andrew McAfee, co-director of the
MIT Initiative on the Digital Economy, and the associate director of the
Center for Digital Business at the MIT Sloan School of Management.
q “Errors using inadequate data are much less than those using no data at
all.” - Charles Babbage, inventor and mathematician.
q “The most valuable commodity I know of is information.” - Gordon
Gekko, fictional character in the 1987 film Wall Street and its 2010
sequel Wall Street: Money Never Sleeps, played by Michael Douglas.
q “Big data will replace the need for 80% of all doctors” - Vinod Khosla,
Indian-born American engineer and businessman.
q “Thanks to big data, machines can now be programmed to the next thing
right. But only humans can do the next right thing.” - Dov Seidman,
American author, attorney, columnist and businessman
School of Computer Engineering
Motivating Quotes cont’d
3

q “With data collection, ‘the sooner the better’ is always the best answer.” -
Marissa Mayer, former president and CEO of Yahoo!
q “Data is a precious thing and will last longer than the systems
themselves.” - Tim Berners-Lee, inventor of the World Wide Web.
q “Numbers have an important story to tell. They rely on you to give them
a voice.” - Stephen Few, Information Technology innovator, teacher, and
consultant.
q “When we have all data online it will be great for humanity. It is a
prerequisite to solving many problems that humankind faces” - Vinod
Khosla, Indian-born American engineer and businessman.
q “Thanks to big data, machines can now be programmed to the next thing
right. But only humans can do the next right thing.” - Robert Cailliau,
Belgian informatics engineer and computer scientist who, together with
Tim Berners-Lee, developed the World Wide Web.
School of Computer Engineering
Importance of the Course
4

q The Big Data is indeed a revolution in the field of Information Technology.

q The use of big data by the companies is enhancing every year and the primary focus of
the companies is on customers. The field is flourishing specifically in Business to
Consumer (B2C) applications.
q Many organizations are actively looking out for the right talent to analyze vast amounts
of data.
q Following four perspectives leads to importance of big data analytics.

Business Data
Science
Big Data
Analytics
Real-time
Job Market Usability

Further study: https://www.whizlabs.com/blog/big-data-analytics-importance/

School of Computer Engineering
Why Learn Big Data?
5

To get an answer to why you should learn Big Data? Let’s start with what
industry leaders say about Big Data:
q Gartner – Big Data is the new Oil.
q IDC – Its market will be growing 7 times faster than the overall IT market.
q IBM – It is not just a technology – it’s a business strategy for capitalizing on
information resources.
q IBM – Big Data is the biggest buzz word because technology makes it
possible to analyze all the available data.
q McKinsey – There will be a shortage of 1500000 Big Data professionals by
the mid of 2020.
Industries today are searching new and better ways to maintain their
position and be prepared for the future. According to experts, Big Data
analytics provides leaders a path to capture insights and ideas to stay ahead
in the tough competition.
School of Computer Engineering
Course Objective
6

q To understand the concept and principles of big data.

q To explore the big data stacks and the technologies associated
with it.
q To evaluate the different NoSQL databases and frameworks
required to handle the big data.
q To formulate the concepts, principles and techniques focusing
on the applications to industry and real world experience.
q To contextually integrate and correlate large amounts of
information to gain faster insights for real time scenarios.

School of Computer Engineering

Course Outcome
7

1. Understand the concept of data management, evolution and

building blocks of big data
2. Analyse various big data technology foundations
3. Apply map reduce paradigm to solve data intensive problems
4. Analyse big data framework like Hadoop and NoSQL to
efficiently store and process big data to generate analytics
5. Present appropriate solutions to big data analytics problems
6. Interpret data findings effectively to any audience, visually

School of Computer Engineering

Course Contents
8

Sr # Major and Detailed Coverage Area Hrs

1 Overview of Big Data 6

Importance of Data, Characteristics of Data, Analysis of unstructured data,

Introduction to Big Data, Challenges of conventional systems, Data analytic,
Evolution of analytic scalability, Big Data Analytics, Key Big Data terminologies, Big
Data analytics lifecycle, Cloud Computing and Big Data.
2 Big Data Technology Foundations 10

Exploring the Big Data Stack, Data Sources Layer, Ingestion Layer, Storage Layer,
Physical Infrastructure Layer, Platform Management Layer, Security Layer,
Monitoring Layer, Analytics Engine, Visualization Layer, Big Data Applications,
Virtualization. Introduction to Streams Concepts – Stream data model and
architecture – Stream Computing, Sampling data in a stream – Filtering streams,
Counting distinct elements in a stream.

School of Computer Engineering

Course Contents continue…
9

Sr # Major and Detailed Coverage Area Hrs

3 Hadoop Ecosystem
Introduction to Hadoop, Hadoop Ecosystem, Hadoop Distributed File System, 10
MapReduce, YARN, Hive, Pig and PigLatin, Jaql - Zookeeper - HBase, Cassandra-
Oozie, Lucene- Avro, Mahout.
4 Storing Data in Big Data context
Data Models, RDBMS and Hadoop, Non-Relational Database, Introduction to NoSQL, 6
Types of NoSQL, Polyglot Persistence, Sharding
5 Frameworks and Visualization 6
Distributed and Parallel Computing for Big Data, Big Data Visualizations – Visual
data analysis techniques, interaction techniques, applications

School of Computer Engineering

Books
10

Textbook
q Big Data, Black Book, DT Editorial Services, Dreamtech Press, 2016
Reference Books
q Big Data and Analytics, Seema Acharya, Subhashini Chellappan, Infosys Limited,
Publication: Wiley India Private Limited,1st Edition 2015.
q Discovering, Analyzing, Visualizing and Presenting Data by EMC Education
Services (Editor), Wiley, 2014
q Stephan Kudyba, Thomas H. Davenport, Big Data, Mining, and Analytics,
Components of Strategic Decision Making, CRC Press, Taylor & Francis Group. 2014
q Norman Matloff , THE ART OF R PROGRAMMING, No Starch Press, Inc.2011
q Big Data For Dummies, Judith Hurwitz et al. Wiley 2013
q Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007 Pete Warden, Big
Data Glossary, O’Reilly, 2011.

School of Computer Engineering

Evaluation
11

Grading:

q Internal assessment – 30 marks

q 2 individual class test = 2.5 X 2 = 5 marks

q 2 group assignments = 2.5 X 2 = 5 marks

q 1 group research paper presentation = 10 marks

q 1 group mini project = 10 marks

q Mid-Term exam - 20 marks

q End-Term exam - 50 marks

?
School of Computer Engineering
Data
12

q A representation of information, knowledge, facts, concepts or instructions

which are being prepared or have been prepared in a formalized manner.
q Data is either intended to be processed, is being processed, or has been
processed.
q It can be in any form stored internally in a computer system or computer
network or in a person’s mind.
q Since the mid-1900s, people have used the word data to mean computer
information that is transmitted or stored.
q Data is the plural of datum (a Latin word meaning something given), a single
piece of information. In practice, however, people use data as both the
singular and plural form of the word.
q It must be interpreted, by a human or machine to derive meaning.
q It is present in homogeneous sources as well as heterogeneous sources.
q The need of the hour is to understand, manage, process, and take the data
for analysis to draw valuable insights.
Data  Information  Knowledge  Actionable Insights
School of Computer Engineering
Importance of Data
13

q The ability to analyze and act on data is increasingly important to businesses.

It might be part of a study helping to cure a disease, boost a company’s
revenue, understand and interpret market trends, study customer behavior
and take financial decisions
q The pace of change requires companies to be able to react quickly to
changing demands from customers and environmental conditions. Although
prompt action may be required, decisions are increasingly complex as
companies compete in a global marketplace
q Managers may need to understand high volumes of data before they can
make the necessary decisions
q Relevant data creates strong strategies - Opinions can turn into great
hypotheses, and those hypotheses are just the first step in creating a strong
strategy. It can look something like this: “Based on X, I believe Y, which will
result in Z”
q Relevant data strengthens internal teams
q Relevant data quantifies the purpose of the work
School of Computer Engineering
Characteristics of Data
14

Deals with the structure of the

data i.e. source, the granularity,
the type, nature whether static Composition
or real-time streaming

Deals with the state of the data

i.e. usability for analysis, does it Condition Data
require cleaning for further
enhancement and enrichment?
Deals with “where it has been Context
generated”, “ why was this
generated”, “how sensitive is
this”, “what are the associated
events” and so on.

School of Computer Engineering

Human vs. Machine Readable data
15

q Human-readable refers to information that only humans can interpret and study,
such as an image or the meaning of a block of text. If it requires a person to
interpret it, that information is human-readable.
q Machine-readable refers to information that computer programs can process. A
program is a set of instructions for manipulating data. Such data can be
automatically read and processed by a computer, such as CSV, JSON, XML, etc.
Non-digital material (for example printed or hand-written documents) is by its non-
digital nature not machine-readable. But even digital material need not be machine-
readable. For example, a PDF document containing tables of data. These are
definitely digital but are not machine-readable because a computer would struggle
to access the tabular information - even though they are very human readable. The
equivalent tables in a format such as a spreadsheet would be machine readable.
Another example scans (photographs) of text are not machine-readable (but are
human readable!) but the equivalent text in a format such as a simple ASCII text file
can machine readable and processable.

School of Computer Engineering

Classification of Digital Data
16

Digital data is classified into the following categories:

q Structured data
q Semi-structured data
q Unstructured data

Approximate percentage distribution of digital data

School of Computer Engineering

Structured Data
17

q It is defined as the data that has a defined repeating pattern and this pattern
makes it easier for any program to sort, read, and process the data.
q This is data is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
q Relationships exist between entities of data.
q Structured data:
q Organize data in a pre-defined format
q Is stored in a tabular form
q Is the data that resides in a fixed fields within a record of file
q Is formatted data that has entities and their attributes mapped
q Is used to query and report against predetermined data types
q Sources:
Relational Multidimensional
database databases
Structured data
Legacy
Flat files
databases
School of Computer Engineering
Ease with Structured Data
18

Insert/Update/ DML operations provide the required ease with data

input, storage, access, process , analysis etc.
Delete

Encryption and tokenization solution to warrant the

security of information throughout life cycle.
Security Organization able to retain control and maintain
compliance adherence by ensuring that only authorized
are able to decrypt and view sensitive information.

Indexing speed up the data retrieval operation at the

Structured data Indexing cost of additional writes and storage space, but the
benefits that ensure in search operation are worth the
additional writes and storage spaces.

The storage and processing capabilities of the traditional

Scalability DBMS can be easily be scaled up by increasing the
horsepower of the database server

Transaction RDBMS has support of ACID properties of transaction

to ensure accuracy, completeness and data integrity
Processing

School of Computer Engineering

Semi-structured Data
19

q Semi-structured data, also known as having a schema-less or self-describing

structure, refers to a form which does not conform to a data model as in
relational database but has some structure.
q In other words, data is stored inconsistently in rows and columns of a database.
q However, it is not in a form which can be used easily by a computer program.
q Example, emails, XML, markup languages like HTML, etc. Metadata for this data
is available but is not sufficient.
q Sources:

Web data in
the form of XML
cookies Semi-structured
data
Other Markup JSON
languages

School of Computer Engineering

XML, JSON, BSON format
20

Source (XML & JSON): http://sqllearnergroups.blogspot.com/2014/03/how-to-get-json-format-through-sql.html

Source (JSON & BSON): http://www.expert-php.fr/mongodb-bson/

School of Computer Engineering

Characteristics of Semi-structured Data
21

Inconsistent Structure

Self-describing
(level/value pair)

Other schema
Semi-structured information is
data blended with data
values

Data objects may

have different
attributes not known
beforehand

School of Computer Engineering

Unstructured Data
22

q Unstructured data is a set of data that might or might not have any logical or
repeating patterns and is not recognized in a pre-defined manner.
q About 80 percent of enterprise data consists of unstructured content.
q Unstructured data:
q Typically consists of metadata i.e. additional information related to data.
q Comprises of inconsistent data such as data obtained from files, social
media websites, satellites etc
q Consists of data in different formats such as e-mails, text, audio, video, or
images.
q Sources: Body of email

Chats, Text
Text both
messages
internal and
external to org.
Mobile data
Unstructured data
Social Media Images, audios,
data videos
School of Computer Engineering
Challenges associated with Unstructured data
23

Working with unstructured data poses certain challenges, which are as follows:
q Identifying the unstructured data that can be processed
q Sorting, organizing, and arranging unstructured data indifferent sets and
formats
q Combining and linking unstructured data in a more structured format to derive
any logical conclusions out of the available information
q Costing in terms of storage space and human resources need to deal with the
exponential growth of unstructured data
Data Analysis of Unstructured Data
The complexity of unstructured data lies within the language that created it. Human
language is quite different from the language used by machines, which prefer
structured information. Unstructured data analysis is referred to the process of
analyzing data objects that doesn’t follow a predefine data model and/or is
unorganized. It is the analysis of any data that is stored over time within an
organizational data repository without any intent for its orchestration, pattern or
categorization.
School of Computer Engineering
Dealing with Unstructured data
24

Data Mining (DM)

Natural Language Processing (NLP)

Dealing with
Unstructured data Text Analytics (TA)
Noisy Text Analytics

Note: Refer to Appendix for further details.

School of Computer Engineering

Definition of Big Data
25

High-volume
Big Data is high-volume, high-velocity,
High-velocity
and high-variety information assets that
High-variety demand cost effective, innovative forms
of information processing for enhanced
insight and decision making.
Cost-effective,
innovative Source: Gartner IT Glossary
forms of
information
processing

Enhanced
insight &
decision making

School of Computer Engineering

What is Big Data?
26

Think of following:

q Every second, there are around 822 tweets on Twitter

q Every minutes, nearly 510 comments are posted, 293 K statuses are updated,
and 136K photos are uploaded in Facebook
q Every hour, Walmart, a global discount departmental store chain, handles more
than 1 million customer transactions.
q Everyday, consumers make around 11.5 million payments by using PayPal.
In the digital world, data is increasing rapidly because of the ever increasing use of
the internet, sensors, and heavy machines at a very high rate. The sheer volume,
variety, velocity, and veracity of such data is signified the term ‘Big Data’.

Semi- Big
Structured Unstructured
structured Data
Data Data
Data

School of Computer Engineering

Challenges of Conventional Systems
27

The main challenge in the traditional approach for computing systems to manage
‘Big Data’ because of immense speed and volume at which it is generated. Some of
the challenges are:
q Traditional approach cannot work on unstructured data efficiently
q Traditional approach is built on top of the relational data model, relationships
between the subjects of interests have been created inside the system and the
analysis is done based on them. This approach will not adequate for big data.
q Traditional approach is batch oriented and need to wait for nightly ETL
(extract, transform and load) and transformation jobs to complete before
the required insight is obtained
q Traditional data management, warehousing, and analysis systems fizzle to
analyze this type of data. Due to it’s complexity, big data is processed with
parallelism. Parallelism in a traditional system is achieved through costly
hardware like MPP (Massively Parallel Processing) systems
q Inadequate support of aggregated summaries of data

School of Computer Engineering

Challenges of Conventional Systems cont’d
28

Other challenges can be categorized as:

q Data Challenges:
q Volume, velocity, veracity, variety
q Data discovery and comprehensiveness
q Scalability
q Process challenges
q Capturing Data
q Aligning data from different sources
q Transforming data into suitable form for data analysis
q Modeling data(Mathematically, simulation)
q Management Challenges:
q Security
q Privacy
q Governance
q Ethical issues
School of Computer Engineering
Elements of Big Data
29
In most big data circles, these are called the four V’s: volume, variety, velocity, and veracity.
(One might consider a fifth V, value.)
Volume - refers to the incredible amounts of data generated each second from social media,
cell phones, cars, credit cards, M2M sensors, photographs, video, etc. The vast amounts of
data have become so large in fact it can no longer store and perform data analysis using
traditional database technology. So using distributed systems, where parts of the data is
stored in different locations and brought together by software.
Variety - defined as the different types of data the digital system now use. Data today looks
very different than data from the past. New and innovative big data technology is now
allowing structured and unstructured data to be harvested, stored, and used simultaneously.
Velocity - refers to the speed at which vast amounts of data are being generated, collected
and analyzed. Every second of every day data is increasing. Not only must it be analyzed,
but the speed of transmission, and access to the data must also remain instantaneous to
allow for real-time access. Big data technology allows to analyze the data while it is being
generated, without ever putting it into databases.
Veracity - is the quality or trustworthiness of the data. Just how accurate is all this data?
For example, think about all the Twitter posts with hash tags, abbreviations, typos, etc., and
the reliability and accuracy of all that content.

School of Computer Engineering

Elements of Big Data cont’d
30
Value - refers to the ability to transform a tsunami of data into business. Having endless
amounts of data is one thing, but unless it can be turned into value it is useless.

Refer to Appendix
for data volumes

School of Computer Engineering

Why Big Data?
31
More data for analysis will result into greater analytical accuracy and greater
confidence in the decisions based on the analytical findings. This would entail a greater
positive impact in terms of enhancing operational efficiencies, reducing cost and time,
and innovating on new products, new services and optimizing existing services.

More data

More accurate analysis

Greater confidence in decision making

Greater operational efficiencies, cost

reduction, time reduction, new
product development, and optimized
offering etc.

School of Computer Engineering

Data Analytics
32

Data analytics is the process of extracting useful information by analysing

different types of data sets. It is used to discover hidden patterns, outliers,
unearth trends, unknown co-relationship and other useful information for the
benefit of faster decision making.
There are 4 types of analytics:

School of Computer Engineering

Analytics Approach – What is the data telling?
33

Approach Explanation
Descriptive What’s happening in my business?
• Comprehensive, accurate and historical data
• Effective Visualisation
Diagnostic Why is it happening?
• Ability to drill-down to the root-cause
• Ability to isolate all confounding information
Predictive What’s likely to happen?
• Decisions are automated using algorithms and technology
• Historical patterns are being used to predict specific outcomes using
algorithms
Prescriptive What do I need to do?
• Recommended actions and strategies based on champion/challenger
strategy outcomes
• Applying advanced analytical algorithm to make specific
recommendations
School of Computer Engineering
Mapping of Big Data’s Vs to Analytics Focus
34

History data can be quite large. There might be a need to process huge amount of data many times a
day as it gets updated continuously. Therefore volume is mapped to history. Variety is pervasive.
Input data, insights, and decisions can span a variety of forms, hence it is mapped to all three. High
velocity data might have to be processed to help real time decision making and plays across
descriptive, predictive, and prescriptive analytics when they deal with present data. Predictive and
prescriptive analytics create data about the future. That data is uncertain, by nature and its veracity
is in doubt. Therefore veracity is mapped to prescriptive and predictive analytics when it deal with
future.
School of Computer Engineering
Analysis vs. Reporting
35
Reporting - The process of organizing data into informational summaries
in order to monitor how different areas of a business are performing.
Analysis: The process of exploring data and reports in order to extract
meaningful insights, which can be used to better understand and improve
business performance.
Difference b/w Reporting and Analysis:
q Reporting translates raw data into information. Analysis transforms
data and information into insights.
q Reporting helps companies to monitor their online business and be alerted
to when data falls outside of expected ranges. Good reporting should raise
questions about the business from its end users. The goal of analysis is to
answer questions by interpreting the data at a deeper level and providing
actionable recommendations.
q In summary, reporting shows you what is happening while analysis focuses
on explaining why it is happening and what you can do about it.

School of Computer Engineering

Evolution of Analytics Scalability
36
It goes without saying that the world of big data requires new levels of scalability. As the
amount of data organizations process continues to increase, the same old methods for
handling data just won’t work anymore. Organizations that don’t update their
technologies to provide a higher level of scalability will quite simply choke on big data.
Luckily, there are multiple technologies available that address different aspects of the
process of taming big data and making use of it in analytic processes.
Traditional Analytics Architecture

Database 1
Analytic Server

Database 2
Extract
Database 3

The heavy processing occurs in the analytic environment. This

Database n may even a PC
School of Computer Engineering
Evolution of Analytics Scalability cont’d
37

Modern In-Database Analytics Architecture

Refer to Appendix for

Database 1 further details on EDW
Analytic Server
Database 2
Submit
Consolidate
Request

Database 3 Enterprise Data

Warehouse (EDW)

Database n

In an in-database environment, the processing stays in the database where the data
has been consolidated. The user’s machine just submits the request; it doesn’t do
heavy lifting.

School of Computer Engineering

Evolution of Analytics Scalability cont’d
38

MPP Database Analytics Architecture

Massively parallel processing (MPP) database systems is the most mature, proven, and
widely deployed mechanism for storing and analyzing large amounts of data. An MPP
database spreads data out into independent pieces managed by independent
storage and central processing unit (CPU) resources. Conceptually, it is like
having pieces of data loaded onto multiple network connected personal computers
around a house. The data in an MPP system gets split across a variety of disks managed
by a variety of CPUs spread across a number of servers.

Single overloaded server

In stead of single
overloaded database, an
MPP database breaks the
data into independent
chunks with independent
Multiple lightly loaded server
disk and CPU.

School of Computer Engineering

MPP Database Example
39

100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte

chunks chunks chunks chunks chunks

One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks

A Traditional database will query

a one-terabyte table one row at time 10 simultaneous 100-gigabyte queries

MPP database is based on the principle of SHARE THE WORK!

A MPP database spreads data out across multiple sets of CPU and disk space. Think
logically about dozens or hundreds of personal computers each holding a small piece of a
large set of data. This allows much faster query execution, since many independent
smaller queries are running simultaneously instead of just one big query
If more processing power and more speed are required, just bolt on additional
capacity in the form of additional processing units.
MPP systems build in redundancy to make recovery easy and have resource
management tools to manage the CPU and disk space
School of Computer Engineering
MPP Database Example cont’d
40

An MPP system allows the different sets of CPU and disk to run the process concurrently

An MPP system
breaks the job into pieces

Single Threaded
Process ★ Parallel Process ★
School of Computer Engineering
Big Data Analytics
41
Big data analytics is the process of extracting useful information by analysing different
types of big data sets. It is used to discover hidden patterns, outliers, unearth trends,
unknown co-relationship and other useful info for the benefit of faster decision making.
Big Data Application in different Industries

School of Computer Engineering

What is Big Data Analytics ?
42

Move code to data for Richer, deeper insights into

Better, faster decisions in
greater speed and customers, partners and the
real-time
efficiency business
Working with datasets
whose volume and variety is Big Data
Competitive advantages
beyond the storage and Analytics
capacity of typical DB
IT’s collaboration with Time-sensitive decisions
Technology enabled
business users and data made in near real time by
analytics
scientist processing real-time data

School of Computer Engineering

What is Big Data Analytics isn’t?
43

Only about Volume Just about technology Meant to replace RDBMS

Big Data
Analytics isn’t

“One-size-fit-all” traditional
Only used by huge online Meant to replace data
RDBMS built on shared disk
companies warehouse
and memory

School of Computer Engineering

Challenges that prevent business from
capitalizing on Big Data
44

1. Obtaining executive sponsorships for investments in big data and its related
activities such as training etc.
2. Getting the business units to share information across organizational silos.
3. Fining the right skills that can manage large amounts of structured, semi-
structured, and unstructured data and create insights from it.
4. Determining the approach to scale rapidly and elastically. In other words,
the need to address the storage and processing of large volume, velocity and
variety of big data.
5. Deciding whether to use structured or unstructured, internal or external
data to make business decisions.
6. Determining what to do with the insights created from big data.
7. Choosing the optimal way to report findings and analysis of big data for the
presentations to make the most sense.

School of Computer Engineering

Top challenges facing Big Data
45

1. Scale: Storage is one major concern that needs to be addressed to handle

the need for scaling rapidly and elastically. The need of the hour is a storage
that can best withstand the onslaught of large volume, velocity, and variety
of big data? Should scale vertically or horizontally?
2. Security: Most of the NoSQL (Not only SQL) big data platforms have poor
security mechanism (lack of proper authentication and authorization
mechanisms) when it comes to safeguarding big data.
3. Schema: Rigid schema have no place. The need of the hour is dynamic
schema and static (pre-defined) schemas are passed.
4. Data Quality: How to maintain data quality – data accuracy, completeness,
timeliness etc. Is the appropriate metadata in place?
5. Partition Tolerant: How to build partition tolerant systems that can take
care of both hardware and software failures?
6. Continuous availability: The question is how to provide 24/7 support
because almost all RDBMS and NoSQL big data platforms have a certain
amount of downtime built in.
School of Computer Engineering
Technologies to help meet the challenges
posed by Big Data
46

1. Cheap and abundant storage

4. Parallel processing, clustering, visualisation, large

grid environments, high connectivity, and high
throughputs rather than low latency.
5. Cloud c o m p u t i n g a n d o t h e r f l ex i b l e re s o u rc e
allocation agreements

School of Computer Engineering

Key terminologies used in Big Data
47

In-Memory Analytics: Data access from non-volatile storage such as hard disk
is a slow process. The more the data is required to be fetched from hard disk or
secondary storage, the slower the process gets. The problem can be addressed
using in-memory analytics. All the relevant data is stored in RAM or primary
storage thus eliminating the need to access the data from hard disk. The
advantage is faster access, rapid deployment, better insights and minimal IT
involvement. In-memory Analytics makes everything Instantly Available due to
lower cost of RAM or Flash Memory, and data can be stored and processed at
lightening speed.
In-Database Processing: Also called as In-Database analytics. It works by
fusing data warehouses with analytical systems. Typically the data from various
enterprise Online Transaction Processing (OLTP) systems after cleaning up (de-
duplication, scrubbing etc.) through the process of ETL is stored in the
Enterprise Data Warehouse or data marts. The huge datasets are then
exported to analytical programs for complex and extensive computations.
Note: Refer to Appendix for further details on OLTP and ETL.
School of Computer Engineering
Key terminologies used in Big Data cont’d
48

Symmetric Multiprocessor System (SMP): In SMP, there is a single common

main memory that is shared by two or more identical processors. The
processors have full access to all I/O devices and are controlled by a single
operating system instance. Each processor has its own high-speed memory,
called cache memory and are connected using a system bus.

Source: https://en.wikipedia.org/wiki/Symmetric_multiprocessing
School of Computer Engineering
Key terminologies used in Big Data cont’d
49

Parallel Systems: A parallel database system is a tightly coupled system.

The processors co-operate for query processing. The user is unaware of the
parallelism since he/she has no access to a specific processor of the system.

User User User

Front end computer

P1 P2 P3

Back end parallel system

School of Computer Engineering

Key terminologies used in Big Data cont’d
50

Distributed Systems: Known to be loosely coupled and are composed of

individual machines. Each of the machine can run their individual application
and serve their own respective users. The data is usually distributed across
several machines, thereby necessitating quite a number of machines to be
accessed to answer a user query.
User User

P2
User User User User

P1 P3

Network

School of Computer Engineering

Distributed vs. Parallel Computing
51

Parallel Computing Distributed Computing

Shared memory system Distributed memory system
Multiple processors share a single Autonomous computer nodes
bus and memory unit connected via network
Processor is order of Tbps Processor is order of Gbps
Limited Scalability Better scalability and cheaper
Distributed computing in local
network (called cluster computing).
Distributed computing in wide-area
network (grid computing)

School of Computer Engineering

Key terminologies used in Big Data cont’d
52

SM SD

In a shared memory (SM)

architecture, a common central
memory is shared by multiple
processors. In a shared disk (SD)
architecture, multiple processors
share a common collection of
disks while having their own
private memory.

School of Computer Engineering

Key terminologies used in Big Data cont’d
53

In a shared nothing (SN) architecture, neither memory nor disk is shared among
multiple processors.
Advantages:
q Fault Isolation: provides the benefit of isolating fault. A fault in a single
machine or node is contained and confined to that node exclusively and
exposed only through messages.
q Scalability: If the disk is a shared resource, synchronization will have to
maintain a consistent shared state and it means that different nodes will
have to take turns to access the critical data. This imposes a limit on how
many nodes can be added to the distributed shared disk system, this
compromising on scalability.

School of Computer Engineering

Key terminologies used in Big Data cont’d
54

CAP Theorem: In the past, when we wanted to store more data or increase
our processing power, the common option was to scale vertically (get more
powerful machines) or further optimize the existing code base. However,
with the advances in parallel processing and distributed systems, it is more
common to expand horizontally, or have more machines to do the same task in
parallel. However, in order to effectively pick the tool of choice like Spark,
Hadoop, Kafka, Zookeeper and Storm in Apache project, a basic idea of CAP
Theorem is necessary. The CAP theorem is called the Brewer’s Theorem. It
states that a distributed computing environment can only have 2 of the 3:
Consistency, Availability and Partition Tolerance – one must be sacrificed.
q Consistency implies that every read fetches the last write
q Availability implies that reads and write always succeed. In other words,
each non-failing node will return a response in a reasonable amount of time
q Partition Tolerance implies that the system will continue to function when
network partition occurs

School of Computer Engineering

CAP Theorem cont’d
55

The CAP theorem categorizes systems into three

categories:
CP (Consistent and Partition Tolerant) - a
system that is consistent and partition tolerant
but never available. CP is referring to a category
of systems where availability is sacrificed only in
the case of a network partition.
CA (Consistent and Available) - CA systems are
consistent and available systems in the absence
of any network partition. Often a single node's
DB servers are categorized as CA systems. Single
node DB servers do not need to deal with
partition tolerance and are thus considered CA
systems.
Source: Towards Data Science AP (Available and Partition Tolerant) - These
are systems that are available and partition
tolerant but cannot guarantee consistency.

School of Computer Engineering

CAP Theorem Proof
56

Let's consider a very simple distributed system. Our system is composed of S1 S2

two servers, S1 and S2. Both of these servers are keeping track of the same V0 V0
variable, v, whose value is initially v0. S1 and S2 can communicate with each
other and can also communicate with external client. Here's what the system
looks like. Client
Assume for contradiction that the system is consistent, available, and
partition tolerant. S1 S2
V0 V0
The first thing we do is partition our system. It looks like this.

Next, the client request that v 1 be written to S1. Since the system is Client
available, S1 must respond. Since the network is partitioned, however, S1
cannot replicate its data to S2. This phase of execution is called α1.
S1 S2 S1 S2 S1 S2
V0 V0 V1 V0 V1 V0

Write V1 done
Client Client Client

School of Computer Engineering

CAP Theorem Proof cont’d

Next, the client issue a read request to S2. Again, since the system is
available, S2 must respond and since the network is partitioned, S2 cannot
update its value from S1. It returns v0. This phase of execution is called α2.
S1 S2 S1 S2

V1 V0 V1 V0

read V0
Client Client

S2 returns v0 to the client after the client had already written v1 to S1. This is
inconsistent.
We assumed a consistent, available, partition tolerant system existed, but we
just showed that there exists an execution for any such system in which the
system acts inconsistently. Thus, no such system exists.

School of Computer Engineering

Big Data Analytics Lifecycle
58

q Big Data analysis differs from traditional data analysis primarily due to the
volume, velocity and variety characteristics of the data being processes.
q To address the distinct requirements for performing analysis on Big Data,
a step-by-step methodology is needed to organize the activities and tasks
involved with acquiring, processing, analyzing and repurposing data.
q From a Big Data adoption and planning perspective, it is important that in
addition to the lifecycle, consideration be made for issues of training,
education, tooling and staffing of a data analytics team.
q The Big Data analytics lifecycle can be divided into the following nine
stages namely –
1. Business Case Evaluation 6. Data Aggregation & Representation
2. Data Identification 7. Data Analysis
3. Data Acquisition & Filtering 8. Data Visualization
4. Data Extraction 9. Utilization of Analysis Results
5. Data Validation & Cleansing

School of Computer Engineering

Big Data Analytics Lifecycle cont’d
59

Stage 1 Stage 2 Stage 3

Data Acquisition &
Business Case Evaluation Data Identification
Filtering

Stage 6 Stage 5 Stage 4

Data Aggregation & Data Validation &
Data Extraction
Representation Cleansing

Stage 7 Stage 8 Stage 9

Utilization of Analysis
Data Analysis Data Visualization
Results

School of Computer Engineering

1. Business Case Evaluation
60

q Before any Big Data project can be started, it needs to be

clear what the business objectives and results of the data
analysis should be.
q This initial phase focuses on understanding the project
objectives and requirements from a business perspective, and
then converting this knowledge into a data mining problem
definition.
q A preliminary plan is designed to achieve the objectives. A
decision model, especially one built using the Decision
Model and Notation standard can be used.
q Once an overall business problem is defined, the problem is
converted into an analytical problem.

School of Computer Engineering

2. Data Identification
61

q The Data Identification stage determines the origin of data.

Before data can be analysed, it is important to know what the
sources of the data will be.
q Especially if data is procured from external suppliers, it is
necessary to clearly identify what the original source of the
data is and how reliable (frequently referred to as the
veracity of the data) the dataset is.
q The second stage of the Big Data Lifecycle is very important,
because if the input data is unreliable, the output data will
also definitely be unreliable.
q Identifying a wider variety of data sources may increase the
probability of finding hidden patterns and correlations.

School of Computer Engineering

3. Data Acquisition and Filtering
62

q The Data Acquisition and Filtering Phase builds upon the

previous stage of the Big Data Lifecycle.
q In this stage, the data is gathered from different sources, both
from within the company and outside of the company.
q After the acquisition, a first step of filtering is conducted to
filter out corrupt data.
q Additionally, data that is not necessary for the analysis will be
filtered out as well.
q The filtering step will be applied on each data source
individually, so before the data is aggregated into the data
warehouse.
q In many cases, especially where external, unstructured data is
concerned, some or most of the acquired data may be irrelevant
(noise) and can be discarded as part of the filtering process.
School of Computer Engineering
3. Data Acquisition and Filtering cont’d
63

q Data classified as “corrupt” can

include records with missing
or nonsensical values or
invalid data types. Data that is
filtered out for one analysis may
possibly be valuable for a
different type of analysis.
q Metadata can be added via
automation to data from both
internal and external data
sources to improve the
classification and querying.
q Examples of appended metadata
include dataset size and
structure, source information,
date and time of creation or
collection and language-specific
information.
School of Computer Engineering
4. Data Extraction
64

q Some of the data identified in the two previous stages may be

incompatible with the Big Data tool that will perform the actual
analysis.
q In order to deal with this problem, the Data Extraction stage is
dedicated to extracting different data formats from data sets
(e.g. the data source) and transforming these into a format the
Big Data tool is able to process and analyse.
q The complexity of the transformation and the extent in which is
necessary to transform data is greatly dependent on the Big Data
tool that has been selected.
q The Data Extraction lifecycle stage is dedicated to extracting
disparate data and transforming it into a format that the
underlying Big Data solution can use for the purpose of the data
analysis.
School of Computer Engineering
4. Data Extraction cont’d
65

q (A). Illustrates the

extraction of (A)
comments and a user
ID embedded within
an XML document
without the need for
f u r t h e r
transformation.
q (B). Demonstrates (B)
the extraction of the
latitude and
l o n g i t u d e
coordinates of a user
from a single JSON
field.

School of Computer Engineering

5. Data Validation and Cleansing
66

q Data that is invalid leads to invalid results. In order to

ensure only the appropriate data is analysed, the Data
Validation and Cleansing stage of the Big Data Lifecycle is
required.
q During this stage, data is validated against a set of
predetermined conditions and rules in order to ensure the
data is not corrupt.
q An example of a validation rule would be to exclude all persons
that are older than 100 years old, since it is very unlikely that
data about these persons would be correct due to physical
constraints.
q The Data Validation and Cleansing stage is dedicated to
establishing often complex validation rules and removing
any known invalid data.
School of Computer Engineering
5. Data Validation and Cleansing cont’d
67

q For example, as illustrated in below figure, the first value in Dataset B is

validated against its corresponding value in Dataset A.
q The second value in Dataset B is not validated against its corresponding
value in Dataset A. If a value is missing, it is inserted from Dataset A.

q Data validation can be used to examine interconnected datasets in order to

fill in missing valid data.

School of Computer Engineering

6. Data Aggregation and Representation
68

q Data may be spread across multiple datasets, requiring that

dataset be joined together to conduct the actual analysis.
q In order to ensure only the correct data will be analysed in the
next stage, it might be necessary to integrate multiple datasets.
q The Data Aggregation and Representation stage is
dedicated to integrate multiple datasets to arrive at a
unified view.
q Additionally, data aggregation will greatly speed up the
analysis process of the Big Data tool, because the tool will
not be required to join different tables from different datasets,
greatly speeding up the process.

School of Computer Engineering

7. Data Analysis
69

q The Data Analysis stage of the Big Data Lifecycle stage is dedicated to
carrying out the actual analysis task.
q It runs the code or algorithm that makes the calculations that will lead to
the actual result.
q Data Analysis can be simple or really complex, depending on the required
analysis type.
q In this stage the ‘actual value’ of the Big Data project will be generated.
If all previous stages have been executed carefully, the results will be factual
and correct.
q Depending on the type of analytic result required, this stage can be as
simple as querying a dataset to compute an aggregation for comparison.
q On the other hand, it can be as challenging as combining data mining and
complex statistical analysis techniques to discover patterns and
anomalies or to generate a statistical or mathematical model to depict
relationships between variables.

School of Computer Engineering

7. Data Analysis cont’d
70

q Data analysis can be classified as confirmatory analysis or exploratory

analysis, the latter of which is linked to data mining, as shown below

q Confirmatory data analysis is a deductive approach where the cause of

the phenomenon being investigated is proposed beforehand. The
proposed cause or assumption is called a hypothesis.
q Exploratory data analysis is an inductive approach that is closely
associated with data mining . No hypothesis or predetermined
assumptions are generated. Instead, the data is explored through analysis to
develop an understanding of the cause of the phenomenon.
School of Computer Engineering
8. Data Visualization
71

q After the data analysis has been performed an the result have been
presented, the final step of the Big Data Lifecycle is to use the results
in practice.
q The utilization of Analysis results is dedicated to determining how
and where the processed data can be further utilized to leverage the
result of the Big Data Project.
q Depending on the nature of the analysis problems being addressed, it
is possible for the analysis results to produce “models” that
encapsulate new insights and understandings about the nature of
the patterns and relationships that exist within the data that was
analyzed.
q A model may look like a mathematical equation or a set of rules.
Models can be used to improve business process logic and
application system logic, and they can form the basis of a new system
or software program.
School of Computer Engineering
8. Data Visualization cont’d
72

School of Computer Engineering

9. Utilization of Analysis Results
73

q After the data analysis has been performed an the result have been
presented, the final step of the Big Data Lifecycle is to use the
results in practice.
q The utilization of Analysis results is dedicated to determining
how and where the processed data can be further utilized to
leverage the result of the Big Data Project.
q Depending on the nature of the analysis problems being addressed, it
is possible for the analysis results to produce “models” that
encapsulate new insights and understandings about the nature of
the patterns and relationships that exist within the data that was
analyzed.
q A model may look like a mathematical equation or a set of rules.
Models can be used to improve business process logic and
application system logic, and they can form the basis of a new system
or software program.
School of Computer Engineering
Big Data And Cloud Computing
74

q Cloud computing is the use of computing resources (hardware and software)

that are delivered as a service over a network (typically the Internet). It’s a
virtualization framework.
q It is like a resource on demand whether it be storage, computing etc. Cloud
follows pay per usage model and one need to pay the amount of resource
usage.
q Cloud plays an important role within the big data world, by providing
horizontally expandable and optimized infrastructure that supports
practical implementation of big data.
q In cloud computing, all variety/volume of data is gathered in data centers
and then distributed to the end-users. Further, automatic backups and
recovery of data is also ensured for business continuity, all such resources
are available in the cloud.

School of Computer Engineering

Cloud Services
75

Cloud services are categorized as below:

q Infrastructure as a service (IaaS): It means complete infrastructure will be
provided to consumer. Maintenance related tasks will be done by cloud
provider and consumer can use it as per the requirement. It can be used as
public and private both. Examples are virtual machines, load balancers, and
network attached storage.
q Platform as a service (PaaS): Here the cloud have object storage, queuing,
databases, runtime etc. All these we can get directly from the cloud provider.
It’s consumer responsibility to configure and use that. Providers will give
consumer the resources but connectivity to the database and other similar
activities are consumer’s responsibility. Examples are Windows Azure and
Google App Engine.
q Software as a service (SaaS): The consumer using the application that is
running on the cloud. All infrastructure setup is the responsibility of the
service provider. Examples are dropbox, Google drive etc.
School of Computer Engineering
Cloud for Big Data - IaaS in cloud
76

q Using a cloud provider’s infrastructure for big data

services, gives access to almost limitless storage and
compute power.
q IaaS can be utilized by enterprise customers to create cost-
effective and easily scalable IT solutions where cloud
providers bear the complexities and expenses of managing
the underlying hardware.
q If the scale of a business customer’s operations fluctuates,
or they are looking to expand, they can tap into the cloud
resource as and when they need it rather than purchase,
install and integrate hardware themselves.

School of Computer Engineering

Cloud for Big Data – PAAS in cloud
77

q PaaS vendors incorporate big data technologies such as

Hadoop and MapReduce into PaaS offerings, which
eliminate the dealing with the complexities of managing
individual software and hardware elements.
q For example, web developers can use individual PaaS
environments at every stage of development, testing and
ultimately hosting their websites.
q However, businesses that are developing their own
internal software can also utilize PaaS , particularly to
create distinct ring-fenced development and testing
environments.

School of Computer Engineering

Cloud for Big Data – SaaS in cloud
78

q Many organizations feel the need to analyze the customer’s

voice, especially on social media. SaaS vendors provide the
platform for the analysis as well as the social media data.
q Office software is the best example of businesses utilizing SaaS.
Tasks related to accounting, sales, invoicing, and planning can
all be performed through SaaS. Businesses may wish to use one
piece of software that performs all of these tasks or several that
each performs different tasks.
q The software can be subscribed through the Internet and then
accessed online via any computer in the office using a username
and password. If needed, they can switch to software that fulfills
their requirements in better manner.

School of Computer Engineering

Appendix
80

q Data Mining: Data mining is the process of looking for hidden, valid, and
potentially useful patterns in huge data sets. Data Mining is all about
discovering unsuspected/previously unknown relationships amongst the
data. It is a multi-disciplinary skill that uses machine learning, statistics,
AI and database technology.
q Natural Language Processing (NLP): NLP gives the machines the ability
to read, understand and derive meaning from human languages.
q Text Analytics (TA): TA is the process of extracting meaning out of text.
For example, this can be analyzing text written by customers in a
customer survey, with the focus on finding common themes and trends.
The idea is to be able to examine the customer feedback to inform the
business on taking strategic action, in order to improve customer
experience.
q Noisy text analytics: It is a process of information extraction whose goal
is to automatically extract structured or semi-structured information from
noisy unstructured text data.
School of Computer Engineering
Appendix cont…
81

Example of Data Volumes

Unit Value Example
Kilobytes (KB) 1,000 bytes a paragraph of a text document
Megabytes (MB) 1,000 Kilobytes a small novel
Gigabytes (GB) 1,000 Megabytes Beethoven’s 5th Symphony
Terabytes (TB) 1,000 Gigabytes all the X-rays in a large hospital
Petabytes (PB) half the contents of all US academic research
1,000 Terabytes
libraries
Exabytes (EB) about one fifth of the words people have ever
1,000 Petabytes
spoken
Zettabytes (ZB) 1,000 Exabytes as much information as there are grains of sand on
all the world’s beaches
Yottabytes (YB) 1,000 Zettabytes as much information as there are atoms in 7,000
human bodies

School of Computer Engineering

Appendix cont…
82

q Enterprise Data Warehouse: An enterprise data warehouse (EDW) is a

database, or collection of databases, that centralizes a business's
information from multiple sources and applications, and makes it
available for analytics and use across the organization. EDWs can be
housed in an on-premise server or in the cloud. The data stored in this
type of digital warehouse can be one of a business’s most valuable assets,
as it represents much of what is known about the business, its employees,
its customers, and more.
q Online Transactional Processing (OLTP): It is a category of data
processing that is focused on transaction-oriented tasks. OLTP typically
involves inserting, updating, and/or deleting small amounts of data in a
database. OLTP mainly deals with large numbers of transactions by a large
number of users.

School of Computer Engineering

Appendix cont…
83

ETL: ETL is short for extract, transform, load, three database functions that are
combined into one tool to pull data out of one database and place it into another
database.
q Extract is the process of reading data from a database. In this stage, the data is
collected, often from multiple and different types of sources.
q Transform is the process of converting the extracted data from its previous form into
the form it needs to be in so that it can be placed into another database.
Transformation occurs by using rules or lookup tables or by combining the data
with other data.
q Load is the process of writing the data into the target database.

School of Computer Engineering

All in One
No ratings yet
All in One
362 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
Digital Notes of Big Data Analytics Dated 5.1.2024
No ratings yet
Digital Notes of Big Data Analytics Dated 5.1.2024
175 pages
DA Full
No ratings yet
DA Full
738 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
83 pages
Bda U1
No ratings yet
Bda U1
80 pages
Bigdata Lecture Notes
No ratings yet
Bigdata Lecture Notes
166 pages
Unit 1 - BD - Introduction To Big Data
100% (1)
Unit 1 - BD - Introduction To Big Data
90 pages
Unit 1
No ratings yet
Unit 1
118 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
89 pages
It - (R20) - 4-1 - Big Data Analytics - Digital Notes
No ratings yet
It - (R20) - 4-1 - Big Data Analytics - Digital Notes
117 pages
Unit 1 - BD - Introduction To Big Data (1) - 2
No ratings yet
Unit 1 - BD - Introduction To Big Data (1) - 2
85 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
Digital Notes IDBA Final Original
No ratings yet
Digital Notes IDBA Final Original
156 pages
Bda - Digital Notes
No ratings yet
Bda - Digital Notes
85 pages
Sybca Bigdata
No ratings yet
Sybca Bigdata
97 pages
BDCC Unit 1
No ratings yet
BDCC Unit 1
165 pages
Big Data Analytics-Digital Notes
No ratings yet
Big Data Analytics-Digital Notes
86 pages
R Programming Lab Manual
No ratings yet
R Programming Lab Manual
57 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Data Science and Big Data Analytics - Unit - 1
No ratings yet
Data Science and Big Data Analytics - Unit - 1
47 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
75 pages
20IT503 - Big Data Analytics - Unit1
No ratings yet
20IT503 - Big Data Analytics - Unit1
59 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
139 pages
Unit 1 - DA - Introduction To Data Science
No ratings yet
Unit 1 - DA - Introduction To Data Science
70 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
cst499 Final Capstone Proposal
No ratings yet
cst499 Final Capstone Proposal
25 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Unit 1 - DA - Introduction To Big Data
No ratings yet
Unit 1 - DA - Introduction To Big Data
65 pages
Bigdata
No ratings yet
Bigdata
54 pages
20ai402 Data Analytics Unit-1
No ratings yet
20ai402 Data Analytics Unit-1
52 pages
Chapter 11 Answers
100% (1)
Chapter 11 Answers
13 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
Bda 1
No ratings yet
Bda 1
95 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
QB - Updated 1
No ratings yet
QB - Updated 1
15 pages
Siddharth Big Data Report 1000016431
No ratings yet
Siddharth Big Data Report 1000016431
6 pages
BD Course Handout
No ratings yet
BD Course Handout
5 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
Big Data Engineering and Analytics Developer
No ratings yet
Big Data Engineering and Analytics Developer
5 pages
Ibda Course File
No ratings yet
Ibda Course File
33 pages
Data Analytics
No ratings yet
Data Analytics
42 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
BDA2023 Outline
No ratings yet
BDA2023 Outline
7 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Ics 2404 Advanced Database Management Systems
No ratings yet
Ics 2404 Advanced Database Management Systems
2 pages
Data Analytics Course Plan 2016
No ratings yet
Data Analytics Course Plan 2016
7 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Gujarat Technological University: Prerequisite: Rationale
No ratings yet
Gujarat Technological University: Prerequisite: Rationale
4 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
326E5E
No ratings yet
326E5E
2 pages
Usr Guide
No ratings yet
Usr Guide
236 pages
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
No ratings yet
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
8 pages
Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
No ratings yet
Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
29 pages
CS8091 Syllabus
No ratings yet
CS8091 Syllabus
2 pages
Mainframe Questionnaire
No ratings yet
Mainframe Questionnaire
34 pages
QSAN Compatibility Matrix XN 2108 en
No ratings yet
QSAN Compatibility Matrix XN 2108 en
90 pages
NetBackup105 AdminGuide Cloud
No ratings yet
NetBackup105 AdminGuide Cloud
208 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
37 pages
Abhay Bansal: Skill Set
No ratings yet
Abhay Bansal: Skill Set
3 pages
D17108GC30 Add Prac Solution
No ratings yet
D17108GC30 Add Prac Solution
13 pages
Data & File Structure
No ratings yet
Data & File Structure
2 pages
Simulacro de Examen
No ratings yet
Simulacro de Examen
6 pages
Tena GG
No ratings yet
Tena GG
25 pages
Lecture 6 Relational Algebra in DBMS
No ratings yet
Lecture 6 Relational Algebra in DBMS
22 pages
SAP HANA Schema Mapping
No ratings yet
SAP HANA Schema Mapping
8 pages
Selfdefending Databases Hashdays 2012
No ratings yet
Selfdefending Databases Hashdays 2012
76 pages
SQL Practicals
No ratings yet
SQL Practicals
7 pages
CO 3 Transaction
No ratings yet
CO 3 Transaction
15 pages
Btree
No ratings yet
Btree
7 pages
Chapter 1-Introduction Fundamaent Database
No ratings yet
Chapter 1-Introduction Fundamaent Database
27 pages
Mis PPT
No ratings yet
Mis PPT
13 pages
Data Storage Management in Cloud
No ratings yet
Data Storage Management in Cloud
7 pages
Database Systems Lab 6 Joining Multiple Tables
No ratings yet
Database Systems Lab 6 Joining Multiple Tables
5 pages
Data Structure: Tree
No ratings yet
Data Structure: Tree
20 pages
Amazon Redshift Serverless - Amazon Redshift
No ratings yet
Amazon Redshift Serverless - Amazon Redshift
13 pages
Assignment 1 FSD 2025 Final
No ratings yet
Assignment 1 FSD 2025 Final
4 pages
Assessment Brief 3 - Individual Project
No ratings yet
Assessment Brief 3 - Individual Project
4 pages
EDU131 Problems 5
No ratings yet
EDU131 Problems 5
2 pages
Sunday, 30 May 2021 9:10 PM: SQ Compilation Page 1
No ratings yet
Sunday, 30 May 2021 9:10 PM: SQ Compilation Page 1
2 pages
Document - 2 - 1643019
No ratings yet
Document - 2 - 1643019
2 pages
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
From Everand
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Neal Fishman
No ratings yet
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet