Introduction PDF
Introduction PDF
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
2
BIG DATA
Big Data defines a situation in which data sets have grown to such enormous
sizes that traditional data management tools can no longer effectively handle
either the size of the data set or the scale and growth of the data set.
Big data has an intrinsic value that can be extrapolated using analytics,
algorithms, and other techniques.
Insights like drug testing, understand behavior of customers.
Need to handle and store big data because of the data being structured or
not structured.
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
3
VALUES
Characteristics of Big data: Volume, Velocity, Variety, Veracity and value.
Variety: Structured data, semi structured data and unstructured data
Structured data: This is the data which is in an organized form (e.g., in rows
and columns) and can be easily used by a computer program. Relationships
exist between entities of data, such as classes their objects. Data stored in
traditional databases is an example of structured data. Can be organized
into a table.
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
4
VALUES
Semi structured data: This is the data which does not conform to a data model but
has some structure. Metadata for this data is available but is not sufficient. Eg:
XML. It can not be stored into tables, but it has tags and other markers to
separate the elements.
Unstructured data: This data does not confirm to a data model. Unstructured data
are stored in non-relational databases. Eg: mail, tweets. Can not be stored in
tables.
Veracity: It refers to the quality or trustworthiness of the data, so that it does not
lead to errors or misinterpretation of the big data.
Value: The value that big data provides for an ecosystem is the requirement.
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
5
BIG DATA ANALYSIS
IT’s collaboration with Better, faster decisions in
business users real time
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
6
CHALLENGES IN BIG DATA ANALYTICS
Capturing and storing data: velocity and volume. Computational limitations.
Data quality: Inaccuracy, incomplete data and unstructured data.
Security and privacy
Knowledge gaps
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
7
CASE FOR BIG DATA
How to tie big data analytics to a business process.
•Background of the project
•Options
•Scope and costs
•Risk analysis
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
8
TEAM CHALLENGES
Step 1: Bringing talented workers together.
Step 2: Organizing the team (IT and BI groups)
Step 3: BDA teams to be in the department where their aims align.
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
9
BIG DATA SOURCES
1. Transportation, retail, logistics and telecommunications
2. Healthcare
3. Government
4. Entertainment media
5. Life Sciences
6. Video surveillance
7. Social Media Data
8. Transactional Data
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
10
ACQUISITION
Businesses move to include big data analytics teams when they realise the size of
data collected.
The start with, the IT teams will identify the problems that align with the business goals.
Then understand the tools which will be useful to gather data and carry analysis, eg
Hadoop.
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
11
BIG DATA EVOLUTION
1940s-1989: Data warehousing and personal computer (mainframe)
1989-1999: World wide web (HTML, data explosion because of internet, RDBMS).
Structured and semi structured.
2000-2010s: Cloud Computing and social media data (launch of social media
platforms, entertainment sites hosted by cloud). Unstructured data. Led to the creation
of Hadoop, an open-source framework created specifically to manage big data sets,
and the adoption of NoSQL database queries, which made it possible to manage
unstructured data
2010s: Internet of things, fog computing, edge computing and mobile devices. New
types of data (sensor, social data, transactional data, health related data)
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
12
BEST PRACTICES FOR BIG DATA ANALYSIS
1. Establish Big Data business objective
2. Start with small data
3. Data Governance
4. Infrastructure around goals
5. Maintenance plan
6. The value of anomalies
7. In-memory processing
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
13
SECURITY, COMPLIANCE, AUDITING AND
PROTECTING
Problems of using Big Data Repository:
1. Access: To allow access to all, results in reduced security. But to enforce security, it is not
practical to restrict access to all. Hence, to allow access to select users, its important to
allow selected ones only.
2. Availability: To control where the data is stored and how is it distributed among the
various departments. Eg, sensitive data should be available to process only where it is
required.
3. Performance: Higher level encryption and additional security layers to improve security
layers but affect the performance.
4. Liability: Accessible data carry with them liability. Eg, sensitivity of the data, privacy
issues, etc.
Aim is to balance them all.
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
14
SECURITY, COMPLIANCE, AUDITING AND
PROTECTING
Pragmatic steps to securing Big Data:
Get red of data that is no longer required. Otherwise it is a risk to store.
If legally required, data can be archived, stored but not in a system connected to
internet.
Classifying Data:
Data is easier to protect if it is classified or categorized. Easier to manage.
Eg, financial data, HR data, sales, inventory etc. Each data might have different
sensitivity and different security protocols.
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
15
SECURITY, COMPLIANCE, AUDITING AND
PROTECTING
Protecting the Big Data:
1. Some data unique to the moment, eg, traffic, movement, weather, etc can be lost
and not possible to recreate them.
If the data is of no value (useless/redundant), its removal is called deduplication. This
is although good for storage but can corrupt encrypted data.
How to backup different sized data files (Oracle, NoSQL, Hadoop)
Big data and compliance
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
16
BIG DATA ARCHITECTURE
A big data architecture is designed to handle the ingestion, processing, and analysis
of data that is too large or complex for traditional database systems.
Big data solutions typically involve one or more of the following types of workload:
•Batch processing of big data sources at rest.
•Real-time processing of big data in motion.
•Interactive exploration of big data.
•Predictive analytics and machine learning.
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
17
BIG DATA ARCHITECTURE
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
18
BIG DATA ARCHITECTURE
Types of big data architecture:
Lambda architecture and Kappa architecture.
Features Lambda Kappa
Processing Pipeline Separate layers Single stream
Data storage Batch store and speed Append only log
store
Consistency Potential inconsistencies Consistent view
Complexity More Less
Cost High Low
Historical advantage Strong Limited
ANANYA CHAKRABORTY
BIG DATA ANALYTICS (CSPE-432)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT JALANDHAR
19
DATA ANALYTICS
Computer Cluster: Collection of resources of multiple machines that work together.
Batch Processing
Real-time processing
Distributed computing: increased speed, power, efficiency
Parallel computing: (shared memory)
20