Aall
Aall
Course Overview
Course Objectives and Outcomes
Quick Review of Complete Handout
Practical Aspects
Different Types of Data and Storage for Data
Big Data Characteristics, sources
Big Data Systems Perspective – in-memory vs storage vs
network
Big Data challenges, applications/case studies
Locality of Reference – principle examples
2
Big Data Systems
BIG DATA ANALYTICS – emphasis / important ones
5
EVOLUTION OF INDUSTRIES
INDUSTRY 5.0
NEAR FUTURE
6
BUILDING BLOCKS OF INDUSTRY 4.0 & 5.0
All the building blocks create a huge opportunities to the universities/engineering institutes to prepare or update the curriculum and
teaching methodologies to train the engineers to be industry 4.0 read & 5.0 ready.
7
What Is Business Intelligence & How
it works?
ERP
Data
CRM Warehouse
SCM
8
The Scope of Business Intelligence
9
WHO IS DATA SCIENTIST ??
10
• There are two basic types of Predictive Analytics / Data Science problems:
• 1. Internal Predictive Analytics / Data Science problems, such as bad data, reckless analytics, or using
inappropriate techniques.
• Internal problems are not business problems; they are internal to the Predictive Analytics / Data
Science community.
• Therefore, the fix consists in training data scientists to do better work and follow best practices.
• 2. Applied business problems are real-world problems for which solutions are sought, such as fraud
detection or identifying if a factor is a cause or a consequence.
11
• These are the characteristics of the modern trends in Predictive Analytics / Data Science which one
you should be aware of:
• In-memory analytics
• MapReduce and Hadoop
• NoSQL, NewSQL, and graph databases
• Python and R
• Data integration: blending unstructured and structured data (such as data storage and
security, privacy issues when collecting data, and data compliance)
• Visualization
• Analytics as a Service, abbreviated as AaaS
• Text categorization/tagging and taxonomies to facilitate extraction of insights from raw text
and to put some structure on unstructured data
12
• What Knowledge does a Data Scientist need:
• Thus, data scientists also need to be good communicators to understand, and many
times guess, what problems their client, boss, or executive management is trying to
solve.
• Some () means the entire field is not part of Predictive Analytics / Data Science.
• New () means new stuff from the field in question is needed
13
• Horizontal Versus Vertical Data Scientist
• Vertical data scientists have deep technical knowledge in some narrow field.
• Statisticians who know everything about eigenvalues, singular value decomposition and its
numerical stability, and asymptotic convergence of maximum pseudo-likelihood estimators
• Software engineers with years of experience writing Python code (including graphic libraries)
applied to API development and web crawling technology
14
• Horizontal Versus Vertical Data Scientist
• Database specialists with strong data modeling, data warehousing, graph database,
Hadoop, and NoSQL expertise
• The key here is that by “vertical data scientist” we mean those with a more narrow
range of technical skills,
• such as expertise in all sorts of Lasso-related regressions but with limited
knowledge of time series, much less of any computer science.
15
• Horizontal Versus Vertical Data Scientist
• They can design robust, efficient, simple, replicable, and scalable code and algorithms.
16
Data Science
• Statistical and Operations research
techniques
• Machine Learning
• Deep Learning
17
18
Nearly everyone across the
organization engages with software
19
BI Questions
Developing a business intelligence strategy is an important first step in
implementing a BI solution.
✔ Who are the key stakeholders? Who will be using this system?
✔ What departments need business intelligence and what will be
measured?
✔ What support do content authors and information consumers need?
✔ Focus on Questions That Are Aligned With Your Business Strategy
✔ Ask BI Questions That Give You Actionable Insights
✔ Questions that Identify Opportunities for ROI
✔ Questions That Identify Opportunities For New Sources of Revenue
✔ Questions Identifying Cost-cutting Opportunities
✔ Questions
As technology That Begin
companies like With “Why”
Amazon, Meta, and Google continue to grow and integrate with our
✔ they
lives, How are
Can You Understand
leveraging big Your
dataCustomers Better?
technologies to monitor sales, improve supply chain
efficiency and customer satisfaction, and predict future business outcomes.
Currently, there is so much big data that International Data Corporation (IDC) predicts the
“Global Datasphere” will grow from 33 Zettabytes (ZB) in 2018 to 175 ZB in 2025. That’s equal to
a trillion gigabytes.
20
What if you could
empower everyone
with analytics
anywhere decisions
are made?
21
Today, BI extends to everyone
Everyone
1st wave
Technical BI
22
Two Data
Factors
23
The Good
Experiments, observations, and numerical simulations in many
areas of science and business are currently generating terabytes of
data, and in some cases are on the verge of generating petabytes
and beyond. Analyses of the information contained in these data
sets have already led to major breakthroughs in fields ranging from
genomics to astronomy and high-energy physics and to the
development of new information-based industries.
- Frontiers in Massive Data Analysis, National Research
Council of the National Academies
The Bad
Given a large mass of data, we can by judicious selection construct perfectly
plausible unassailable theories—all of which, some of which, or none of which
may be right. - Paul Arnold Srere
24
DATA ANALYTICS vs BUSINESS ANALYTICS
• Data analytics is a broad umbrella for finding insights in data
• Data analytics can refer to any form of analysis of data—whether in a
spreadsheet, database, or app—where the intent is to uncover trends,
identify anomalies, or measure performance.
• Additional mathematics or IT skills can help data analysts do everything from
managing a database of subscribers to calculating yields for a potential
investment.
• Data analytics (DA) is the technical process of mining data, cleaning data,
transforming data, and building the systems to manage data. Data analytics
takes large quantities of data to find trends and solve problems. Data
analytics is not just confined to business applications—it’s used across
disciplines, from the government to science.
25
DIFFERENCE BETWEEN BUSINESS ANALYTICS AND DATA
ANALYSIS
26
DIFFERENCE BETWEEN DATA ANALYTICS AND BIG
DATA ANALYTICS
27
Evolution of BI
28
Characteristics of Data for Good Decision Making
29
CRISP – DM
Cross-Industry Standard Process for Data
Mining
3
0
31
BIG DATA SYSTEMS PERSPECTIVE…
Advantages:
•Can handle massive datasets that don’t fit in memory.
•Suitable for batch processing (e.g., Hadoop MapReduce).
Challenges:
•Slower compared to in-memory processing.
•Requires optimized data locality and access patterns.
Systems Perspective: Processing
over the network
Characteristics:
•Distributed processing across multiple nodes in a network.
•Data often stored in HDFS, S3, or similar distributed systems.
Advantages:
•Scalability: Can process petabytes of data by leveraging many nodes.
•Redundancy: Fault-tolerant systems with data replication.
Challenges:
•Network latency can become a bottleneck.
•Requires efficient task scheduling and data shuffling (e.g., Apache
Hadoop, Spark).
LOCALITY REFERENCES…
• - **Reduced Latency**:
• - Data in the CPU cache is faster to access than RAM or disk.
• - Better locality = fewer cache misses = reduced latency.
• - **Optimized Resource Usage**:
• - CPU pipelines stay efficient.
• - Reduced memory bandwidth contention.
• - **Examples**:
• - Loop unrolling and blocking in matrix multiplication.
• - Optimized database query plans.
Algorithms & Data Structures
Leveraging Locality
• - **Sorting Algorithms**:
• - Merge Sort benefits from spatial locality during
merging phases.
• - **Search Structures**:
• - B-Trees/B+ Trees: Designed for efficient disk access.
• - **Dynamic Programming**:
• - Uses temporal locality by storing reusable
subproblem solutions.
Data Organization for Better
Locality
• - **On-Disk Data Layout**:
• - Contiguous allocation for files (e.g., ext4, NTFS).
• - Index structures like clustered B-Trees for databases.
• - **In-Memory Data Structures**:
• - Arrays vs. Linked Lists: Arrays have better spatial locality.
• - Cache-Aware Algorithms: Tailored for specific cache sizes.
Mitigating Latency Through Locality
Optimization
• - **Software Optimizations**:
• - Code restructuring to improve data locality.
• - Using cache-friendly algorithms (e.g., blocking in matrix operations).
• - **Hardware Optimizations**:
• - Multi-level caches (L1, L2, L3).
• - Prefetching mechanisms.
• - **Real-World Applications**:
• - High-performance computing.
• - Database systems optimized for query efficiency.