0% found this document useful (0 votes)
22 views41 pages

Aall

The document outlines a comprehensive agenda for a course on Big Data Systems, covering topics such as data types, storage, analytics, and industry applications. It highlights the roles of data analysts, scientists, and specialists, along with the skills required for each position. Additionally, it discusses the evolution of industries towards Industry 4.0 and 5.0, emphasizing the importance of business intelligence and predictive analytics in leveraging big data for organizational success.

Uploaded by

billy973171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views41 pages

Aall

The document outlines a comprehensive agenda for a course on Big Data Systems, covering topics such as data types, storage, analytics, and industry applications. It highlights the roles of data analysts, scientists, and specialists, along with the skills required for each position. Additionally, it discusses the evolution of industries towards Industry 4.0 and 5.0, emphasizing the importance of business intelligence and predictive analytics in leveraging big data for organizational success.

Uploaded by

billy973171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Agenda

Course Overview
Course Objectives and Outcomes
Quick Review of Complete Handout
Practical Aspects
Different Types of Data and Storage for Data
Big Data Characteristics, sources
Big Data Systems Perspective – in-memory vs storage vs
network
Big Data challenges, applications/case studies
Locality of Reference – principle examples

2
Big Data Systems
BIG DATA ANALYTICS – emphasis / important ones

Introduction to Industry 4.0/5.0, BI &


Analytics PRACTICAL / LAB EXAMPLES USING LINUX, SHELL, DB,
Big Data Definition & Characteristics SQL, JAVA, SQL, PYTHON, POWERBI.
Sources of Big Data MCQ – PUZZLES – CASE STUDIES/DISCUSSIONS
Challenges & Benefits of Big Data Systems
Case Studies / Applications Focus Areas/Practicals
✔VIRTUAL MACHINE
Hadoop Ecosystem & Map Reduce ✔UBUNTU LINUX COMMANDS
Hadoop Intro ✔EDITING FILES AND ENVIRONMENT
HDFS architecture ✔SETTING UP HADOOP CLUSTER
MAP REDUCE architecture ✔STARTING CLUSTER
SPARK Usage ✔PRACTICING HADOOP AND HDFS COMMANDS
✔SETTING UP AND RUNNING MAP REDUCE
HIVE Database & Others ✔SETTING UP HIVE DB
Hive Architecture and uses ✔RUNNING QUERIES AGAINST HIVE DB
Pig, Sqoop, Flume, OOzie ✔EXPORT/IMPORT TO HIVE
Other popular tools (open source & ✔WORK WITH MONGODB
commercial) ✔QUICK DEMO… OF OTHER TOOLS…
Mongodb Architecture & Commands
Data Visualization; 3
Applications/projects – PowerBI/Tableau &
Excel
Big Data Systems
A QUICK REVIEW OF DATA ANALYST
& SCIENTIST
Data Analysts and Scientists
Job Profile: Data analysts help businesses develop well-informed
strategies by creating charts and prepare visual presentations.
Also, examining large data and identifying trends are the expected
roles of a data analyst. Data scientists construct and develop new
processes for data modelling and primarily use prototypes,
algorithms, predictive models and custom analysis.
Chief skills: Python coding, Hadoop Platform, R programming,
SQL Database/Coding, Apache spark, Machine learning and AI,
Data Visualisation
Big Data Specialists
Job Profile: Utilise data analysis to evaluate the technical
performance of an organisation. Also provides recommendations
on system enhancements.
Job skills: Apache Hadoop, Apache Spark, NoSQL, Machine
learning and Data Mining, Statistical and Quantitative Analysis,
SQL, Data Visualization, General Purpose Programming language
Digital Transformation Specialists
Job Profile: Work in enhancing a company’s technical
performance. They analyse the company’s infrastructure and the
gaps in service.
Job skills: Technical aptitude, critical thinking abilities, excellent
communication skills, adaptability, SQL, C++, HTML, CSS

Big Data Systems


A FEW million $
QUESTIONS
Data is woven into the everyday fabric of our lives. With
the rise of mobile, social media, and smart technologies
associated with the Internet of Things (IoT), we now
transmit more data than ever before—and at a dizzying
speed. Thanks to big data analytics !!
Organizations can now use that information to rapidly
improve the way they work, think, and provide value to
their customers.
With the assistance of tools and applications, big data
can help you gain insights, optimize operations, and
predict future outcomes.

Information Internet Of Everything/Things


Science Basics/Formulas/neurons..
Engineering Structural Approaches/principles/methods
Technology Enablers… Wifi.. 5S.. etc
Intelligence AI, ML, DL, ANN
Automation Robots, Chatbots, Programs, Daemons, etc
Machine Intelligence Making machine to think.. and communicate with
other machines as well as human
B-B C-B..!? Management Data Analytics, Visualization, Statistics
Societal Values & Cost Savings & New Way of Lazy Life??

5
EVOLUTION OF INDUSTRIES

INDUSTRY 5.0

“Society 5.0” (Super Smart Society)


Artificial Intelligence
Augmented Reality
Beyond Industries….
People’s common life

NEAR FUTURE

6
BUILDING BLOCKS OF INDUSTRY 4.0 & 5.0

All the building blocks create a huge opportunities to the universities/engineering institutes to prepare or update the curriculum and
teaching methodologies to train the engineers to be industry 4.0 read & 5.0 ready.
7
What Is Business Intelligence & How
it works?

ERP

Data
CRM Warehouse

SCM

“Getting data in” “Getting data out”

8
The Scope of Business Intelligence

Smaller organizations: Larger organizations:


Excel spreadsheets Data mining, Predictive,
Prescriptive, analytics,
dashboards

9
WHO IS DATA SCIENTIST ??

10
• There are two basic types of Predictive Analytics / Data Science problems:

• 1. Internal Predictive Analytics / Data Science problems, such as bad data, reckless analytics, or using
inappropriate techniques.

• Internal problems are not business problems; they are internal to the Predictive Analytics / Data
Science community.

• Therefore, the fix consists in training data scientists to do better work and follow best practices.

• 2. Applied business problems are real-world problems for which solutions are sought, such as fraud
detection or identifying if a factor is a cause or a consequence.

• These may involve internal or external (third-party) data.

11
• These are the characteristics of the modern trends in Predictive Analytics / Data Science which one
you should be aware of:

• In-memory analytics
• MapReduce and Hadoop
• NoSQL, NewSQL, and graph databases
• Python and R
• Data integration: blending unstructured and structured data (such as data storage and
security, privacy issues when collecting data, and data compliance)
• Visualization
• Analytics as a Service, abbreviated as AaaS
• Text categorization/tagging and taxonomies to facilitate extraction of insights from raw text
and to put some structure on unstructured data

12
• What Knowledge does a Data Scientist need:

• Thus, data scientists also need to be good communicators to understand, and many
times guess, what problems their client, boss, or executive management is trying to
solve.

• Translating high-level English into simple, efficient, scalable, replicable, robust,


flexible, platform-independent solutions is critical.

• Predictive Analytics / Data Science = Some (computer science) + Some (statistical


science)
• + Some (business management) + Some (software engineering) + Domain
• expertise + New (statistical science), where

• Some () means the entire field is not part of Predictive Analytics / Data Science.
• New () means new stuff from the field in question is needed
13
• Horizontal Versus Vertical Data Scientist

• Vertical data scientists have deep technical knowledge in some narrow field.

• For instance, they might be any of the following:

• Computer scientists familiar with computational complexity of all sorting algorithms

• Statisticians who know everything about eigenvalues, singular value decomposition and its
numerical stability, and asymptotic convergence of maximum pseudo-likelihood estimators

• Software engineers with years of experience writing Python code (including graphic libraries)
applied to API development and web crawling technology

14
• Horizontal Versus Vertical Data Scientist

• Database specialists with strong data modeling, data warehousing, graph database,
Hadoop, and NoSQL expertise

• Predictive modelers with expertise in Bayesian networks, SAS, and SVM

• The key here is that by “vertical data scientist” we mean those with a more narrow
range of technical skills,
• such as expertise in all sorts of Lasso-related regressions but with limited
knowledge of time series, much less of any computer science.

15
• Horizontal Versus Vertical Data Scientist

• Horizontal data scientists are a blend of business analysts, statisticians, computer


scientists, and domain experts.

• They combine vision with technical knowledge.

• They might not be experts in eigenvalues, generalized linear models, and


• other semi-obsolete statistical techniques,
• but they know about more modern data-driven techniques applicable to unstructured,
streaming, and big data.

• They can design robust, efficient, simple, replicable, and scalable code and algorithms.

16
Data Science
• Statistical and Operations research
techniques
• Machine Learning
• Deep Learning

17
18
Nearly everyone across the
organization engages with software

Yet, fewer than 25% of workers have access to analytical insights

19
BI Questions
Developing a business intelligence strategy is an important first step in
implementing a BI solution.
✔ Who are the key stakeholders? Who will be using this system?
✔ What departments need business intelligence and what will be
measured?
✔ What support do content authors and information consumers need?
✔ Focus on Questions That Are Aligned With Your Business Strategy
✔ Ask BI Questions That Give You Actionable Insights
✔ Questions that Identify Opportunities for ROI
✔ Questions That Identify Opportunities For New Sources of Revenue
✔ Questions Identifying Cost-cutting Opportunities
✔ Questions
As technology That Begin
companies like With “Why”
Amazon, Meta, and Google continue to grow and integrate with our
✔ they
lives, How are
Can You Understand
leveraging big Your
dataCustomers Better?
technologies to monitor sales, improve supply chain
efficiency and customer satisfaction, and predict future business outcomes.

Currently, there is so much big data that International Data Corporation (IDC) predicts the
“Global Datasphere” will grow from 33 Zettabytes (ZB) in 2018 to 175 ZB in 2025. That’s equal to
a trillion gigabytes.

20
What if you could
empower everyone
with analytics
anywhere decisions
are made?

21
Today, BI extends to everyone

Everyone

1st wave
Technical BI

22
Two Data
Factors

23
The Good
Experiments, observations, and numerical simulations in many
areas of science and business are currently generating terabytes of
data, and in some cases are on the verge of generating petabytes
and beyond. Analyses of the information contained in these data
sets have already led to major breakthroughs in fields ranging from
genomics to astronomy and high-energy physics and to the
development of new information-based industries.
- Frontiers in Massive Data Analysis, National Research
Council of the National Academies

The Bad
Given a large mass of data, we can by judicious selection construct perfectly
plausible unassailable theories—all of which, some of which, or none of which
may be right. - Paul Arnold Srere

24
DATA ANALYTICS vs BUSINESS ANALYTICS
• Data analytics is a broad umbrella for finding insights in data
• Data analytics can refer to any form of analysis of data—whether in a
spreadsheet, database, or app—where the intent is to uncover trends,
identify anomalies, or measure performance.
• Additional mathematics or IT skills can help data analysts do everything from
managing a database of subscribers to calculating yields for a potential
investment.
• Data analytics (DA) is the technical process of mining data, cleaning data,
transforming data, and building the systems to manage data. Data analytics
takes large quantities of data to find trends and solve problems. Data
analytics is not just confined to business applications—it’s used across
disciplines, from the government to science.

• Business analytics focuses on identifying operational insights.


• Business analytics focuses on the overall function and day-to-day operation
of the business.
• A business analyst would deal less with the technical aspects of analysis and
more with the practical applications of data insights.
• Some job responsibilities might include creating a streamlined workflow or
choosing the best vendors.
• Business analytics (BA) refers to the process of taking your company’s raw
data and turning it into useful information, including identifying trends,
predicting outcomes, and more.

25
DIFFERENCE BETWEEN BUSINESS ANALYTICS AND DATA
ANALYSIS

26
DIFFERENCE BETWEEN DATA ANALYTICS AND BIG
DATA ANALYTICS

27
Evolution of BI

28
Characteristics of Data for Good Decision Making

29
CRISP – DM
Cross-Industry Standard Process for Data
Mining

3
0
31
BIG DATA SYSTEMS PERSPECTIVE…

Big Data Systems


32
Systems Perspective: Processing
Data
In-Memory Processing
Characteristics:
– Data processed directly in RAM, avoiding disk I/O.
– Extremely fast (low latency).
Advantages:
– Ideal for real-time analytics and low-latency applications
(e.g., Spark, Apache Flink).
– Supports iterative algorithms and machine learning.
Challenges:
– Limited by available RAM.
– More expensive compared to disk-based solutions.
Systems Perspective: Processing
from Secondary Storage
Characteristics:
•Data processed from hard drives or SSDs.
•Disk I/O introduces latency.

Advantages:
•Can handle massive datasets that don’t fit in memory.
•Suitable for batch processing (e.g., Hadoop MapReduce).

Challenges:
•Slower compared to in-memory processing.
•Requires optimized data locality and access patterns.
Systems Perspective: Processing
over the network

Characteristics:
•Distributed processing across multiple nodes in a network.
•Data often stored in HDFS, S3, or similar distributed systems.

Advantages:
•Scalability: Can process petabytes of data by leveraging many nodes.
•Redundancy: Fault-tolerant systems with data replication.

Challenges:
•Network latency can become a bottleneck.
•Requires efficient task scheduling and data shuffling (e.g., Apache
Hadoop, Spark).
LOCALITY REFERENCES…

Big Data Systems


36
Locality of Reference: Principle &
Examples
• - **Definition**: Locality of reference is a principle in computing that
describes how programs tend to access a relatively small portion of
their address space at any given time.
• - **Types of Locality**:
• - **Temporal Locality**: Recently accessed data is likely to be
accessed again soon.
• - **Spatial Locality**: Data near a recently accessed location is likely
to be accessed soon.
• - **Examples**:
• - **Code**: Sequential instruction execution (loops).
• - **Data**: Consecutive array accesses in loops.
• - **Memory Allocation**: Reuse of stack/heap data.
Impact of Locality on Performance

• - **Reduced Latency**:
• - Data in the CPU cache is faster to access than RAM or disk.
• - Better locality = fewer cache misses = reduced latency.
• - **Optimized Resource Usage**:
• - CPU pipelines stay efficient.
• - Reduced memory bandwidth contention.
• - **Examples**:
• - Loop unrolling and blocking in matrix multiplication.
• - Optimized database query plans.
Algorithms & Data Structures
Leveraging Locality

• - **Sorting Algorithms**:
• - Merge Sort benefits from spatial locality during
merging phases.
• - **Search Structures**:
• - B-Trees/B+ Trees: Designed for efficient disk access.
• - **Dynamic Programming**:
• - Uses temporal locality by storing reusable
subproblem solutions.
Data Organization for Better
Locality
• - **On-Disk Data Layout**:
• - Contiguous allocation for files (e.g., ext4, NTFS).
• - Index structures like clustered B-Trees for databases.
• - **In-Memory Data Structures**:
• - Arrays vs. Linked Lists: Arrays have better spatial locality.
• - Cache-Aware Algorithms: Tailored for specific cache sizes.
Mitigating Latency Through Locality
Optimization
• - **Software Optimizations**:
• - Code restructuring to improve data locality.
• - Using cache-friendly algorithms (e.g., blocking in matrix operations).
• - **Hardware Optimizations**:
• - Multi-level caches (L1, L2, L3).
• - Prefetching mechanisms.
• - **Real-World Applications**:
• - High-performance computing.
• - Database systems optimized for query efficiency.

You might also like