0% found this document useful (0 votes)
28 views39 pages

Introduction To Big Data Analytics: Welcome Intro To BDA !

The document provides an introduction to a course on big data analytics (BDA). It includes: 1. Contact information for the instructor and an overview of the basic course content which covers introductions, emerging technologies, data preprocessing, big data technologies and applications, and data visualization. 2. Definitions and explanations of terms related to data science, data mining, and big data analytics. It also describes the characteristics of big data and issues organizations face in effectively managing large amounts of data. 3. Examples of the vast amount of data generated every minute and characteristics of big data including volume, velocity, variety and complexity from different data sources and formats.

Uploaded by

galge turo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views39 pages

Introduction To Big Data Analytics: Welcome Intro To BDA !

The document provides an introduction to a course on big data analytics (BDA). It includes: 1. Contact information for the instructor and an overview of the basic course content which covers introductions, emerging technologies, data preprocessing, big data technologies and applications, and data visualization. 2. Definitions and explanations of terms related to data science, data mining, and big data analytics. It also describes the characteristics of big data and issues organizations face in effectively managing large amounts of data. 3. Examples of the vast amount of data generated every minute and characteristics of big data including volume, velocity, variety and complexity from different data sources and formats.

Uploaded by

galge turo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

11/24/2021

Introduction to Big Data


Analytics

Welcome Intro to BDA !


• Instructor: Kuulaa Qaqqabaa (PhD in CSE)
• Assistant Professor, Department of Software Engineering
(AASTU, Finfinne)
• Director, Center of Excellence for HPC and Big Data Analytics
• Office: College of Electrical and Mechanical Eng., Block 64
(Room-206).
• Email1: [email protected]
• Email1: [email protected]

1
11/24/2021

Basic Course Content

1. Introduction

2. Overview of BDA and Emerging Technologies

3. Data Pre-Processing

4. Overview of Big Data Technologies and


Applications

5. Data Visualization
3

Terminology: What is DSc, DM and BDA?


• Data: facts, measurements or text collected for reference or analysis (Oxford
dictionary).
Unstructured data: data that does not t a certain data structure (text, a list of numeric
measurements)
Structured data: data that fits a certain data structure (table, tree, graph/network, etc.)

• “Data mining is the process of discovering meaningful new correlations,


patterns and trends by sifting through large amounts of data stored in
repositories, using pattern recognition technologies as well as statistical and
mathematical techniques.” (The Gartner Group, www.gartner.com)

• “Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are
both understandable and useful to the data owner”. (David Hand, Heikki
Mannila, and Padhraic Smyth, Principles of Data Mining, MIT Press, Cambridge,
MA, 2001.)

• Process Mining is the task of converting event data into process models.

2
11/24/2021

Terminology: What is DSc, DM and BDA?


• Data science is the application of computational and statistical
techniques to address or gain insight into some problem in the real
world

• An area that manages, manipulates, extracts, and interprets


knowledge from tremendous amount of data
• Data science (DS) is a multidisciplinary field of study with goal to
address the challenges in big data
• Data science principles apply to all data – big and small

• Knowledge Discovery in Data is the non-trivial process of identifying


valid,
novel,
potentially useful
and ultimately understandable patterns in data.
5

The Data Analytics Process


• Understand the domain
• Create a dataset:
• Select the interesting attributes
• Data cleaning and preprocessing
• Choose the data mining task and the specific algorithm
• Interpret the results, and possibly return to 2

• Must address:
• Enormity of data
• High dimensionality
of data
• Heterogeneous,
distributed nature
of data

3
11/24/2021

4
11/24/2021

10

5
11/24/2021

11

12

6
11/24/2021

13

14

7
11/24/2021

15

16

8
11/24/2021

17

18

9
How much data do we generate?

20
19
11/24/2021

10
11/24/2021

https://www.allaccess.com/merge/archive/31294/
infographic-what-happens-in-an-internet-minute/

22

11
11/24/2021

23

24

12
11/24/2021

Characteristics of Big Data


Big Data is any data that is • Relational Data
expensive to manage and (Tables/Transaction/Legacy Data)
hard to extract value from
• Volume • Text Data (Web)
• The size of the data • Semi-structured Data (XML)
• Velocity
• The latency of data • Graph Data
processing relative to • Social Network, Semantic Web (RDF),
the growing demand for …
interactivity
• Variety and Complexity • Streaming Data
• the diversity of sources, • You can afford to scan the data once
formats, quality,
structures.

25

Characteristics of Big Data

26

13
11/24/2021

27

Business Management Issues


• “We have mountains of data in this company, but we can’t
access it.”
• “We need to slice and dice the data every which way.”
• “You’ve got to make it easy for business people to get at the
data directly.”
• “Just show me what is important.”
• “It drives me crazy to have two people present the same
business metrics at a meeting, but with different numbers.”
• “We want people to use information to support more fact-
based decision making.”

28

14
11/24/2021

Why BDA?

29

KK

30

15
Slide 30

KK1 Kula K, 5/20/2019


11/24/2021

Managing Organizations
Informed decision making as a prerequisite for success

Vision

Mission
Values, Purpose, Structure, Politics, Environment, etc.
Strategic Givens
Direction
Policies, Goals, and Objectives
Decision What should be done ?
Making
Analytics, Decision Making
When and how ??
Implementation
Project Management

Action

Managerial Decision Making


Information Technology Solutions for Improving Effectiveness

INTELLIGENCE MODELS
DATA

Structuring Relationships
DESIGN Problem Representation
Variables (Measures and Generation of Alternatives
Estimates)
Probabilities and
Estimates
CHOICE
Spreadsheet Models
Decision Analysis and
Influence Diagrams for for managing complex
Visualizing Models and relationships and detail
Choices

16
11/24/2021

Components of a DSS
Creating Information Under Conditions of Uncertainty and Complexity

Information Technology for Enterprise Strategic Systems

DATA MODEL
BASE BASE
Enterprise Application
Data Models
DBMS MBMS

DATA ON LINE ANALYTICAL


WAREHOUSING PROCESSING

Business Reporting

Enterprise Wide Decisions


Goals/Strategy

Pricing
Promotion Marketing Demand Consumers
Loyalty

Capacity
Labor Production Quantity Suppliers

Materials

Cash flow
Finance Revenues Investors
Debt/Equity
Investments

17
11/24/2021

35

36

18
11/24/2021

37

38

19
11/24/2021

Why DM?
• Data explosion • Data  Information Knowledge
• We are drowning in data, but
starving for knowledge!" • Knowledge Discovery
• Interpretation
• Machine Learning
• Understanding
• Learning
• Data Mining

• Acting
• Descriptive data mining:
clustering, pattern mining, etc.
• Predictive data mining:
classification, prediction, etc.

• Big Data Analytics or Data Science

39

40

20
11/24/2021

41

42

21
11/24/2021

43

44

22
11/24/2021

45

46

23
11/24/2021

47

24
11/24/2021

49

50

25
11/24/2021

51

What is Data Warehouse?


• Data warehouse: a copy of transaction data specially structured for query and
analysis (R. Kimball)
• Data warehouse: a system used for reporting and data analysis (Wikipedia)
• Data warehouse: a subject oriented, integrated, nonvolatile, timestamped
collection of data designed to support management’s decision support needs
(B. Inmon)

26
11/24/2021

Data warehouse (DW): Definition


• Data warehouse (DW or DWH), also known as an enterprise data
warehouse (EDW), is a system used for reporting and data analysis,
and is considered a core component of business intelligence.

• A data warehouse is simply a single, complete, and consistent store


of data obtained from a variety of sources and made available to
end users in a way they can understand and use it in a business
context.

• A data warehouse is a subject-oriented, integrated, time-varying,


non-volatile collection of data that is used primarily in organizational
decision making.

• A data warehouse integrates data originating from multiple sources


and various timeframes.

Basic Elements of the Data Warehouse

54

27
11/24/2021

56

28
11/24/2021

Data Warehouse

• The data warehouse:


• must make an organization’s information easily accessible
• must present the organization’s information consistently
• must be adaptive and resilient to change
• must be a secure bastion that protects our information assets
• must serve as the foundation for improved decision making
• the business community must accept the data warehouse if it is to
be deemed successful

57

29
11/24/2021

Benefits of a Data Mart (contd…)

Operational Source Systemsand Data Staging Area


• Operational Source Systems
• capture the transactions of the business
• queries against source systems are narrow
• stovepipe application

• A storage area: a set of ETL processes (extract-transform-load)


• it is off-limits to business users and does not provide query and presentation
services.

• Data Staging Area - ETL


• EXTRACTION
• reading and understanding the source data and copying the data needed for
the data warehouse into the staging area for further manipulation.
• TRANSFORMATION
• cleansing, combining data from multiple sources, deduplicating data, and
assigning warehouse keys
• LOADING
60
• loading the data into the data warehouse presentation area

30
11/24/2021

Data Presentation Area


• where data is organized, stored and made available for direct
querying by users, report writers, and other analytical
applications

• it is all the business community sees and touches via data


access tools

• dimensional data modeling


user understandability
query performance
resilience to change

• detailed, atomic data


61

Data Access Tools Microsoft SQL Server


• tools that query the data in the data • SQL Server Integration Services
warehouse’s presentation area. (SSIS)
• tool for the ETL process
• the variety of capabilities that can
be provided to business users to
leverage the presentation area for
• SQL Server Analysis Services
analytic decision making. (SSAS)
• tool for multidimensional
modeling
prebuilt parameter-driven analytic
applications.
• SQL Server Reporting Services
ad hoc query tools.
(SSRS)
data mining, modeling, forecasting
• tool for reporting

62

31
11/24/2021

64

32
11/24/2021

65

66

33
11/24/2021

67

The KDD Process (Contd.)

34
11/24/2021

Steps of the KDD Process (Contd.)


• Data cleaning to remove noise and inconsistent data.

• Data integration, where multiple data sources may be combined.

• Data selection, where data relevant to the analysis task are retrieved from the
database.

• Data transformation, where data are transformed and consolidated into


forms appropriate for mining by preforming summary or aggregation
operations.

• Data mining, which is an essential process where intelligent methods are


applied to extract data patterns.

• Pattern evaluation to identify the truly interesting patterns representing


knowledge based on interesting measures.

• Knowledge presentation, where visualization and knowledge representation


techniques are used to present mined knowledge to users.

KDD
• KDD: Knowledge Discovery in
Databases.

• Data archeology.

• Information harvesting

• Knowledge extraction

• Machine learning

• Big data techniques?

• Data science?

• Business intelligence?

70

35
11/24/2021

71

CROSS-INDUSTRY STANDARD PROCESS FOR DATA


MINING

CRISP-DM
An industry- and tool-neutral
data mining process model.

 Business understanding
phase

 Data understanding phase

 Data preparation phase

 Modeling phase

 Evaluation phase

 Deployment phase

36
11/24/2021

DM in Businesses DM in practice
• Process management
1. Learn about the problem domain
• Market basket analysis
2. Data selection
• Marketing 3. Data, cleaning, preprocessing and
reduction
• Customer loyalty
4. Data mining
• Fraud detection
5. Interpretation of information
• Trend analysis
6. Apply knowledge in domain

• Data preprocessing: Sampling; Normalization Missing data


Data confilicts Duplicate data Ambiguity in datam
73

74

37
11/24/2021

Guidelines for Successful Data mining


• The data must be available, relevant, adequate and clean;
• There must be a well-defined problem;
• The problem should not be solvable by means of ordinary query or
• OLAP tools
• The results must be actionable

• Successful data mining in businesses involves:


Use a small team with a strong internal integration and a loose
management style;
Carry out a small pilot project before a major data mining project;
Identify a clear problem owner responsible for the project, e.g., from
sales or marketing;
Try to realize a positive return on investment within 6 to 12 months
Have top management back the project up 75

Proposed Assessment and Grading


Scheme

Homework (Lit. Review) 15%

Quizzes 10%

Project (Indv + Group) 30%

Class Participation 5-10%

Final Exam
30 - 50%
• Each Class should join Google Classroom
• Zero-Tolerance on plagiarism
76

38

You might also like