0% found this document useful (0 votes)
10 views25 pages

IT0009 Reviewer

The document provides an overview of SQL statements in PL/SQL, including SELECT, DML, and transaction control statements, while explaining the use of implicit and explicit cursors. It also covers control structures, PL/SQL records, procedures, functions, packages, triggers, and data warehousing concepts. Additionally, it highlights the differences between anonymous blocks and subprograms, as well as the architecture of data warehouses compared to traditional databases.

Uploaded by

rickman0422
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views25 pages

IT0009 Reviewer

The document provides an overview of SQL statements in PL/SQL, including SELECT, DML, and transaction control statements, while explaining the use of implicit and explicit cursors. It also covers control structures, PL/SQL records, procedures, functions, packages, triggers, and data warehousing concepts. Additionally, it highlights the differences between anonymous blocks and subprograms, as well as the architecture of data warehouses compared to traditional databases.

Uploaded by

rickman0422
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

SQL Statements in PL/SQL

You can use the following kinds of SQL statements in PL/SQL:


• SELECT statements to retrieve data from a database.
• DML statements, such as INSERT, UPDATE, and DELETE, to
make changes to the database.
• Transaction control statements, such as COMMIT, ROLLBACK,
or SAVEPOINT, to make changes to the database permanent or to
discard them.

You cannot directly execute DDL and DCL statements


because they are constructed and executed at run time—
that is, they are dynamic.
• There are times when you may need to run DDL or DCL
within PL/SQL.
• The recommended way of working with DDL and DCL
within PL/SQL is to use Dynamic SQL with the EXECUTE
IMMEDIATE statement.

Retrieve data from a database into a PL/SQL variable with a


SELECT statement so you can work with the data within
PL/SQL.

The INTO clause is mandatory and occurs between the


SELECT and FROM clauses.
• It is used to specify the names of PL/SQL variables that
hold the values that SQL returns from the SELECT clause

Make changes to data by using DML commands within your


PLSQL block:
• INSERT
• UPDATE
• DELETE
• MERGE
– The INSERT statement adds new rows to the table.
– The UPDATE statement modifies existing rows in the table.
– The DELETE statement removes rows from the table.
- The MERGE statement selects rows from one table to
update and/or insert into another table

What is a Cursor?

• Every time an SQL statement is about to be executed, the Oracle server


allocates a private memory area to store the SQL statement and the data that
it uses.
• This memory area is called an implicit cursor.
• Because this memory area is automatically managed by the
Oracle server, you have no direct control over it.
• However, you can use predefined PL/SQL variables, called implicit cursor
attributes, to find out how many rows were processed by the SQL statement.

There are two types of cursors:


• Implicit cursors: Defined automatically by Oracle for all SQL
data manipulation statements, and for queries that return
only one row.
– An implicit cursor is always automatically named “SQL.”
• Explicit cursors: Defined by the PL/SQL
programmer for queries that return more
than one row.
Cursor attributes are automatically declared variables that
allow you to evaluate what happened when a cursor was last
used.
• Attributes for implicit cursors are prefaced with “SQL.”
• Use these attributes in PL/SQL statements, but not in SQL
statements.
• Using cursor attributes, you can test the
outcome of your SQL statements.

Cursor Attributes for Implicit Cursors

Attribute Description

SQL%FOUND Boolean attribute that evaluates to TRUE if the


most recent SQL statement returned at least
one row.
SQL%NOTFOUND Boolean attribute that evaluates to TRUE if the
most recent SQL statement did not return
even one row
SQL%ROWCOUNT An integer value that represents the number
of rows affected by the most recent SQL
statement.

Controlling the Flow of Execution


• You can change the logical flow of statements within the PL/SQL block with a number of
control structures.
•Three types of PL/SQL control structures:
– Conditional constructs with the IF statement
– CASE expressions
– LOOP control structures

You can use CASE statements to test for non-equality


conditions such as <, >, >=, etc.
• These are called searched CASE statements.
• The syntax is virtually identical to an equivalent IF
statement.

Using a CASE Expression


• You want to assign a value to one variable that depends on
the value in another variable.
• Look at this IF statement.
• Again, the coding is very repetitive.

Iterative Control: LOOP Statements


• Loops repeat a statement or a sequence of
statements multiple times.
• PL/SQL provides the following types of loops:
– Basic loops that perform repetitive actions
without overall conditions. The simplest form.
– FOR loops that perform iterative actions based
on a counter
– WHILE loops that perform repetitive actions
based on a condition
PL/SQL Records
• A PL/SQL record is a composite data type consisting of a group of related data items
stored as fields, each with its own name and data type.
• You can refer to the whole record by its name and/or to individual fields by their names.
• Typical syntax for defining a record, shown below, is similar to what we used for cursors,
we just replace the cursor name with the table name.
record_name table_name%ROWTYPE;
Use %ROWTYPE to declare a variable as a record based on the structure of the employees
table.

You can use %ROWTYPE to declare a record based on another record

Syntax for User-Defined Records

• Start with the TYPE keyword to define your record structure.


• It must include at least one field and the fields may be defined using scalar data types
such as DATE, VARCHAR2, or NUMBER, or using attributes such as %TYPE
and %ROWTYPE.
• After declaring the type, use the type_name to declare a variable of that type

Implicit and Explicit Cursors

There are two types of cursors:


• Implicit cursors: Defined automatically by Oracle for all SQL DML statements
(INSERT, UPDATE, DELETE, and MERGE), and for SELECT statements that return only one
row.
• Explicit cursors: Declared by the programmer for queries that return more than
one row.
– You can use explicit cursors to name a context area and
access its stored data.

Steps for Using Explicit Cursors

Now that you have a conceptual understanding of cursors, review the steps to use them:
• DECLARE the cursor in the declarative section by naming it and defining the SQL SELECT
statement to be associated with it.
• OPEN the cursor.
– This will populate the cursor's active set with the results of the SELECT statement in the
cursor's definition.
– The OPEN statement also positions the cursor pointer at the
first row.

• FETCH each row from the active set and load the data into
variables.
– After each FETCH, the EXIT WHEN checks to see if the FETCH
reached the end of the active set resulting in a data NOTFOUND
condition.
– If the end of the active set was reached, the LOOP is exited.
• CLOSE the cursor.
– The CLOSE statement releases the active set of rows.
– It is now possible to reopen the cursor to establish a fresh
active set using a new OPEN statement.

Declaring the Cursor

When declaring the cursor:


• Do not include the INTO clause in the cursor declaration
because it appears later in the FETCH statement.
• If processing rows in a specific sequence is required, then
use the ORDER BY clause in the query.
• The cursor can be any valid SELECT statement, including
joins, subqueries, and so on.
• If a cursor declaration references any PL/SQL variables,
these variables must be declared before declaring the
cursor.

Syntax for Declaring the Cursor


• The active set of a cursor is determined by the SELECT
statement in the cursor declaration

Explicit Cursor Attributes

• As with implicit cursors, there are several attributes for


obtaining status information about an explicit cursor.
• When appended to the cursor variable name, these
attributes return useful information about the execution of
a cursor manipulation statement.
Attribute Type Description
%ISOPEN Boolean Evaluates to TRUE if the cursor is open.
%NOTFOUND Boolean Evaluates to TRUE if the most recent fetch did not
return a row.
%FOUND Boolean Evaluates to TRUE if the most recent fetch returned a
row; opposite of %NOTFOUND.
%ROWCOUNT Number Evaluates to the total number of rows FETCHed so far.

Cursors with Parameters


• A parameter is a variable whose name is used in a cursor
declaration.
• When the cursor is opened, the parameter value is passed
to the Oracle server, which uses it to decide which rows to
retrieve into the active set of the cursor.

Two Types of Oracle Server Errors


• Predefined Oracle server errors:
• Each of these errors has a predefined name, in
addition to a standard Oracle error number (ORA-
#####) and message.
• For example, if the error ORA-01403 occurs when no
rows are retrieved from the database in a SELECT
statement, then PL/SQL raises the predefined
exception NO_DATA_FOUND.

Non-predefined Oracle server errors:


• Each of these errors has a standard Oracle error
number (ORA-#####) and error message, but not a
predefined name.
• You declare your own names for these so that you can
reference these names in the exception section.
Differences Between Anonymous Blocks
and Subprograms
• As the word “anonymous” indicates, anonymous blocks
are unnamed executable PL/SQL blocks.
• Because they are unnamed, they can neither be reused
nor stored in the database for later use.
• While you can store anonymous blocks on your PC, the
database is not aware of them, so no one else can
share them.
• Procedures and functions are PL/SQL blocks that are
named, and they are also known as subprograms.
• These subprograms are compiled and stored in the
database.
• The block structure of the subprograms is similar to
the structure of anonymous blocks.
• While subprograms can be explicitly shared, the
default is to make them private to the owner’s
schema.
• Later subprograms become the building blocks of
packages and triggers.

The alternative to an anonymous block is a named


block. How the block is named depends on what
you are creating.
• You can create :
– a named procedure (does not return values except as out
parameters)
– a function (must return a single value not including out
parameters)
– a package (groups functions and procedures together)
– a trigger

What Is a Procedure?

• A procedure is a named PL/SQL block that can accept


parameters.
• Generally, you use a procedure to perform an action
(sometimes called a “side-effect”).
• A procedure is compiled and stored in the database as
a schema object.
– Shows up in USER_OBJECTS as an object type of PROCEDURE
– More details in USER_PROCEDURES
– Detailed PL/SQL code in USER_SOURCE

Syntax for Creating Procedures


• Parameters are optional
• Mode defaults to IN
• Datatype can be either explicit (for example,
VARCHAR2) or implicit with %TYPE
• Body is the same as an anonymous block

• Use CREATE PROCEDURE followed by the name,


optional parameters, and keyword IS or AS.
• Add the OR REPLACE option to overwrite an
existing procedure.
• Write a PL/SQL block containing local variables, a
BEGIN, and an END (or END procedure_name).

What Is a Stored Function?

A function is a named PL/SQL block (subprogram) that can


accept optional IN parameters and must return exactly
one value.
• Functions must be called as part of a SQL or PL/SQL
expression.
• In SQL expressions, a function must obey specific rules to
control side effects.
• Avoid the following within functions:
– Any kind of DML or DDL
– COMMIT or ROLLBACK
– Altering global variables
• Certain return types (Boolean, for example) prevent a
function from being called as part of a SELECT.
• In PL/SQL expressions, the function identifier acts like a
variable whose value depends on the parameters passed
to it.
• A function must have a RETURN clause in the header and
at least one RETURN statement in the executable section.

Differences Between
Procedures and Functions

Procedures
• You create a procedure to store a series of actions for later
execution.
• A procedure does not have to return a value.
• A procedure can call a function to assist with its actions.
• Note: A procedure containing a single OUT parameter
might be better rewritten as a function returning the
value.

Functions
• You create a function when you want to compute a value
that must be returned to the calling environment.
• Functions return only a single value, and the value is
returned through a RETURN statement.
• The functions used in SQL statements cannot use OUT or
IN OUT modes.
• Although a function using OUT can be invoked from a
PL/SQL procedure or anonymous block, it cannot be used
in SQL statements.

What Are PL/SQL Packages?

• PL/SQL packages are containers that enable you to group


together related PL/SQL subprograms, variables, cursors,
and exceptions.

A package consists of two parts stored separately


in the database:
• Package specification: The interface to your
applications.
– It must be created first.
– It declares the constructs (procedures, functions,
variables, and so on) that are visible to the calling
environment.
• Package body: This contains the executable
code of the subprograms that were declared in
the package specification.
– It can also contain its own variable declarations.

Creating the Package Body

When creating a package body, do the following:


• Specify the OR REPLACE option to overwrite an existing
package body.
• Define the subprograms in an appropriate order.
• The basic principle is that you must declare a variable or
subprogram before it can be referenced by other
components in the same package body.
• Every subprogram declared in the package specification
must also be included in the package body.

Removing Packages

• To remove the entire package, specification and body, use


the following syntax:
DROP PACKAGE package_name;

• To remove only the package body, use the following


syntax:

DROP PACKAGE BODY package_name;


• You cannot remove the package specification
on its own.

What Is a Trigger?
• A database trigger:
• Is a PL/SQL block associated with a specific action (an event)
such as a successful logon by a user, or an action taken on a
database object such as a table or view
• Executes automatically whenever the
associated action occurs
• Is stored in the database
• In the example on the previous slide, the
trigger is associated with this action: UPDATE
OF salary ON employees

Types of Triggers
Triggers can be either row-level or statement-level.
• A row-level trigger fires once for each row affected by the
triggering statement
• A statement-level trigger fires once for the whole statement.

What Is a DML Trigger?


• A DML trigger is a trigger that is automatically fired (executed) whenever an
SQL DML statement (INSERT, UPDATE, or DELETE) is executed.
• You classify DML triggers in two ways:
– By when they execute: BEFORE, AFTER, or INSTEAD OF the triggering DML
statement.
– By how many times they execute: Once for the whole DML statement (a
statement trigger), or once for each row affected by the DML statement
(a row trigger).

Statement Trigger Timing


When should the trigger fire?
•BEFORE: Execute the trigger body before the triggering DML
event on a table.
•AFTER: Execute the trigger body after the triggering DML event
on a table.
•INSTEAD OF: Execute the trigger body instead of the triggering
DML event on a view.
•Programming requirements will dictate which one will be used.

What are DDL and Database Event Triggers?


• DDL triggers are fired by DDL statements: CREATE, ALTER, or
DROP.
• Database Event triggers are fired by non-SQL events in the
database, for example:
– A user connects to, or disconnects from, the database.
– The DBA starts up, or shuts down, the database.
– A specific exception is raised in a user session.

Creating Triggers on DDL Statements Syntax


• ON DATABASE fires the trigger for DDL on all schemas in the
database
• ON SCHEMA fires the trigger only for DDL on objects in your
own schema

Creating Triggers on Database Events Syntax


• ON DATABASE fires the trigger for events on all sessions in the
database.
• ON SCHEMA fires the trigger only for your own sessions.

Creating Triggers on Database Events Guidelines

• Remember, you cannot use INSTEAD OF with Database Event


triggers.
• You can define triggers to respond to such system events as
LOGON, SHUTDOWN, and even SERVERERROR.
• Database Event triggers can be created ON DATABASE or ON
SCHEMA, except that ON SCHEMA cannot be used with
SHUTDOWN and STARTUP events

What is a Data Warehousing?


• An electronic storage of a large amount of
information by a business or organization.
• A type of data management system that is
designed to enable and support business
intelligence (BI) activities, especially analytics.
Data warehouses are solely intended to
perform queries and analysis and often contain
large amounts of historical data. The data
within a data warehouse is usually derived from
a wide range of sources such as application log
files and transaction applications.

Data Warehousing vs. Databases


• A database is a transactional system that is set to monitor and update real-time
data in order to have only the most recent data available.

• A data warehouse is programmed to aggregate structured data over a period of


time.

Data Warehouse Architecture

• Simple. All data warehouses share a basic design in which metadata, summary data,
and raw data are stored within the central repository of the warehouse. The
repository is fed by data sources on one end and accessed by end users for analysis,
reporting, and mining on the other end.
• Simple with a staging area. Operational data must be cleaned and processed before
being put in the warehouse. Although this can be done programmatically, many data
warehouses add a staging area for data before it enters the warehouse, to simplify
data preparation.

• Hub and spoke. Adding data marts between the central repository and end users
allows an organization to customize its data warehouse to serve various lines of
business. When the data is ready for use, it is moved to the appropriate data mart.
• Sandboxes. Sandboxes are private, secure, safe areas that allow companies to quickly
and informally explore new datasets or ways of analyzing data without having to
conform to or comply with the formal rules and protocol of the data warehouse.

Data Warehouse Architecture


Top Tier
The Top Tier consists of the Client-side front end of the architecture.
The Transformed and Logic applied information stored in the Data Warehouse will be
used and acquired for Business purposes in this Tier.

Middle Tier
The Middle Tier consists of the OLAP Servers
OLAP is Online Analytical Processing Server
OLAP is used to provide information to business analysts and managers

Bottom Tier
The Bottom Tier mainly consists of the Data Sources, ETL Tool, and Data Warehouse.

Components of Data Warehouse


1. Load Manager
• Load manager is also called the front component.
• It performs with all the operations associated with the extraction and
load of data into the warehouse.
• These operations include transformations to prepare the data for
entering the Data warehouse.

2. Warehouse Manager
• Warehouse manager performs operations associated with the
management of the data in the warehouse.
• It performs operations like analysis of data to ensure consistency,
creation of indexes and views, generation of denormalization and
aggregations, transformation and merging of source data and
archiving and baking-up data.

3. Query Manager
• Query manager is also known as backend component.
• It performs all the operation operations related to the management of
user queries.
• The operations of this Data warehouse components are direct queries
to the appropriate tables for scheduling the execution of queries.
4. End-user Access Tools
• This is categorized into five different groups like:
o Data Reporting
o Query Tools
o Application development tools
o EIS tools
o OLAP tools and data mining tools.

Three types of data warehouse:


1. Enterprise Data Warehouse
2. Operational Data Store
3. Data Mart

Enterprise Data Warehouse (EDW)


• Enterprise Data Warehouse is a centralized warehouse.
• It provides decision support service across the enterprise.
• It offers a unified approach for organizing and representing data.
• It also provide the ability to classify data according to the subject and give
access according to those divisions.

Enterprise Data Warehouse Architecture


One-tier architecture for EDW means that you have a database directly connected
with the analytical interfaces where the end user can make queries.
Two-tier architecture. A data mart level is added between the user interface and
EDW. A data mart is a low-level repository that contains domain-specific
information. Simply put, it’s another, smaller-sized database that extends EDW
with dedicated information for your sales/operational departments, marketing, etc.
Three-tier architecture. On top of the data mart layer, enterprises also use online
analytical processing (OLAP) cubes. An OLAP cube is a specific type of database that
represents data from multiple dimensions.

Data Mart
• A data mart is a subset of the data warehouse. It is specially designed
for a particular line of business, such as sales, finance, sales or finance.
In an independent data mart, data can collect directly from sources.
• A data mart is a scaled-down version of a data warehouse aimed at
meeting the information needs of a homogeneous small group of end
users such as a department or business unit (marketing, finance,
logistics, or human resources). It typically contains some form of
aggregated data and is used as the primary source for report
generation and analysis by this end user group.

Types of Data Mart


1. Dependent: A dependent data mart
allows sourcing organization's data from a
single Data Warehouse.

2. Independent: Independent data mart is


created without the use of a central data
warehouse.

3. Hybrid: This type of data marts can take


data from data warehouses or operational
systems.

Types of Data Warehouse Schema


• Star Schema
• Snowflake Schema
• Galaxy Schema

Schema
• A schema is a logical description that describes the entire database.
• In the data warehouse there includes the name and description of records.
• It has all data items and also different aggregates associated with the data.

Star Schema
• It is known as star schema as its structure resembles a star.
• The star schema is the simplest type of Data Warehouse schema.
• It is also known as Star Join Schema and is optimized for querying large data sets.
• The center of the star can have one fact table and a number of associated
dimension tables.

Fact Table
• A Fact table in a Data Warehouse system is nothing but the table that contains
all the facts or the business information, which can be subjected to analysis
and reporting activities when required.
• These tables hold fields that represent the direct facts, as well as the foreign
fields that are used to connect the fact table with other dimension tables in
the Data Warehouse system.
• A Data Warehouse system can have one or more fact tables, depending on the
model type used to design the Data Warehouse.

Dimension Table
• Dimension is a collection of reference information about a measurable in the
fact table.
• The primary key column of the dimension table has uniquely identifies each
dimension record or row.
• The dimension tables are organized has descriptive attributes. For example, a
customer dimension’s attributes could include first and last name, birth date,
gender, Qualification, Address etc.

Snowflake Schema
• A snowflake schema is an extension of star schema where the dimension
tables are connected to one or more dimensions.
• The dimension tables are normalized which splits data into additional tables.
• Snowflake schema keeps same fact table structure as star schema.
• In the dimension, it has multiple levels with multiple hierarchies. From each
hierarchy of levels any one level can be attached to Fact Table.

Galaxy Schema

• Multiple fact tables share dimension tables.


• This schema is viewed as collection of stars hence called galaxy schema.
• It is also called Fact Constellation Schema

Online Analytical Processing (OLAP)


• The OLAP (Online Analytical Processing) is a powerful technology behind many Business
Intelligence (BI) applications that discovers data, report viewing capabilities, complex
analytical calculations, and predictive “what if” scenario, budget planning, forecast
planning.
• OLAP is for the data analytic purpose, hence it enables us to analyze information from
multiple database systems at the same time.
• It is a computing method that allows users to easily extract required data and query data
in order to analyze it from different points of view.
• It is basically based on the huge data that is called data warehouse; it collects the
required data from the data warehouse and perform the business required analysis to take
some decision in the business to improve in profit, to improve sale, to improve brand, to
improve marketing and so all.

An OLAP Cube is a data structure that allows fast analysis of data according to the multiple
Dimensions that define a business problem.

Basic Analytical Operations of OLAP


• Roll-up – Also known as drill-up or consolidation, use to summarize operation data
along with the dimension.

Drill-down – To perform the analysis in deeper among the dimensions of data. For example,
drilling down from “time period” to “years” and “months” and to “days” and so on to plot
sales growth for a product.

Slice – To perform the analysis take one level of information for display, such as “sales in
2019.”

Dice – To perform the analysis, select data from multiple dimensions to analyze, such as
“sales of Laptop in Region 4 in 2019.”

Pivot – To perform the analysis that can gain a new view of data by rotating the data axes
of the cube
Types of OLAP System

TYPE OF OLAP EXPLANATION


ROLAP utilize a relational database management system to keep
and control the data. These are the servers that exist
between the database and the user. ROLAP systems work
on the information that resides in a relational database.
MOLAP This server utilizes a multi-dimensional Database (MDDB)
for storing and analyzing information. MDDB can
proficiently store summaries, giving a method for quick
questioning and recovering information from the database
for processing

HOLAP It is a blend of MOLAP and ROLAP. By utilizing both ROLAP


and MOLAP information stores, Hybrid OLAP offers the
qualities of both techniques. HOLAP stores data summaries
in the binary files or in the pre-calculated cubes. It leaves
the quantities of fact and dimension information in the
relational database.
DOLAP Desktop On-Line Analytical Processing (DOLAP) is a single-
tier, desktop-based OLAP technology.

WOLAP Web OLAP which is OLAP system accessible via the web
browser. WOLAP is a three-tiered architecture. It consists of
three components: client, middleware, and a database
server.
Mobile OLAP Mobile OLAP helps users to access and analyze OLAP data
using their mobile devices
SOLAP SOLAP is created to facilitate management of both spatial
and non-spatial data in a Geographic Information system
(GIS)

What is ETL?
• ETL stands for Extract, Transform and Load.
• It is a process in data warehousing used to extract data from the
database or source systems and after transforming placing the
data into data warehouse. It is a combination of three database
functions i.e. Extract, Transform and Load.
ETL Process
• Step 1 - Extraction
All the preferred data from various source systems such as databases,
applications, and flat files is identified and extracted. Data extraction can be
completed by running jobs during non-business hours.

Data Extraction Strategies


Full Extraction: This is followed when whole data from sources get loaded into the data
warehouses that show either data warehouse is being populated the first time or no
strategy
has been made for data extraction.
Partial Extraction (with update notification): This strategy is also known delta, where only
the
data being changed is extracted and update data warehouses
Partial Extraction (without update notification): This strategy refers to extract specific
required
data from sources according to load in the data warehouses instead of extracting whole
data.

Step 2 - Transformation
Most of the extracted data can’t be directly loaded into the target system.
Based on the business rules, some transformations can be done before
loading the data.
The transformation process also corrects the data, removes any incorrect
data and fixes any errors in the data before loading it.

Step 3 - Loading
All the gathered information is loaded into the target Data Warehouse
tables.
Types of Loading:
Initial Load — populating all the Data Warehouse tables
Incremental Load — applying ongoing changes as when needed periodically.
Full Refresh —erasing the contents of one or more tables and reloading with
fresh data.
What is Data Mining?
It is basically the extraction of vital information/knowledge from a large set of data.
Fundamentally, data mining is about processing data and identifying patterns and trends in
that information so that you can decide or judge.
Data mining is also called as Knowledge discovery, Knowledge extraction, data/pattern
analysis, information harvesting, etc.

Its foundation comprises three scientific disciplines:


statistics (the numeric study of data relationships)
artificial intelligence (human-like intelligence displayed by software and/or machines)
machine learning (algorithms that can learn from data to make predictions).

Data Mining Techniques


Cluster Analysis. Enables to identify a given user group according to common features in
a database. These features could include age, geographic location, education level and so
on.

Anomaly Detection.
It is used to determine when something is noticeably different from the regular pattern. It is
used to eliminate any database inconsistencies or anomalies at the source.

Regression Analysis.
This technique is used to make predictions based on relationships within the data set.

Classification.
This deals with the things which have labels on it. Note in cluster detection, the things did
not have a label in it and by using data mining we had to label and form into clusters, but in
classification, there is information existing that can be easily classified using an algorithm.

Associative Learning.
It is used to analyze which things tend to occur together either in pairs or larger groups.

Predictive Data Mining vs. Descriptive Data Mining


The descriptive analysis is used to mine data and provide the latest information on past or
recent events.
The predictive analysis provides answers of the future queries that move across using
historical data as the chief principle for decisions.

• Key Data Mining Algorithms


Supervised learning requires a known output, sometimes called a label or target. These
algorithms include Naïve Bayes, Decision Tree, Neural Networks, SVMs, Logistic
Regression, etc.
• Unsupervised learning algorithms do not require a predefined set of outputs but rather
look for patterns or trends without any label or target. They include k-Means Clustering,
Anomaly Detection, and Association Mining

Data Mining Algorithms

C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is
given a set of data representing things that are already classified.

k-means creates k groups from a set of objects so that the members of a group are
more similar. It’s a popular cluster analysis technique for exploring a dataset.

Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At
a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision
trees at all.

The Apriori algorithm learns association rules and is applied to a database containing a
large number of transactions.

In data mining, expectation-maximization (EM) is generally used as a clustering algorithm


(like k-means) for knowledge discovery.

PageRank is a link analysis algorithm designed to determine the relative importance of


some object linked within a network of objects

AdaBoost is a boosting algorithm which constructs a classifier. As you probably


remember, a classifier takes a bunch of data and attempts to predict or classify which
class a new data element belongs to.

kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the


classifiers previously described because it’s a lazy learner.
Naive Bayes is not a single algorithm, but a family of classification algorithms that share
one common assumption: Every feature of the data being classified is independent of all
other features given the class.

CART stands for classification and regression trees. It is a decision tree learning technique
that outputs either classification or regression trees. Like C4.5, CART is a classifier.

CRISP-DM
• CRISP-DM stands for Cross Industry Standard Process for Data Mining.
• Is a 1996 methodology created to shape Data Mining projects. It consists of 6 steps to
conceive a Data Mining project and they can have cycle iterations according to developers’
needs.

Phases of CRISP-DM
1. Business Understanding
Focuses on understanding the project objectives and requirements from a
business perspective, and then converting this knowledge into a data mining
problem definition and a preliminary plan.
2. Data Understanding
Starts with an initial data collection and proceeds with activities in order to get
familiar with the data, to identify data quality problems, to discover first
insights into the data, or to detect interesting subsets to form hypotheses for
hidden information.
3. Data Preparation
The data preparation phase covers all activities to construct the final dataset
from the initial raw data.
4. Modeling
Modeling techniques are selected and applied. Since some techniques like
neural nets have specific requirements regarding the form of the data, there
can be a loop back here to data prep.

5. Evaluation
Once one or more models have been built that appear to have high quality based on
whichever loss functions have been selected, these need to be tested to ensure they
generalize against unseen data and that all key business issues have been sufficiently
considered. The end result is the selection of the champion model(s).
6. Deployment
Consists of presenting the results in a useful and understandable manner, and by
achieving this, the project should achieve its goals. It is the only step not belonging to
a cycle.

You might also like