Lecture Notes Lecture Database Modelling and Design
Lecture Notes Lecture Database Modelling and Design
Toby J. Teorey
University of Michigan
Lecture Notes
Contents
I. Database Systems and the Life Cycle (Chapter 1)……………………2
Introductory concepts; objectives of database management 2
Relational database life cycle 3
Characteristics of a good database design process 7
II. Requirements Analysis (Chapter 3)………………………………….8
database administrator (DBA) -- person or group responsible for the effective use of
database technology in an organization or enterprise. Motivation: control over all phases of
the lifecycle.
*Y2K (year 2000) problem—many systems store 2-digit years (e.g. ‘02-OCT-98’) in their
programs and databases, that give incorrect results when used in date arithmetic (especially
subtraction), so that ‘00’ is still interpreted as 1900 rather than 2000. Fixing this problem
requires many hours of reprogramming and database alterations for many companies and
government agencies.
1. Organizational objectives
- sell more cars this year
- move into to recreational vehicle market
3. Organizational structure/chart
Middle management - functions in operational areas, technical areas, job-titles, job functions
2. Agree with the interviewee on format for documentation (ERD, DFD, etc.)
Function: Take customer orders and either fill them or make adjustments.
Frequency: daily
10
Entity - a class of real world objects having common characteristics and properties about
which we wish to record information.
* degree - number of entities associated in the relationship (binary, ternary, other n-ary)
* Surrogate - system created and controlled unique key (e.g. Oracle’s “create sequence”)
11
12
13
14
* special property: inheritance - subclass inherits the primary key of the super-class, super-class has
common nonkey attributes, each subclass has specialized non-key attributes
Aggregation
* “part-of” relationship among entities to a higher type aggregate entity (“contains” is the inverse
relationship)
15
Constraints
16
Constraints in ER modeling
* role - the function an entity plays in a relationship
- mandatory/optional
- specifies lower bound of connectivity of entity instances
- participating in a relationship as 1 or 0
17
18
19
20
21
- naming conflicts
homonyms - same name for different concepts
synonyms - different names for the same concept
- structural conflicts
type conflicts - different modeling construct for the same concept (e. g. “order” as an entity,
attribute, relationship)
- dependency conflicts - connectivity is different for different views (e.g. job-title vs. job-title-history)
- key conflicts - same concept but different keys are assigned (e.g. ID-no vs. SSN)
- behavioral conflicts - different integrity constraints (e.g. null rules for optional/mandatory:
insert/delete rules)
2. Conforming of schemas
* superimpose entities
22
23
24
25
Entity-Relationship Clustering
Motivation
* conceptual (ER) models are difficult to read and understand for large and complex databases, e.g.
10,000 or more data elements
* there is a need for a tool to abstract the conceptual database schema (e. g. clustering of the ER
diagram)
* potential applications
- documentation of the database conceptual schema (in coordination with the data dictionary)
Clustering Methodology
Given an extended ER diagram for a database.....
26
27
28
* Ternary relationship – directly to a SQL table, taking the 3 primary keys of the 3
entities associated with this relationship as foreign keys in the new table
* Attribute of an entity – directly to be an attribute of the table transformed from this
entity
* Generalization subclass (subtype) entity – directly to a SQL table, but with the
primary key of its super-class (super-type) propagated down as a foreign key into its table
29
30
31
32
33
34
First normal form (1NF) to third normal form (3NF) and BCNF
Goals of normalization
1. Integrity
2. Maintainability
super-key -- a set of one or more attributes, which, when taken collectively, allows us to identify
uniquely an entity or table.
candidate key —any subset of the attributes of a super-key that is also a super-key, but not
reducible.
primary key -- arbitrarily selected from the set of candidate keys, as needed for indexing.
35
36
S1 P3 6 11-5-89
S1 P4 2 6-30-90
S1 P5 1 8-12-91
S1 P6 5 4-21-91
S2 P1 3 5-3-90
S2 P2 4 12-31-90
S3 P3 4 3-25-91
S3 P5 2 3-27-91
S4 P2 2 10-31-89
S4 P4 3 7-14-90
S4 P5 7 8-20-90
S5 P5 5 8-11-91
NOT Third Normal Form
37
38
39
This table has a composite key that must be separated from functional dependencies (FDs) that involve
any individual component of this key (e.g. empno) on the left side.
Table 2
Let us start with the following set of FDs and then refine them, eliminating transitive dependencies
within the same table.
We need to eliminate the redundant right sides of the transitive dependencies (office_no) and put them
into other tables. Thus we get:
40
1. Reflexivity
If Y is a subset of the attributes of X, then X->Y.
X = ABCD, Y = ABC => X->Y
X->X trivial case
2. Augmentation
If X->Y and Z is a subset of table R (i.e. Z is any set of attributes in R), then XZ -> YZ .
3. Transitivity
If X->Y and Y->Z then X->Z.
4. Pseudo-transitivity
If X->Y and YW->Z then XW->Z.
(transitivity is a special case of pseudo-transitivity when W is null)
5. Union
If X->Y and X->Z then X->YZ.
6. Decomposition
If X->YZ then X->Y and X->Z.
Given: any FD containing all attributes in the table R(W,X,Y,Z), i.e. XY -> WZ.
Proof:
(1) XY -> WZ given
(2) XY -> XY by the reflexivity axiom
(3) XY -> XYWZ by the union axiom
(4) XY uniquely determines every attribute in table R, as shown in (3)
(5) XY uniquely defines table R, by the definition of a table as having no duplicate rows
(6) XY is therefore a super-key, by the definition of a super-key.
Super-key Rule 2. Any attribute that functionally determines a
Super-key of a table, is also a super-key for that table.
41
H+ closure of H - set of all FDs derivable from H using all the FD inference rules
Step 3: Partition H’ into tables such that all FDs with the
same left side are in one table, thus eliminating any non-fully functional FDs. (Note: creating tables
at this point would be a feasible solution for 3NF, but not necessarily minimal.)
R1: AB->C R4: G->DI R7: L->D
R2: A->EF R5: F->DJ R8: PQR->T
R3: E->G R6: D->KLMNP
R9: PR->S
42
Step 4: Merge equivalent keys, i.e. merge tables where all FD’s satisfy 3NF.
4.1 Write out the closure of all LHS attributes resulting from Step 3, based on transitivities.
4.2 Using the closures, find tables that are subsets of other groups and try to merge them. Use Rule
1 and Rule 2 to establish if the merge will result in FDs with super-keys on the LHS. If not, try
using the axioms to modify the FDs to fit the definition of super-keys.
4.3 After the subsets are exhausted, look for any overlaps among tables and apply Rules 1 and 2
(and the axioms) again.
In this example, note that R7 (L->D) has a subset of the attributes of R6 (D->KLMNP). Therefore
we merge to a single table with FDs D->KLMNP, L->D because it satisfies 3NF: D is a super-key
by Rule 1 and L is a super-key by Rule 2.
Final 3NF (and BCNF) table attributes, FDs, and candidate keys:
R1: ABC (AB->C with key AB) R5: DFJ (F->DJ with key F)
R2: AEF (A->EF with key A) R6: DKLMNP (D->KLMNP, L->D, w/keys D, L)
R3: EG (E->G with key E) R7: PQRT (PQR->T with key PQR)
R4: DGI (G->DI with key G) R8: PRS (PR->S with key PR)
Step 4a. Check to see whether all tables are also BCNF. For any table that is not BCNF,
add the appropriate partially redundant table to eliminate the delete anomaly.
43
44
Delete anomaly: If Sutton drops Journalism, then we have no record of Murrow teaching Journalism.
How can we decompose this table into BCNF?
45
The new row is allowed in SI using unique(student,instructor) in the create table command, and the
join of SI and IC is lossless. However, a join of SI and IC now produces the following two rows:
student course instructor
Sutton Math Von Neumann
Sutton Math Dantzig which violates the FD SC -> I.
Oracle, for instance, has no way to automatically check SC->I, although you could write a procedure to
do this at the expense of a lot of overhead.
Decomposition 3 (tradeoff between integrity and performance)
SC -> I and I -> C (two tables with redundant data)
Problems -extra updates and storage cost
46
1. delete anomaly - two independent facts get tied together unnaturally so there may be bad side
effects of certain deletes, e.g. in “skills required” the last record of a skill may be lost if employee
is temporarily not working on any projects).
2. update inefficiency - adding a new project in “skills required” requires insertions for many
records (rows) that to include all required skills for that new project. Likewise, loss of a project
requires many deletes.
47
2-way lossless join occurs when skill_required is projected over {empno, project} and {project, skill}.
Projection over {empno, project} and {empno, skill}, and over {empno, skill} and {project, skill},
however, are not lossless. 3-way lossless join occurs when skill_required is projected over {empno,
project}, {empno, skill}, {project, skill}.
48
skill_in_common—an employee must apply the intersection of available skills to the skills needed for
different projects. In other words if an employee has a certain skill and he or she works on a given
project that requires that skill, then he or she must provide that skill for that project (this is less restrictive
than skill_required because the employee need not supply all the required skills, but only those in
common).
49
V. Access Methods
Types of Queries
Query type 1: access all records of a given type
“Increase everyone’s salary by 10%”
access method: sequential processing
where Tsba is the average disk i/o service time for a sequential
block and Trba is the average disk i/o service time for a random block
access
50
51
52
Extendible Hashing
* number of buckets grow or contracts
* bucket splits when it becomes full
* collisions are resolved immediately, no long overflow chains
* primary key transformed to an entry in the Bucket Address Table (BAT),
typically in RAM
* BAT has pointers to disk buckets that hold the actual data
* Retrieve a single record = 1 rba (access the bucket in one step)
* Cost (service time) of I/O for updates, inserts, and deletes is the same as for B+-trees
53
54
Example: B + -tree
To determine the order of a B+-tree, let us assume that the database has 500,000
records of 200 bytes each, the search key is 15 bytes, the tree and data pointers are 5
bytes, and the index node (and data block size) is 1024 bytes. For this configuration
we have non-leaf index node size = 1024 bytes = p*5 + (p-1)*15 bytes
p = floor((1024+15)/20) = floor(51.95) = 51
number of search key values in the leaf nodes = floor ((1024-5)/(15+5))=50
h = height of the B+-tree (number of index levels, including the leaf index nodes
n = number of records in the database (or file); all must be pointed at from the next to last level, h-1
ph-1(p-1) > n
(h-1)log p + log(p-1) > log n
(h-1)log p > log n-log(p-1)
h > 1 + (log n-log(p-1)) / log p
h > 1 + (log 500,000-log 50)/log 51 = 3.34, h=4 (nearest higher integer)
A good approximation can be made by assuming that the leaf index nodes are
implemented with p pointers and p key values:
ph > n
h log p > log n
h > log n/log p
In this case, the result above becomes h > 3.35 or h = 4.
55
B+-tree performance
read a single record (B+-tree) = h+1 rba
As an example, consider the insertion of a node (with key value 77) to the B+-tree
shown in Fig. 6.6. This insertion requires a search (query) phase and an insertion
phase with one split node. The total insertion cost for height 3 is
56
57
Secondary Indexes
Basic characteristics of secondary indexes
* based on Boolean search criteria (AND, OR, NOT) of attributes that are not
the primary key
* one accession list per attribute value; pointers have block address and record
offset typically
bfac = block_size/pointer_size
* assume all accesses to the accession list are random due to dynamic re-allocation of
disk blocks
58
59
60
where h is the height of the B+tree index, bfac is the blocking factor for the accession
list (i.e. the number of pointer/key pairs in the leaf nodes in the B+tree), and t is the
number of target records in the table that satisfies all the conditions in the query.
61
Denormalization
* motivation – poor performance by normalized databases
To illustrate the effect of denormalization, let us assume that the table review is
associated with the tables employee and manager as the table that follows shows.
The extension of the review table, review-ext, is shown as a means of reducing
the number of joins required in the query shown below. This extension results in a
real denormalization, that is,
review_no -> emp_id -> emp_name, emp_address
with the side effects of add and update anomalies. However, the delete anomaly
cannot occur because the original data is redundant in the extended schema.
62
3. Evaluate total cost for storage, query, and update for the database schema, with
and without the extended table, and determine which configuration minimizes total
cost.
4. Consider also the possibility of denormalization due to a join table and its side
effects. If a join table schema appears to have lower storage and processing cost and
insignificant side effects, then consider using that schema for physical design in
addition to the original candidate table schema. Otherwise use only the original
schema.
63
Join Strategies
1. nested loop: complexity O(mn)
2. merge-join: complexity O(n log2 n)
3. indexed join: complexity O(n)
4. hash-join: complexity O(n)
If a sequential block access requires an average of 10 ms, the total time required is 2505 seconds.
Nested Loop Case 2: project is the outer loop table.
Note that this strategy does not take advantage of row order for these tables
join cost = sort time for assigned_to + merge time (to scan both sorted tables)
= (50,000*log 2 50,000)/100 + 50,000/100 + 250/50
= (50,000*16)/100 + 500 + 5
= 8505 sequential block accesses (or 85.05 seconds)
64
join cost = sort time for both tables + merge time for both tables
= (50,000*log 2 50,000)/100 +(250*log 2 250)/50 + 50,000/100
+ 250/50
= 8000 + 40 + 500 + 5
= 8545 sequential block accesses (or 85.45 seconds)
We see that the sort phase of the merge-join strategy is the costliest component, but it still
significantly improves performance compared to the nested loop strategy.
Low Selectivity Joins
Let ntr=100 qualifying rows for the foreign key table (assigned_to) and let ntr=1 row for the
primary key table (project) in the example below. Assume h=2 for the unique index to project, Tsba
= 10 ms, and Trba = 40 ms.
Indexed join Case 1: Scan foreign key table once and index to the primary key
join cost = scan the entire foreign key table (assigned_to)
+ index to the primary key table (project) qualifying row
= 50,000/100 sba + (h+1) rba
= 500 sba + 3 rba (or 5.12 seconds)
For the next case, assume the nonunique index height, hn = 3, index blocking factor
bfac = 500, with ntr = 100 target foreign key rows as given above.
Indexed join Case 2: Index to both the primary key table and the foreign key
Join cost = index to the primary key table + index to the foreign key table
= (h+1) rba + [hn + ceil(ntr/bfac) – 1 + ntr] rba
= 3 rba + [3 + 0 + 100] rba
= 106 rba (or 4.24 seconds)
Indexed join Case 3: Nonunique indexes required for both tables due to join on
two nonkeys.
Join cost = index to the first table + index to the second table
= [h1 + ceil(ntr1/bfac1) –1 + ntr1] rba
+ [h2 + ceil(ntr2/bfac2) –1 + ntr2] rba
In the hash join strategy, the table scans may only have to be done infrequently as
long as the hash file in RAM remains intact for a series of queries, so in Case 1
above, the incremental cost for the given query requires only 101 rba or 4.04
seconds.
65
Distributed Database Management System (DDBMS) - a software system that permits the
management of a distributed database and makes the distribution transparent to the users. If
heterogeneous, it may allow transparent simultaneous access to data on multiple dissimilar systems.
Advantages
1. Improves performance, e.g. it saves communication costs and reduces query delays by providing data
at the sites where it is most frequently accessed.
2. Improves the reliability and availability of a system by providing alternate sites from where the
information can be accessed.
3. Increases the capacity of a system by increasing the number of sites where the data can be located.
4. Allows users to exercise control over their own data while allowing others to share some of the data
from other sites.
2. Makes central control more difficult and raises several security issues because a data item stored at a
remote site can be always accessed by the users at the remote site.
3. Makes performance evaluation difficult because a process running at one node may impact the entire
network.
66
67
Rule 2. No Central Site. There must be no central point of failure or bottleneck. Therefore the
following must be distributed: dictionary management, query processing, concurrency control, and
recovery control.
Rule 3. Continuous Operation. The system should not require a shutdown to add or remove a
node from the network. User applications should not have to change when a new network is added,
provided they do not need information from the added node.
Rule 4. Location Independence (or Transparency). A common global user view of the
database should be supported so that users need not know where the data is located. This allows data to
be moved for performance considerations or in response to storage constraints without affecting the user
applications.
Rule 6. Replication Independence (or Transparency). This allows several copies of a table
(or portions thereof) to reside at different nodes. Query performance can be improved since applications
can work with a local copy instead of a remote one. Update performance, however, may be degraded due
to the additional copies. Availability can improve.
Rule 7. Distributed Query Processing. No central site should perform optimization; but the
submitting site, which receives the query from the user, should decide the overall strategy. Other
participants perform optimization at their own levels.
Rule 8. Distributed Transaction Processing. The system should process a transaction across
multiple databases exactly as if all of the data were local. Each node should be capable of acting as a
coordinator for distributed updates, and as a participant in other transactions. Concurrency control must
occur at the local level (Rule 2), but there must also be cooperation between individual systems to ensure
that a “global deadlock” does not occur.
Rule 9. Hardware Independence. The concept of a single database system must be presented
regardless of the underlying hardware used to implement the individual systems.
Rule 10. Operating System Independence. The concept of a single database system must be
presented regardless of the underlying operating systems used.
Rule 11. Network Independence. The distributed system must be capable of communicating over
a wide variety of networks, often different ones in the same configuration. Standard network protocols
must be adhered to.
Rule 12. DBMS Independence (Heterogeneity). The distributed system should be able to be
made up of individual sites running different database management systems.
68
* Distribution Transparency
- location, fragmentation, replication, update
* Integrity
- Transaction management
- Concurrency control
- Recovery and availability
- Integrity constraint checking
69
IV. Data distribution (allocation). Create a data allocation schema that indicates
where each copy of each table is to be stored. The allocation schema defines at which site(s) a table is
located. A one-to-one mapping in the allocation schema results in non-redundancy, while a one-to-
many mapping defines a redundant distributed database.
Fragmentation.
Fragmentation is the process of taking subsets of rows and/or columns of tables as the smallest unit of
data to be sent across the network. Unfortunately, very few commercial systems have implemented this
feature, but we include a brief discussion for historical reasons. We could define a fragmentation
schema of the database based on dominant applications’ “select” predicates (set of conditions for
retrieval specified in a select statement).
Horizontal fragmentation partitions the rows of a global fragment into subsets. A fragment r1 is a
selection on the global fragment r using a predicate Pi, its qualification. The reconstruction of r is
obtained by taking the union of all fragments.
Vertical fragmentation subdivides the attributes of the global fragment into groups. The simplest
form of vertical fragmentation is decomposition. A unique row-id may be included in each fragment to
guarantee that the reconstruction through a join operation is possible.
Mixed fragmentation is the result of the successive application of both fragmentation techniques.
2. Fragments must be disjoint and their union must become the whole
fragment. Overlapping fragments are too difficult to analyze and
implement.
3. The largest fragment is the whole table. The smallest table is a
single record. Fragments should be designed to maintain a
balance between these extremes.
70
Data Distribution
Data distribution defines the constraints under which data allocation strategies may operate. They are
determined by the system architecture and the available network database management software. The
four basic data distribution approaches are :
* Centralized
In the centralized database approach, all the data are located at a single site. The implementation of
this approach is simple. However, the size of the database is limited by the availability of the secondary
storage at the central site. Furthermore, the database may become unavailable from any of the remote
sites when communication failures occur, and the database system fails totally when the central site fails.
* Partitioned
In this approach, the database is partitioned by tables, and each table is assigned to a particular site.
This strategy is particularly appropriate where either local secondary storage is limited compared to the
database size, the reliability of the centralized database is not sufficient, or operating efficiencies can be
gained through the exploitation of the locality of references in database accesses.
* Replicated
The replicated data distribution strategy allocates a complete copy of the database to each site in the
network. This completely redundant distributed data strategy is particularly appropriate when reliability is
critical, the database is small, and update inefficiency can be tolerated.
* Hybrid
The hybrid data distribution strategy partitions the database into critical and non-critical tables.
Non-critical tables need only be stored once, while critical tables are duplicated as desired to meet the
required level of reliability.
71
Configuration Information
Constraints
72
the allocation of programs and database tables to sites which minimizes C, the total cost:
C = Ccomm + Cproc + Cstor
where:
Ccomm = communications cost for message and data.
Cproc = site processing cost (CPU and I/O).
Cstor = storage cost for data and programs at sites.
* Transaction response time which is the sum of communication delays, local processing,
and all resource queuing delays.
* Transaction availability which is the percentage of time the transaction executes with all
components available.
73
The non-redundant “best fit” method determines the single most likely site to allocate a table based on
maximum benefit, where benefit is interpreted to mean total query and update references. In particular,
place table Ri at the site s* where the number of local query and update references by all the user
transactions are maximized.
Example
System Parameters
Avg. Service Time Avg. Service Time
Table Size Local Query(Update) Remote Query(Update)
R1 300 KBytes 100 ms (150 ms) 500 ms (600 ms)
R2 500 KBytes 150 ms (200 ms) 650 ms (700 ms)
R3 1.0 Mbytes 200 ms (250 ms) 1000 ms (1100 ms)
User transactions are described in terms of their frequency of occurrence, which tables they access, and
whether the accesses are reads or writes.
74
Our goal is to compute the number of local references to each table residing at each site, one by one.
The site that maximizes the local references to a given table is chosen as the site where that table should
reside.
Table Site Trans. T1(freq) T2(freq) T3(freq) Total local refs
R1 S1 3 read,1 write(1) 0 0 4
S2 0 2 read(2) 0 4
S3 0 0 0 0
S4 3 read,1 write(1) 2 read(2) 0 8 (max.)
S5 3 read,1 write(1) 0 0 4
R2 S1 2 read(1) 0 0 2
S2 0 0 0 0
S3 0 0 3 read,1 write(3) 12
S4 2 read(1) 0 0 2
S5 2 read(1) 0 3 read,1 write(3) 14 (max.)
R3 S1 0 0 0 0
S2 0 3 read,1 write(2) 0 8 (max.)
S3 0 0 2 read(3) 6
S4 0 3 read,1 write(2) 0 8 (max.)
S5 0 0 2 read(3) 6
Advantages
- simple algorithm
Disadvantages
- number of local references may not accurately characterize time or cost (reads and writes
given equal weights)
- no insights regarding replication
75
76
The benefit for table R at site S is measured by the difference in elapsed time to do a remote query to
table R from site S (i.e. no replicated copy available locally) and a local query to table R at site S (i.e.
replicated copy available locally). Total benefit for table R at site S is the weighted sum of benefit for
each query times the frequency of queries.
The cost for table R at site S is the total elapsed time for all the local updates of table R, plus the total
elapsed time for all the remote updates for the given table at that site. Total cost for table R at site S is
weighted sum of cost for each update transaction times the frequency of update transactions.
Example
Cost/Benefit Computations for “All Beneficial Sites”
77
R3 S1 None 0 0
S2 T2 at S2 3*2*(1000 - 200) 4800 ms**
S3 T3 at S3 2*3*(1000 - 200) 4800 ms**
S4 T2 at S4 3*2*(1000 - 200) 4800 ms**
S5 T3 at S5 2*3*(1000 - 200) 4800 ms**
Advantages
- simple algorithm
- can be applied to either redundant or non-redundant case
- reads and writes given appropriate weights
Disadvantages
- global averages of query and update time may not be realistic
- network topology and protocols not taken into account
78
Data warehouse – a large repository of historical data that can be integrated for decision support
79
6. The DW should have a capability for rewriting history, that is, allowing “what-if” analysis.
80
1.2 Define the DW architecture and do some initial capacity planning for servers and tools. Integrate
the servers, storage elements, and client tools.
Star schema is the most often used format –- good performance, ease of use
Fact table (one) – very large table containing numeric and/or non numeric attributes, including
the primary keys from the dimension tables; similar to intersection tables between entities with
many-to-many relationships
Dimension tables (several) - smaller tables containing mostly non numeric attributes; similar to
relational tables based on entities
Snowflake schema – similar to star schema, except dimension tables are normalized
Fact table family (constellation) – multiple fact tables interact with dimension tables
81
82
join indexes – used to map dimension tables to the fact table efficiently
3.2 View materialization – associated with aggregation of data by one or more dimensions
such as time or location
83
5.2 Design and implement scripts for data extraction, cleaning, transformation, load, and refresh.
84
5.3 Populate the repository with the schema and view definitions, scripts, and other metadata.
5.4 Design and implement end-user applications. Rollout the DW and applications.
85
4. Formulas –- derived data values can be defined by formulas (sum, average, etc.)
5. Links – links are needed to connect hypercubes and their data sources
Aggregation Issues
• Pre-aggregate nothing
86
No. of aggregate 1 1 1
levels
87
Subset configuration
One-way aggregates
Product-type totals by store by day
State totals by product-name by day
Monthly totals by product-name by store
Two-way aggregates
Product-type totals by state totals by day
Product-type totals by month totals by store
State totals by monthly totals by product-name
Three-way aggregates
Product-type totals by monthly totals by state totals
88
Largest 3-way
aggregate fact table 210*4500*120 210*15000*520
= 113,400,000 records = 1,638,000,000 records
89
90
91
92
Data Mining
Definition – data mining is the activity of sifting through large files and databases to discover useful,
nonobvious, and often unexpected trends and relationships
3. Data mining
2. Database segmentation
6. Optimization searching
93
94
95