0% found this document useful (0 votes)
162 views28 pages

First Normal Form

The document discusses various database normalization forms and distributed database systems. It defines the first, second, third, and Boyce-Codd normal forms, and explains how BCNF differs from third normal form by ensuring every determinant is a candidate key. It also describes centralized versus distributed databases, noting distributed databases allow data sharing across locations but require replication to maintain consistency. The primary advantages of distributed databases are reliability through redundancy and increased availability due to the ability to continue operating if one site fails.

Uploaded by

Akansha Rawat
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views28 pages

First Normal Form

The document discusses various database normalization forms and distributed database systems. It defines the first, second, third, and Boyce-Codd normal forms, and explains how BCNF differs from third normal form by ensuring every determinant is a candidate key. It also describes centralized versus distributed databases, noting distributed databases allow data sharing across locations but require replication to maintain consistency. The primary advantages of distributed databases are reliability through redundancy and increased availability due to the ability to continue operating if one site fails.

Uploaded by

Akansha Rawat
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Ques1. List and explain various Normal Forms.

How BCNF differs from the Third Normal Form and 4th Normal forms? Ans. The Various Normal Forms are: First Normal Form Second Normal Form Third Normal Form Boyce-Codd Normal Form Fourth Normal Form Domain/Key Normal Form

First Normal Form


Any table having any relation is said to be in the first normal form. The criteria that must be met to be considered relational is that the cells of the table must contain only single values, and repeat groups or arrays are not allowed as values. All attributes (the entries in a column) must be of the same kind, and each column must have a unique name. Each row in the table must be unique. Databases in the first normal form are the weakest and suffer from all modification anomalies.

Second Normal Form


If all a relational database's non-key attributes are dependent on all of the key, then the database is considered to meet the criteria for being in the second normal form. This normal form solves the problem of partial dependencies, but this normal form only pertains to relations with composite keys.

Third Normal Form


A database is in the third normal form if it meets the criteria for a second normal form and has no transitive dependencies.

Boyce-Codd Normal Form


A database that meets third normal form criteria and every determinant in the database is a candidate key, it's said to be in the Boyce-Codd Normal Form. This normal form solves the issue of functional dependencies.

Fourth Normal Form


Fourth Normal Form (4NF) is an extension of BCNF for functional and multivalued dependencies. A schema is in 4NF if the left hand side of every nontrivial functional or multi-valued dependency is a super-key.

Domain/Key Normal Form


The domain/key normal form is the Holy Grail of relational database design, achieved when every constraint on the relation is a logical consequence of the definition of keys and domains, and enforcing key and domain restraints and conditions causes all constraints to be met. Thus, it avoids all non-temporal anomalies. It's much easier to build a database in domain/key normal form than it is to convert lesser databases which may contain numerous anomalies. However, successfully building a domain/key normal form database remains a difficult task, even for experienced database

programmers. Thus, while the domain/key normal form eliminates the problems found in most databases, it tends to be the most costly normal form to achieve. However, failing to achieve the domain/key normal form may carry long-term, hidden costs due to anomalies which appear in databases adhering only to lower normal forms over time. Ques2. What are differences in Centralized and Distributed Database Systems? List the relative advantages of data distribution. Ans. A distributed database is a database that is under the control of a central database management system (DBMS) in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. The replication and distribution of databases improves database performance at end-user worksites. To ensure that the distributive databases are up to date and current, there are two processes: replication and duplication. Replication involves using specialized software that looks for changes in the distributive database. Once the changes have been identified, the replication process makes all the databases look the same. The replication process can be very complex and time consuming depending on the size and number of the distributive databases. This process can also require a lot of time and computer resources. Duplication on the other hand is not as complicated. It basically identifies one database as a master and then duplicates that database. The duplication process is normally done at a set time after hours. This is to ensure that each distributed location has the same data. In the duplication process, changes to the master database only are allowed. This is to ensure that local data will not be overwritten. Both of the processes can keep the data current in all distributive locations. Besides distributed database replication and fragmentation, there are many other distributed database design technologies. For example, local autonomy, synchronous and asynchronous distributed database technologies. These technologies' implementation can and does depend on the needs of the business and the sensitivity/confidentiality of the data to be stored in the database, and hence the price the business is willing to spend on ensuring data security, consistency and integrity. Basic architecture A database User accesses the distributed database through: Local applications: Applications which do not require data from other sites.

Global applications: Applications which do require data from other sites. A distributed database does not share main memory or disks. A centralized database has all its data on one place. As it is totally different from distributed database which has data on different places. In centralized database as all the data reside on one place so problem of bottle-neck can occur, and data availability is not efficient as in distributed database. Let me define some advantages of distributed database, it will clear the difference between centralized and distributed database.

Advantages of Data Distribution


The primary advantage of distributed database systems is the ability to share and access data in a reliable and efficient manner. Data sharing and Distributed Control If a number of different sites are connected to each other, then a user at one site may be able to access data that is available at another site. For example, in the distributed banking system, it is possible for a user in one branch to access data in another branch. Without this capability, a user wishing to transfer funds from one branch to another would have to resort to some external mechanism for such a transfer. This external mechanism would, in effect, be a single centralized database. The primary advantage to accomplishing data sharing by means of data distribution is that each site is able to retain a degree of control over data stored locally. In a centralized system, the database administrator of the central site controls the database. In a distributed system, there is a global database administrator responsible for the entire system. A part of these responsibilities is delegated to the local database administrator for each site. Depending upon the design of the distributed database system, each local administrator may have a different degree of autonomy which is often a major advantage of distributed databases. Reliability and Availability If one site fails in distributed system, the remaining sited may be able to continue operating. In particular, if data are replicated in several sites, transaction needing a particular data item may find it in several sites. Thus, the failure of a site does not necessarily imply the shutdown of the system. The failure of one site must be detected by the system, and appropriate action may be needed to recover from the failure. The system must no longer use the service of the failed

site. Finally, when the failed site recovers or is repaired, mechanisms must be available to integrate it smoothly back into the system. Although recovery from failure is more complex in distributed systems than in a centralized system, the ability of most of the systems to continue to operate despite failure of one site, results in increased availability. Availability is crucial for database systems used for real-time applications. Loss of access to data, for example, in an airline may result in the loss of potential ticket buyers to competitors. Speedup Query Processing If a query involves data at several sites, it may be possible to split the query into subqueries that can be executed in parallel by several sites. Such parallel computation allows for faster processing of a users query. In those cases in which data is replicated, queries may be directed by the system to the least heavily loaded sites.

Ques3. Describe the concepts of Structural Semantic Data Model (SSM). Ans.The Structural Semantic Model, SSM, first described in Nordbotten (1993a & b), is an extension and graphic simplification of the EER modelling tool 1st presented in the '89 edition of (Elmasri & Navathe, 2003). SSM was developed as a teaching tool and has been and can continue to be modified to include new modelling concepts. A particular requirement today is the inclusion of concepts and syntax symbols for modelling multimedia objects. SSM Concept The current version of SSM belongs to the class of Semantic Data Model types extended with concepts for specification of user defined data types and functions, UDT and UDF. It supports the modelling concepts defined in Table 1 and compared in Table 2. Figure 1 shows the concepts and graphic syntax of SSM, which include: Table 1: Data Modeling Concepts Concepts Example(s) (synonym(s) Definition ) Entity types: Something of interest to A person, student, Entity the Information System customer, employee, (object) about which data is department, product, collected exam, order, A set of entities sharing Citizens of Norway Entity type common attributes PERSON {Name, Address, ...} A sub-class entity type is Subclass, Subclass : a specialization, of, superclass Superclass alternatively a role played entity type Student IS_A by, a super-class entity Person Teacher type. IS_A Person

A shared subclass entity type has characteristics of 2 or more parent entity types A subclass entity type of 2 or more distinct / independent super-class entity types An entity type dependent Weak entity on another for its type identification and existence Attributes: Property a characteristic of an entity The name given to a Attribute property of an entity or relationship type An attribute having a - Atomic single value Shared subclass entity type Category entity type - Multivalued An attribute with multiple values - Composite An attribute composed of (compound) several sub-attributes - derived An attribute whose value depends on other values in the DB and/or environment.

A student-assistant IS_BOTHA student and an employee An owner IS_EITHERA Person or an Organization Education is (can be) a weak entity typeDependent on Person

Person.name = Joan Person {ID, Name, Address, Telephone, Age, Position, } Person.Id Telephone# {home, office, mobil, fax} Address {Street, Nr, City, State, Post#} Name {First, Middle, Last} Person.age: as current_date birth_date. Person.salary: calculated in relationship to currect salary levels. Joan married_to Svein Joan works_for IFI Course_Grade {Joan, I33,UiB-DB, 19nn, 1.5, } Employee works_for DepartmentCourse_gr ade:: Student, Course, Project Person => Student => Graduate- student Person => (Teacher, Student) => Assistant

Relationships: Relationship Associati ve relationsh ip Hierarchi c relationsh ip Constraints: Domain Primary Key (PK)(identif ier, OID) Foreign Key(referen ce key) A relationship between 2 or more entities. A set of relationships between 2 or more entity types A super-subclass structure - Strict hierarchy = 1 path to each subclass entity type - Latice structure = multiple Paths The set of valid values for an attribute The set of attributes whose values uniquely identifying an entity An attr. Containing the PK of an entity to which this entity is related

Person.age:: [0-125] Person.Id

Person. Id, , Manager, Department

(min,max) association Student may have Rel. between an entity type many Cardinality and a relationship type Course_grades Structure Classificati [partial p | total t] , (min,max) = (0,n) (Teacher, Person (p,o) => on [disjoint d | overlapping o] participati Student) "(Data) Behavior" ::=dbms action by event: on User A function triggered by Calculation of a current data defined use (storage, update, value, such a birth-date functions, retrieval) of an attribute. UDF Table 2: Data Model Type - Concept Comparison RM RM/ ER EER SSM OO UML (Boo (Codd T (Chen (Elma94 (Nord93 M Entity types: 70) (Codd 76) ,00) ,03) (Catt ch 99 79) 94) Base )Y Y Y y y y y (independen Subclass / t) -y -y y y Y superclass Shared -? -y y y -subclas Category -? -y y y -s Weak -y y y y --(dependent) Attribute types: Atomic y y y y y y Y Multivalued --y y y y Y Composite --y y y y Y (compound) Derived ---y y y Y Relationship types: Associative y y y y y y Y Hierarchic -y -y y y -Constraints: Domain y y y -y y y Primary y y y y y OID y Key (identifier Foreign , OID) y y y y y OID y Key ref. (reference Cardinality Ei : Ei : E : R Ei : ? -? key) Structure Ej nEj n(min,ma Ej nClassificati R-m R-m ---(p|t,d|o) x) (p|t,d|o) R-m --on User defined data types and functions: participati UDT ----y y y on Concept

UDF

--

--

--

--

Figure 1: Extended ER data model - example 1. Three types of entity specifications: base (root), subclass, and weak 2. Four types of inter-entity relationships: n-ary associative, and 3 types of classification hierarchies, 3. Four attribute types: atomic, multi-valued, composite, and derived, 4. Domain type specifications in the graphic model, including;

standard data types, Binary large objects (blob, text, image, ...), userdefined types (UDT) and functions (UDF), 5. Cardinality specifications for entity to relationship-type connections and for multi-valued attribute types and 6. Data value constraints. base and weak entitytypes Hierarchic relationshi p subclass entity types Associative relationship s with (min,max) cardinality types base entity specification

Figure 2.1: SSM Entity Relationships - hierarchical and associative primary key atomic attributes Composite attribute Multivalued attribute Multivalued composite attribute with: -UDT - spatial data types Derived attribute Image text data types

Figure 2.2: SSM Attribute and Data Types

Ques4. Describe the following with respect to Object Oriented Databases: a. Query Processing in Object-Oriented Database Systems b. Query Processing Architecture Ans.

Query Processing in Object-Oriented Database Systems


One of the criticisms of first-generation object-oriented database management systems (OODBMSs) was their lack of declarative query capabilities. This led some researchers to brand first generation (network and hierarchical) DBMSs as object-oriented. It was commonly believed that the application domains that OODBMS technology targets do not need querying capabilities. This belief no longer holds, and declarative query capability is accepted as one of the fundamental features of OO-DBMS. Indeed, most of the current prototype systems experiment with powerful query languages and investigate their optimization. Commercial products have started to include such languages as well e.g. O2 and Object Store. In this Section we discuss the issues related to the optimization and execution of OODBMS query languages (which we collectively call query processing). Query optimization techniques are dependent upon the query model and language. For example, a functional query language lends itself to functional optimization which is quite different from the algebraic, costbased optimization techniques employed in relational as well as a number of object-oriented systems. The query model, in turn, is based on the data (or object) model since the latter defines the access primitives which are used by the query model. These primitives, at least partially, determine the power of the query model. Despite this close relationship, in this unit we do not consider issues related to the design of object models, query models, or query languages in any detail.

Type System Relational query languages operate on a simple type system consisting of a single aggregate type: relation. The closure property of relational languages implies that each relational operator takes one or more relations as operands and

produces a relation as a result. In contrast, object systems have richer type systems. The results of object algebra operators are usually sets of objects (or collections) whose members may be of different types. If the object languages are closed under the algebra operators, these heterogeneous sets of objects can be operands to other operators. This requires the development of elaborate type inference schemes to determine which methods can be applied to all the objects in such a set. Furthermore, object algebras often operate on semantically different collection types (e.g., set, bag, list) which imposes additional requirements on the type inference schemes to determine the type of the results of operations on collections of different types.

Encapsulation Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. The encapsulation of methods with the data that they operate on in OODBMSs raises (at least) two issues. First, estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path. In fact, optimizers have to worry about optimizing method execution, which is not an easy problem because methods may be written using a general-purpose programming language. Second, encapsulation raises issues related to the accessibility of storage information by the query optimizer. Some systems overcome this difficulty by treating the query optimizer as a special application that can break encapsulation and access information directly. Others propose a mechanism whereby objects reveal their costs as part of their interface. Complex Objects and Inheritance Objects usually have complex structures where the state of an object references other objects. Accessing such complex objects involves path expressions. The optimization of path expressions is a difficult and central issue in object query languages. We discuss this issue in some detail in this unit. Furthermore, objects belong to types related through inheritance hierarchies. Efficient access to objects through their inheritance hierarchies is another problem that distinguishes objectoriented from relational query processing. Object Models OODBMSs lack a universally accepted object model definition. Even though there is some consensus on the basic features that need to be supported by any object model (e.g., object identity, encapsulation of state and behavior, type inheritance, and typed collections), how these features are supported differs among models and systems. As a result, the numerous projects that experiment with object query processing follow quite different paths and are, to a certain degree, incompatible, making it difficult to amortize on the experiences of others. This diversity of approaches is likely to prevail for some time, therefore, it is important to develop extensible approaches to query processing that allow experimentation with new ideas as they evolve. We provide an overview of various extensible object query

processing approaches.

Query Processing Architecture


In this section we focus on two architectural issues: the query processing methodology and the query optimizer architecture. Query Processing Methodology A query processing methodology similar to relational DBMSs, but modified to deal with the difficulties discussed in the previous section, can be followed in OODBMSs. The steps of the methodology are as follows. 1. Queries are expressed in a declarative language 2. It requires no user knowledge of object implementations, access paths or

processing strategies 3. The calculus expression is first 4. Calculus Optimization 5. Calculus Algebra Transformation 6. Type check 7. Algebra Optimization 8. Execution Plan Generation 9. Execution Ques5. Describe the Differences between Distributed & Centralized Databases. Ans. The centralized database is a database where data is stored and maintained in a single place. This is the traditional approach to store data in large companies. The distributed database is a database where data is stored in the storage devices that are not found in the same physical location, but the database is controlled using a management system central database (DBMS). Centralized Database In a centralized database, all data of an organization are kept in a single computer as a central processor or server. Users in remote locations access data by using WAN by application software provided to access data. The centralized database (the central processor or server) should be able to satisfy all requests from the system; this is why it creates restricted access. But since all data resides in a single location it easier to maintain and support data. In addition, it is easier to maintain the integrity of data, because once the

data is stored in a centralized database, out-of-date data is no longer available in other places. Distributed database In a distributed database, data is stored in storage devices that are situated in different physical locations. They are not attached to a common central unit, but the database is controlled by the central DBMS. Data can be accessed by users in a distributed database by accessing the WAN. The process of copying and replication are used for keeping the database updated. After identifying the changes in distributed database, the replication process applies them to ensure that all the distributed databases look the same. Depending on the number of distributed databases, the process can be time consuming and complex. Duplication identifies a database as master database and creates duplicate copy of it. This process is not complicated as the replication process, but ensures that all distributed databases have the same data. Difference between the Database and Distributed Database Centralized A centralized database stores data in the storage devices located at one place and they are connected to a single CPU, while a system of distributed database keeps its data in the storage devices that may be situated in different geographical locations and administered by a central DBMS. A centralized database is easier to maintain and keep updated, as all data is stored in a single place. In addition, it is easier to maintain the integrity of data and avoid the need for keeping copies of data. However, all the requirements for accessing data are processed by one entity like a solo mainframe, and this is why it could easily become a blockage. But with distributed databases, we can avoid the blockage since the databases are parallelized which balances the load between a number of servers. But to maintaining data in the distributed database needs additional work, thus increasing the cost of maintenance and complexity and also requires additional software for this purpose. In addition, the creation of databases for distributed database is more complex than the same for a centralized database.

Ques6. Describe the following: a. Data Mining Functions b. Data Mining Techniques Ans.

Data Mining Functions


Data mining methods may be classified by the function they perform or according to the class of application they can be used in. Some of the main techniques used in data mining are described in this section. Classification Data Mining tools have to infer a model from the database, and in the case of Supervised Learning this requires the user to define one or more classes. The

database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class. When learning classification rules the system has to find the rules that predict the class from the predicting attributes so firstly the user has to define conditions for each class, the data mine system then constructs descriptions for the classes. Basically the system should give a case or tuple with certain known attribute values be able to predict what class this case belongs to. Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative. A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class. A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, is very probable. The categories of rules are: Exact Rule permits no exceptions so each object of LHS must be an element of RHS Strong Rule allows some exceptions, but the exceptions have a given limit Probabilistic Rule relates the conditional probability P(RHS|LHS) to the probability P(RHS) Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS. Associations Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection

of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A, B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule. A typical application, identified by IBM that can be built using an association function is Market Basket Analysis. This is where a retailer run an association operator over the point of sales transaction log, which contains among other information, transaction identifiers and product identifiers. The set of products identifiers listed under the same transaction identifier constitutes a record. The output of the association function is, in this case, a list of product affinities. Thus, by invoking an association function, the market basket analysis application can determine affinities such as "20% of the time that a specific brand toaster is sold, customers also buy a set of kitchen gloves and matching cover sets." Another example of the use of associations is the analysis of the claim forms submitted by patients to a medical insurance company. Every claim form contains a set of medical procedures that were performed on a given patient during one visit. By defining the set of items to be the collection of all medical procedures that can be performed on a patient and the records to correspond to each claim form, the application can find, using the association function, relationships among medical procedures that are often performed together.

Sequential/Temporal patterns Sequential/temporal pattern functions analyse a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i.e. Consisting of a number of items drawn from a given collection of items). The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information, for each

customer, of the sets of products that the customer buys in every purchase order. A sequential pattern function will analyse such collections of related records and will detect frequently occurring patterns of products bought over time. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven. Sequential pattern mining functions are quite powerful and can be used to detect the set of customers associated with some frequent buying patterns. Use of these functions on for example a set of insurance claims can lead to the identification of frequently occurring sequences of medical procedures applied to patients which can help identify good medical practices as well as to potentially detect some medical insurance fraud.

Clustering/Segmentation Clustering and Segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A Cluster is a set of objects grouped together because of their similarity or proximity. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters. Clustering according to similarity is a very powerful technique, the key to it being to translate some intuitive measure of similarity into a quantitative measure. When learning is unsupervised then the system has to discover its own classes i.e. the system clusters the data in the database. The system has to discover subsets of related objects in the training set and then it has to find descriptions that describe each of these subsets. There are a number of approaches for forming clusters. One approach is to form rules which dictate membership in the same group based on the level of similarity between members. Another approach is to build set functions that measure some property of partitions as functions of some parameter of the partition.

IBM Market Basket Analysis example IBM have used segmentation techniques in their Market Basket Analysis on POS

transactions where they separate a set of untagged input records into reasonable groups according to product revenue by market basket i.e. the market baskets were segmented based on the number and type of products in the individual baskets. Each segment reports total revenue and number of baskets and using a neural network 275,000 transaction records were divided into 16 segments. The following types of analysis were also available: 1. Revenue by segment 2. Baskets by segment 3. Average revenue by segment etc.

Data Mining Techniques


Cluster Analysis In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram. The first step is to discover subsets of related objects and then find descriptions e.eg D1, D2, D3 etc. which describe each of these subsets.

Figure 7.2: Discovering Clusters and Descriptions in a Database Clustering and segmentation basically partition the database so that each partition or group is similar according to some criteria or metric. Clustering according to similarity is a concept which appears in many disciplines. If a measure of similarity is available there are a number of techniques for forming clusters. Membership of groups can be based on the level of similarity between members and from this the rules of membership can be defined. Another approach is to build set functions

that measure some property of partitions i.e. groups or subsets as functions of some parameter of the partition. This latter approach achieves what is known as optimal partitioning. Many data mining applications make use of clustering according to similarity for example to segment a client/customer base. Clustering according to optimization of set functions is used in data analysis e.g. when setting insurance tariffs the customers can be segmented according to a number of parameters and the optimal tariff segmentation achieved. Clustering/segmentation in databases are the processes of separating a data set into components that reflect a consistent pattern of behaviour. Once the patterns have been established they can then be used to "deconstruct" data into more understandable subsets and also they provide sub-groups of a population for further analysis or action which is important when dealing with very large databases. For example a database could be used for profile generation for target marketing where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response.

Induction A database is a store of information but more important is the information which can be inferred from it. There are two main inference techniques available i.e. deduction and induction. Deduction is a technique to infer information that is a logical consequence of the information in the database e.g. the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers. Induction has been described earlier as the technique to infer information that is generalised from the database as in the example mentioned above to infer that each employee has a manager. This is higher level information or knowledge in that it is a general statement about objects in the database. The database is searched for patterns or

regularities. Induction has been used in the following ways within data mining. Decision Trees Decision Trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names, the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Objects are classified by following a path down the tree, by taking the edges, corresponding to the values of the attributes in an object. The following is an example of objects that describe the weather at a given time. The objects contain information on the outlook, humidity etc. Some objects are positive examples denote by P and others are negative i.e. N. Classification is in this case the construction of a tree structure, illustrated in the following diagram, which can be used to classify all the objects correctly.

Figure 7.3: Decision Tree Structure Rule Induction A Data Mining System has to infer a model from the database that is it may define classes such that the database contains one or more attributes that denote the class of a tuple i.e. the predicted attributes while the remaining attributes are the predicting attributes. A Class can then be defined by condition on the attributes. When the classes are defined the system should be able to infer the rules that govern classification, in other words the system should find the description of each class. Production rules have been widely used to represent knowledge in expert systems and they have the advantage of being easily interpreted by human experts because of their modularity i.e. a single rule can be understood in isolation and doesn't need reference to other rules. The propositional like structure of such rules has been described earlier but can summed up as if- then rules. Neural Networks Neural Networks are an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural Networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained Neural Network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if" questions.

Neural Networks have broad applicability to real world business problems and have already been successfully applied in many industries. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including: Forecasting Process Control Research Validation Risk Management Marketing etc Neural Networks use a set of processing elements (or nodes) analogous to Neurons in the brain. These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i.e. the network learns from experience just as people do. This distinguishes neural networks from traditional computing programs that simply follow instructions in a fixed sequential order. The structure of a neural network looks something like the following:

Figure 7.4: Structure of a neural network The bottom layer represents the input layer, in this case with 5 inputs labels X1 through X5. In the middle is something called the hidden layer, with a variable number of nodes. It is the hidden layer that performs much of the work of a network. The output layer in this case has two nodes, Z1 and Z2 representing output values

we are trying to determine from the inputs. For example, predict sales (output) based on past sales, price and season (input). Each node in the hidden layer is fully connected to the inputs which mean that what is learned in a hidden node is based on all the inputs taken together. Statisticians maintain that the network can pick up the interdependencies in the model. The following diagram provides some detail into what goes on inside a hidden node.

Simply speaking a weighted sum is performed: X1 times W1 plus X2 times W2 on through X5 and W5. This weighted sum is performed for each hidden node and each output node and is how interactions are represented in the network. The issue of where the network gets the weights from is important but suffice to say that the network learns to reduce error in its prediction of events already known (i.e. past history). The problems of using neural networks have been summed by Arun Swami of Silicon Graphics Computer Systems. Neural networks have been used successfully for classification but suffer somewhat in that the resulting network is viewed as a black box and no explanation of the results is given. This lack of explanation inhibits confidence, acceptance and application of results. He also notes as a problem the fact that neural networks suffered from long learning times which become worse as the volume of data grows. The Clementine User Guide has the following simple diagram 7.6 to summarize a

Neural Net trained to identify the risk of cancer from a number of factors.

Figure 7.6: Example Neural network from Clementine User Guide

On-line Analytical processing A major issue in information processing is how to process larger and larger databases, containing increasingly complex data, without sacrificing

response time. The client/server architecture gives organizations the opportunity to deploy specialized servers which are optimized for handling specific data management problems. Until recently, organizations have tried to target Relational Database Management Systems (RDBMSs) for the complete spectrum of database applications. It is however apparent that there are major categories of database applications which are not suitably serviced by relational database systems. Oracle, for example, has built a totally new Media Server for handling multimedia applications. Sybase uses an Object - Oriented DBMS (OODBMS) in its Gain Momentum product which is designed to handle complex data such as images and audio. Another category of applications is that of On-Line Analytical Processing (OLAP). OLAP was a term coined by E F Codd (1993) and was defined

by him as the dynamic synthesis, analysis and consolidation of large volumes of multidimensional data Codd has developed rules or requirements for an OLAP system; Multidimensional Conceptual View Transparency Accessibility Consistent Reporting Performance Client/Server Architecture Generic Dimensionality Dynamic Sparse Matrix Handling Multi-User Support Unrestricted Cross Dimension Operation Intuitive Data Manipulation Flexible Reporting Unlimited Dimensions and Aggregation Levels

An alternative definition of OLAP has been supplied by Nigel Pendse who unlike Codd does not mix technology prescriptions with application requirements. Pendse defines OLAP as, Fast Analysis of Shared Multidimensional Information which means; Fast in that users should get a response in seconds and so doesn't lose their chain of thought;

Analysis in that the system can provide analysis functions in an intuitive manner and that the functions should supply business logic and statistical analysis relevant to the users applications. Shared from the point of view of supporting multiple users concurrently; Multidimensional as a main requirement so that the system supplies a multidimensional conceptual view of the data including support for multiple hierarchies; Information is the data and the derived information required by the user application. One question is what is multidimensional data and when does it become OLAP? It is essentially a way to build associations between dissimilar pieces of information using predefined business rules about the information you are using. Kirk Cruikshank of Arbor Software has identified three components to OLAP, in an issue of UNIX News on data warehousing; A multidimensional database must be able to express complex business calculations very easily. The data must be referenced and mathematics defined. In a relational system there is no relation between line items which makes it very difficult to express business mathematics. Intuitive navigation in order to `roam around' data which requires mining hierarchies. Instant response i.e. the need to give the user the information as quick as possible. Dimensional databases are not without problem as they are not suited to storing all types of data such as lists for example customer addresses and purchase orders etc. Relational systems are also superior in security, backup and replication services as these tend not to be available at the same level in dimensional systems. The advantages of a dimensional system are the freedom they offer in that the user is free to explore the data

and receive the type of report they want without being restricted to a set format. OLAP Example An example OLAP database may be comprised of sales data which has been aggregated by region, product type, and sales channel. A typical OLAP query might access a multi-gigabyte/multi-year sales database in order to find all product sales in each region for each product type. After reviewing the results, an analyst might further refine the query to find sales volume for each sales channel within region/product classifications. As a last step the analyst might want to perform year-to-year or quarter-to-quarter comparisons for each sales channel. This whole process must be carried out on-line with rapid response time so that the analysis process is undisturbed. OLAP queries can be characterized as on-line transactions which: Access very large amounts of data, e.g. several years of sales data. Analyze the relationships between many types of business elements e.g. sales, products, regions, channels. Involve aggregated data e.g. sales volumes, budgeted dollars and dollars spent. Compare aggregated data over hierarchical time periods e.g. monthly, quarterly, and yearly. Present data in different perspectives e.g. sales by region vs. sales by channels by product within each region. Involve complex calculations between data elements e.g. expected profit as calculated as a function of sales revenue for each type of sales channel in a particular region. Are able to respond quickly to user requests so that users can pursue an analytical thought process without being stymied by the system.

Comparison of OLAP and OLTP OLAP applications are quite different from On-line Transaction Processing (OLTP) applications which consist of a large number of relatively simple transactions. The transactions usually retrieve and update a small number of records that are contained in several distinct tables. The relationships between the tables are generally simple. A typical customer order entry OLTP transaction might retrieve all of the data relating to a specific customer and then insert a new order for the customer. Information is selected from the customer, customer order, and detail line tables. Each row in each table contains a customer identification number which is used to relate the rows from the different tables. The relationships between the records are simple and only a few records are actually retrieved or updated by a single transaction. The difference between OLAP and OLTP has been summarized as, OLTP servers handle mission-critical production data accessed through simple queries; while OLAP servers handle management-critical data accessed through an iterative analytical investigation. Both OLAP and OLTP, have specialized requirements and therefore require special optimized servers for the two types of processing. OLAP database servers use multidimensional structures to store data and relationships between data. Multidimensional structures can be best visualized as cubes of data, and cubes within cubes of data. Each side of the cube is considered a dimension. Each dimension represents a different category such as product type, region, sales channel, and time. Each cell within the multidimensional structure contains aggregated data relating elements along each of the dimensions. For example, a single cell may contain the total sales for a given product in a region for a specific sales channel in a single month.

Multidimensional databases are a compact and easy to understand vehicle for visualizing and manipulating data elements that have many inter relationships. OLAP database servers support common analytical operations including: consolidation, drill-down, and "slicing and dicing". Consolidation involves the aggregation of data such as simple rollups or complex expressions involving inter-related data. For example, sales offices can be rolled-up to districts and districts rolled-up to regions. Drill-Down OLAP data servers can also go in the reverse direction and automatically display detail data which comprises consolidated data. This is called drill-downs. Consolidation and drill-down are an inherent property of OLAP servers. "Slicing and Dicing" Slicing and dicing refers to the ability to look at the database from different viewpoints. One slice of the sales database might show all sales of product type within regions. Another slice might show all sales by sales channel within each product type. Slicing and dicing is often performed along a time axis in order to analyse trends and find patterns. OLAP servers have the means for storing multidimensional data in a compressed form. This is accomplished by dynamically selecting physical storage arrangements and compression techniques that maximize space utilization. Dense Data (i.e., data exists for a high percentage of dimension cells) are stored separately from Sparse Data (i.e., a significant percentage of cells are empty). For example, a given sales channel may only sell a few products, so the cells that relate sales channels to products will be mostly empty and therefore sparse. By optimizing space utilization, OLAP servers can minimize physical storage requirements, thus making it possible to analyze exceptionally large amounts of data. It also makes it possible to

load more data into computer memory which helps to significantly improve performance by minimizing physical disk I/O. In conclusion OLAP servers logically organize data in multiple dimensions which allows users to quickly and easily analyze complex data relationships. The database itself is physically organized in such a way that related data can be rapidly retrieved across multiple dimensions. OLAP servers are very efficient when storing and processing multidimensional data. RDBMSs have been developed and optimized to handle OLTP applications. Relational database designs concentrate on reliability and transaction processing speed, instead of decision support need. The different types of server can therefore benefit a broad range of data management applications. Data Visualization Data visualization makes it possible for the analyst to gain a deeper, more intuitive understanding of the data and as such can work well alongs ide data mining. Data mining allows the analyst to focus on certain patterns and trends and explore in-depth using visualization. On its own data visualization can be overwhelmed by the volume of data in a database but in conjunction with data mining can help with exploration.

You might also like