0% found this document useful (0 votes)
755 views

Data Mining Assignment

The document summarizes the key steps involved in the knowledge discovery process as follows: 1) Developing an understanding of the application domain and goals. 2) Selecting and preprocessing the data, including handling missing values and outliers. 3) Further preprocessing such as data transformation through techniques like feature selection and extraction. 4) Choosing the appropriate data mining algorithm(s) to apply. 5) Evaluating and interpreting the mined patterns, using interestingness measures or visualizations.

Uploaded by

Sahil Thakur
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
755 views

Data Mining Assignment

The document summarizes the key steps involved in the knowledge discovery process as follows: 1) Developing an understanding of the application domain and goals. 2) Selecting and preprocessing the data, including handling missing values and outliers. 3) Further preprocessing such as data transformation through techniques like feature selection and extraction. 4) Choosing the appropriate data mining algorithm(s) to apply. 5) Evaluating and interpreting the mined patterns, using interestingness measures or visualizations.

Uploaded by

Sahil Thakur
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 11

.Homework Title / No.

: __________homework 1__Course Code :_ Course code: CAP 624

Course Instructor : ______ Sanjay Sood ___ Course Tutor (if applicable) : ____________ Students Roll No.__ RDE624A20 ________ Section No. : _______ DE624 _____________

Declaration: I declare that this assignment is my individual work. I have not copied from any other students work or from any other source except where due acknowledgment is made explicitly in the text, nor has any part been written for me by another person. Students Signature : Bineet Kumar Kalia

Evaluators comments: _____________________________________________________________________ Marks obtained : ___________ out of ______________________

Part-A Q1. Explain, what are practical applications of Data mining? Answer : Following two application are presented to illustrate the potential of data mining. Healthcare Services Data mining has been used intensively and extensively by many health care organizations and can greatly benefit all parties involved. For example, data mining can help healthcare insurers detect fraud and abuse, can help health care organizations make customer-relationship management decisions, can help physicians identify effective treatments and best practice and can help patients receives better and more affordable healthcare services include, but are not limited to the following : Modeling health outcomes and predicting patient outcomes Modeling clinical knowledge of decision support systems . Bioinformatics Pharmaceutical research Infection control Ranking hospitals Identifying high risk patients Evaluation of treatment effectiveness Banking In todays world, traditional banking has changed for many reasons. Gone are the days when conducting simple surveys would enable banks to make necessary changes in their various marketing, business- process, and customer relationship strategies. While the emergence of new banks has provided strong competition among them , it has also made him unrealistic for them to rely on only their internal procedures to stay profitable in the market . Streamlining business procedures improving customer relationships, detecting fraudulent characters and provided security at all levels of service, and taking other measures to

improve business builds trust among not only major players of the market , but also among employees .

Q2. With a suitable diagram explain the architecture of data warehouse? Answer Though its easy to think of the data warehouse as just a big collection of data, in fact delivering an effective data warehouse requires a large set of related capabilities. (see Figure 1). Certainly data is the fundamental component: cleaned, organized data, mostly extracted from the campus operational systems. Making that data useful to a variety of campus personnel, though, requires some applications to deliver and explain it. These applications range from predefined reports through query tools to complex tools for analysis and modeling. Delivering data and applications and securing the data as specified by campus data stewards requires a set of technology, most of it centralized in secure computer locations. Equally important, transforming operational data into a shared resource useful across the boundaries of functional business domains requires a broad set of functional skills, organized appropriately and working through proven processes. The architecture for the data warehouse is described in terms of four inter related dimensions: 1. Applications (or the business intelligence layer). 2. Data. 3. Technology and security. 4. Supportprocesses and organization

Q3. What are the challenges in creating and maintaining a data warehouse? Explain with suitable example? Answer : The major issue with data warehousing is that it represents a significant infrastructure investment that is time-consuming to produce. Designing a data warehouse is a complicated procedure that requests buy-in from many organization stakeholders. It may take months or years to produce a data warehouse and involve outside consultants and significant personnel time. The tools to perform ETL functions and data warehouse management are expensive. Once produced a data warehouse must be maintained (personnel, licensing costs, maintenance, etc.). A data warehouse is not an investment to be considered lightly. The second major issue with data warehousing relates to its freshness. The freshness of a data warehouse is how up-to-date (or real-time) the information it

contains. A data warehouse may only have information that is 24 hours old (based on when it was extracted from the operational systems) which may not be sufficient for some business decisions. A larger issue is that adding new information to a data warehouse as the organizational needs evolve is time consuming. For example, if an organization acquires another smaller organization and wants to merge its customer information into its data warehouse, this may take several months before it can be accomplished. Maintaining a data warehouse is costly.

Part-B Q1. Explain the various steps involved in knowledge discovery process? The knowledge discovery process (Figure 1.1) is iterative and interactive, consisting of nine steps. Note that the process is iterative at each step, meaning that moving back to previous steps may be required. The process has many artistic aspects in the sense that one cannot present one formula or make a complete taxonomy for the right choices for each step and application type. Thus it is required to understand the process and the different needs and possibilities in each step. The process starts with determining the KDD goals, and ends with the implementation of the discovered knowledge. Then the loop is closed - the Active Data Mining part starts (which is beyond the scope of this book and the process defined here). As a result, changes would have to be made in the application domain (such as offering different features to mobile phone users in order to reduce churning). This closes the loop, and the effects are then measured on the new data repositories, and the KDD process is launched again. Following is a brief description of the nine-step KDD process, starting with a managerial step: 1. Developing an understanding of the application domain This is the initial preparatory step. It prepares the scene for understanding what should be done with the many decisions (about transformation, algorithms, representation, etc.). The people who are in charge of a KDD project need to understand and define the goals of the end-user and the environment in which the knowledge discovery

process will take place (including relevant prior knowledge). As the KDD process proceeds, there may be even a revision of this step. Having understood the KDD goals, the preprocessing of the data starts, defined in the next three steps (note that some of the methods here are similar to Data Mining algorithms, but are used in the preprocessing context): 2. Selecting and creating a data set on which discovery will be performed. Having defined the goals, the data that will be used for the knowledge discovery should be determined. This includes finding out what data is available, obtaining additional necessary data, and then integrating all the data for the knowledge discovery into one data set, including the attributes that will be considered for the process. This process is very important because the Data Mining learns and discovers from the available data. This is the evidence base for constructing the models. If some important attributes are missing, then the entire study may fail. From this respect, the more attributes are considered, the better. On the other hand, to collect, organize and operate complex data repositories is expensive and there is a tradeoff with the opportunity for best understanding the phenomena. This tradeoff represents an aspect where the interactive and iterative aspect of the KDD is taking place. This starts with the best available data set and later expands and observes the effect in terms of knowledge discovery and modeling. 3. Preprocessing and cleansing. In this stage, data reliability is enhanced. It includes data clearing, such as handling missing values and removal of noise or outliers. There are many methods explained in the handbook, from doing nothing to becoming the major part (in terms of time consumed) of a KDD project in certain projects. It may involve complex statistical methods or using a Data Mining algorithm in this context. For example, if one suspects that a certain attribute is of insufficient reliability or has many missing data, then this attribute could become the goal of a data mining supervised algorithm. A prediction model for this attribute will be developed, and then missing data can be predicted. The extension to which one pays attention to this level depends on many factors. 4. Data transformation. In this stage, the generation of better data for the data mining is prepared and developed. Methods here include dimension reduction (such as feature selection and extraction and record sampling), and attribute transformation (such as discretization of numerical attributes and functional transformation). This step can be crucial for the success of the entire KDD project, and it is usually very project-specific. For example, in medical examinations, the quotient of attributes may often be the most important factor, and not each one by itself. In marketing, we may need to consider effects beyond our control as well as efforts and temporal issues (such as studying the effect of advertising accumulation). However, even if we do not use the right transformation at the beginning, we may obtain a surprising effect that hints to us about the

transformation needed (in the next iteration). Thus the KDD process reflects upon itself and leads to an understanding of the transformation needed. 5. Choosing the appropriate Data Mining task. We are now ready to decide on which type of Data Mining to use, for example, classification, regression, or clustering. This mostly depends on the KDD goals, and also on the previous steps. There are two major goals in Data Mining: prediction and description. Prediction is often referred to as supervised Data Mining, while descriptive Data Mining includes the unsupervised and visualization aspects of Data Mining. Most data mining techniques are based on inductive learning, where a model is constructed explicitly or implicitly by generalizing from a sufficient number of training examples. 6. Choosing the Data Mining algorithm. Having the strategy, we now decide on the tactics. This stage includes selecting the specific method to be used for searching patterns (including multiple inducers). For example, in considering precision versus understandability, the former is better with neural networks, while the latter is better with decision trees. For each strategy of meta-learning there are several possibilities of how it can be accomplished. Meta-learning focuses on explaining what causes a Data Mining algorithm to be successful or not in a particular problem. Thus, this approach attempts to understand the conditions under which a Data Mining algorithm is most appropriate. Each algorithm has parameters and tactics of learning (such as ten-fold cross-validation or another division for training and testing). 7. Employing the Data Mining algorithm. Finally the implementation of the Data Mining algorithm is reached. In this step we might need to employ the algorithm several times until a satisfied result is obtained, for instance by tuning the algorithms control parameters. 8. Evaluation. In this stage we evaluate and interpret the mined patterns (rules, reliability etc.), with respect to the goals defined in the first step. Here we consider the preprocessing steps with respect to their effect on the Data Mining algorithm results (for example, adding features in Step 4, and repeating from there). This step focuses on the comprehensibility and usefulness of the induced model. In this step the discovered knowledge is also documented for further usage. 9. Using the discovered knowledge. We are now ready to incorporate the knowledge into another system for further action. The knowledge becomes active in the sense that we may make changes to the system and measure the effects. Actually the success of this step determines the effectiveness of the entire KDD process. There are many challenges in this step, such as loosing the laboratory

conditions under which we have operated. For instance, the knowledge was discovered from a certain static snapshot (usually sample) of the data, but now the data becomes dynamic. Data structures may change (certain attributes become unavailable), and the data domain may be modified (such as, an attribute may have a value that was not assumed before). Q2. Write short notes on: a) Data selection b) Data cleaning Answer : Data Selection: - Data selection is defined as the process of determining the appropriate data type and source, as well as suitable instruments to collect data. Data selection precedes the actual practice of data collection. This definition distinguishes data selection from selective data reporting (selectively excluding data that is not supportive of a research hypothesis) and interactive/active data selection (using collected data for monitoring activities/events, or conducting secondary data analyses). The process of selecting suitable data for a research project can impact data integrity. The primary objective of data selection is the determination of appropriate data type, source, and instrument(s) that allow investigators to adequately answer research questions. This determination is often discipline-specific and is primarily driven by the nature of the investigation, existing literature, and accessibility to necessary data sources. Issues of data selection:

the appropriate type and sources of data which permit investigators to adequately answer the stated research questions, suitable procedures in order to obtain a representative sample the proper instruments to collect data. There should be compatibility between the type/source of data and the mechanisms to collect it. It is difficult to extricate the selection of the type/source of data from instruments used to collect the data.

Data cleaning: - Data cleansing, data cleaning, or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. After cleansing, a data

set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).

Q3. Give an account of the need for decision support-system for business and scientific applications? Answer : SCIENTIFIC AND BUSINESS APPLICATIONS Rapid advances in information and sensor technologies (IT and ST) along with the availability of large-scale scientific and business data repositories or database management technologies, combined with breakthroughs in computing technologies, computational methods and processing speeds, have opened the floodgates to data dictated models and pattern matching . The use of sophisticated and computationally intensive analytical methods are expected to become even more commonplace with recent research breakthroughs in computational methods and their commercialization by leading vendors Scientists and engineers have developed innovative methodologies for extracting correlations and associations, dimensionality reduction, clustering or classification, regression and predictive modeling, tools based on expert systems and case based reasoning, as well as decision support systems for batch or real-time analysis. They have utilized tools from areas like .traditional. statistics, signal processing and artificial intelligence as well as emerging fields like data mining, machine learning, operations research, systems analysis and nonlinear dynamics.

Innovative models and newly discovered patterns in complex, nonlinear and stochastic systems, encompassing the natural and human environments, have demonstrated the effectiveness of these approaches. However, applications that can utilize these tools in the context of scientific databases in a scalable fashion have only begun to emerge Business solution providers and IT vendors, on the other hand, have focused primarily on scalability, process automation and workflows, and the ability to combine results from relatively simple analytics with judgments from human experts. For example, .e-business applications. in the areas of supply chain planning, financial analysis and business forecasting, traditionally rely on decision support systems with embedded .data mining., operations research and OLAP technologies, business intelligence (BI) and reporting tools as well as an easy touse GUI (graphical user interface) and extensible business workflows (e.g., see Geoffrion and Krishnan, 2003).

You might also like