Data Mining Assignment

The document summarizes the key steps involved in the knowledge discovery process as follows: 1) Developing an understanding of the application domain and goals. 2) Selecting and preprocessing the data, including handling missing values and outliers. 3) Further preprocessing such as data transformation through techniques like feature selection and extraction. 4) Choosing the appropriate data mining algorithm(s) to apply. 5) Evaluating and interpreting the mined patterns, using interestingness measures or visualizations.

Uploaded by

Sahil Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

767 views11 pages

Data Mining Assignment

Uploaded by

Sahil Thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 11

.Homework Title / No.

: ________homework 1Course Code :_ Course code: CAP 624

Course Instructor : ______ Sanjay Sood ___ Course Tutor (if applicable) : ____________ Students Roll No.__ RDE624A20 ________ Section No. : _______ DE624 _____________

Declaration: I declare that this assignment is my individual work. I have not copied from any other students work or from any other source except where due acknowledgment is made explicitly in the text, nor has any part been written for me by another person. Students Signature : Bineet Kumar Kalia

Evaluators comments: _______________________________________________________________ Marks obtained : _ out of __________________

Part-A Q1. Explain, what are practical applications of Data mining? Answer : Following two application are presented to illustrate the potential of data mining. Healthcare Services Data mining has been used intensively and extensively by many health care organizations and can greatly benefit all parties involved. For example, data mining can help healthcare insurers detect fraud and abuse, can help health care organizations make customer-relationship management decisions, can help physicians identify effective treatments and best practice and can help patients receives better and more affordable healthcare services include, but are not limited to the following : Modeling health outcomes and predicting patient outcomes Modeling clinical knowledge of decision support systems . Bioinformatics Pharmaceutical research Infection control Ranking hospitals Identifying high risk patients Evaluation of treatment effectiveness Banking In todays world, traditional banking has changed for many reasons. Gone are the days when conducting simple surveys would enable banks to make necessary changes in their various marketing, business- process, and customer relationship strategies. While the emergence of new banks has provided strong competition among them , it has also made him unrealistic for them to rely on only their internal procedures to stay profitable in the market . Streamlining business procedures improving customer relationships, detecting fraudulent characters and provided security at all levels of service, and taking other measures to

improve business builds trust among not only major players of the market , but also among employees .

Q2. With a suitable diagram explain the architecture of data warehouse? Answer Though its easy to think of the data warehouse as just a big collection of data, in fact delivering an effective data warehouse requires a large set of related capabilities. (see Figure 1). Certainly data is the fundamental component: cleaned, organized data, mostly extracted from the campus operational systems. Making that data useful to a variety of campus personnel, though, requires some applications to deliver and explain it. These applications range from predefined reports through query tools to complex tools for analysis and modeling. Delivering data and applications and securing the data as specified by campus data stewards requires a set of technology, most of it centralized in secure computer locations. Equally important, transforming operational data into a shared resource useful across the boundaries of functional business domains requires a broad set of functional skills, organized appropriately and working through proven processes. The architecture for the data warehouse is described in terms of four inter related dimensions: 1. Applications (or the business intelligence layer). 2. Data. 3. Technology and security. 4. Supportprocesses and organization

Q3. What are the challenges in creating and maintaining a data warehouse? Explain with suitable example? Answer : The major issue with data warehousing is that it represents a significant infrastructure investment that is time-consuming to produce. Designing a data warehouse is a complicated procedure that requests buy-in from many organization stakeholders. It may take months or years to produce a data warehouse and involve outside consultants and significant personnel time. The tools to perform ETL functions and data warehouse management are expensive. Once produced a data warehouse must be maintained (personnel, licensing costs, maintenance, etc.). A data warehouse is not an investment to be considered lightly. The second major issue with data warehousing relates to its freshness. The freshness of a data warehouse is how up-to-date (or real-time) the information it

contains. A data warehouse may only have information that is 24 hours old (based on when it was extracted from the operational systems) which may not be sufficient for some business decisions. A larger issue is that adding new information to a data warehouse as the organizational needs evolve is time consuming. For example, if an organization acquires another smaller organization and wants to merge its customer information into its data warehouse, this may take several months before it can be accomplished. Maintaining a data warehouse is costly.

Part-B Q1. Explain the various steps involved in knowledge discovery process? The knowledge discovery process (Figure 1.1) is iterative and interactive, consisting of nine steps. Note that the process is iterative at each step, meaning that moving back to previous steps may be required. The process has many artistic aspects in the sense that one cannot present one formula or make a complete taxonomy for the right choices for each step and application type. Thus it is required to understand the process and the different needs and possibilities in each step. The process starts with determining the KDD goals, and ends with the implementation of the discovered knowledge. Then the loop is closed - the Active Data Mining part starts (which is beyond the scope of this book and the process defined here). As a result, changes would have to be made in the application domain (such as offering different features to mobile phone users in order to reduce churning). This closes the loop, and the effects are then measured on the new data repositories, and the KDD process is launched again. Following is a brief description of the nine-step KDD process, starting with a managerial step: 1. Developing an understanding of the application domain This is the initial preparatory step. It prepares the scene for understanding what should be done with the many decisions (about transformation, algorithms, representation, etc.). The people who are in charge of a KDD project need to understand and define the goals of the end-user and the environment in which the knowledge discovery

process will take place (including relevant prior knowledge). As the KDD process proceeds, there may be even a revision of this step. Having understood the KDD goals, the preprocessing of the data starts, defined in the next three steps (note that some of the methods here are similar to Data Mining algorithms, but are used in the preprocessing context): 2. Selecting and creating a data set on which discovery will be performed. Having defined the goals, the data that will be used for the knowledge discovery should be determined. This includes finding out what data is available, obtaining additional necessary data, and then integrating all the data for the knowledge discovery into one data set, including the attributes that will be considered for the process. This process is very important because the Data Mining learns and discovers from the available data. This is the evidence base for constructing the models. If some important attributes are missing, then the entire study may fail. From this respect, the more attributes are considered, the better. On the other hand, to collect, organize and operate complex data repositories is expensive and there is a tradeoff with the opportunity for best understanding the phenomena. This tradeoff represents an aspect where the interactive and iterative aspect of the KDD is taking place. This starts with the best available data set and later expands and observes the effect in terms of knowledge discovery and modeling. 3. Preprocessing and cleansing. In this stage, data reliability is enhanced. It includes data clearing, such as handling missing values and removal of noise or outliers. There are many methods explained in the handbook, from doing nothing to becoming the major part (in terms of time consumed) of a KDD project in certain projects. It may involve complex statistical methods or using a Data Mining algorithm in this context. For example, if one suspects that a certain attribute is of insufficient reliability or has many missing data, then this attribute could become the goal of a data mining supervised algorithm. A prediction model for this attribute will be developed, and then missing data can be predicted. The extension to which one pays attention to this level depends on many factors. 4. Data transformation. In this stage, the generation of better data for the data mining is prepared and developed. Methods here include dimension reduction (such as feature selection and extraction and record sampling), and attribute transformation (such as discretization of numerical attributes and functional transformation). This step can be crucial for the success of the entire KDD project, and it is usually very project-specific. For example, in medical examinations, the quotient of attributes may often be the most important factor, and not each one by itself. In marketing, we may need to consider effects beyond our control as well as efforts and temporal issues (such as studying the effect of advertising accumulation). However, even if we do not use the right transformation at the beginning, we may obtain a surprising effect that hints to us about the

transformation needed (in the next iteration). Thus the KDD process reflects upon itself and leads to an understanding of the transformation needed. 5. Choosing the appropriate Data Mining task. We are now ready to decide on which type of Data Mining to use, for example, classification, regression, or clustering. This mostly depends on the KDD goals, and also on the previous steps. There are two major goals in Data Mining: prediction and description. Prediction is often referred to as supervised Data Mining, while descriptive Data Mining includes the unsupervised and visualization aspects of Data Mining. Most data mining techniques are based on inductive learning, where a model is constructed explicitly or implicitly by generalizing from a sufficient number of training examples. 6. Choosing the Data Mining algorithm. Having the strategy, we now decide on the tactics. This stage includes selecting the specific method to be used for searching patterns (including multiple inducers). For example, in considering precision versus understandability, the former is better with neural networks, while the latter is better with decision trees. For each strategy of meta-learning there are several possibilities of how it can be accomplished. Meta-learning focuses on explaining what causes a Data Mining algorithm to be successful or not in a particular problem. Thus, this approach attempts to understand the conditions under which a Data Mining algorithm is most appropriate. Each algorithm has parameters and tactics of learning (such as ten-fold cross-validation or another division for training and testing). 7. Employing the Data Mining algorithm. Finally the implementation of the Data Mining algorithm is reached. In this step we might need to employ the algorithm several times until a satisfied result is obtained, for instance by tuning the algorithms control parameters. 8. Evaluation. In this stage we evaluate and interpret the mined patterns (rules, reliability etc.), with respect to the goals defined in the first step. Here we consider the preprocessing steps with respect to their effect on the Data Mining algorithm results (for example, adding features in Step 4, and repeating from there). This step focuses on the comprehensibility and usefulness of the induced model. In this step the discovered knowledge is also documented for further usage. 9. Using the discovered knowledge. We are now ready to incorporate the knowledge into another system for further action. The knowledge becomes active in the sense that we may make changes to the system and measure the effects. Actually the success of this step determines the effectiveness of the entire KDD process. There are many challenges in this step, such as loosing the laboratory

conditions under which we have operated. For instance, the knowledge was discovered from a certain static snapshot (usually sample) of the data, but now the data becomes dynamic. Data structures may change (certain attributes become unavailable), and the data domain may be modified (such as, an attribute may have a value that was not assumed before). Q2. Write short notes on: a) Data selection b) Data cleaning Answer : Data Selection: - Data selection is defined as the process of determining the appropriate data type and source, as well as suitable instruments to collect data. Data selection precedes the actual practice of data collection. This definition distinguishes data selection from selective data reporting (selectively excluding data that is not supportive of a research hypothesis) and interactive/active data selection (using collected data for monitoring activities/events, or conducting secondary data analyses). The process of selecting suitable data for a research project can impact data integrity. The primary objective of data selection is the determination of appropriate data type, source, and instrument(s) that allow investigators to adequately answer research questions. This determination is often discipline-specific and is primarily driven by the nature of the investigation, existing literature, and accessibility to necessary data sources. Issues of data selection:

the appropriate type and sources of data which permit investigators to adequately answer the stated research questions, suitable procedures in order to obtain a representative sample the proper instruments to collect data. There should be compatibility between the type/source of data and the mechanisms to collect it. It is difficult to extricate the selection of the type/source of data from instruments used to collect the data.

Data cleaning: - Data cleansing, data cleaning, or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. After cleansing, a data

set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).

Q3. Give an account of the need for decision support-system for business and scientific applications? Answer : SCIENTIFIC AND BUSINESS APPLICATIONS Rapid advances in information and sensor technologies (IT and ST) along with the availability of large-scale scientific and business data repositories or database management technologies, combined with breakthroughs in computing technologies, computational methods and processing speeds, have opened the floodgates to data dictated models and pattern matching . The use of sophisticated and computationally intensive analytical methods are expected to become even more commonplace with recent research breakthroughs in computational methods and their commercialization by leading vendors Scientists and engineers have developed innovative methodologies for extracting correlations and associations, dimensionality reduction, clustering or classification, regression and predictive modeling, tools based on expert systems and case based reasoning, as well as decision support systems for batch or real-time analysis. They have utilized tools from areas like .traditional. statistics, signal processing and artificial intelligence as well as emerging fields like data mining, machine learning, operations research, systems analysis and nonlinear dynamics.

Innovative models and newly discovered patterns in complex, nonlinear and stochastic systems, encompassing the natural and human environments, have demonstrated the effectiveness of these approaches. However, applications that can utilize these tools in the context of scientific databases in a scalable fashion have only begun to emerge Business solution providers and IT vendors, on the other hand, have focused primarily on scalability, process automation and workflows, and the ability to combine results from relatively simple analytics with judgments from human experts. For example, .e-business applications. in the areas of supply chain planning, financial analysis and business forecasting, traditionally rely on decision support systems with embedded .data mining., operations research and OLAP technologies, business intelligence (BI) and reporting tools as well as an easy touse GUI (graphical user interface) and extensible business workflows (e.g., see Geoffrion and Krishnan, 2003).

Hadoop
No ratings yet
Hadoop
58 pages
MQ Admin Program
No ratings yet
MQ Admin Program
124 pages
Guide To Oracle HCM
No ratings yet
Guide To Oracle HCM
284 pages
Test Drive Unlimited - Night Mod Setup Log
No ratings yet
Test Drive Unlimited - Night Mod Setup Log
246 pages
Social Web Analytics - Solution Answers
33% (3)
Social Web Analytics - Solution Answers
22 pages
01 Data Warehouse
No ratings yet
01 Data Warehouse
15 pages
Solution - Data Analysis With Python-Project-2 - v1.0
No ratings yet
Solution - Data Analysis With Python-Project-2 - v1.0
14 pages
Unit 2
No ratings yet
Unit 2
39 pages
Data Science Part Xiv Genetic Algorithms by DerekKane
No ratings yet
Data Science Part Xiv Genetic Algorithms by DerekKane
71 pages
Introduction To Statistics HAND OUT ARBA - 2
No ratings yet
Introduction To Statistics HAND OUT ARBA - 2
143 pages
Colonial Pipeline Ransomware Attack 2021 11122023 053758pm
No ratings yet
Colonial Pipeline Ransomware Attack 2021 11122023 053758pm
8 pages
Final Exam Paper Fall 2020
No ratings yet
Final Exam Paper Fall 2020
3 pages
Digital Analyst
No ratings yet
Digital Analyst
5 pages
Lecture 3:introduction To Rapid Miner
No ratings yet
Lecture 3:introduction To Rapid Miner
15 pages
Client Server
100% (1)
Client Server
13 pages
10-Universal Switch Config 2017 v0.6
No ratings yet
10-Universal Switch Config 2017 v0.6
23 pages
4870362
No ratings yet
4870362
6 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
Naveen's Resume
No ratings yet
Naveen's Resume
1 page
Course Outines Statistical Inference (QTM-522)
No ratings yet
Course Outines Statistical Inference (QTM-522)
15 pages
Statistics For Management I - Best
No ratings yet
Statistics For Management I - Best
127 pages
Assignment 2 - 2 Embedded Devices
50% (4)
Assignment 2 - 2 Embedded Devices
3 pages
DBMS Question Bank Student
No ratings yet
DBMS Question Bank Student
10 pages
Visual Basic
No ratings yet
Visual Basic
41 pages
Backup Grabber
No ratings yet
Backup Grabber
2 pages
Internship Report Topic:: "Digital Transformation in BRAC Bank Limited"
No ratings yet
Internship Report Topic:: "Digital Transformation in BRAC Bank Limited"
30 pages
Production Planning and Inventory Control
No ratings yet
Production Planning and Inventory Control
7 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Chapter One
No ratings yet
Chapter One
7 pages
Windows Server 2016
No ratings yet
Windows Server 2016
3 pages
Research Paper 4
No ratings yet
Research Paper 4
3 pages
Introduction
No ratings yet
Introduction
36 pages
CT6033 Cyber Security Management
No ratings yet
CT6033 Cyber Security Management
9 pages
Dr. Sanjay P. Ahuja, Ph.D. FIS Distinguished Professor of CIS School of Computing UNF
No ratings yet
Dr. Sanjay P. Ahuja, Ph.D. FIS Distinguished Professor of CIS School of Computing UNF
14 pages
Robson2002 Book
100% (1)
Robson2002 Book
91 pages
Fortinet and Ibm Security Qradar Integrated Solution
No ratings yet
Fortinet and Ibm Security Qradar Integrated Solution
2 pages
Market Basket Analysis For Data Mining - Msthesis PDF
No ratings yet
Market Basket Analysis For Data Mining - Msthesis PDF
75 pages
Hadoop Single Node Cluster Setup Steps
No ratings yet
Hadoop Single Node Cluster Setup Steps
7 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
57 pages
Data Analytics Problems
0% (1)
Data Analytics Problems
2 pages
Authorizations and Roles (BPM)
No ratings yet
Authorizations and Roles (BPM)
6 pages
Ambo University Institute of Technology Department of Computer Science
No ratings yet
Ambo University Institute of Technology Department of Computer Science
21 pages
Vmware Validated Design 62 SDDC Architecture Design Vi Workload Domain
No ratings yet
Vmware Validated Design 62 SDDC Architecture Design Vi Workload Domain
166 pages
R - (2017) Understanding and Applying Basic Statistical Methods Using R (Wilcox - R - R) (Sols.)
No ratings yet
R - (2017) Understanding and Applying Basic Statistical Methods Using R (Wilcox - R - R) (Sols.)
91 pages
Websphere Application Server Runtime Architecture: Welcome To
No ratings yet
Websphere Application Server Runtime Architecture: Welcome To
24 pages
Data Validation & Research
No ratings yet
Data Validation & Research
41 pages
Basic Stats and Probability
100% (1)
Basic Stats and Probability
703 pages
Applied Statistics: Confidence Intervals
No ratings yet
Applied Statistics: Confidence Intervals
8 pages
Business Inteligence
0% (1)
Business Inteligence
18 pages
Computer Security: 1 Vulnerabilities
No ratings yet
Computer Security: 1 Vulnerabilities
23 pages
Case 2 Predicting Boston Housing
0% (6)
Case 2 Predicting Boston Housing
2 pages
Poisson and Hypergeometric Prob. Distribution Classroom Exercise Set
100% (1)
Poisson and Hypergeometric Prob. Distribution Classroom Exercise Set
2 pages
Nutrition Knowledge, Attitudes, and Self-Regulation As PDF
No ratings yet
Nutrition Knowledge, Attitudes, and Self-Regulation As PDF
9 pages
Chap 3
No ratings yet
Chap 3
20 pages
Statistical Inference and Hypothesis Testing
No ratings yet
Statistical Inference and Hypothesis Testing
34 pages
MSC Statistics
No ratings yet
MSC Statistics
36 pages
Pre Interview Assignment Data Analyst - Albanero
No ratings yet
Pre Interview Assignment Data Analyst - Albanero
3 pages
Business Analytics - The Science of Data Driven Decision Making
No ratings yet
Business Analytics - The Science of Data Driven Decision Making
55 pages
Evaluating Analytical Chemistry
No ratings yet
Evaluating Analytical Chemistry
4 pages
Knowledge Mapping
No ratings yet
Knowledge Mapping
14 pages
Chapter 4-Duality & Sensitivity Analysis
No ratings yet
Chapter 4-Duality & Sensitivity Analysis
9 pages
Labmanual For Mba
No ratings yet
Labmanual For Mba
23 pages
Data Analyst Course Content-1
No ratings yet
Data Analyst Course Content-1
10 pages
Resume
No ratings yet
Resume
3 pages
Time Series Analysis
No ratings yet
Time Series Analysis
13 pages
Deliverable 1 Fyp Final
No ratings yet
Deliverable 1 Fyp Final
11 pages
Time Series and Arima Models
No ratings yet
Time Series and Arima Models
20 pages
Text Book Answers Unit 11
100% (2)
Text Book Answers Unit 11
16 pages
PSSC Maths Statistics Project Handbook Eff08 PDF
No ratings yet
PSSC Maths Statistics Project Handbook Eff08 PDF
19 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
13 pages
Receivables Management Final
No ratings yet
Receivables Management Final
33 pages
Efficient Frontier
No ratings yet
Efficient Frontier
27 pages
Inventory Velocity
No ratings yet
Inventory Velocity
3 pages
Adventure Works Cycles
No ratings yet
Adventure Works Cycles
2 pages
Parametric Test
No ratings yet
Parametric Test
2 pages
Secure Programming With Static Analysis
No ratings yet
Secure Programming With Static Analysis
56 pages
Simulation
No ratings yet
Simulation
63 pages
Machine Learning Algorithms For Recommender System - A Comparative Analysis
No ratings yet
Machine Learning Algorithms For Recommender System - A Comparative Analysis
4 pages
Chapter 3 Homework - Quantitive Analysis For Management
No ratings yet
Chapter 3 Homework - Quantitive Analysis For Management
3 pages
Assignment 1 Quantitative Management: Bayers' Theorem & Conditional Probability
No ratings yet
Assignment 1 Quantitative Management: Bayers' Theorem & Conditional Probability
2 pages
Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
Introduction To SAP UI5
No ratings yet
Introduction To SAP UI5
11 pages
Chapter 9 Fundamental of Hypothesis Testing
No ratings yet
Chapter 9 Fundamental of Hypothesis Testing
26 pages
BA Project Group33
No ratings yet
BA Project Group33
10 pages
Regression
No ratings yet
Regression
6 pages
SAP Press - SAP Transaction Codes - Your Quick Reference To Transactions in SAP ERP
No ratings yet
SAP Press - SAP Transaction Codes - Your Quick Reference To Transactions in SAP ERP
7 pages
Kirkby When Growth Stalls Summary
No ratings yet
Kirkby When Growth Stalls Summary
2 pages
Course Outline - Corporate Performance and Planning
No ratings yet
Course Outline - Corporate Performance and Planning
3 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages

Data Mining Assignment

Uploaded by

Data Mining Assignment

Uploaded by

.Homework Title / No.

: __________homework 1__Course Code :_ Course code: CAP 624

Evaluators comments: _____________________________________________________________________ Marks obtained : ___________ out of ______________________

You might also like

: ________homework 1Course Code :_ Course code: CAP 624

Evaluators comments: _______________________________________________________________ Marks obtained : _ out of __________________