This document contains the responses to 10 tasks related to data mining and data preprocessing. It defines key concepts like data mining, the steps in data mining, discrete vs continuous attributes, and applications of data mining. It also discusses issues in data mining techniques, numeric attributes, major tasks in data preprocessing, data similarity and dissimilarity, statistical descriptions of data, and defines data discretization.
This document contains the responses to 10 tasks related to data mining and data preprocessing. It defines key concepts like data mining, the steps in data mining, discrete vs continuous attributes, and applications of data mining. It also discusses issues in data mining techniques, numeric attributes, major tasks in data preprocessing, data similarity and dissimilarity, statistical descriptions of data, and defines data discretization.
This document contains the responses to 10 tasks related to data mining and data preprocessing. It defines key concepts like data mining, the steps in data mining, discrete vs continuous attributes, and applications of data mining. It also discusses issues in data mining techniques, numeric attributes, major tasks in data preprocessing, data similarity and dissimilarity, statistical descriptions of data, and defines data discretization.
This document contains the responses to 10 tasks related to data mining and data preprocessing. It defines key concepts like data mining, the steps in data mining, discrete vs continuous attributes, and applications of data mining. It also discusses issues in data mining techniques, numeric attributes, major tasks in data preprocessing, data similarity and dissimilarity, statistical descriptions of data, and defines data discretization.
1. Define Data mining. List out the steps in data mining.
Data mining, which is also referred to as Knowledge Discovery in Databases, involves the extraction of patterns, trends and insights, from datasets usually stored in databases. This process entails analyzing and interpreting volumes of data to uncover patterns and relationships that can inform business decisions, predict future outcomes and provide a competitive edge. STEPS IN DATA MINING Data Cleaning – removing unrelated data from the collection. Data Integration – refers to the data gathered from multiple sources and merged into a single repository. Data Selection – determining and retrieving data from the data collection that's pertinent to the analysis. Data Transformation – turning data into suitable form required by mining process. Data Mining – techniques for extracting valuable patterns that have the potential to be useful. Pattern Evaluation – assessing the quality, relevance and usefulness of the patterns by using predefined criteria or measures. Knowledge Representation – makes use of visualization tools to showcase the outcomes of data mining. 2. Compare Discrete versus Continuous Attributes. Discrete Attribute is characterized by having a set of values that are either finite or countably infinite. These values, often represented as integers or in categorical form whereas Continuous Attribute have an infinite number of states and are also of the float type. It represents a continuous range of possible values and is frequently associated with measurements or quantities 3. Give the applications of Data Mining. Financial Data Analysis – Banking services includes loans, investments, credits, debits, etc. It is generally reliable and of high quality, making systematic data analysis and data mining possible. Retail Industry – It gathers a large amount of data from on sales, consumers, goods, consumption, and service. It aids in understanding customer purchasing patterns and trends, which leads to enhanced customer quality and satisfaction. Telecommunication Industry – It is one of the most rapidly growing industries, offering a wide range of services. This industry aids in identifying telecommunication patterns, detecting fraudulent actions, making better use of resources, and improving service quality. Biological Data Analysis – It deals with Genomics (Gene Study), Proteomics (Protein Study), and Biomedical Research, also comparison and identification of human genomes. Other Scientific Applications – Scientific domains (Geosciences, Astronomy, Climate and Ecosystem Modelling, Chemical Engineering, Fluid Dynamics, etc.) Intrusion Detection – Any set of actions that threaten the integrity, confidentiality/availability of network resource. 4. Analyze the issues in Data Mining Techniques. 5. Generalize in detail about Numeric Attributes. Numeric attributes are a fundamental type of data attribute used in various data analysis and machine learning applications. These attributes represent measurable quantities and can take on a range of numerical values. Numeric attributes are characterized by the following key features, the Interval-Scaled and Ratio-Scaled. 6. Evaluate the major tasks of data preprocessing. Data Cleaning – filling in missing values, smoothing the noisy data, or resolving the inconsistencies in the data. Data Integration – Data from several representations is combined, and conflicts within the data are addressed. Data Transformation – Data is normalized, aggregated, and generalized. Data Reduction – The goal of this procedure is to give a simplified representation of the data in a data warehouse. Data Discretization – Involves dividing the range of attribute intervals to reduce a number of continuous attribute values. 7. Define an efficient procedure for cleaning the noisy data. 8. Distinguish between data similarity and dissimilarity. Data Similarity - Numerical measure of how alike two data objects are. - Value is higher when objects are more alike. Often falls in the range [0,1]. Data Dissimilarity - Numerical measure of how different two data objects are. - Values are lower when objects are more alike. - Minimum dissimilarity is often 0. - Upper limit varies. 9. Show the Displays of Basic Statistical Descriptions of Data. Measures of Central Tendency –The mean, median, and mode are the primary measurements of central tendency that represent the value in a dataset.
Measures of Dispersion – These are range, variance, and standard
deviation. It helps one to determine the quality of data in an objectively quantifiable manner.
Frequency Distribution – It is a graphical or tabular representation that shows the
number of observations inside a specified interval.
10. Formulate what is data discretization.
Data discretization is the process of transforming continuous data into discrete or categorical values by dividing the data into intervals, making it easier to analyze and categorize information.