The document discusses various issues and processes related to data mining, including human interaction, overfitting, outliers, and the importance of data cleaning and transformation. It outlines the stages of the Knowledge Discovery in Databases (KDD) process, applications of data mining in fields like education and healthcare, and introduces concepts such as data warehouses, data marts, and OLAP operations. Additionally, it covers classification techniques, algorithms, and their applications, emphasizing the significance of decision trees and Bayesian methods in data analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views
Data Mining
The document discusses various issues and processes related to data mining, including human interaction, overfitting, outliers, and the importance of data cleaning and transformation. It outlines the stages of the Knowledge Discovery in Databases (KDD) process, applications of data mining in fields like education and healthcare, and introduces concepts such as data warehouses, data marts, and OLAP operations. Additionally, it covers classification techniques, algorithms, and their applications, emphasizing the significance of decision trees and Bayesian methods in data analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15
DATA MINING
DATA MINING ISSUES
1. Human interaction: When a data mining task is to be undertaken, the goal is not clear. Users as well as the technical expert are unaware of the results. There is a need for a proper interface between the comain expert and users. The queries are formed by the experts based on the user's demand. 2. Overfitting: Overfitting is a statistical error. When a model is generated for a particular data set, it is supposed that the same model should accommodate future data sets as well. But overfitting occurs when the generated model is well suited for the training data set and it is not suited for the test data set or future data set. 3. Outliers: When a model is derived, there are some values of data that do not fit in the model. These values significantly different from the normal values, or they do not fit in any cluster. These values are called outliers. 4.Interpretation of the results: Interpretation of the results obtained by data mining is a very crucial task. This interpretation is beyond only explanation of the results. This task requires expert analysis and interpretation. Hence, interpretation of the results is an issue in data mining 5. Visualization of the results: Visualization of the results is useful to understand and quickly view the output of the different database algorithms. 6.Large data sets: Data mining models are generally designed to test the small data sets. But, when these models are applied to large data sets i.e. datasets with larger size then these models either fail or they wobble. There are many such models that work very well for the normal data sets but are inefficient in handling large data sets. The large data set issue can be handled with sampling and parallelization. 7. Noisy data: The data which has no meaning is called noisy data. These values need to be corrected or replaced with meaningful data. 8. Multimedia Data: Many users demand the mining tasks for graphical, video or audio data. The multimedia data can be an issue in data mining as traditionally data mining tasks are designed fur numeric or alphanumeric data. 9. Missing data: Sometimes the data is incomplete or missing. During the KDD process, this data may be filled with nearest estimates. These estimates may give false or invalid results creating problems. **STAGES OF THE DATA MINING PROCESS (KDD)** KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and potentially valuable information from large datasets. The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate knowledge from the data. 1. Selection: The data which is to be mined may not be necessarily from a single source. The data may have many heterogeneous origins. This data needs to be obtained from various data sources and files. The data selection is based on your mining goal. Data relevant to the mining task is selected from various sources. 2. Pre-processing: Pre-processing involves cleaning of the data and integration of the data. The data selected for mining purposes may have some incorrect, irrelevant values which leads to unwanted results. Some values may be missing or erroneous. Also, when data is collected from heterogeneous sources, it may involve varying data types and metrics. So, this data needs to be cleaned and integrated for noise elimination and inconsistency. 3. Transformation: Data transformation is the process of converting the data into the format which is suitable for processing. Here, data is molded in the form which is required by the data mining process. 4. Mining: The Mining process leads towards using methods, techniques to extract the pattern present in the data. The process involves transformation of relevant data records into patterns using classification. This step involves application of various data mining algorithms to the transformed data. Mining process generates the desired results for which the whole KDD process is undertaken. APPLICATIONS OF DATA MINING Data mining is used by many organizations to improve the customer base. They focus on customer behavioral patterns, market analysis, profit areas and product improvement. (a) Education: Educational data mining deals with developing the methods to discover the knowledge from the education field. It is used to find out / project students' areas of interest, future learning capacities and other aspects (b) Health and Medicine: Data mining can effectively be used in health care systems. During Covid 19 pandemic, the predictions of the covid 19 waves and the volume of patients was done using data mining. (c) Market Analysis: Market analysis is based on a particular pattern of purchase followed by customers. These patterns help the shop owner to understand the buying pattern of customers and accordingly useful decisions can be implemented so as to increase the profit of the store. (d) Fraud Detection: A fraud detection system helps in finding out the pattern of fraud, its potential attackers/ criminal detection and possible solutions using different data mining algorithms. These data mining methods provide timely and efficient solutions for detection and prevention of the frauds. prevention of the frauds. !!DATA CLEANING!! The first step in data preprocessing is data cleaning. Data cleaning includes handling missing data and noisy data (a) Missing data: Missing data is the case wherein some of the attributes or attribute data is missing or the data is not normalised. This situation can be handled by either ignoring the values or filling the missing value. (b) Noisy data: This is data with error or data which has no meaning at all. This type of data can either lead to invalid results or can create the problem to the process of mining itself. The problem of noisy data can be solved with binning method, regression and clustering. !!DATA TRASFORMATION!! (a) Smoothing: This is the process of removing the unnecessary data and cleaning the data so as to improve the functionality of the data. (b) Aggregation: This is the process of collecting the data from heterogeneous platforms and converting it to a uniform format. This improves the cuality of the data. (c) Discretization: Large data sets are complex to handle. Discretization is the process of breaking up the data in small intervals. These chunks are continuous chunks and these are supported by all the existing frameworks. (d) Attribute construction: To improve the efficiency in the mining process, some new attributes are generated from existing data sets. (e) Generalization: This is the process of converting low level attributes to high level attributes using hierarchy. (f) Normalization: In the process of Normalization, attributes are scaled within a specified range. !!DATA REDUCTION!! Data reduction is a process that reduced the volume of original data and represents it in a much smaller volume METHODS : a) Attribute Selection: When data is collected from various sources, it may contain duplicate attributes. Some of the attributes are irrelevant. The Attribute Selection method is used to remove such redundant and unnecessary attributes from the data set. This process results in an improved data set. (b) Data Cube Aggregation: In this reduction method, aggregation property is applied on selected data sets so as to get the data in a much simpler format. (c) Numerosity Reduction: In this reduction method, actual data is substituted with a mathematical model of the data. (d) Dimensionality Reduction: In this reduction method, duplicate attributes are removed to reduce the data size. !!DATA DISCRETIZATION!! Large data sets are complex to handle. Discretization is the process of breaking up the data in small intervals. Here, the data size is reduced. But the data which is divided into intervals is kept continuous having some sequence. Every interval has its own name and later these intervals can be replaced with actual data. These chunks are continuous chunks and these are supported by all the existing frameworks. What is data warehouse? A data warehouse is an enterprise system used for the analysis and reporting of structured and semi-structured data from multiple sources, such as point-of-sale transactions, marketing automation, customer relationship management, and more. Need of Data Mart: Since data mart is related to a specific domain, the time of retrieval of information is less with improved efficiency. It provides easy access to frequently requested data. They are easy to implement and the cost of implementation is less as compared to a data warehouse. A data mart is agile. In case of change in model, data mart can be built quicker due to a smaller size. A Data mart is defined by a single Subject Matter Expert. Data can be segmented and stored on different hardware/software platforms. Advantages and Disadvantages of Data Mart: Advantages: Data marts are domain specific; hence it is valuable to a specific group of users. It is cost effective and easy to implement. Data mart allows faster access of data. Data mart is easy to use as it is specifically designed for the needs of its users. A data mart can accelerate business processes. It is easy to implement and efficient to use. It contains historical data which enables the analyst to determine data trends. Disadvantages: Many subsets of corporate data warehouse may create unnecessary burden. It is very hard to maintain the data mart if they are created with unrelated data. Data mart cannot provide company-wide data analysis as their data set is limited. !!Different OLAP Operations!! OLAP operations are done on multidimensional data. This multidimensional data is organized in various dimensions. Every dimension includes multiple levels of abstraction. So, there are various OLAP operations to demonstrate these views. OLAP operations are based on a multidimensional view of data. Here is the list of OLAP operations: 1 Roll-up 2Drill-down 3Slice and dice 4Pivot (rotate) !!Snowflake Schema!! Snowflake schema is a refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake. 'Snow flaking' or the normalization of the dimension tables can be done in many different ways. •Snowflake schema is an arrangement of tables in a multidimensional database system. Advantages to the Snowflake Schema: Data is structured. Data integrity is maintained. Less disk space is utilized. Disadvantages of Snowflake Schema: It requires more complex queries. Complex queries decrease the performance !!Star Schema!! •Star schema is the most common schema in data warehouses. This is widely used to design the data warehouse. The basic architecture of star schema includes one fact table and many dimension tables. The advantage of star schema is that it is very efficient in handling the queries. •Star schema contains one fact table associated with many dimension tables. •The fact table contains primary information of the data warehouse. •Dimension tables have details of the surrounding tables. •The primary key which is present in each dimension is related to a foreign key which is present in the fact table. Advantages of Star schema: •Its performance is good because simple queries are used. •It contains single dimension tables. •In star schema, both Dimension and Fact Tables are in De-Normalized form. •It has less number of foreign keys and hence shorter query execution time. Disadvantages of Star Schema: •It has redundant data and hence difficult to maintain/ch.ange •There are data integrity issues. •Many-to-Many relationships are not supported !!Definition of Classification!!: •As the name suggests, classification is the process of classifying the data. It is a data mining technique which is done for analysis of the data. It is the process of finding the model that defines the classes and their concepts. It identifies and categorizes the sub population of the data. •The main goal of the Classification algorithm is to identify the category of a given dataset, and these algorithms are mainly used to predict the output for the categorical data. !! Applications of Classification!! •Credit approval: Applicant as good or poor credit risk. •Target marketing: Profile of a good customer. •Medical diagnosis: Develop a profile of stroke victims. Cancer tumor cell identification. •Fraud detection: Determine a credit card purchase is fraudulent. •Email spam classification: Filter the spam e-mail automatically. •Banking: Bank customer's loan pay willingness prediction !!Classifier!! The algorithm which implements the classification on a dataset is known as a Classifier 1.Binary Classifier: If the classification problem has only two possible outcomes then it is called as Binary Classifier. 2. Multi-class Classifier: If a classification problem has more than two outcomes then it is called as Multi-class Classifier !! Tree Pruning!! •The pruning of tree pruning is a process where the anomalies in the training data are removed due to outliers or noise in the data. •The pruned trees are beneficial as the pruned trees are smaller and less complex. •Also, decision trees that are trained on any training data do have the feat of overfitting the training data. So, pruning will remove the overfitting problem as well. •The pruning process cuts off or crops the lower ends of the tree so as to make it simple and less complex. !!Advantages of Decision Tree: 1. D'T's are easy to use and efficient. 2. Rules can be generated that are easy to interpret and understand. 3. They can handle nonlinear parameters easily. 4. The scalability of DTs is good as the tree size is independent of the database size. 5. Trees can be constructed for many attributes in the tree. Disadvantages of Decision Tree: 1. They do not easily handle continuous data. These attribute domains must be divided categories to be handled. 2. The mathematical calculation of the decision tree mostly requires more memory. 3.The mathematical calculation of the decision tree is time consuming. 4. The space and time complexity of the decision tree model is relatively higher. 5. Decision tree model training time is more as complexity is high. !!Classification Algorithms!! The main aim of the Classification algorithm is to identify the category of a given dataset, and these algorithms are mainly used to predict the output for the categorical data !!Types!! 1. Statistical based classification: Statistical classification is the broad supervised learning approach that trains a program to categorize new, unlabeled information based upon its relevance to known, labeled data. 2. Distance based classification: Distance based algorithms are nonparametric methods that can be used for classification. These algorithms classify objects by the difference between them as measured by distance functions. 3. Decision tree based classification: Decision tree builds classification in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. It can handle both categorical and numerical data. 4. Neural network based classification: Artificial neural networks are relatively basic electronic networks of neurons based on the neural structure of the brain. They process records one at a time, and learn by comparing their classification of the record (i.e., largely arbitrary) with the known actual classification of the record. !!BAYES CLASSIFICATION METHODS!! Bayesian classification is a probabilistic approach to learning and inference based on a different view of what it means to learn from data, in which probability is used to represent uncertainty about the relationship being learnt. Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Naive Bayes is a type of classifier which uses the Bayes' Theorem. It predicts membership probabilities for each class such as the probability that given record or data point belongs to a particular class. The class with the highest probability is considered as the most likely class. !!Naïve Bayesian Classification!! Naive Bayes classifiers are a collection of classification algorithms based on Bayes' Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. It is a classification technique based on Bayes' Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. !!Applications of Naive Bayes Algorithms!! •Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time. •Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable. •Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments). •Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not. !!K-NEAREST-NEIGHBOR CLASSIFIERS!! K-Nearest-Neighbor (KNN) is a supervised learning algorithm which is used for classification and regression. The major application of KNN is in classification of predictive problems. K-nearest- neighbours stores all available cases and classifies new cases based on a similarity between the data items. KNN does not make assumptions on underlying data (non-parametric algorithm). !!SVM CLASSIFIER!! •Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. Support Vector Machine is a method for the classification of both linear and nonlinear data. SVM is an algorithm that uses a nonlinear mapping to transform the original training data into a higher dimension. •The main objective of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a Hyperplane. !!Types of SVM!! 1.Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier. 2. Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier. !!Linear Regression!! •Linear regression is the method where regression models the relationship between two variables by fitting a linear equation to observe the data. It is a simple type of regression. •Linear regression attempts to find the mathematical relationship between variables. •If the outcome is a straight line then it is considered a linear model and if it is a curved line, then it is a non-linear model. •The relationship between dependent variables is given by a straight line and it has only one independent variable. Y = a + βx !!Non-linear Regression •Non-linear regression is a form of regression analysis in which data is fit to a model and then expressed as a mathematical function. •Non-linear regression is a curved function of an X variable (or variables) that is used to predict a Y variable. •Non-linear regression can show a prediction of population growth over time. !!INTRODUCTION TO PREDICTION!! Prediction is one of the data mining processes. It is used to find a numerical output. Here, the training dataset contains the inputs and corresponding numerical output values. According to the training dataset, the algorithm derives the model or a predictor. When the new data is given, the model should find a numerical output. This method does not have the class label. The model predicts a continuous-valued function or ordered value. •Prediction is the method of recognizing the missing or not available numerical data for a new process of observing. •In prediction, the authenticity depends on how well a given predictor can guess the value of a predicted attribute for new data. !!Applications of Clustering:!! •Data summarization and compression: Clustering will be useful in the fields like image processing and vector quantization which requires data summarization, compression and reduction. •Trend detection in dynamic data: Clustering can also be applied for trend detection in dynamic data sets as clusters of similar trends can be created. •Social network analysis: In social network analysis, like Facebook/Twitter clustering will be sued for generating sequences in images, videos or audios. •Biological data analysis: In biological data as example to detect cancer, clustering can be used for making cluster images, videos. •Marketing: Finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records. •Biology: Data Mining helps in the classification of animals and plants are done using similar functions or genes in the field of biology. •Libraries: Book ordering. •Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds. •City-planning: Identifying groups of houses according to their house type, value and geographical location •Earthquake Studies: Clustering observed earthquake epicenters to identify dangerous zones. •WWW: Document classification; clustering weblog data to discover groups of similar access patterns. !! Requirements for Cluster Analysis!! •Scalability: Highly scalable clustering algorithms are needed to deal with large databases such as big data. •Ability to deal with different kinds of attributes: Algorithms should be applicable to any kind of data sets such as numerical data, categorical, and binary data. •Discovery of clusters with attribute shape: Clustering algorithms determines clusters of random shape. Different types of distance measures are used to find out circular shape clusters of small sizes. •High dimensionality: The desired clustering algorithm should be able to handle low as well as high dimensional data sets. •Ability to deal with noisy data: Algorithms should be designed in such a way that they should able to handle noisy data. Some algorithms do not handle noisy data so peor quality of clusters as designed. •Interpretability: Clustering algorithms should produce interpretable, comprehensible, and usable results. !! Types of Clustering!! 1 Hierarchical Clustering: •In this type of clustering method, a nested set of clusters is created. •Every level in hierarchy has distinct set of clusters. •At lowest level, every item is its own cluster.These clusters are unique. •As level in the hierarchy increases, items are grouped together in cluster. •These algorithms can be classified into Agglomerative or Divisive. •Agglomerative: In this type, clusters are created in bottom-up fashion. •Divisive: In this type, clusters are created in top-down fashion. 2. Partitional Clustering: •In this type of clustering method, only one set of clusters is created. •The desired numbers of clusters are defined. 3. Categorical Clustering: •These algorithms work on categorical databases. In Categorical databases, values describe some characteristic or category. For example, what is your favourite colour? 4. Large DB Clustering: •These algorithms work on Large databases. i.e. Big data. •These algorithms familiarize to memory constraints either by sampling or compression technique. •In case of sampling, data structures are used. These data structures can be compressed or pruned to fit into memory irrespective of size of database. !!K-MEANS: A Centroid-Based Technique!! K-Means Clustering: This is iterative clustering algorithm. In this, items are moved among set of clusters until desired set is reached. It is one type of squared error algorithm and some convergence criteria should be defined to obtain the final result. Input: D=(t1, t2,..., t) // Set of elements k // Number of desired clusters Output: K // Set of clusters K-means algorithm: 1. assign initial values for means m₁, M2, 2. repeat 3. assign each item ti to the cluster which has the closest mean; 4. calculate new mean for each cluster; 5. until convergence criteria is met; !!Market Basket Analysis:!! •Market basket analysis is a data mining technique used by retailers to increase sales of items by well understanding customer purchasing patterns. •Analysis of large data sets for example to find out purchase history, to find out product groupings, to find out products that are likely to be purchased together can be done using market basket analysis. •Combinations of items that occur frequently in the transactions will be examined. Such examinations of transactions allow retailers to identify relationships between the items that people frequently purchase together. •For example, people who buy bread and peanut butter also buy jelly. People who buy sugar may buy milk. !!APRIORI ALGORITHM!! This is one of the well-known best algorithms for generating association rules. It is powerful algorithm for mining frequent itemsets for Boolean association rules. Name of the algorithm is based on the priority it uses. i.e. Apriori. •Apriori property based on fact that it uses prior knowledge of frequent itemset properties. •This property uses iterative approach as level wise search. •At any level k itemsets are used to explore (k + 1) itemsets. •At first step, whole database is scanned and count of each individual item is found. Assume minimum support. •Consider those items which satisfy minimum support. Set of such frequent itemset is found. • The resulting set is denoted as L1(Level 1). • L1is used to find L2(L2is the frequent itemsets 2). •L2 is used to find L3 and the process continues till no more frequent k-itemsets can be found. •Every time database has to be scanned for find L_{k} frequent itemset. !! Components of Data Science!! 1Statistics: Without statistics data science cannot be thought of. It is the basic and critical unit of data science. Huge amount of numerical data is given to algorithm, in that different statistical measures are applied and useful output is generated. 2. Visualization: Huge amount of data is represented in quick glance which is easy to understand by all business processes and its stakeholders and society. 3.Machine Learning: Different machine learning algorithms are used to make predictions about future/unpredicted data. 4. Deep Learning: It is a new machine learning research technique where an algorithm selects the analysis model to follow. 5. Advanced Computing: Data science is nothing but data computing. So, many advanced computer techniques tools are need to process data in data science. 6. Data Engineering: It is preparation of systems for collecting, validating, and preparing that high- quality data. Data engineers gather and prepare the data and data scientists use the data to promote better business decisions. 7. Domain Expertise: Domain experts have knowledge about data domains. The useful insight which is output of data science needed to be clearly understood by user. If user is unable to interpret results the domain experts are needed to interpret the results. !!DATA SCIENCE PROCESS!! Step 1 - Discovery: This step involves obtaining data from all identified, recognized internal and external sources. This acquired data helps user to answer the business problem under study. Step 2- Data Preparation: The collected data have many conflicts like missing value, blank columns, incorrect data format, which needs to be cleaned. Before using data for modelling it should be clean, transformed and in a single format. So, Data transformation is done at this step. Step 3 Model Planning: In this step of data science process, model for training data set is finalized. At this stage, user needs to determine the method and technique to draw the relation between input variables and output variables. Planning for a model which is used different statistical formulas and visualization tools. Step 4 Model Building: Models must be building using training data set and their performance is evaluated using test data set. Actual model building process happens here. Data scientist distributes datasets for training and testing Step 5 - Operationalize: In this stage, user will deliver the final baselined model with reports, code, and technical documents. Model is deployed into a real-time production environment after thorough testing. At this step, preparation of visual insights to discover useful patterns in the data will be done. Step 6 Communicate Results: This is known as presenting your results and automating the analysis. In this stage, the key results are transferred to all stakeholders. This helps user to decide if the results of the project are a success or a failure based on the inputs from the model. Actionable insight is a key outcome that user show how data science can bring about predictive analytics and later prescriptive analytics. !!BASICS OF DATA ANALYTICS!! Data Analytics (DA) is the process of examining data sets to draw inferences about the information they contain, increasingly with the aid of specialized systems and software. Data analytics is used in many industries to allow companies and organization to make better business decisions and in the sciences to verify or disprove existing models or theories. Following points will clear ideas about Data Analytics: 1.Define the question or goal behind the analysis: What are you (user) trying to discover/ find out? 2. Right data collection to answer this question. 3. Perform data cleaning/scrubbing to improve data quality, removal of unnecessary data and prepare it for analysis process. Right data, right format. 4. Data Manipulation using any tools or techniques. 5. Data Analysis and interpretation using statistical tools and techniques, finding correlations, patterns, trends, outliers in data. !!CHALLENGES OF DATA SCIENCE TECHNOLOGY!! Data science is broadening its branches all over the world. But there includes a lot of challenges which delays a data scientist while dealing with data. Let us see some of the major challenges faced by data scientists. • High variety of information & data is required for accurate analysis. •Not adequate data science talent pool available. •Management does not provide financial support for a data science team. • Unavailability of/difficult access to data. •Data Science results not effectively used by business decision makers. •Explaining data science to others is difficult. • Privacy issues. •Lack of significant domain expert. •If an organization is very small, they can't have a Data Science team. !!Difference between EDA and IDA: IDA means Initial Data Analysis. IDA focuses on: Checking assumptions required for model fitting. Hypothesis testing. Handling missing values. Making transformations of variables as per need. IDA is part of EDA. EDA is a method/philosophy for data analysis that employs a variety of techniques (mostly graphical). The goal of EDA is to accomplish the following: Maximise insight into a data set. Uncover underlying structure. Extract important variables. Detect outliers and anomalies. Test underlying assumptions, Develop parsimonious models. Determine optimal factor settings. !!Principles of Data Visualization!! 1.Select the right graph/chart: Some charts are having good intension but more time complexity.So, choose the graph based on the kind of data and the message to be conveyed. Do not use variety of graphs as just sake of purpose. Sometimes use of numbers is essential rather than graphs or charts. For example, display pie chart for percentage instead of bar graph. 2.Form must follow function: An intuitive design is more important than appealing charts, and graphs should convey the meaning of data in an easy-to-understand manner. 3.Balance the design: The visual elements should be equally distributed across plots, charts, colour, text, shape, and space. Symmetrical visual software should be used for best visualization of data. 4.Focus on the key areas: Ensure that key areas should be highlighted so that it will quickly be highlighted in front of user. 5.Keep visualization simple: The visualizations should be displayed must be easy to understand. Remove unwanted information. Avoid confusions. The goal of data visualization is simplicity. 6.Incorporate interactivity: Data visualization tools implemented in to charts and graphs. 7.Use patterns: Different visualizations tools will be used to display patterns in data with some similar colour. 8.Compare aspects: Many comparison aspects are used to display same data using different charts either horizontally or vertically or in both manners. !!BENEFITS OF DATA VISUALIZATION!! Data visualization helps business stakeholders analyze reports regarding sales, marketing strategies, and product interest. Based on the analysis, they can focus on the areas that require attention to increase profits. This makes the business more productive. Different Visualization techniques are used to take quick action on problem under solution and take necessary actions for business growth. It benefits business users to recognize new patterns and to find errors in the data. These patterns give idea and help the users to pay attention to areas that indicate progress. This process, in turn, drives the business ahead to achieve business goals to get benefits. Some visualization techniques are used understand the story behind data As decision making becomes easy it will be used for exploring different business insights. It is used for grasping the latest trends in knowledge through data. !!ADVANTAGES AND DISADVANTAGES OF EDA!! Advantages of EDA: It gives up valuable insights into the data. Visualization is an effective tool to detect outlier. It helps us with feature selection. Disadvantages of EDA: If not performed properly EDA can misguide a problem EDA is not effective when we deal with high dimensional data.
Get Provenance and Annotation of Data and Processes 7th International Provenance and Annotation Workshop IPAW 2018 London UK July 9 10 2018 Proceedings Khalid Belhajjame PDF ebook with Full Chapters Now
Complete Download Internet of Things IoT Infrastructures Second International Summit IoT 360 2015 Rome Italy October 27 29 2015 Revised Selected Papers Part II 1st Edition Benny Mandler PDF All Chapters