Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
0 ratings
0% found this document useful (0 votes)
47 views
36 pages
DataAnalyticsChapter1Vision PDF
Uploaded by
King of Success
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Download
Save
Save DataAnalyticsChapter1Vision.pdf For Later
Share
0%
0% found this document useful, undefined
0%
, undefined
Print
Embed
Report
0 ratings
0% found this document useful (0 votes)
47 views
36 pages
DataAnalyticsChapter1Vision PDF
Uploaded by
King of Success
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Carousel Previous
Carousel Next
Download
Save
Save DataAnalyticsChapter1Vision.pdf For Later
Share
0%
0% found this document useful, undefined
0%
, undefined
Print
Embed
Report
Download
Save DataAnalyticsChapter1Vision.pdf For Later
You are on page 1
/ 36
Search
Fullscreen
| / Introduction to Data —— yf. Analytics 1. Introduction Data has changed the face of our world over the last ten years. We live in a data driven world or the “Information Age” today. Today, much of our daily interaction occurs via computers and smartphones. While we use our devices to communicate with friends, update social media and surf the web, they're also being used to quietly collect data related to our activity. Whatever we do, generates some data. Whether we send an email, “like” a friend's post, reply to a message, post a photo on Instagram, search for something using a search engine like Google, shop for items online, use a map for navigation, carry out a transaction, book a railway/flighv’movie ticket, etc., it creates some data. Thus, huge volumes of data are generated every second, not only by individuals but also by companies and organizations. On the other hand, the last 20 years have seen steeply decreasing costs to gather, store, and Process data, creating an even stronger motivation for analyzing this data in order to make sense of this data. Businesses, organizations, governments and other bodies can benefit from studying and analyzing this data, It can help to paint a picture of users’ interests, habits, connections and more, providing valuable insight that can be utilized in a variety of ways. This knowledge could help us understand our world better, help businesses and organizations make better decisions. For example, your internet cookies are a digital representation of the places you’ ve visited on the internet, when you visited them, what you did there and for how long. Those cookies can be Scraped (or collected) then sorted and analyzed by a data analyst in order to understand your ttDe 4-2 /uiion Data Analytics browsing habits. Those patterns can then be used to predict your future interactions, including products you might buy, websites you might visit or people you might be interested in. Businesses and organizations in all sectors are increasingly depending on data to make critical business decisions like which products to make, which markets to enter, what investments to make, or Which customers to target. 2. Concept of Data Analytics The term “data analytics” can be described in many ways: - Data Analytics is the process of exploring the data from the past to make appropriate decisions in the future by using valuable insights. extract - Data Analytics refers to the process and practice of analyzing data to answer questions, insights, and identify trends. - Data Analytics is the science of analyzing data to convert information to useful knowledge. ~ Data Analytics is the process of exploring and analyzing large datasets to find hidden patterns, unseen trends, discover correlations, and derive valuable insights to make business predictions. Data analyst tasks as someone who has the knowledge and skills to tum The role of a data analyst can be defined raw data into information and insight, which can be used to make business decisions. The data analyst carries out the following tasks: Identifying the informational needs. Acquiring data from primary and secondary sources. Developing and implementing databases and data collection systems. iv, Filtering, cleaning and reorganizing the data for analysis. v. . Identifying, analyzing, and interpreting trends and patterns that can be translated int actionable insights. vi. Presenting findings in an easy-to-understand way to take data-driven decisions. Example: How Google uses data analytics:Introduction to Data Analytics wision\ 1-3 Google is a company that uses data analytics in a big way. Google collects massive amounts of user data through its Chrome browser and Gmail products. Google also receives billions of search requests every day on its search engine. The company uses that data to train its algorithms, getting better at fundamental search tasks such as parsing sentences, correcting misspellings and understanding what a user is trying to search for. Google uses historical and ‘current search terms to recommend search suggestions to users before they finish typing, which provides useful autocomplete services to its users. Google uses tools and techniques to understand our requirements based on several parameters like search history, locations, trends ete. 3. Data Analysis vs. Data Analytics Data analysis and data analytics are terms that are confused by many people. They are often treated as interchangeable terms, but they hold different meanings. While data analysis and data analytics both work on data, the main difference lies in what is done with the data. Literally, “Analysis” is the detailed examination of the elements or structure of something. On the other hand, “Analytics” is the systematic computational analysis of data by applying various tools and techniques tools such as statistics, machine learning etc. The following table lists some of the important differences between the two terms. [ Data analysis Data analytics Data analysis refers to hands-on data exploration and evaluation. Data analytics is a broader term and includes data analysis as a subcomponent. Analytics defines the science behind the analysis. Data analysis refers to the process of examining, transforming and arranging data in order to study its individual parts and extract useful information. Data. analytics encompasses the complete management of data. This not only includes analysis, but also data collection, organization, storage, and all the tools and techniques used. collection, manipulation, and examination of data |e analysis is a process involving the for getting a deep insight. Data analytics is taking the analyzed data and working on it in a meaningful and useful way to make well-versed business decisions Analysis looks backwards, providing marketers with a historical view of what has happened. Data analysis is a subset of data analytics and feters to specific actions. Data analysis helps in understanding the data and provides required insights from the past to what happened so far. Data Analytics models the future or predicts a result. Data analytics uses data and tools to make business decisions. Data analytics is the process of exploring the data from the past to make appropriate decisions in the future by using valuable insights.1-4 /ebion Data Analytics nna dual steps | ysis is cone and mechanisms. led with individual analysis or analysis steps, but with the entire methodolony, Data science is smaller in scale and scope. The steps are: Data gathering, Data validation, interpretation, Analysis, Results Data analytics is broader or more extensive in scais” and scope, The steps are: identi Data, Data filtering, Data visualization, prediction etc tying the problem, finding the’ Data validation, Data cleaning Data analysis, Inference’ Data analysis is a subset of data analytics which helps us to understand the data by questioning and to collect useful insights from the information already available Data analytics is a wide area Involving handling” data with a lot of necessary tools to produce helpful decisions with useful predictions for a better output Tools used: Excel, Open Refine, Rapid Miner, Tools used: Tableau, R analytics, Python, Googl KNIME, Google Fusion Tables, Node XL etc. 0 : e Analytics eto. 4. Types of Analytics Analytics is not concemed with individual analysis or analysis steps,. but with the entire methodology. Based on their application on various stages of data analysis, analysis techniques are divided into six classes of analysis and analytics: descriptive analysis, diagnostic analytics, (predictive analytics, prescriptive analytics, exploratory analysis, and mechanistic analysis. 4.1 Descriptive Analysis Descriptive analysis “Describes things”. Descriptive analysis, also known as descriptive analytics or descriptive statistics, is the process of using statistical techniques to describe or summarize a set of data. Descriptive analysis is a commonly used form of data analysis whereby historical data is collected, organized and then presented in a way that is easily understood. Descriptive analysis is about: “What is happening in the data.” It is a method for quantitatively describing the main features of a collection of data. It is a glimpse into the past. It’s descriptive as it describes the things that took place. It takes a glance at data and analyzes the previous events and situations to gain insight on how to become more efficient. Typically, it is the first kind of data analysis performed on a dataset. Descriptive analytics uses simple mathematics and statistical tools, such as averages, mean, variance etc. rather than the complex calculations necessary for predictive and prescriptive analytics. Visual tools ‘such as line graphs and pie and bar charts etc, are used to present the output of descriptive analysis.(x Introduction to Data Analytics vision 1-5 Two important aspects of descriptive analysis are: j, Data description: Data analysts are often faced with a large amount of raw data that needs to be organized and summarized before it can be analyzed. In order to make sense of the data and extract patterns, the data must be presented in an organized summary. This is where descriptive analysis comes into the picture. It facilitates analyzing and summarizing the data and is thus an important process in data science. humidity, rainfall and other For example, the weather department collects temperature, weather related data at periodic intervals daily. If this data is presented to the user in the form of a table of values, it will be very difficult to interpret the data. Instead, if the same data is presented in the form of minimum, maximum, average values, it becomes much easier to understand. ii. Data interpretation: Data cannot be properly used if it is not correctly interpreted. This requires appropriate statistics. Several statistical measures can be applied on data. Each of these measures is a summary that emphasizes certain aspects of the data and overlooks others. They all provide information we need to get a full picture of the world we are trying to understand. The data analyst must decide which aspects of the data to highlight. For example, to paint a scene, the artist must first decide which features to highlight. Descriptive analysis makes use of the following to describe the data. i. (Wariablesi Before we process or analyze any data, we have to capture and represent it This is done with the help of variables. A variable is a label we give to our data. For example, name and age of a person. Here, “age” is a variable and it is of numeric type. Variables can be of the following types: a. ; A categorical or discrete variable is one that has two or more categories (values). These variables cannot be quantified. In other words, you can’t perform arithmetic operations on them; like addition or subtraction, or logical operations like “equal to” or “greater than” on them. It can be either Nominal or Ordinal. Nominal; Nominal scale is a naming scale, where variables are simply named or labeled, with no specific order.> 1-6 / wbion Data Analytics Example: Hair color: Brown, Black, Blonde, Red. Course: Science, Arts, Commerce Ordinal: Ordinal variables have a specific order. The ordinal scale i IS a type of measurement scale that deals with ordered variables. Example: Feedback: Poor, Satisfactory, Good, Excellent, Shirt size: Small, Medium, Large, Extra large. An interval variable is a numeric variable. It is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. The scale provides a degree of difference along with the rank and order of the values. It does not have a true zero point, ie., it doesn’t have a meaningful zero point. Example: Temperatures. Here a temperature of 0 degrees does not mean there is no temperature at all. c. GRafieWaPIable: Like interval data, it is ordered/ranked and the numerical distance between points is consistent (and can be measured). Here, there is a true zero point which reflects an absolute zero. Example: Weight, height, length etc. Variables are also classified as dependent and independent variables. Often, we connect multiple variables or use one set of variables to make predictions about another set. A variable that is not controlled or affected by other variables is called an independent variable. A variable that depends on other variables is called 4 dependent variable. In case of a prediction problem, an independent variable is also called a predictor variable and a dependent variable is called an outcome variable. Suppose we have employee experience and salary data in two columms Here, experience is an independent variable and salary is a dependent variable. fi, (Fréqtiency Distributions Numeric data is difficult to understand if presented as a sen of numbers. A visual depiction makes the understanding much easier. A ae distribution shows the frequency of repeated items in a graphical form or tabular form gives a visual display of the frequency of items.th Introduction to Data Analytics = Waton 1-7 The following are some of the ways in which numerical findings are displayed. 4, MISHRA: Histograms plot values of observations on the horizontal axis, with a bar showing how many times each value occurred in the dataset, b, SPIWRRAPATA pie chart is used for depicting categorical data 50 40 5 en CONE ee alias Cale 30 © 20 1.0 0 20 30 40 50 60 70 80 90 100 Age Figure 1.1: Histogram and Ple chart We often work with numeric data and we need to understand how those numbers are spread, A distribution indicates the spread of the data values. Normal Distribution: In ideal world, data would be distributed symmetrically around the center of all scores. Thus, if we drew a vertical line through the center of a distribution, both sides should look the same. This is normal distribution which is characterized by a bell-shaped curve. There are two ways in which a distribution can deviate from normal: * Lack of symmetry (called sK@W) © Pointiness (called Kui#t08i8)> 18 /wition Data Analytics Normal distribution 1900 1200 Frequency 400 600 200 200 0.2 0.4 06 08 1.0 X-value Figure 1.2: Normal Distribution iii. /Measures'of Centrality: If we have a large set of values, it is difficult to understand the set just by looking at it. One value.can tell us enough about a distribution. We can calculate where the “center” of a frequency distribution lies, which is also known as the central tendency. The central tendency is defined as the statistical measure that identifies 2 value that is capable of representing the entire distribution. The main goal of using measures of central tendency is to get an accurate description of the entire data. There are three measures commonly used: a. (MA: Mean is the most often used measure of central tendency of continuous dat as well as a discrete dataset. It is simply the average of all the values in the set. 6. Medi? The median is the middle score for a dataset that has been sorted according to the values of the data. With an even number of values, the median is calculated as the average of the middle two data points. ¢. — (Moidé? Mode is the most frequently occurring value in a dataset. iv. Wispersion ofa Distributions Distributions come in all shapes and sizes. Simply look"é at a central point (mean, median, or mode) may not help in understanding the actual a of a distribution. Therefore, we often look at the spread, or the dispersion, © distribution. The following are some of the most common measures of dispersion.Za Introduction to Data Analytics SION 1-9 a. IRWAGEThe simplest description of variation is a straightforward measure is called range which is the difference between the largest and the smallest value in the distribution, ‘The problem with this measure is that the presence of extreme values called outliers can affect the result, b. _tirenqumdril@RGAGE: One way to overcome the range’s disadvantage is to calculate it after removing extreme values. One method is to cut off the top and bottom one- quarter of the data and calculate the range of the remaining middle 50% of the scores. This is known as the interquartile range. ‘$VarianeeThe variance is a measure used to indicate how spread out the data points are, To measure the variance, the common method is to pick a center of the distribution, typically the mean, then measure how far each data point is from the center, If the individual observations vary greatly from the group mean, the variance is big and vice versa. 4. SranddavdlDD@viation-The standard deviation is also a measure of how spread out the “ numbers are. It is the square root of the variance. 2 There is one issue with the variance as a measure. It gives us the measure of spread in units squared. So, for example, if we measure the variance of age (measured in years) of all the students in a class, the variance will be in years’. However, practically, it would make more sense if we got the measure in years. Hence, we often take the square root of the variance, which ensures the measure of average spread is in the same units as the original measure. 4.2 Diagnostic Analytics Diagnostic analytics is a form of advanced analytics that examines data or content to answer the question, “Why did it happen?” Diagnostic analytics are used for discovery, or to determine why something happened, Sometimes this type of analytics is also known as causal analysis, since it involves at least one cause (usually more than one) and one effect. This allows a look at past performance to determine what happened and why. The result of the analysis is often referred to as an analytic dashboard. There are various types of techniques available for diagnostic or causal analytics such as drill-down, data discovery, data mining and correlations. Among them, one of the most frequently used is correlation.De 1-10 /uibion Date Anaivtics Correlation Correlation is a statistical analysis that is used to measure and describe the Strength and direction of the relationship between two variables, Strength indicates how closely two Variables are Telated to each other, and direction indicates how one variable would change its value as the value of the other variable changes. Correlation is a simple statistical measure that examines how two variables change together over time, Correlation could be negative or positive. A Positive correlation means that the variables move in the same direction. A negative correlation means that the variables move in opposite directions. For example, Positive correlation: As the temperature rises, the sale of fans also increases Negative correlation: More the number of absent days, lesset the CGPA, The most common correlations used in Statistics are: and Kendall correlation. earson r correlation, Spearman correlation 4.3 Predictive Analytics Predictive analytics aims to answer the question: “What will happen?”. It is the use of data, Statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data. It uses past data to create a model to find out what will happen in the future. These analytics are about understanding the future using the data and the trends we have seen in the past, as well as emerging new contexts and Processes. For example, Predict house prices, Number of Covid cases, Weather etc. Predictive analytics provides companies with actionable insights based on data. Such information includes estimates about the likelihood of a future outcome. It is important to remember that no statistical algorithm can “predict” the future with 100% certainty because the foundation of predictive analytics is based on probabilities. Predictive analytics is done in stages. . Data collection and cleaning. KC ii, Identify relationships between variables (hindsight), Various plots like a scatterplot ¢ give a good visualization of the relationship.th Introduction to Data Analytics wiaven\ 1-11 iii, Obtain insight from hindsight. In this step, we confirm the existence of such relationships in the data. This is where regression comes into play. From the regression equation, we can confirm the pattern of distribution inside the data. iv, Finally, based on the identified patterns, or insight, we can’ predict the future, i.e., foresight. 4.4 Prescriptive Analytics Prescriptive analytics is the area of business analytics dedicated to finding the best course of action for a given situation. This goes beyond predicting future outcomes by suggesting actions based on the predictions and also the implications of each decision. The steps are: i. Prescriptive analytics first starts by analyzing the situation (using descriptive analysis). ii, Find connections among various parameters/ variables, and their relation to each other to address a specific problem, more likely that of prediction. Analyze potential decisions, the interactions between decisions, the influences upon these decisions, and as an outcome, prescribe an optimal course of action in real time. iv. Suggest options for taking advantage of a future opportunity or mitigate a future risk and illustrate the implications of each. For example, in healthcare, we can manage the patient population in a better way by using prescriptive analytics to measure the number of patients who are clinically obese, then add filters for factors like diabetes and LDL cholesterol levels to determine where to focus treatment. 4.5 Exploratory Analysis Exploratory Data’ Analysis (EDA) is an important aspect of any data analysis task. Often when working with data, we may not have a clear understanding of the problem or the situation. And yet, we may have to derive some insights, The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand Patlerns within the data, detect outliers or anomalous events, find interesting relations among the Variables. Exploratory analysis is an approach to analyzing datasets to find previously unknown relationships. The goal of EDA is to find out the underlying structure of the data, summarized 1-12 / when Data Analytic ‘the main characteristics and visualize trends. Often such analysis involves using various data Visualization approaches, Plotting the data in different forms could provide us with some clue, regarding what we may find or want to find in the data. Such insights can then be useful f, defining future studies/questions, leading to other forms of analysis. As exploratory analysis consists of a range of techniques such as univariate and multivariate visualizations using box plots, scatter plots etc, dimensionality reduction, etc. 4.6 Mechanistic Analysis Among all the analysis techniques, mechanistic is the least common type. Mechanistic analysis involves understanding the exact changes in variables that lead to changes in other variables for individual objects while excluding extemal influences. For example, user data, at sites like ‘Facebook and Instagram, can be used by analysts for understanding user perception, like what users are doing and what motivates them. This information can benefit commercial ads where a particular group of users are targeted to sell them things. It is also helpful for the application developers to understand users’ response and habits and make changes in products accordingly. The relationship between two variables are often explored using regression. Regression In statistical modeling, regression analysis is a process for estimating the relationships among variables. Correlation by itself does not provide any indication of how one variable can be predicted from another. Regression provides this crucial information. Regression analysis is 3 way of predicting an outcome variable from one predictor variable (simple linear regression) of several predictor variables (multiple linear regression). Regression analysis has a number of applications. It can be used in business to evaluate trends and make estimates or forecasts. This can be helpful for things like budgeting and planning Medical researchers often use linear regression to understand the relationship between dm dosage and patient's parameters like blood pressure etc. A lot of businesses use linear regression models to predict how stocks will perform in the future. This is done by analyzing past data 0" Stock prices and trends to identify patterns. In the business domain, for example, regression be used to generate insights on consumer behavior. This can be helpful for things like arse marketing and product development, Linear regression can also be used to identify relationstP between different variables. For example, you could use linear regression to find out temperature affects sales of fans. a2 Introduction to Data Analytics wsin\ 1-19 5, Mathematical Models - Concept Computer professionals are familiar with the term “algorithm”. We use an algorithm to solve a particular computational problem. For example, we know sorting and searching algorithms. An algorithm is implemented in code and run on data, It gives an output. Thus, in traditional programming, we are explicitly programming the computer to carry out some task. The main goal of machine learning is to give computers the ability to learn by itself without being explicitly programmed. Machine learning is a method that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data (if trained), identify patterns and make decisions with minimal human intervention. What is a machine learning model? Modeling is the process of encapsulating information into a tool which can forecast and make predictions. A “model” in machine learning is the output of a machine learning algorithm that is, trained using data, It is considered as the mathematical representation of a real-world process. A model represents what was leamed by a machine leaning algorithm. Machine learning models are output by algorithms and are comprised of model data and a prediction algorithm. The best analogy is to think of the machine learning model as a “program.” The machine learning model “program” is comprised of both data and a procedure for using the data. Data (input) acc ae ees ———> Output Program Data (input) —_—_—_—_ ——> Program —_—_—> Output Figure 1.3: Traditional programming vs Machine Learning A model is generated in the following manner: i, Data scientists train a model over a set of data, giving it the required algorithm to learn from the data,de 1-14 / ition Data Analytics ii, First, the training data must include the correct answer, also known as the “targe attribute”, In some cases, the target is not specified. In such cases the algorithm must fing the target attributes on its own, Next, the learning algorithm seeks out patterns in the training data that map the relevant data attributes to the correct answer, also known as the target, ming algorithm dis iv. The Ie: ‘overs patterns within the training data, and it outputs a machine learning model which captures these patterns. ¥. Once you have trained the model, you can use it to on data that it hasn't seen before, and make predictions about those data, For example, let's say you want to build an application that can recognize a user's based on their facial expre emotions You can train a model by providing it with images of faces that are each tagged with a certain emoti n, and then you can use that model in an application that can recognize any user's emotion. Example i. Consider the linear regression algorithm and resulting model. The model comprised of a vector of coefficients (data) that are multiplied and summed with a row of new data taken as input in order to make a prediction (prediction procedure). a. Algorithm: Find set of coefficients that minimize error on training dataset. b. Model: e Model data: Vector of coefficients . Prediction algorithm: Multiple and sum coefficients with input row ii, The decision tree algorithm results in a model comprised of a tree of if-then statements with specific values, The neural network algorithms result in a model comprised of a graph structure with vectors or matrices of weights Blas-Varlance Tradeott ; em can be Any given model has several limitations depending on the data distribution, None of them ean i: , z by the entirely accurate since they are just estimations, These limitations are popularly know" PY name of bias and variance,dm m\ 1-15 Introduction to Data Analytics Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. Errors of bias produce underfit models. A machine learning model is said to be underfit when it cannot capture the underlying trend of the data. Models that do not perform well on both training and testing data are underfit. Underfit models have high bias and low variance, Variance is error from sensitivity to fluctuations in the training set. If our training set contains sampling or measurement error, this noise introduces variance into the resulting model. Errors of variance result in overfit models. A model is said to be overfitted when we train it with a lot of data. The model does not categorize the data correctly, because of too many details and noise Models that do much better on testing data than training data are overfit. It has high variance and low bias. ‘The issue arises when we have to choose between models having similar bias and variance. This is where model selection and model evaluation come into consideration. Model evaluation is a method of assessing the correctness of models on test data. The test data consists of data points that have not been seen by the model before. Model selection is a technique for selecting the best model after the individual models are evaluated based on the required criteria. 5.1 Model Evaluation In machine learning, models are only as useful as their quality of predictions; hence, our goal is Not to create models but to create high-quality models which performs as expected. Once you have trained your machine learning model, you have to ask the most important question: How do I know this model will succeed? How will it perform on unseen data? To answer this important question, we need to understand how to evaluate a machine learning model. , ies . learning model's performance, as well as its strengths and weaknesses. Model evaluation is the Process through which Welquaitify the quality/ofjaisystemsipredictions) To do this, we measure the newly trained model performance on a new and independent dataset. Model evaluation performance metrics are used to find out: ~ i, How well the model is performing?20 1-16 / sion Data Analytics ii, Is the model accurate enough? iii, Will a larger training set improve the model's performance? iv, _ Is the model under-fitting or over-fitting? Evaluating model performance with the data used for training is not acceptable in data science. because it can easily generate overfitted models. To evaluate a model, several metrics have been defined. The following chart lists various metrics used for Classification and Regression models, Model Evaluation ics Classification Regression © Confusion Matrix © Mean Absolute Error (MAE) e@ Accuracy Mean Squared Error (MSE) © Precision and Recall © Root Mean Squared Error (RMSE) © =AUC-ROC Curve © Squared © F-score ® Adjusted A-Squared © Log Loss © Gini Coefficient 5.2 Metrics for Evaluating Classifiers The following metrics are used for evaluating classification models: Confusion Matrix (Classification is defined as the process of recognition, understanding, and grouping of oP For example, classifying emails as spam ° identifying user emotion. ett: performs classification" ‘0 relevant groups or categories called not spam, classifying feedback as positive, negative or neutral, There are four different outcomes that can occur when the model predictions: r ass i, TAIBOSENESETPYoccur when the model predicts that an observation belongs ° ac and in reality it belongs to that class.Introcuction to Data Analyth.s wision\ 1-17 ii, TRO NGRRVEMMMNG occur when the model predicts that an observation does not belong to a class and in reality it does not belong to that class. jij, SRASEPGSTVENENPYPOCCUr When the model predicts that an observation belongs to a class when in reality it does not, Also known as a type [ error, iv, HRRSEINEGREKENEADY occur when the model predicts an observation does not belong to a class when in reality it does, Also known as a type I error (A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the is NXN, where Nis the number of target values (classes). The following table displays a 2 x 2 confusion matrix for two cl a) The matrix 1e8 (Positive and Negative). Predicted class Positive Negative Positive True Positive | False Negative Actual P FN class False Positive | True Negative Negative FP 1N From the confusion matrix, several metrics can be calculated as shown in the table below. Predicted class Positive Negative eave P FN Sensitivity | TP/(TP+FN) fetes FP 1N Specificity | TN(TN+FP) Negative isi . /, Precis ales Accuracy te (TP+TN)/(TP+TN+FP+FN) > TPTP+FP) | TNATN+FN) The various terms are: 1 (AeeiirweyeyThe accuracy of the classifier is the ratio of correct predictions to the total number of predictior eke TP+7N Accuracy = TPyIN+ FP +Nn 1-18 /whion Data Analytics It can be applied to most problems but is not very useful when the dataset is unbalanceg, For example, if we are detecting frauds in bank data the ratio of fraud to non-fraud cay, can be 1:99. Suppose the model predicts all cases to be non-fraud. In such cases, accuracy is used, the model will turn out to be 99% accurate, The 99% accurate modg will be completely useless because it has missed the most important fraud case. Thi why accuracy is a false indicator of the model’s health. Ui, (PRRGISION or Positive\Predictive|Value? Precision measures the ratio of positive valu correctly predicted out ofall cases predicted as positive. is i ee TP recsion= Ps al Precision is the metric used to identify the correctness of classification. The greater the fraction, the higher is the precision, which means better is the ability of the model w correctly classify the positive class. ‘Sensitivity or Recall: This gives the ratio of positive cases correctly identified out of al cases that were actually positive. TP Recall = 7P+EN Going back to the fraud problem, the recall value will be very useful in fraud cass ‘because a high recall value will indicate that a lot of fraud cases were identified out of th total number of frauds. ‘Specificity: The proportion of actual negative cases which are correctly identified to a! cases that are actually negative. TN Specificity = TN+P iv. IN@Gative)PFEAiCtiveNWallie? The proportion of negative cases that were come! identified to all the cases predicted negative. 14) Negative Predicted Value Let us consider an example of a classifier predicting the patient's disease. ‘There are (wo oe predicted classes: "DY" (class 1) and "DN" (class 0). Here, "DY" would mean ak rs disease, and "DN" would mean they don't have the disease. 20 patients were i, on Presence of that disease, Out of those 20 patients, 7 actually have the disease an aN = IN+FNintroduction to Data Anaytcs wisn 1-19 (column 2). The predicted output of the model is given in column 3 of the table. Based on the actual class and predicted class, we mark each entry as TP/TN/FP/FN as follows: From the table above, TP = 5, TN = 9. FP True Positive (TP): When the actual class is positive (1) and your machine learning model also predicts that class as positive (1). ‘True Negative (TN): When the actual class is negative (0) and your machine learning model also predicts that class as negative (0). False Positive (FP): When the actual class is negative (0) but your machine learning model predicts that class as positive (1). False Negative (FN): When the actual class is positive (1) but your machine learning model predicts that class as negative (0). Patient No. | Actual class | Predicted class | Confusion Matrix 1 1 1 TP 2 0 0 TN 3 1 oO FN 4 oO oO ™N 5 oO 1 FP 6 oO 0 TN Te [ee fies) FN 8 0 0 TN 9 0 1 FP. [Oka |e 1 TP 1 0 0 TN io 0 0 TN 13 1 1 TP. esl eae [none a easter 0 1 FP. in 1 1 i 7 ous ia m ry 0 1 FP. ts : a 20 0 u » = 4, FN =2. The confusion matrix is as shown.De 1-20 I ‘wSion Data Analytics Sick people correctly Sick people incorrectly Predicted as sick by the a eee by mene Predicted class Rees Positive Negative Positive ere: Negative Healthy people a Healthy people correctly incorrectly predicted as predicted as not sick by sick by the model the model + Actual class Based on the confusion matrix, we calculate the model metrics. Predicted cl N=20 Positive Negative pease te ‘Actual | Positive | TP=5 FN=2 Sensitivity | 5/7 = 0.71 ©1888 | Negative | FP =4 TN=9 Specificity | 9/13 = 0.69 zi Negative hen eee Value | Accuracy = 14/20 = 0.7 "| 59-055] 9/11 =0.81 + F1 Score The Fl-score combines the precision and recall ofa classifier into a single metric by aking thet harmonic mean. It is primarily used to compare the performance of two classifiers. Suppose ta! classifier A has @ higher recall, and classifier B has higher precision. In this case, the F-scO®® for both the classifiers can be used to determine which one produces better results. The go“! the Fl score is to combine the precision and recall metrics into a single metric. At the same tim the F1 score has been designed to work well on imbalanced data,om Introduction to Data Analytics wsion\ 1-21 The Fl-score of a classification model is calculated as follows: Precision * Recall FI score = 2 * Precision + Recall A model will obtain a high FI score if both Precision and Recall are high A model will obtain a low FI score if both Precision and Recall are low. ‘A model will obtain a medium F1 score if one of Precision and Recall is low and the other is high. In the confusion matrix considered earlier, the precision was 0.71 and recall was 0.55. Thus, the FI score is calculated as: 2*(0.71*0.55)/(0.71+0.55) = 0.62 Python Implementation Scikit-leam (sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. The sklearn.metrics module implements several loss, score, and utility functions to measure classification performance. The sklearn.metrics.confusion_matrix module computes the confusion matrix to evaluate the accuracy of a classification, The sklearn.metrics .ConfusionMatrixDisplay function can be used to visually represent a confusion matrix. The following table lists the classification metrics along with the corresponding sklearn function. Metric Sklearn function ‘Accuracy. metrics .accuracy_score Precision metrics. precision_score Recall metrics. recall_score Fi score metrics-f1_score The following Python program creates and displays the confusion matrix for our patient disease Prediction example. It also calculates and displays various metrics. fImporting the libraries from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, c lassification_report Actual values : Yiact = [1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0) #Predicted values Ypred = [1,0,0,0,1,0,0,0,1,1,0,0,1,0,1,1,0/1,1,0]| { 1-22 /wsion Data Analytics #Generate the confusion matrix cm=confusion_matrix(y_act, y_pred) print ("Confusion matrix =") print (cm) #Calculate and print the metrics print ("Recall =", metrics.recall_score(y_act, y_pred)) print ("Precision =", metrics.precision_score(y_act, y_pred)) print ("Accuracy =", metrics ‘acy_score(y_act, y_pred)) print ("Fl score =", metrics.fl_score(y_: #Print the confusion matrix — #The columns will show the instances predicted for each class, #The rows will show the actual number of instances for each class disp = ConfusionMatrixDisplay (confusion_matrix=cm) disp.plot() | Precision = 0. SSSBSESSESESSSSE | | | Predicted label Class Imbalance Classification is a predictive modeling problem that involves assigning a class label to ett observation. When working on classification predictive modeling problems, we must a training dataset. A taining dataset is a number of examples (belonging to each oe include both the input data (¢.g., measurements) and the output data (e.g, class label). number of examples that belong to each class may be referred to as the class distribution.th Introduction to Data Analytics wiion \ 1-23 Most machine learning algorithms work best when the number of samples in each class are about equal. This is because most algorithms are designed to maximize accuracy and reduce errors. However, if the data set is imbalanced then in such cases, you get a pretty high accuracy just by predicting the majority class, but you fail to capture the minority class Imbalanced classification refers to a classification predictive modeling problem where the number of examples in the training dataset for each class label is not balanced. The distribution can vary from a slight bias to a severe imbalance where there is one example in the minority class for hundreds, thousands, or millions of examples in the majority class or classes. For example, in fraud detection, there may be one fraud case in thousands, or if the training dataset has measurements of 100 flowers of 2 species, out of which 90 examples are of one flower species and only 10 examples are of the second flower species. ~ Class imbalance appears in many domains, including: i, Fraud detection ii, Spam filtering iii, Disease screening There are several techniques for dealing with imbalanced classification such as oversampling, under-sampling, ensemble learning, boosting etc. 6. ROC (Receiver-Operating Characteristic) Curves In a classification problem, the model may predict the class values directly (either labels or 0/1 format), Alternately, it can be more flexible to predict the probabilities for each class instead. For example, the model may output a value 0.47 for some input. So in such a case, how do we interpret the class? Does it belong to positive class or negative class? To interpret the Probability, we must set a threshold. This gives us more control over the result. We can determine our own threshold to interpret the result of the classifier. ‘When making a prediction for a binary or two-class classification problem, there are two types of errors that we could make. i, False Positive: Predict a class when there was no class. fi, False Negative: Predict no class when in fact there was a class.> 1-24 /sion Date Anaytes By predicting probabilities and calibrating a threshold, a balance of these two values can chosen by the operator of the model, This threshold can be adjusted to tune the behavior of i, model for a specific problem. How do we know which threshold would give us more accurate logistic regression mode|? 5, for that we will be using the ROC curve and the Area Under ROC Curve (AUC). For differen threshold values we will get different TPR and FPR. So, in order to visualize which threshold best suited for the classifier we plot the ROC curve. ‘An ROC curve (or Receiver-Operating Characteristic curve) is a plot that summarizes the performance of a binary classification model. It is a metric used to measure the performance of classifier model. The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis, i., it depicts the rate of true positives with respect to the rate of false positives. Hence, it highlights the sensitivity of the classifier model. The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields. This curve plots two parameters: i ii, ‘True Positive Rate (TPR): is a synonym for recall and is therefore defined as follows: TPR / Recall / Sensitivity = wa False Positive Rate (FPR): is defined as follows: FPR = I - Specificity = Ty pp The following image shows an ROC curve TP vs. FP rate at one decision threshold TP vs. FP rate at another threshold 9 FP Rate 1 Figure 1.4: ROC curveon Introduction to Data Analytics wson\ 1-25 The shape of the curve contains a lot of information. The closer an ROC curve is to the diagonal line, the less accurate the model is. Smaller values on the x-axis of the plot indicate lower false positives and higher true negatives. Larger values on the y-axis of the plot indicate higher trie positives and lower false negatives. So, the choice of the threshold depends on the ability to balance between False positives and False negatives The following figure shows the ROC curves of two classification models M, and M; Thus, M; is more accurate here since M) is closer to the diagonal. 1.0 True positive rate ° ° ° = & & 2 rs 0.0 CO en Oe ass04 70.6. 2,08: 4 1.0 False positive rate Figure 1.5: ROC curve for models M1 and M2 ikit-learn We can plot a ROC curve for a model in Python using the roc_curve () function, Advantages i, It is a simple graphical representation of TPR vs. FPR. It lets you see the tradeoff between sensitivity and specificity for all possible thresholds Tather than just one that was chosen by the modeling technique. iii, It aids in finding a classification threshold that suits our specific problem. iv. Allows a simple method of determining the optimal cutoff values.128 /whitn Date Anaytes Disadvantages i, To compute the points in an ROC curve, we have to evaluate the model many times with different classification thresholds. ii, Calculation is time consuming without specialized software. AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two- dimensional area underneath the entire ROC curve from (0, 0) to (1, 1). It provides the ability for a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, it is assumed that the better the Performance of the model at distinguishing between the positive and negative classes. ‘TP Rate 0 FP Rate 1 Figure 1.6: Area under the ROC Curve AUC measures the area under the curve plotted with true positives on the y-axis and fat Positives on the x-axis. This metric is useful because it provides a single number that lets y°Y compare models of different types. AUC is one of the good ways to estimate the accuracy of the model, An excellent model pO an AUC near to 1 which tells that it has a good measure of separability. A poor model will his? an AUC near 0 which describes that it has the worst measure of separability. When an AUC ® 0.5. it means the model has no class separation capacity present whatsoever. The AUC for the ROC can be calculated using the scikit-learn roc_auc_score"” function.iz Introduction to Data Analytics wsion\ 1-27 7. Evaluating Value Prediction Models Predictive models can “predict the future”, A simple example of a prediction problem is prediction of the selling price of a real estate property based on its attributes (location, square meters available, condition, etc.), There are many different techniques available for carrying out predictive analytics (for example, Regression). However, the abundance of predictive modeling techniques means that there are multiple models that can provide a good predictive evaluation to a simple problem, and choosing the right one can be challenging. Performance evaluation plays a dominant role in the technique of predictive modelling. The performance of a predictive model is calculated and compared by choosing the right metrics. So, it is very crucial to choose the right metrics for a particular predictive model in order to get an accurate outcome. Measuring the performance of a value prediction system involves two decisions: i, Fixing the specific individual error function. ii. Selecting the statistic to best represent the full error distribution. For numerical values, error is a function of the difference between the predicted value y' and the actual result y. The primary choices for the individual error functions include: i, _Absolute x.The error value A = y' - y is the simplest metric for performance evaluation. Typically, the absolute value of the error is taken to obliterate the sign. ii, Relative error: Normalizing the error by the magnitude of the observation produces a unit-less quantity, which can be interpreted as a fraction or as a percentage. The relative error ¢ is calculated as: ¢ = (y'-y)/y fii, Squared error; The value A’ = (y' - y)" is always positive, and hence these values can be meaningfully summed. Itis a very good idea to plot a histogram of the error distribution for any value predictor. The distribution should be symmetric, and centered around zero. It should be bell-shaped, meaning small errors are more common than big errors. There are several summary statistics that reduce such error distributions to a single number, in order to compare the performance of different value prediction models. These are: 1. Medi Abso1ute HHH (MAB)! Méan absolute error, also known as L1 loss is one of the Simplest loss functions and an easy-to-understand evaluation metric. It is calculated byDe 4-28 / sion Data Analytics d taking the absolute difference between the predicted values and the actual values any averaging it across the dataset, Mathematically speaking, it is the arithmetic average « absolute errors, It is calculated as: n | mates Bio is a f ‘ where y, = actual value, y; = predicted value, n = sample size MAE measures only the magnitude of the errors and doesn’t concern itself with thei direction. The lower the MAE, the higher the accuracy of a model. ‘(Meait'S@uiareaEFFOF (MSE): MSE is one of the most common regression loss functions In Mean Squared Error also known as L2 loss, we calculate the error by squaring the difference between the predicted value and actual value and averaging it across the dataset. Squaring the error gives higher weight to the outliers, MSE will never be negative since the errors are squared. The value of the error ranges from zero to infinity, MSE increases exponentially with an increase in error. A good model will have an MSE value closer to zero. Squared Error = (y,- ,)° ‘(Root Mean Squared Error (RMSE)? RMSE is computed by taking the square r0ot 0! MSE. RMSE is also called the Root Mean Square Deviation. It measures the averse magnitude of the errors and is concerned with the deviations from the actual value. RMSE value with zero indicates that the model has a perfect fit. The lower the RMSE, the bette’ the mode] and its predictions. A higher RMSE indicates that there is a large deviatio" from the predicted to the actual, 1 n RMSE = Diy-$)" Va 2-9 The following functions from the sklearn library are used for the above statistics: i, Mean Absolute Error; Metrics mean absolute error Mean Squared Error: mMetrica.mean squared error 4 Root Mean Squared Enor:netrics.naan squared exrox with naib ii,A on \ Introduction to Data Analytics wsion\ 1-29 Exercises Multiple Choice Questions: is the science of analyzing data to convert information to useful knowledge a. Datascience b. Data analysis C Data analytics d. Data exploration is a method for quantitatively describing the main features of a collection of data. a. Diagnostic analytics b. Predictive analytics ¢. Descriptive analysis d. Prescriptive analytics Which of the following is not a measure of centrality? a. Mean b. Mode c. Range d. Median Which of the following is a measure of dispersion? a. Mean b. Mode c. Range d. Median is used to measure and describe the strength and direction of the relationship between two variables. a. Range b. Variance ¢. Correlation d. Standard Deviation indicates how spread out the data points are. a. Range b. Variance ¢. Correlation d. Standard Deviation deals with the question “Why did it happen”? a Diagnostic analytics b. Predictive analytics ¢. — Descriptive analysis d. Prescriptive analytics deals with the question “What is happening in the data."? a Diagnostic analytics b. Predictive analytics ¢. — Descriptive analysis d. Prescriptive analyticsa 1-30 /wsion Data Analytics ll. 12. 16. Which type of analytics uses past data to create a model to find out what will happen the future? a. Diagnostic analytics b. Predictive analytics c. Descriptive analysis d. Prescriptive analytics Predictive analytics aims to answer the question: 1 a. Whatis happening with the data? _b. Why did it happen? ¢. What will happen? d. None of the above The goal of is to find out the underlying structure of the data, summarize th. | main characteristics and visualize trends. a. Exploratory data analysis b. Diagnostic analytics c. Prescriptive analytics d. Predictive analytics deals with understanding the exact changes in variables that lead to change: 7 in other variables for individual objects while excluding external influences. a. Mechanistic analysis b. Diagnostic analytics c. Prescriptive analytics d. Predictive analytics 2 Which analytics is used for finding the best course of action for a given situation? a. Mechanistic analysis b. Diagnostic analytics c. Prescriptive analytics 4. Predictive analytics 4 . analyzes datasets to find previously unknown relationships and invol\ using various data visualization approaches. a. Mechanistic analysis - b. Diagnostic analytics 3 c. Prescriptive analytics d. Exploratory analysis ‘The variable type that is nominal when there is no natural order between the pos" 2 values that it stores is a. Ordinal b. Nominal c. Ratio * : de eierval 2 The ratio of correct predictions to the total number of predictions is called: a. Accuracy b. Sensitivity c, Recall d. Precision20. 22. ZA Inroduction to Date Anaiytics wistm\ 1-31 measures the ratio of positive values correctly predicted out of all cases predicted as positive. a. Accuracy b. Sensitivity a Recall d. Precision Sensitivity is also called a. Accuracy b. Specificity c. Recall d. Precision is the proportion of actual negative cases which are correctly identified to all cases that are actually negative. a. Accuracy b. Specificity cc. — Recall d. Precision An ROC curve plots: a. TP vs. FP b. TNvs. FN c. TP vs, FN 4. TNvs. FP A good model has AUC value: a Closer to 1 b. Closer too e..05 d. Any value The Fl-score combines and of a classifier. a. precision and recall b. sensitivity and accuracy ¢. accuracy and recall d, accuracy and sensitivity The term (y' - y)° gives the a. Absolute error c, Mean squared error d. Which of the following indicates the spread of the data values. b. Distribution b. Squared error Relative error a. Measure of centrality c. Symmetry d. Skew Which type of analytics/analysis is used for causal analysis? a Descriptive’ b. Diagnostic - c. Predictive a Prescriptivede 1-32 / ‘wsion Data Analytics B,__ State True or False: 1, Data analysis includes data analytics. 2. A categorical variable can be either nominal or ordinal. 3. Data analytics and data analysis are identical terms. 4. Nationality is an example of ordinal variable. 5. _ Regression analysis predicts a variable from one or several predictor variables, 6. A mathematical model is same as a machine learning algorithm. 7. Machine learning models are output by algorithms and are comprised of model data at prediction algorithm. 8. Errors of bias produce underfit models. 9. Errors of variance result in overfit models. 10. Models that do much better on testing data than training data are underfit. 11. The ratio of correct predictions to the total number of predictions is called accuracy 12. Sensitivity measures the ratio of positive values correctly predicted out of all ca predicted as positive. 13. The closer an ROC curve is to the diagonal line, the less accurate the model is. 14. tis desirable to have an AUC near 0. 15. The Fl-score combines the precision and recall of a classifier. 16. Descriptive analytics deals with learning from past data to make predictions of the furue 17. The ROC curve is used to visualize the performance of a prediction model. 18. | Models that perform better on testing data than training data are overfit. 19. The accuracy of a model is high if the MAE is low. 20. Data analysis models the future and predicts the result. 21. Accuracy is a good metric to use for imbalanced datasets, C.__ Short Answer Questions: 1. Define data analytics. 2. How does data analytics differ from data analysis? * 2. Which are the two important aspects of descriptive analysis? List the types of variables in descriptive analysis,ee I Dw m\ 1-33 Introduction to Data Analytics Wi What is a categorical variable? What is a nominal variable? What is an ordinal variable? What is a ratio variable? What is an interval variable? Classify the following data into the corresponding variable type. i. Nationality — Indian, American, Canadian etc. ii, _ Dress size — small, medium, large iii, Age What are dependent and independent variables? How is the range of a dispersion calculated? Which type of analytics uses past data to create a model to find out what will happen in the future? Which type of analysis is used to find out what is happening with the data? What is interquartile range? What is variance and standard deviation? What is the use of diagnostic analytics? What is the concept of correlation? What is the use of prescriptive analytics? What is the purpose of exploratory analysis? What is a machine learning model? What is the purpose of model evaluation? Define the terms; True Positive, False Positive, True Negative, False Negative How is the accuracy and precision of a classification model calculated? What is the F1 score? How is the sensitivity or recall calculated? What is plotted on the X-axis and Y-axis of the ROC curve? What is overfitting and underfitting? List the measures of centrality. Which type of data causes the model to overfit?o 1-34 /vibion Data Anaiytics D._Long Answer Questions: 1. Explain the concept of data analytics. Differentiate between data analytics and data analysis, 3. Write a note on descriptive analysis. 4, Explain the two important aspects of descriptive analysis. <5. How is data described in descriptive analysis? 6, Explain the types of variables with examples. 7. Write a note on frequency distribution. 8. Which are the measures of centrality? 9. Explain the quantities for measuring dispersion, 10. Write a note on diagnostic analytics. 11. Write a note on predictive analytics. 2. Explain the concept of mathematical models. 13. What is a machine learning model? 14. Write a note on bias-variance tradeoff. 15. Explain the concept of confusion matrix. 16. Explain the concept of class imbalance. 17. Write a note on ROC curve. 18. Whatis AUC? 19. Explain various measures for evaluating prediction models, 20. Explain the concept of overfitting and underfitting with an example. ' E. Problem solving exercises: —* 7) Given the following values, evaluate the performance of the classification model. N =!" TP = 100, FP = 10, TN = 50, FN=5 @) Consider classification problem for classifying emails as spam or not spam ™ following table gives the model output for 20 emails. Evaluate the model's performIntroduction to Data Analytics wsion\ 1-35 [ Patient No. | Actual class | Predicted class 4___| Spam Spam 2 Not Spam Not Spam 1? 3 ‘Spam | Not Spam 4 Not Spam __| Not Spam thy 5 Not Spam __| Spam | 6 Not Spam __| Not Spam tr 7 ‘Spam Not Spam’ 8 Not Spam | Spam 9 NotSpam__| Spam 10 | Spam Spam vf 1 ‘Spam Spam é 12 NotSpam | Not Spam 1M 13___| Spam Spam 14 | Not Spam | Not Spam ’ 15 | NotSpam | Spam fe 16 | Spam ‘Spam 1 17 ‘Not Spam’ ‘Spam ' 18___| NotSpam | Spam Fee 19 Spam Spam vf 20___[ Not Spam | Not Spam N 22 Consider a classifier for classifying reviews as positive(1) or negative(0). The table shows the model output for 20 reviews. Evaluate the model. Review true_label | predicted_label Sample review 1 1 Sample review 2 Sample review 3 ‘Sample review 4 ‘Sample review 5 ‘Sample review 6 Sample review 7 ‘Sample review 8 Sample review 9 Sample review 10 ‘Sample review 11 Sample review 12 Sample review 13 ‘Sample review 14 Sample review 15 Sample review 16 ‘Sample review 17 ‘Sample review 18 ‘Sample review 19 Sample review 20 I lo}=|=|ololo|=Jo}-|+|-lolo e}=|=|o]=|o]=]o}=|ofo}o}-|—|=|-]+Jo]o o}=|-|=|olo}=!3g) Consider the following table which shows the output of a prediction model for 1-36 /wsion Data Analytics scores. Evaluate the model. Predict 1 1 1 75 79.03 2 [8s 62.11 2 80 82.11 7 78 82.11 2 89 i 82.11 2 93 82.11 3 90 85.19 3 91 85.19 4 94 88.27 5 88 91.35 5 84 91.35 5 90 91.35 6 94 94.43, Answers A. Cas NS Stace be % Stee Geb le hice Sg 9% b 10. ies m2 13. c 14. d iS; oh 16. a 1. 4 18. ¢ 19. b 20. a Bina 2 a 23. b 24. b | 25. b | B. i 1. False 2, ‘Trice <3, False 4 False 5. True -6. Fale | Pine 8. Antes APR IU aM arae PTI te 12, False i oie 4.14. Fale Ie tg true 19, Tre 20. False 21. False ( vision
You might also like
SY Sem-II-2
PDF
No ratings yet
SY Sem-II-2
108 pages
Tybsc It Sem 6 CL Unit 1 To 5 Theshikshak Com LMR Point To Point
PDF
No ratings yet
Tybsc It Sem 6 CL Unit 1 To 5 Theshikshak Com LMR Point To Point
93 pages
CL APR-19 (Sol) (E-Next - In)
PDF
No ratings yet
CL APR-19 (Sol) (E-Next - In)
51 pages
E2 SEM 4 Wireless Communication-1
PDF
No ratings yet
E2 SEM 4 Wireless Communication-1
15 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1175)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2133)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)