How-To Tutorials

article-image-3-different-types-of-generative-adversarial-networks-gans-and-how-they-work

08 Jan 2020

6 min read

3 different types of generative adversarial networks (GANs) and how they work

08 Jan 2020

Generative adversarial networks (GANs) have been greeted with real excitement since their creation back in 2014 by Ian Goodfellow and his research team. Yann LeCun, Facebook's Director of AI Research went as far as describing GANs as "the most interesting idea in the last 10 years in ML." With all this excitement, however, it can be easy to miss the subtle diversity of GANs; there are a number of different types of generative adversarial networks, each one working in slightly different ways and helping engineers to achieve slightly different results. To give you a deeper insight on GANs, in this article we'll look at three different generative adversarial networks: SRGANs, CycleGANs, and InfoGANs. We'll explore how these different GANs work and how they can be used. This should give you a solid foundation to explore GANs in more depth and begin to apply them in your own experiments and projects. This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal. SRGAN - Super Resolution GANs Remember seeing a crime-thriller where our hero asks the computer guy to magnify the faded image of the crime scene? With the zoom we are able to see the criminal’s face in detail, including the weapon used and anything engraved upon it! Well, SRGAN can perform similar magic. Here a GAN is trained in such a way that it can generate a photorealistic high-resolution image when given a low-resolution image. The SRGAN architecture consists of three neural networks: a very deep generator network, a discriminator network, and a pretrained VGG-16 network. How do SRGANs work? SRGANs use the perceptual loss function (developed by Johnson et al, Perceptual Losses for Real-Time Style Transfer and Super-Resolution). The difference in the feature map activations in high layers of a VGG network between the network output part and the high-resolution part comprises the perceptual loss function. Besides perceptual loss, the authors further added content loss and an adversarial loss so that images generated look more natural and the finer details more artistic. The perceptual loss is defined as the weighted sum of content loss and adversarial loss: lSR = lSR X+ 10−3×lSRGen The first term on the right-hand side is the content loss, obtained using the feature maps generated by pretrained VGG 19. Mathematically it is the Euclidean distance between the feature map of the reconstructed image (that is the one generated by the generator) and the original high-resolution reference image. The second term on the right-hand side is the adversarial loss. It is the standard generative loss term, designed to ensure that images generated by the generator are able to fool the discriminator. You can see in the following figure taken from the original paper that the image generated by SRGAN is much closer to the original high-resolution image: [caption id="attachment_31006" align="aligncenter" width="907"] image via https://arxiv.org/pdf/1609.04802.pdf[/caption] CycleGAN Another noteworthy architecture is CycleGAN; proposed in 2017, it can perform the task of image translation. Once trained you can translate an image from one domain to another domain. For example, when trained on horse and zebra data set, if you give it an image with horses in the ground, the CycleGAN can convert the horses to zebra with the same background. How does CycleGAN work? Have you ever imagined how a scenery would look if Van Gogh or Manet had painted it? We have many sceneries, and many landscapes painted by Gogh/Manet, but we do not have any collection of input-output pairs. CycleGAN performs the image translation, that is, transfers an image given in one domain (scenery for example) to another domain (Van Gogh painting of the same scene, for instance) in the absence of training examples. CycleGAN’s ability to perform image translation in the absence of training pairs is what makes it unique. To achieve image translation the authors of CycleGAN used a very simple and yet effective procedure. They made use of two GANs, the generator of each GAN performing the image translation from one domain to another. To elaborate, let us say the input is X, then the generator of the first GAN performs a mapping G: X → Y, thus its output would be Y = G(X). The generator of the second GAN performs an inverse mapping F: Y → X, resulting in X = F(Y). Each discriminator is trained to distinguish between real images and synthesized images. The idea is shown as follows: To train the combined GANs, the authors added beside the conventional GAN adversarial loss a forward cycle consistency loss (left figure) and a backward cycle consistency loss (right figure). This ensures that if an image X is given as input, then after the two translations F(G(X)) ~ X the obtained image is the same X (similarly the backward cycle consistency loss ensures the G(F(Y)) ~ Y). Following are some of the successful image translations by CycleGAN: Following are few more examples, you can see the translation of seasons (summer → winter), photo → painting and vice versa, horses → zebra: InfoGAN The GAN architectures that we have considered up to now provide us with little or no control over the generated images. InfoGAN changes this; it provides control over various attributes of the images generated. The InfoGAN uses concepts from information theory such that the noise term is transformed into latent codes which provide predictable and systematic control over the output. How does InfoGAN work? The generator in InfoGAN takes two inputs the latent space Z and a latent code c, thus the output of generator is G(Z,c). The GAN is trained such that it maximizes the mutual information between the latent code c and the generated image G(Z,c). The following figure shows the architecture of InfoGAN: The concatenated vector (Z,c) is fed to the Generator. Q(c|X) is also a neural network, combined with the generator it works to form a mapping between random noise Z and its latent code c_hat, it aims to estimate c given X. This is achieved by adding a regularization term to the objective function of conventional GAN: minDmaxG VI(D,G) = VG(D,G) −λI(c;G(Z,c)) The term VG(D,G) is the loss function of conventional GAN, and the second term is the regularization term, where λ is a constant. Its value was set to 1 in the paper, and I(c;G(Z,c)) is the mutual information between the latent code c and the Generator generated image G(Z,c). Below is the results of InfoGAN on the MNIST dataset: That concludes our brief look at three different types of generative adversarial networks. You can find the book from which this article was taken on the Packt store or you can read the first chapter for free on the Packt subscription platform.

0
0
100369

article-image-building-trust-in-ai-the-role-of-rag-in-data-security-and-transparency

Keith Bourne

13 Dec 2024

15 min read

Building Trust in AI: The Role of RAG in Data Security and Transparency

Keith Bourne

13 Dec 2024

15 min read

This article is an excerpt from the book, "Unlocking Data with Generative AI and RAG", by Keith Bourne. Master Retrieval-Augmented Generation (RAG), the most popular generative AI tool, to unlock the full potential of your data. This book enables you to develop highly sought-after skills as corporate investment in generative AI soars.IntroductionAs the adoption of Retrieval-Augmented Generation (RAG) continues to grow, its potential to address key security challenges in AI-driven applications is becoming evident. Far from merely introducing risks, RAG offers a robust framework to enhance data protection, ensure accuracy, and maintain transparency in content generation. This article delves into the multifaceted security benefits of RAG, while also addressing the unique challenges it poses and strategies to mitigate them.How RAG can be leveraged as a security solutionLet’s start with the most positive security aspect of RAG. RAG can actually be considered a solution to mitigate security concerns, rather than cause them. If done right, you can limit data access via user, ensure more reliable responses, and provide more transparency of sources.Limiting dataRAG applications may be a relatively new concept, but you can still apply the same authentication and database-based access approaches you can with web and similar types of applications. This provides the same level of security you can apply in these other types of applications. By implementing userbased access controls, you can restrict the data that each user or user group can retrieve through the RAG system. This ensures that sensitive information is only accessible to authorized individuals. Additionally, by leveraging secure database connections and encryption techniques, you can safeguard the data at rest and in transit, preventing unauthorized access or data breaches.Ensuring the reliability of generated contentOne of the key benefits of RAG is its ability to mitigate inaccuracies in generated content. By allowing applications to retrieve proprietary data at the point of generation, the risk of producing misleading or incorrect responses is substantially reduced. Feeding the most current data available through your RAG system helps to mitigate inaccuracies that might otherwise occur.With RAG, you have control over the data sources used for retrieval. By carefully curating and maintaining high-quality, up-to-date datasets, you can ensure that the information used to generate responses is accurate and reliable. This is particularly important in domains where precision and correctness are critical, such as healthcare, finance, or legal applications.Maintaining transparencyRAG makes it easier to provide transparency in the generated content. By incorporating data such as citations and references to the retrieved data sources, you can increase the credibility and trustworthiness of the generated responses.When a RAG system generates a response, it can include links or references to the specific data points or documents used in the generation process. This allows users to verify the information and trace it back to its original sources. By providing this level of transparency, you can build trust with your users and demonstrate the reliability of the generated content.Transparency in RAG can also help with accountability and auditing. If there are any concerns or disputes regarding the generated content, having clear citations and references makes it easier to investigate and resolve any issues. This transparency also facilitates compliance with regulatory requirements or industry standards that may require traceability of information.That covers many of the security-related benefits you can achieve with RAG. However, there are some security challenges associated with RAG as well. Let’s discuss these challenges next.RAG security challengesRAG applications face unique security challenges due to their reliance on large language models (LLMs) and external data sources. Let’s start with the black box challenge, highlighting the relative difficulty in understanding how an LLM determines its response.LLMs as black boxesWhen something is in a dark, black box with the lid closed, you cannot see what is going on in there! That is the idea behind the black box when discussing LLMs, meaning there is a lack of transparency and interpretability in how these complex AI models process input and generate output. The most popular LLMs are also some of the largest, meaning they can have more than 100 billion parameters. The intricate interconnections and weights of these parameters make it difficult to understand how the model arrives at a particular output.While the black box aspects of LLMs do not directly create a security problem, it does make it more difficult to identify solutions to problems when they occur. This makes it difficult to trust LLM outputs, which is a critical factor in most of the applications for LLMs, including RAG applications. This lack of transparency makes it more difficult to debug issues you might have in building an RAG application, which increases the risk of having more security issues.There is a lot of research and effort in the academic field to build models that are more transparent and interpretable, called explainable AI. Explainable AI aims at making the operations of A I systems transparent and understandable. It can involve tools, frameworks, and anything else that, when applied to RAG, helps us understand how the language models that we use produce the content they are generating. This is a big movement in the field, but this technology may not be immediately available as you read this. It will hopefully play a larger role in the future to help mitigate black box risk, but right now, none of the most popular LLMs are using explainable models. So, in the meantime, we will talk about other ways to address this issue.You can use human-in-the-loop, where you involve humans at different stages of the process to provide an added line of defense against unexpected outputs. This can often help to reduce the impact of the black box aspect of LLMs. If your response time is not as critical, you can also use an additional LLM to perform a review of the response before it is returned to the user, looking for issues. We will review how to add a second LLM call in code lab 5.3, but with a focus on preventing prompt attacks. But this concept is similar, in that you can add additional LLMs to do a number of extra tasks and improve the security of your application.Black box isn’t the only security issue you face when using RAG applications though; another very important topic is privacy protection.Privacy concerns and protecting user dataPersonally identifiable information (PII) is a key topic in the generative AI space, with governments a round the world trying to determine the best path to balance user privacy with the data-hungry needs of these LLMs. As this gets worked out, it is important to pay attention to the laws and regulations that are taking shape where your company is doing business and make sure all of the technologies you are integrating into your RAG applications adhere. Many companies, such as Google and Microsoft , are taking these efforts into their own hands, establishing their own standards of protection for their user data and emphasizing them in training literature for their platforms.At the corporate level, there is another challenge related to PII and sensitive information. As we have said many times, the nature of the RAG application is to give it access to the company data and combine that with the power of the LLM. For example, for financial institutions, RAG represents a way to give their customers unprecedented access to their own data in ways that allow them to speak naturally with technologies such as chatbots and get near-instant access to hard-to-find answers buried deep in their customer data.In many ways, this can be a huge benefit if implemented properly. But given that this is a security discussion, you may already see where I am going with this. We are giving unprecedented access to customer data using a technology that has artificial intelligence, and as we said previously in the black box discussion, we don’t completely understand how it works! If not implemented properly, this could be a recipe for disaster with massive negative repercussions for companies that get it wrong. Of course, it could be argued that the databases that contain the data are also a potential security risk. Having the data anywhere is a risk! But without taking on this risk, we also cannot provide the significant benefits they represent.As with other IT applications that contain sensitive data, you can forge forward, but you need to have a healthy fear of what can happen to data and proactively take measures to protect that data. The more you understand how RAG works, the better job you can do in preventing a potentially disastrous data leak. These steps can help you protect your company as well as the people who trusted your company with their data.This section was about protecting data that exists. However, a new risk that has risen with LLMs has been the generation of data that isn’t real, called hallucinations. Let’s discuss how this presents a new risk not common in the IT world.HallucinationsWe have discussed this in previous chapters, but LLMs can, at times, generate responses that sound coherent and factual but can be very wrong. These are called hallucinations and there have been many shocking examples provided in the news, especially in late 2022 and 2023, when LLMs became everyday tools for many users.Some are just funny with little consequence other than a good laugh, such as when ChatGPT was asked by a writer for The Economist, “When was the Golden Gate Bridge transported for the second time across Egypt?” ChatGPT responded, “The Golden Gate Bridge was transported for the second time across Egypt in October of 2016” (https://www.economist.com/by-invitation/2022/09/02/artificialneural-networks-today-are-not-conscious-according-to-douglashofstadter).Other hallucinations are more nefarious, such as when a New York lawyer used ChatGPT for legal research in a client’s personal injury case against Avianca Airlines, where he submitted six cases that had been completely made up by the chatbot, leading to court sanctions (https://www. courthousenews.com/sanctions-ordered-for-lawyers-who-relied-onchatgpt-artificial-intelligence-to-prepare-court-brief/). Even worse, generative AI has been known to give biased, racist, and bigoted perspectives, particularly when prompted in a manipulative way.When combined with the black box nature of these LLMs, where we are not always certain how and why a response is generated, this can be a genuine issue for companies wanting to use these LLMs in their RAG applications.From what we know though, hallucinations are primarily a result of the probabilistic nature of LLMs. For all responses that an LLM generates, it typically uses a probability distribution to determine what token it is going to provide next. In situations where it has a strong knowledge base of a certain subject, these probabilities for the next word/token can be 99% or higher. But in situations where the knowledge base is not as strong, the highest probability could be low, such as 20% or even lower. In these cases, it is still the highest probability and, therefore, that is the token that has the highest probability to be selected. The LLM has been trained on stringing tokens together in a very natural language way while using this probabilistic approach to select which tokens to display. As it strings together words with low probability, it forms sentences, and then paragraphs that sound natural and factual but are not based on high probability data. Ultimately, this results in a response that sounds very plausible but is, in fact, based on very loose facts that are incorrect.For a company, this poses a risk that goes beyond the embarrassment of your chatbot saying something wrong. What is said wrong could ruin your relationship(s) with your customer(s), or it could lead to the LLM offering your customer something that you did not intend to offer, or worse, cannot afford to offer. For example, when Microsoft released a chatbot named Tay on Twitter in 2016 with the intention of learning from interactions with Twitter users, users manipulated this spongy personality trait to get it to say numerous racist and bigoted remarks. This reflected poorly on Microsoft, which was promoting its expertise in the AI area with Tay, causing significant damage to its reputation at the time (https://www.theguardian.com/technology/2016/mar/26/microsoftdeeply-sorry-for-offensive-tweets-by-ai-chatbot).Hallucinations, threats related to black box aspects, and protecting user data can all be addressed through red teaming.ConclusionRAG represents a promising avenue for enhancing security in AI applications, offering tools to limit data access, ensure reliable outputs, and promote transparency. However, challenges such as the black box nature of LLMs, privacy concerns, and the risk of hallucinations demand proactive measures. By employing strategies like user-based access controls, explainable AI, and red teaming, organizations can harness the advantages of RAG while mitigating risks. As the technology evolves, a thoughtful approach to its implementation will be crucial for maintaining trust, compliance, and the integrity of data-driven solutions.Author BioKeith Bourne is a senior Generative AI data scientist at Johnson & Johnson. He has over a decade of experience in machine learning and AI working across diverse projects in companies that range in size from start-ups to Fortune 500 companies. With an MBA from Babson College and a master’s in applied data science from the University of Michigan, he has developed several sophisticated modular Generative AI platforms from the ground up, using numerous advanced techniques, including RAG, AI agents, and foundational model fine-tuning. Keith seeks to share his knowledge with a broader audience, aiming to demystify the complexities of RAG for organizations looking to leverage this promising technology.

0
0
100006

article-image-4-ways-implement-feature-selection-python-machine-learning

Sugandha Lahoti

16 Feb 2018

13 min read

4 ways to implement feature selection in Python for machine learning

Sugandha Lahoti

16 Feb 2018

13 min read

[box type="note" align="" class="" width=""]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner's guide to combining powerful machine learning algorithms to build optimized models.[/box] In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library: Univariate selection Recursive Feature Elimination (RFE) Principle Component Analysis (PCA) Choosing important features (feature importance) We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community. Univariate selection Statistical tests can be used to select those features that have the strongest relationships with the output variable. The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features. The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset: #Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:]) You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age. Scores for each feature: [111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304] Selected Features: [[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]] Recursive Feature Elimination RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation. The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_) After execution, we will get: Num Features: 3 Selected Features: [ True False False False False True True False] Feature Ranking: [1 2 3 5 6 1 1 4] You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array. Principle Component Analysis PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the following example, we use PCA and select three principal components: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_) You can see that the transformed dataset (three principal components) bears little resemblance to the source data: Explained Variance: [ 0.88854663 0.06159078 0.02579012] [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [ -2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02 9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]] Choosing important features (feature importance) Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let's understand it in detail. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy. A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Let's see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory. This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy). We will start with importing all of the libraries: #Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn from sklearn.feature_selection import SelectFromModel np.random.seed(1) Let's define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model: #Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy: #Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier: #Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6)) Let's take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let's see: Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data. In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn: #Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees = 250 max_feat = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc)) The accuracy of our model is: Accuracy of model before feature selection is 98.82 As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes. So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let's find out the importance of each feature: #Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_): print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366) As you can see here, each feature has a different importance based on its contribution to the final prediction. We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training: #Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain) Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset: #Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1 = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20) Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That's about 80% reduction from the original dataset. In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let's see what accuracy we get after modifying the training set: #Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97 Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly. This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table: Evaluation criteria Before feature selection After feature selection Number of features 94 20 Size of dataset 26.32 MB 5.60 MB Training time 2.91 seconds 1.71 seconds Accuracy 98.82 percent 99.97 percent The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before. To summarize the article, we explored 4 ways of feature selection in machine learning. If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.

0
4
99736

Packt

11 Aug 2015

17 min read

Divide and Conquer – Classification Using Decision Trees and Rules

Packt

11 Aug 2015

17 min read

In this article by Brett Lantz, author of the book Machine Learning with R, Second Edition, we will get a basic understanding about decision trees and rule learners, including the C5.0 decision tree algorithm. This algorithm will cover mechanisms such as choosing the best split and pruning the decision tree. While deciding between several job offers with various levels of pay and benefits, many people begin by making lists of pros and cons, and eliminate options based on simple rules. For instance, ''if I have to commute for more than an hour, I will be unhappy.'' Or, ''if I make less than $50k, I won't be able to support my family.'' In this way, the complex and difficult decision of predicting one's future happiness can be reduced to a series of simple decisions. This article covers decision trees and rule learners—two machine learning methods that also make complex decisions from sets of simple choices. These methods then present their knowledge in the form of logical structures that can be understood with no statistical knowledge. This aspect makes these models particularly useful for business strategy and process improvement. By the end of this article, you will learn: How trees and rules "greedily" partition data into interesting segments The most common decision tree and classification rule learners, including the C5.0, 1R, and RIPPER algorithms We will begin by examining decision trees, followed by a look at classification rules. (For more resources related to this topic, see here.) Understanding decision trees Decision tree learners are powerful classifiers, which utilize a tree structure to model the relationships among the features and the potential outcomes. As illustrated in the following figure, this structure earned its name due to the fact that it mirrors how a literal tree begins at a wide trunk, which if followed upward, splits into narrower and narrower branches. In much the same way, a decision tree classifier uses a structure of branching decisions, which channel examples into a final predicted class value. To better understand how this works in practice, let's consider the following tree, which predicts whether a job offer should be accepted. A job offer to be considered begins at the root node, where it is then passed through decision nodes that require choices to be made based on the attributes of the job. These choices split the data across branches that indicate potential outcomes of a decision, depicted here as yes or no outcomes, though in some cases there may be more than two possibilities. In the case a final decision can be made, the tree is terminated by leaf nodes (also known as terminal nodes) that denote the action to be taken as the result of the series of decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree. A great benefit of decision tree algorithms is that the flowchart-like tree structure is not necessarily exclusively for the learner's internal use. After the model is created, many decision tree algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn't work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results need to be shared with others in order to inform future business practices. With this in mind, some potential uses include: Credit scoring models in which the criteria that causes an applicant to be rejected need to be clearly documented and free from bias Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of disease progression Although the previous applications illustrate the value of trees in informing decision processes, this is not to suggest that their utility ends here. In fact, decision trees are perhaps the single most widely used machine learning technique, and can be applied to model almost any type of data—often with excellent out-of-the-box applications. This said, in spite of their wide applicability, it is worth noting some scenarios where trees may not be an ideal fit. One such case might be a task where the data has a large number of nominal features with many levels or it has a large number of numeric features. These cases may result in a very large number of decisions and an overly complex tree. They may also contribute to the tendency of decision trees to overfit data, though as we will soon see, even this weakness can be overcome by adjusting some simple parameters. Divide and conquer Decision trees are built using a heuristic called recursive partitioning. This approach is also commonly known as divide and conquer because it splits the data into subsets, which are then split repeatedly into even smaller subsets, and so on and so forth until the process stops when the algorithm determines the data within the subsets are sufficiently homogenous, or another stopping criterion has been met. To see how splitting a dataset can create a decision tree, imagine a bare root node that will grow into a mature tree. At first, the root node represents the entire dataset, since no splitting has transpired. Next, the decision tree algorithm must choose a feature to split upon; ideally, it chooses the feature most predictive of the target class. The examples are then partitioned into groups according to the distinct values of this feature, and the first set of tree branches are formed. Working down each branch, the algorithm continues to divide and conquer the data, choosing the best candidate feature each time to create another decision node, until a stopping criterion is reached. Divide and conquer might stop at a node in a case that: All (or nearly all) of the examples at the node have the same class There are no remaining features to distinguish among the examples The tree has grown to a predefined size limit To illustrate the tree building process, let's consider a simple example. Imagine that you work for a Hollywood studio, where your role is to decide whether the studio should move forward with producing the screenplays pitched by promising new authors. After returning from a vacation, your desk is piled high with proposals. Without the time to read each proposal cover-to-cover, you decide to develop a decision tree algorithm to predict whether a potential movie would fall into one of three categories: Critical Success, Mainstream Hit, or Box Office Bust. To build the decision tree, you turn to the studio archives to examine the factors leading to the success and failure of the company's 30 most recent releases. You quickly notice a relationship between the film's estimated shooting budget, the number of A-list celebrities lined up for starring roles, and the level of success. Excited about this finding, you produce a scatterplot to illustrate the pattern: Using the divide and conquer strategy, we can build a simple decision tree from this data. First, to create the tree's root node, we split the feature indicating the number of celebrities, partitioning the movies into groups with and without a significant number of A-list stars: Next, among the group of movies with a larger number of celebrities, we can make another split between movies with and without a high budget: At this point, we have partitioned the data into three groups. The group at the top-left corner of the diagram is composed entirely of critically acclaimed films. This group is distinguished by a high number of celebrities and a relatively low budget. At the top-right corner, majority of movies are box office hits with high budgets and a large number of celebrities. The final group, which has little star power but budgets ranging from small to large, contains the flops. If we wanted, we could continue to divide and conquer the data by splitting it based on the increasingly specific ranges of budget and celebrity count, until each of the currently misclassified values resides in its own tiny partition, and is correctly classified. However, it is not advisable to overfit a decision tree in this way. Though there is nothing to stop us from splitting the data indefinitely, overly specific decisions do not always generalize more broadly. We'll avoid the problem of overfitting by stopping the algorithm here, since more than 80 percent of the examples in each group are from a single class. This forms the basis of our stopping criterion. You might have noticed that diagonal lines might have split the data even more cleanly. This is one limitation of the decision tree's knowledge representation, which uses axis-parallel splits. The fact that each split considers one feature at a time prevents the decision tree from forming more complex decision boundaries. For example, a diagonal line could be created by a decision that asks, "is the number of celebrities is greater than the estimated budget?" If so, then "it will be a critical success." Our model for predicting the future success of movies can be represented in a simple tree, as shown in the following diagram. To evaluate a script, follow the branches through each decision until the script's success or failure has been predicted. In no time, you will be able to identify the most promising options among the backlog of scripts and get back to more important work, such as writing an Academy Awards acceptance speech. Since real-world data contains more than two features, decision trees quickly become far more complex than this, with many more nodes, branches, and leaves. In the next section, you will learn about a popular algorithm to build decision tree models automatically. The C5.0 decision tree algorithm There are numerous implementations of decision trees, but one of the most well-known implementations is the C5.0 algorithm. This algorithm was developed by computer scientist J. Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is an improvement over his Iterative Dichotomiser 3 (ID3) algorithm. Although Quinlan markets C5.0 to commercial clients (see http://www.rulequest.com/ for details), the source code for a single-threaded version of the algorithm was made publically available, and it has therefore been incorporated into programs such as R. To further confuse matters, a popular Java-based open source alternative to C4.5, titled J48, is included in R's RWeka package. Because the differences among C5.0, C4.5, and J48 are minor, the principles in this article will apply to any of these three methods, and the algorithms should be considered synonymous. The C5.0 algorithm has become the industry standard to produce decision trees, because it does well for most types of problems directly out of the box. Compared to other advanced machine learning models, the decision trees built by C5.0 generally perform nearly as well, but are much easier to understand and deploy. Additionally, as shown in the following table, the algorithm's weaknesses are relatively minor and can be largely avoided: Strengths Weaknesses An all-purpose classifier that does well on most problems Highly automatic learning process, which can handle numeric or nominal features, as well as missing data Excludes unimportant features Can be used on both small and large datasets Results in a model that can be interpreted without a mathematical background (for relatively small trees) More efficient than other complex models Decision tree models are often biased toward splits on features having a large number of levels It is easy to overfit or underfit the model Can have trouble modeling some relationships due to reliance on axis-parallel splits Small changes in the training data can result in large changes to decision logic Large trees can be difficult to interpret and the decisions they make may seem counterintuitive To keep things simple, our earlier decision tree example ignored the mathematics involved in how a machine would employ a divide and conquer strategy. Let's explore this in more detail to examine how this heuristic works in practice. Choosing the best split The first challenge that a decision tree will face is to identify which feature to split upon. In the previous example, we looked for a way to split the data such that the resulting partitions contained examples primarily of a single class. The degree to which a subset of examples contains only a single class is known as purity, and any subset composed of only a single class is called pure. There are various measurements of purity that can be used to identify the best decision tree splitting candidate. C5.0 uses entropy, a concept borrowed from information theory that quantifies the randomness, or disorder, within a set of class values. Sets with high entropy are very diverse and provide little information about other items that may also belong in the set, as there is no apparent commonality. The decision tree hopes to find splits that reduce entropy, ultimately increasing homogeneity within the groups. Typically, entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). In each case, the minimum value indicates that the sample is completely homogenous, while the maximum value indicates that the data are as diverse as possible, and no group has even a small plurality. In the mathematical notion, entropy is specified as follows: In this formula, for a given segment of data (S), the term c refers to the number of class levels and pi refers to the proportion of values falling into class level i. For example, suppose we have a partition of data with two classes: red (60 percent) and white (40 percent). We can calculate the entropy as follows: > -0.60 * log2(0.60) - 0.40 * log2(0.40) [1] 0.9709506 We can examine the entropy for all the possible two-class arrangements. If we know that the proportion of examples in one class is x, then the proportion in the other class is (1 – x). Using the curve() function, we can then plot the entropy for all the possible values of x: > curve(-x * log2(x) - (1 - x) * log2(1 - x), col = "red", xlab = "x", ylab = "Entropy", lwd = 4) This results in the following figure: As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in maximum entropy. As one class increasingly dominates the other, the entropy reduces to zero. To use entropy to determine the optimal feature to split upon, the algorithm calculates the change in homogeneity that would result from a split on each possible feature, which is a measure known as information gain. The information gain for a feature F is calculated as the difference between the entropy in the segment before the split (S1) and the partitions resulting from the split (S2): One complication is that after a split, the data is divided into more than one partition. Therefore, the function to calculate Entropy(S2) needs to consider the total entropy across all of the partitions. It does this by weighing each partition's entropy by the proportion of records falling into the partition. This can be stated in a formula as: In simple terms, the total entropy resulting from a split is the sum of the entropy of each of the n partitions weighted by the proportion of examples falling in the partition (wi). The higher the information gain, the better a feature is at creating homogeneous groups after a split on this feature. If the information gain is zero, there is no reduction in entropy for splitting on this feature. On the other hand, the maximum information gain is equal to the entropy prior to the split. This would imply that the entropy after the split is zero, which means that the split results in completely homogeneous groups. The previous formulae assume nominal features, but decision trees use information gain for splitting on numeric features as well. To do so, a common practice is to test various splits that divide the values into groups greater than or less than a numeric threshold. This reduces the numeric feature into a two-level categorical feature that allows information gain to be calculated as usual. The numeric cut point yielding the largest information gain is chosen for the split. Though it is used by C5.0, information gain is not the only splitting criterion that can be used to build decision trees. Other commonly used criteria are Gini index, Chi-Squared statistic, and gain ratio. For a review of these (and many more) criteria, refer to Mingers J. An Empirical Comparison of Selection Measures for Decision-Tree Induction. Machine Learning. 1989; 3:319-342. Pruning the decision tree A decision tree can continue to grow indefinitely, choosing splitting features and dividing the data into smaller and smaller partitions until each example is perfectly classified or the algorithm runs out of features to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and the model will be overfitted to the training data. The process of pruning a decision tree involves reducing its size such that it generalizes better to unseen data. One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions or when the decision nodes contain only a small number of examples. This is called early stopping or pre-pruning the decision tree. As the tree avoids doing needless work, this is an appealing strategy. However, one downside to this approach is that there is no way to know whether the tree will miss subtle, but important patterns that it would have learned had it grown to a larger size. An alternative, called post-pruning, involves growing a tree that is intentionally too large and pruning leaf nodes to reduce the size of the tree to a more appropriate level. This is often a more effective approach than pre-pruning, because it is quite difficult to determine the optimal depth of a decision tree without growing it first. Pruning the tree later on allows the algorithm to be certain that all the important data structures were discovered. The implementation details of pruning operations are very technical and beyond the scope of this article. For a comparison of some of the available methods, see Esposito F, Malerba D, Semeraro G. A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997;19: 476-491. One of the benefits of the C5.0 algorithm is that it is opinionated about pruning—it takes care of many decisions automatically using fairly reasonable defaults. Its overall strategy is to post-prune the tree. It first grows a large tree that overfits the training data. Later, the nodes and branches that have little effect on the classification errors are removed. In some cases, entire branches are moved further up the tree or replaced by simpler decisions. These processes of grafting branches are known as subtree raising and subtree replacement, respectively. Balancing overfitting and underfitting a decision tree is a bit of an art, but if model accuracy is vital, it may be worth investing some time with various pruning options to see if it improves the performance on test data. As you will soon see, one of the strengths of the C5.0 algorithm is that it is very easy to adjust the training options. Summary This article covered two classification methods that use so-called "greedy" algorithms to partition the data according to feature values. Decision trees use a divide and conquer strategy to create flowchart-like structures, while rule learners separate and conquer data to identify logical if-else rules. Both methods produce models that can be interpreted without a statistical background. One popular and highly configurable decision tree algorithm is C5.0. We used the C5.0 algorithm to create a tree to predict whether a loan applicant will default. This article merely scratched the surface of how trees and rules can be used. Resources for Article: Further resources on this subject: Introduction to S4 Classes [article] First steps with R [article] Supervised learning [article]

0
0
99630

article-image-creating-2d-3d-plots-using-matplotlib

Pravin Dhandre

22 Mar 2018

10 min read

Creating 2D and 3D plots using Matplotlib

Pravin Dhandre

22 Mar 2018

10 min read

0
0
91754

article-image-cross-validation-strategies-for-time-series-forecasting-tutorial

Packt Editorial Staff

06 May 2019

12 min read

Cross-Validation strategies for Time Series forecasting [Tutorial]

Packt Editorial Staff

06 May 2019

12 min read

Time series modeling and forecasting are tricky and challenging. The i.i.d (identically distributed independence) assumption does not hold well to time series data. There is an implicit dependence on previous observations and at the same time, a data leakage from response variables to lag variables is more likely to occur in addition to inherent non-stationarity in the data space. By non-stationarity, we mean flickering changes of observed statistics such as mean and variance. It even gets trickier when taking inherent nonlinearity into consideration. Cross-validation is a well-established methodology for choosing the best model by tuning hyper-parameters or performing feature selection. There are a plethora of strategies for implementing optimal cross-validation. K-fold cross-validation is a time-proven example of such techniques. However, it is not robust in handling time series forecasting issues due to the nature of the data as explained above. In this tutorial, we shall explore two more techniques for performing cross-validation; time series split cross-validation and blocked cross-validation, which is carefully adapted to solve issues encountered in time series forecasting. We shall use Python 3.5, SciKit Learn, Matplotlib, Numpy, and Pandas. By the end of this tutorial you will have explored the following topics: Time Series Split Cross-Validation Blocked Cross-Validation Grid Search Cross-Validation Loss Function Elastic Net Regression Cross-Validation Image Source: scikit-learn.org First, the data set is split into a training and testing set. The testing set is preserved for evaluating the best model optimized by cross-validation. In k-fold cross-validation, the training set is further split into k folds aka partitions. During each iteration of the cross-validation, one fold is held as a validation set and the remaining k - 1 folds are used for training. This allows us to make the best use of the data available without annihilation. It also allows us to avoid biasing the model towards patterns that may be overly represented in a given fold. Then the error obtained on all folds is averaged and the standard deviation is calculated. One usually performs cross-validation to find out which settings give the minimum error before training a final model using these elected settings on the complete training set. Flavors of k-fold cross-validations exist, for example, leave-one-out and nested cross-validation. However, these may be the topic of another tutorial. Grid Search Cross-Validation One idea to fine-tune the hyper-parameters is to randomly guess the values for model parameters and apply cross-validation to see if they work. This is infeasible as there may be exponential combinations of such parameters. This approach is also called Random Search in the literature. Grid search works by exhaustively searching the possible combinations of the model’s parameters, but it makes use of the loss function to guide the selection of the values to be tried at each iteration. That is solving a minimization optimization problem. However, in SciKit Learn it explicitly tries all the possible combination which makes it computationally expensive. When cross-validation is used in the inner loop of the grid search, it is called grid search cross-validation. Hence, the optimization objective becomes minimizing the average loss obtained on the k folds. R2 Loss Function Choosing the loss function has a very high impact on model performance and convergence. In this tutorial, I would like to introduce to you a loss function, most commonly used in regression tasks. R2 loss works by calculating correlation coefficients between the ground truth target values and the response output from the model. The formula is, however, slightly modified so that the range of the function is in the open interval [+1, -∞]. Hence, +1 indicates maximum positive correlation and negative values indicate the opposite. Thus, all the errors obtained in this tutorial should be interpreted as desirable if their value is close to +1. It is worth mentioning that we could have chosen a different loss function such as L1-norm or L2-norm. I would encourage you to try the ideas discussed in this tutorial using other loss functions and observe the difference. Elastic Net Regression This also goes in the literature by the name elastic net regularization. Regularization is a very robust technique to avoid overfitting by penalizing large weights or in other words it alters the objective function by emphasizing the errors caused by memorizing the training set. Vanilla linear regression can be tricked into learning the parameters that perform very well on the training set, but yet fail to generalize for unseen new samples. Both L1-regularization and L2-regularization were incorporated to resolve overfitting and are known in the literature as Lasso and Ridge regression respectively. Due to the critique of both Lasso and Ridge regression, Elastic Net regression was introduced to mix the two models. As a result, some variables’ coefficients are set to zero as per L1-norm and some others are penalized or shrank as per the L2-norm. This model combines the best from both worlds and the result is a stable, robust, and a sparse model. As a consequence, there are more parameters to be fine-tuned. That’s why this is a good example to demonstrate the power of cross-validation. Crypto Data Set I have obtained ETHereum/USD exchange prices for the year 2019 from cryptodatadownload.com which you can get for free from the website or by running the following command: $ wget http://www.cryptodatadownload.com/cdd/Gemini_ETHUSD_d.csv Now that you have the CSV file you can import it to Python using Pandas. The daily close price is used as both regressor and response variables. In this setup, I have used a lag of 64 days for regressors and a target of 8 days for responses. That is, given the past 64 days closing prices forecast the next 8 days. Then the resulting nan rows at the tail are dropped as a way to handle missing values. df = pd.read_csv('./Gemini_ETHUSD_d.csv', skiprows=1) for i in range(1, STEPS): col_name = 'd{}'.format(i) df[col_name] = df['d0'].shift(periods=-1 * i) df = df.dropna() Next, we split the data frame into two one for the regressors and the other for the responses. And then split both into two one for training and the other for testing. X = df.iloc[:, :TRAIN_STEPS] y = df.iloc[:, TRAIN_STEPS:] X_train = X.iloc[:SPLIT_IDX, :] y_train = y.iloc[:SPLIT_IDX, :] X_test = X.iloc[SPLIT_IDX:, :] y_test = y.iloc[SPLIT_IDX:, :] Model Design Let’s define a method that creates an elastic net model from sci-kit learn and since we are going to forecast more than one future time step, let’s use a multi-output regressor wrapper that trains a separate model for each target time step. However, this introduces more demand for computation resources. def build_model(_alpha, _l1_ratio): estimator = ElasticNet( alpha=_alpha, l1_ratio=_l1_ratio, fit_intercept=True, normalize=False, precompute=False, max_iter=16, copy_X=True, tol=0.1, warm_start=False, positive=False, random_state=None, selection='random' ) return MultiOutputRegressor(estimator, n_jobs=4) Blocked and Time Series Splits Cross-Validation The best way to grasp the intuition behind blocked and time series splits is by visualizing them. The three split methods are depicted in the above diagram. The horizontal axis is the training set size while the vertical axis represents the cross-validation iterations. The folds used for training are depicted in blue and the folds used for validation are depicted in orange. You can intuitively interpret the horizontal axis as time progression line since we haven’t shuffled the dataset and maintained the chronological order. The idea for time series splits is to divide the training set into two folds at each iteration on condition that the validation set is always ahead of the training split. At the first iteration, one trains the candidate model on the closing prices from January to March and validates on April’s data, and for the next iteration, train on data from January to April, and validate on May’s data, and so on to the end of the training set. This way dependence is respected. However, this may introduce leakage from future data to the model. The model will observe future patterns to forecast and try to memorize them. That’s why blocked cross-validation was introduced. It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next. Implementing k-fold cross-validation using sci-kit learn is pretty straightforward, but in the following lines of code, we pass the k-fold splitter explicitly as we will develop the idea further in order to implement other kinds of cross-validation. model = build_model(_alpha=1.0, _l1_ratio=0.3) kfcv = KFold(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=kfcv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -103.076 (+/- 205.979) The same applies to time series splitter as follows: model = build_model(_alpha=1.0, _l1_ratio=0.3) tscv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -9.799 (+/- 19.292) Sci-kit learn gives us the luxury to define any new types of splitters as long as we abide by its splitter API and inherit from the base splitter. class BlockingTimeSeriesSplit(): def __init__(self, n_splits): self.n_splits = n_splits def get_n_splits(self, X, y, groups): return self.n_splits def split(self, X, y=None, groups=None): n_samples = len(X) k_fold_size = n_samples // self.n_splits indices = np.arange(n_samples) margin = 0 for i in range(self.n_splits): start = i * k_fold_size stop = start + k_fold_size mid = int(0.8 * (stop - start)) + start yield indices[start: mid], indices[mid + margin: stop] Then we can use it exactly the same way like before. model = build_model(_alpha=1.0, _l1_ratio=0.3) btscv = BlockingTimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=btscv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -15.527 (+/- 27.488) Please notice how the loss is different among the different types of splitters. In order to interpret the results correctly, let’s put it to test by using grid search cross-validation to find the optimal values for both regularization parameter alpha and -ratio that controls how much -norm contributes to the regularization. It follows that -norm contributes 1 - . params = { 'estimator__alpha':(0.1, 0.3, 0.5, 0.7, 0.9), 'estimator__l1_ratio':(0.1, 0.3, 0.5, 0.7, 0.9) } for i in range(100): model = build_model(_alpha=1.0, _l1_ratio=0.3) finder = GridSearchCV( estimator=model, param_grid=params, scoring=r2, fit_params=None, n_jobs=None, iid=False, refit=False, cv=kfcv, # change this to the splitter subject to test verbose=1, pre_dispatch=8, error_score=-999, return_train_score=True ) finder.fit(X_train, y_train) best_params = finder.best_params_ Experimental Results K-Fold Cross-Validation Optimal Parameters Grid-search cross-validation was run 100 times in order to objectively measure the consistency of the results obtained using each splitter. This way we can evaluate the effectiveness and robustness of the cross-validation method on time series forecasting. As for the k-fold cross-validation, the parameters suggested were almost uniform. That is, it did not really help us in discriminating the optimal parameters since all were equally good or bad. Time Series Split Cross-Validation Optimal Parameters Blocked Cross-Validation Optimal Parameters However, in both the cases of time series split cross-validation and blocked cross-validation, we have obtained a clear indication of the optimal values for both parameters. In case of blocked cross-validation, the results were even more discriminative as the blue bar indicates the dominance of -ratio optimal value of 0.1. Ground Truth vs Forecasting After having obtained the optimal values for our model parameters, we can train the model and evaluate it on the testing set. The results, as depicted in the plot above, indicate smooth capture of the trend and minimum error rate. # optimal model model = build_model(_alpha=0.1, _l1_ratio=0.1) # train model model.fit(X_train, y_train) # test score y_predicted = model.predict(X_test) score = r2_score(y_test, y_predicted, multioutput='uniform_average') print("Test Loss: {0:.3f}".format(score)) The output is: Test Loss: 0.925 Ideas for the Curious In this tutorial, we have demonstrated the power of using the right cross-validation strategy for time-series forecasting. The beauty of machine learning is endless. Here you’re a few ideas to try out and experiment on your own: Try using a different more volatile data set Try using different lag and target length instead of 64 and 8 days each. Try different regression models Try different loss functions Try RNN models using Keras Try increasing or decreasing the blocked splits margins Try a different value for k in cross-validation References Jeff Racine,Consistent cross-validatory model-selection for dependent data: hv-block cross-validation,Journal of Econometrics,Volume 99, Issue 1,2000,Pages 39-61,ISSN 0304-4076. Dabbs, Beau & Junker, Brian. (2016). Comparison of Cross-Validation Methods for Stochastic Block Models. Marcos Lopez de Prado, 2018, Advances in Financial Machine Learning (1st ed.), Wiley Publishing. Doctor, Grado DE et al. “New approaches in time series forecasting: methods, software, and evaluation procedures.” (2013). Learn More Seize the chance to learn more about time series forecasting techniques, machine learning, trading strategies, and algorithmic trading on my step by step online video course: Hands-on Machine Learning for Algorithmic Trading Bots with Python on PacktPub. Author Bio Mustafa Qamar-ud-Din is a machine learning engineer with over 10 years of experience in the software development industry engaged with startups on solving problems in various domains; e-commerce applications, recommender systems, biometric identity control, and event management. Time series modeling: What is it, Why it matters and How it’s used Implementing a simple Time Series Data Analysis in R Training RNNs for Time Series Forecasting

0
0
90763

How-To Tutorials

article-image-25-datasets-deep-learning-iot

Sugandha Lahoti

20 Mar 2018

8 min read

25 Datasets for Deep Learning in IoT

Sugandha Lahoti

20 Mar 2018

8 min read

0
2
88157

article-image-how-to-customize-lines-and-markers-in-matplotlib-2-0

Sugandha Lahoti

13 Dec 2017

6 min read

How to Customize lines and markers in Matplotlib 2.0

Sugandha Lahoti

13 Dec 2017

6 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim, titled Matplotlib 2.x By Example. The book illustrates methods and applications of various plot types through real world examples.[/box] In this post we demonstrate how you can manipulate Lines and Markers in Matplotlib 2.0. It covers steps to plot, customize, and adjust line graphs and markers. What are Lines and Markers Lines and markers are key components found among various plots. Many times, we may want to customize their appearance to better distinguish different datasets or for better or more consistent styling. Whereas markers are mainly used to show data, such as line plots and scatter plots, lines are involved in various components, such as grids, axes, and box outlines. Like text properties, we can easily apply similar settings for different line or marker objects with the same method. Lines Most lines in Matplotlib are drawn with the lines class, including the ones that display the data and those setting area boundaries. Their style can be adjusted by altering parameters in lines.Line2D. We usually set color, linestyle, and linewidth as keyword arguments. These can be written in shorthand as c, ls, and lw respectively. In the case of simple line graphs, these parameters can be parsed to the plt.plot() function: import numpy as np import matplotlib.pyplot as plt # Prepare a curve of square numbers x = np.linspace(0,200,100) # Prepare 100 evenly spaced numbers from # 0 to 200 y = x**2 # Prepare an array of y equals to x squared # Plot a curve of square numbers plt.plot(x,y,label = '$x^2$',c='burlywood',ls=('dashed'),lw=2) plt.legend() plt.show() With the preceding keyword arguments for line color, style, and weight, you get a woody dashed curve: Choosing dash patterns Whether a line will appear solid or with dashes is set by the keyword argument linestyle. There are a few simple patterns that can be set by the linestyle name or the corresponding shorthand. We can also define our own dash pattern: 'solid' or '-': Simple solid line (default) 'dashed' or '--': Dash strokes with equal spacing 'dashdot' or '-.': Alternate dashes and dots 'None', ' ', or '': No lines (offset, on-off-dash-seq): Customized dashes; we will demonstrate in the following advanced example Setting capstyle of dashes The cap of dashes can be rounded by setting the dash_capstyle parameter if we want to create a softer image such as in promotion: import numpy as np import matplotlib.pyplot as plt # Prepare 6 lines x = np.linspace(0,200,100) y1 = x*0.5 y2 = x y3 = x*2 y4 = x*3 y5 = x*4 y6 = x*5 # Plot lines with different dash cap styles plt.plot(x,y1,label = '0.5x', lw=5, ls=':',dash_capstyle='butt') plt.plot(x,y2,label = 'x', lw=5, ls='--',dash_capstyle='butt') plt.plot(x,y3,label = '2x', lw=5, ls=':',dash_capstyle='projecting') plt.plot(x,y4,label = '3x', lw=5, ls='--',dash_capstyle='projecting') plt.plot(x,y5,label = '4x', lw=5, ls=':',dash_capstyle='round') plt.plot(x,y6,label = '5x', lw=5, ls='--',dash_capstyle='round') plt.show() Looking closely, you can see that the top two lines are made up of rounded dashes. The middle two lines with projecting capstyle have closer spaced dashes than the lower two with butt one, given the same default spacing: Markers A marker is another type of important component for illustrating data, for example, in scatter plots, swarm plots, and time series plots. Choosing markers There are two groups of markers, unfilled markers and filled_markers. The full set of available markers can be found by calling Line2D.markers, which will output a dictionary of symbols and their corresponding marker style names. A subset of filled markers that gives more visual weight is under Line2D.filled_markers. Here are some of the most typical markers: 'o' : Circle 'x' : Cross '+' : Plus sign 'P' : Filled plus sign 'D' : Filled diamond 'S' : Square '^' : Triangle Here is a scatter plot of random numbers to illustrate the various marker types: import numpy as np import matplotlib.pyplot as plt from matplotlib.lines import Line2D # Prepare 100 random numbers to plot x = np.random.rand(100) y = np.random.rand(100) # Prepare 100 random numbers within the range of the number of # available markers as index # Each random number will serve as the choice of marker of the # corresponding coordinates markerindex = np.random.randint(0, len(Line2D.markers), 100) # Plot all kinds of available markers at random coordinates # for each type of marker, plot a point at the above generated # random coordinates with the marker type for k, m in enumerate(Line2D.markers): i = (markerindex == k) plt.scatter(x[i], y[i], marker=m) plt.show() The different markers suit different densities of data for better distinction of each point: Adjusting marker sizes We often want to change the marker sizes so as to make them clearer to read from a slideshow. Sometimes we need to adjust the markers to have a different numerical value of marker size to: import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as ticker # Prepare 5 lines x = np.linspace(0,20,10) y1 = x y2 = x*2 y3 = x*3 y4 = x*4 y5 = x*5 # Plot lines with different marker sizes plt.plot(x,y1,label = 'x', lw=2, marker='s', ms=10) # square size 10 plt.plot(x,y2,label = '2x', lw=2, marker='^', ms=12) # triangle size 12 plt.plot(x,y3,label = '3x', lw=2, marker='o', ms=10) # circle size 10 plt.plot(x,y4,label = '4x', lw=2, marker='D', ms=8) # diamond size 8 plt.plot(x,y5,label = '5x', lw=2, marker='P', ms=12) # filled plus sign # size 12 # get current axes and store it to ax ax = plt.gca() plt.show() After tuning the marker sizes, the different series look quite balanced: If all markers are set to have the same markersize value, the diamonds and squares may look heavier: Thus, we learned how to customize lines and markers in a Matplotlib plot for better visualization and styling. To know more about how to create and customize plots in Matplotlib, check out this book Matplotlib 2.x By Example.

0
0
87978

article-image-mastering-the-api-life-cycle-a-comprehensive-guide-to-design-implementation-release-and-maintenance

Bruno Pedro

06 Nov 2024

15 min read

Mastering the API Life Cycle: A Comprehensive Guide to Design, Implementation, Release, and Maintenance

Bruno Pedro

06 Nov 2024

15 min read

1
0
87683

article-image-mastering-threat-detection-with-virustotal-a-guide-for-soc-analysts

Mostafa Yahia

11 Nov 2024

15 min read

Mastering Threat Detection with VirusTotal: A Guide for SOC Analysts

Mostafa Yahia

11 Nov 2024

15 min read

0
0
87002

article-image-creating-and-using-kibana-dashboards

Huage Chen

02 Jul 2024

12 min read

Creating and Using Kibana Dashboards

Huage Chen

02 Jul 2024

12 min read

This article is an excerpt from the book, Elastic Stack 8.x Cookbook, by Huage Chen and Yazid Akadiri. Unlock the full potential of Elastic Stack for search, analytics, security, and observability and manage substantial data workloads in both on-premise and cloud environmentsIntroductionIn this guide, we will integrate all previously created visualizations into a comprehensive dashboard consisting of multiple panels. Additionally, we will explore how to enhance user interaction using control-based drilldowns.Getting readyMake sure to complete the following recipes from this chapter:Creating visualizations with Kibana LensCreating visualizations from runtime fieldsCreating Kibana mapsAt the end of this recipe, you will have dashboards composed of the various visualizations and elements built into the aforementioned recipes.How to do it...Building dashboards is very straightforward in Kibana, especially if you’ve already created some visualizations. Follow these steps:1. Go to Kibana | Analytics | Dashboard and click on Create dashboard.This will bring you to a blank canvas, where you can start adding some visualizations.2. We will start by adding a nice image! You can be creative, but we provided a sample picture:A. Click on Add panel | Image. B. Select the Use link tab and set Link to image with the following URL: https://upload. wikimedia.org/wikipedia/commons/6/60/Ville_de_RENNES_Noir. svg. Then, click on Save:Figure 6.54 – Adding an image for a logoThe logo will be added to the panel. Including a picture is a great way to add some personalization and branding to your dashboards. Let’s add some proper visualizations from the ones we’ve built in the last three recipes.3. Click on Add from library and select the [Rennes Traffic] Number of locations visualization. Make sure to align it to the right with the image panel.4. Let’s add another visualization; this time, we’ll pick [Rennes Traffic] Average speed gauge.At this stage, your dashboard should look like the one shown in Figure 6.55:Figure 6.55 – Rennes traffic dashboard – first stepYou can easily rearrange the position of the different panels by clicking on the title section and moving the panel with your mouse anywhere you want on the canvas. To adjust the size and fit of the panel, position your mouse on the small arrow at the bottom right of the panel. Let’s keep adding more panels to our dashboard.5. Click on Add from library and add the following visualizations in the respective order:I. [Rennes Traffic] Traffic status waffleII. [Rennes Traffic] Speed by road hierarchyIII. [Rennes Traffic] Average speed & Traffic StatusIV. [Rennes Traffic] Traffic status by hour6. Finally, let’s add a Map visualization for a real-time view of the traffic; select the one named [Rennes Traffic] Traffic fluidity.By now, your dashboard should look like the one shown in Figure 6.56:Figure 6.56 – Rennes traffic dashboard – more visualizationsYou can start playing around with the dashboard to see the built-in interactivity of the panels. For example, clicking on a specific road hierarchy will automatically apply the filter to the entire dashboard.You can also have dedicated panels to filter and display only the data you are interested in with Controls. Let’s add some to our dashboard.7. On the dashboard toolbar, click on Controls:Figure 6.57 – Adding controls to the dashboard8. From the drop-down list, select Add control; the Create control flyout will appear on the right of the screen.9. Select the traffic_status field and click on Save and close.10. Back to the dashboard, you now have a new panel on top of the visualization named traffic_status. By clicking on it, you will see a drop-down list where you can select the values associated with the status of the traffic you want to filter, as shown in Figure 6.58. Select congested as an example:Figure 6.58 – Using controls in the dashboard11. You can see on your dashboard that all the panels have been updated according to the value selected in the traffic_status control.Imagine you want to filter your traffic data to analyze it within a specific time range, such as early in the morning or late in the afternoon, to better understand traffic patterns. This is where the time slider control proves to be incredibly useful.12. Go to the Controls menu again in the dashboard toolbar and select Add time slider control.You’ll see a new panel to the right of traffic_status:Figure 6.59 – Time slider controlBy clicking the play icon, you will see your dashboard animate and your data change over the defined time range. You can advance the time range forward as well as backward, which is especially useful when working with time series data.Your dashboard should now look as shown in Figure 6.60, with our two controls: Figure 6.60 – Rennes traffic dashboard with controls13. Save the dashboard by clicking the Save button in the upper-right corner. Name it [Rennes Traffic] Overview.To enhance our dashboard further, consider this: users frequently manage multiple dashboards, and the ability to navigate seamlessly from one to another is crucial, especially when aiming to refine analysis or focus on more detailed panels related to a specific dataset. Dashboard drilldowns are invaluable in this scenario as they allow you to transition between dashboards while maintaining the overall context. Let’s explore how to implement and use this feature effectively!For this exercise, we have already built a drilldown dashboard. Download and save the NDJSON file of the exported dashboard from the following location: https://github.com/PacktPublishing/ Elastic-Stack-8.x-Cookbook/blob/main/Chapter6/kibana-objects/rennesdata-drilldown-dashboard.ndjson. Then, follow these steps:1. To import the dashboard, go to Stack Management | Saved Objects.2. Click on Import and select the NDJSON file you have previously downloaded from the GitHub repository. Upon completing the import process, you will notice a warning in the flyout about data view conflicts. The reason is straightforward: our saved objects rely on an existing data view. To resolve the conflict, simply click on the drop-down list under the New data view column and select metrics-rennes_traffic-raw, as shown in Figure 6.61, then click on Confirm all changes to finalize the import procedure:Figure 6.61 – Importing saved objects and selecting the right data view3. Once all the objects have been imported, you will get a recap as shown in the following screenshot: Figure 6.62 – Saved objects successfully imported from the fileReturn to the [Rennes Traffic] Overview dashboard. Then, open the menu for the [Rennes Traffic] Speed by road hierarchie panel and select Create drilldown:Figure 6.63 – Creating drilldown from the panel4. Navigate to the drilldowns page and select the Go to Dashboard option. Here, you will need to name your drilldown—consider View Details for Road Hierarchy as a suggestion. Then, from the Choose destination dashboard drop-down menu, select [Rennes Traffic] Detailed traffic drilldown dashboard, which you have recently imported. This process sets up a targeted navigation path within your dashboard environment, allowing for a seamless transition between your overview and detailed analysis dashboards:Figure 6.64 – Configuring dashboard drilldown5. Click on Create drilldown. Save the dashboard to test our drilldown, click on one of the five charts in the [Rennes Traffic] Speed by road hierarchie panel. You will be redirected to the detailed dashboard filtered on the value you have selected.Figure 6.65 – Dashboard view after drilldownEt voilà! You have just built your first dashboard with a nice touch of interactivity thanks to controls and drilldowns.How it works...In Kibana, a dashboard is a collection of visualizations and saved searches that you can arrange and customize to display the data that is most important to you. You can create multiple dashboards for different use cases, and each dashboard can have its own set of visualizations and searches.Dashboards are a powerful tool for data analysis because they allow you to see multiple visualizations side by side and quickly identify patterns and trends in your data. You can also use dashboards to monitor key metrics in real time, which is especially useful for operational use cases. Kibana provides a wide range of visualization types that you can use to create custom dashboards, including bar charts, line charts, pie charts, tables, and more.The following table outlines a framework for choosing the right visualization:Use caseRecommended type of visualizationComparison and correlationMany items: Horizontal barFew items: Vertical barComparison over timeFew periods and categories: Stacked barFew time periods but many categories: Line graphDistribution of valuesFew numbers of points: Vertical bar histogramMany points: Line histogramComposition of a wholeSimple compositions with few items: Waffle or TreemapMultiple grouping dimensions for a few bottomlevel items: MosaicMultiple grouping dimensions for many bottomlevel items: TreemapEye-catching summaryOne value: MetricMany values: Table with color stylingVisualizing goals or targetsVertical bar or Line with reference linesMetricTable 6.2 – Choosing the right visualizationIn addition to visualizations, Kibana dashboards also support saved searches, which allow you to quickly filter your data based on specific criteria. You can save searches that you use frequently and add them to your dashboard for easy access.Overall, Kibana dashboards are a powerful tool for data analysis and monitoring. They allow you to quickly identify patterns and trends in your data, monitor key metrics in real time, and customize your view of the data to suit your needs.There’s more...In our recipe, we have used dashboard drilldowns, but you can also create URL and Discover drilldowns. With the former, you can link to data outside of Kibana, and with the latter, you can open Discover from a Lens panel while keeping all the contextual information.Dashboards are great when used in Kibana, but you can also share them with teams and colleagues outside of Kibana. You have many options that are easily accessible from the Share menu in the toolbar when it comes to sharing dashboards: you can interactively embed dashboards as an iFrame, export them as reports in various formats (PNG, CSV, PDF, etc.), and share them as direct links for easy access.When building dashboards, design thinking is a good practice. Start by asking yourself the following questions:What is the outcome or the goal of the dashboard? Is it about understanding high-level behaviors, visually correlating specific metrics at the same time, or finding the root cause of an issue?Who is using this dashboard to do their job? If you are building it for a team or someone else, step into their shoes to visualize their perspective when they will need that data.See alsoLooking for more design tips to elevate your dashboards? Look no further and check out this blog: https://www.elastic.co/blog/designing-intuitive-kibanadashboards-as-a-non-designerIf you’re interested in delving deeper into the topics of creating dashboards more efficiently, be sure to check out this technical blog: https://www.elastic.co/blog/buildingkibana-dashboards-more-efficientlyFor developers interested in debugging their Kibana dashboard, the following article will be very useful: https://www.elastic.co/blog/debugging-kibana-dashboardsConclusionIn this guide, we've explored the process of integrating various visualizations into a comprehensive Kibana dashboard, enhancing user interaction through control-based drilldowns. By following the steps outlined, you should now have a functional and interactive dashboard that can provide valuable insights into your data.We began by preparing the necessary visualizations and then moved on to assembling the dashboard by adding images for personalization and aligning various traffic visualizations. We also incorporated control panels for dynamic filtering, allowing for more precise data analysis. The final touch was adding drilldowns to enable seamless navigation between detailed and overview dashboards.Kibana dashboards offer powerful tools for data analysis and real-time monitoring. By displaying multiple visualizations side by side, you can quickly identify patterns and trends, making dashboards invaluable for operational and analytical use cases.Remember, the key to a successful dashboard is thoughtful design—consider the goals, the audience, and the specific data insights needed. Utilize the wide range of visualization types that Kibana offers and don't hesitate to leverage the sharing options to collaborate with your team effectively.For further reading and advanced tips on designing intuitive dashboards, building them efficiently, or debugging, check out the additional resources provided. Happy dashboarding!Author BioHuage Chen is a member of Elastic's customer engineering team and has been with Elastic for over five years, helping users throughout Europe to innovate and implement cloud-based solutions for search, data analysis, observability, and security. Before joining Elastic, he worked for 10 years in web content management, web portals, and digital experience platforms.Yazid Akadiri has been a solutions architect at Elastic for over four years, helping organizations and users solve their data and most critical business issues by harnessing the power of the Elastic Stack. At Elastic, he works with a broad range of customers, with a particular focus on Elastic observability and security solutions. He previously worked in web services-oriented architecture, focusing on API management and helping organizations build modern applications.

0
0
86916

article-image-how-does-elasticsearch-work-tutorial

Savia Lobo

30 Jul 2018

12 min read

How does Elasticsearch work? [Tutorial]

Savia Lobo

30 Jul 2018

12 min read

0
2
86390

article-image-which-python-framework-is-best-for-building-restful-apis-django-or-flask

Vincy Davis

07 May 2019

9 min read

Which Python framework is best for building RESTful APIs? Django or Flask?

Vincy Davis

07 May 2019

9 min read

Python is one of the top-rated programming languages. It's also known for its less-complex syntax, and its high-level, object-oriented, robust, and general-purpose programming. Python is the top choice for any first-time programmer. Since its release in 1991, Python has evolved and powered by several frameworks for web application development, scientific and mathematical computing, and graphical user interfaces to the latest REST API frameworks. This article is an excerpt taken from the book, 'Hands-On RESTful API Design Patterns and Best Practices' written by Harihara Subramanian and Pethura Raj. This book covers design strategy, essential and advanced Restful API Patterns, Legacy Modernization to Microservices centric apps. In this article, we'll explore two comprehensive frameworks, Django and Flask, so that you can choose the best one for developing your RESTful API. Django Django is a web framework also available as open source with the BSD license, designed to help developers create their web app very quickly as it takes care of additional web-development needs. It includes several packages (also known as applications) to handle typical web-development tasks, such as authentication, content administration, scaffolding, templates, caching, and syndication. Let's use the Django REST Framework (DRF) built with Python, and use it for REST API development and deployment. Django Rest Framework DRF is an open source, well-matured Python and Django library intended to help APP developers build sophisticated web APIs. DRF's modular, flexible, and customizable architecture makes the development of both simple, turnkey API endpoints and complicated REST constructs possible. The goal of DRF is to divide a model, generalize the wire representation, such as JSON or XML, and customize a set of class-based views to satisfy the specific API endpoint using a serializer that describes the mapping between views and API endpoints. Core features Django has many distinct features including: Web-browsable API This feature enhances the REST API developed with DRF. It has a rich interface, and the web-browsable API supports multiple media types too. The browsable API does mean that the APIs we build will be self-describing and the API endpoints that we create as part of the REST services and return JSON or HTML representations. The interesting fact about the web-browsable API is that we can interact with it fully through the browser, and any endpoint that we interact with using a programmatic client will also be capable of responding with a browser-friendly view onto the web-browsable API. Authentication One of the main attractive features of Django is authentication; it supports broad categories of authentication schemes, from basic authentication, token authentication, session authentication, remote user authentication, to OAuth Authentication. It also supports custom authentication schemes if we wish to implement one. DRF runs the authentication scheme at the start of the view, that is, before any other code is allowed to proceed. DRF determines the privileges of the incoming request from the permission and throttling policies and then decides whether the incoming request can be allowed or disallowed with the matched credentials. Serialization and deserialization Serialization is the process of converting complex data, such as querysets and model instances, into native Python datatypes. Converting facilitates the rendering of native data types, such as JSON or XML. DRF supports serialization through serializers classes. The serializers of DRF are similar to Django's Form and ModelForm classes. It provides a serializer class, which helps to control the output of responses. The DRF ModelSerializer classes provide a simple mechanism with which we can create serializers that deal with model instances and querysets. Serializers also do deserialization, that is, serializers allow parsed data that needs to be converted back into complex types. Also, deserialization happens only after validating the incoming data. Other noteworthy features Here are some other noteworthy features of the DRF: Routers: The DRF supports automatic URL routing to Django and provides a consistent and straightforward way to wire the view logic to a set of URLs Class-based views: A dominant pattern that enables the reusability of common functionalities Hyperlinking APIs: The DRF supports various styles (using primary keys, hyperlinking between entities, and so on) to represent the relationship between entities Generic views: Allows us to build API views that map to the database models DRF has many other features such as caching, throttling, testing, etc. Benefits of the DRF Here are some of the benefits of the DRF: Web-browsable API Authentication policies Powerful serialization Extensive documentation and excellent community support Simple yet powerful Test coverage of source code Secure and scalable Customizable Drawbacks of the DRF Here are some facts that may disappoint some Python app developers who intend to use the DRF: Monolithic and components get deployed together Based on Django ORM Steep learning curve Slow response time Flask Flask is a microframework for Python developers based on Werkzeug (WSGI toolkit) and Jinja 2 (template engine). It comes under BSD licensing. Flask is very easy to set up and simple to use. Like other frameworks, it comes with several out-of-the-box capabilities, such as a built-in development server, debugger, unit test support, templating, secure cookies, and RESTful request dispatching. The powerful Flask RESTful API framework is discussed below. Flask-RESTful Flask-RESTful is an extension for Flask that provides additional support for building REST APIs. You will never be disappointed with the time it takes to develop an API. Flask-Restful is a lightweight abstraction that works with the existing ORM/libraries. Flask-RESTful encourages best practices with minimal setup. Core features of Flask-RESTful Flask-RESTful comes with several built-in features. Django and Flask have many common RESTful frameworks, because they have almost the same supporting core features. The unique RESTful features of Flask is mentioned below. Resourceful routing The design goal of Flask-RESTful is to provide resources built on top of Flask pluggable views. The pluggable views provide a simple way to access the HTTP methods. Consider the following example code: class Todo(Resource): def get(self, user_id): .... def delete(self, user_id): .... def put(self, user_id): args = parser.parse_args() .... Restful request parsing Request parsing refers to an interface, modeled after the Python parser interface for command-line arguments, called argparser. The RESTful request parser is designed to provide uniform and straightforward access to any variable that comes within the (flask.request) request object. Output fields In most cases, app developers prefer to control rendering response data, and Flask-RESTful provides a mechanism where you can use ORM models or even custom classes as an object to render. Another interesting fact about this framework is that app developers don't need to worry about exposing any internal data structures as its let one format and filter the response objects. So, when we look at the code, it'll be evident which data would go for rendering and how it'll be formatted. Other noteworthy features Here are some other noteworthy features of Flask-RESTful: API: This is the main entry point for the restful API, which we'll initialize with the Flask application. ReqParse: This enables us to add and parse multiple arguments in the context of the single request. Input: A useful functionality, it parses the input string and returns true or false depending on the Input. If the input is from the JSON body, the type is already native Boolean and passed through without further parsing. Benefits of the Flask framework Here are some of the benefits of Flask framework: Built-in development server and debugger Out-of-the-box RESTful request dispatching Support for secure cookies Integrated unit-test support Lightweight Very minimal setup Faster (performance) Easy NoSQL integration Extensive documentation Drawbacks of Flask Here are some of Flask and Flask-RESTful's disadvantages: Version management (managed by developers) No brownie points as it doesn't have browsable APIs May incur a steep learning curve Frameworks – a table of reference The following table provides a quick reference of a few other prominent micro-frameworks, their features, and supported programming languages: Language Framework Short description Prominent features Java Blade Fast and elegant MVC framework for Java8 Lightweight High performance Based on the MVC pattern RESTful-style router interface Built-in security Java/Scala Play Framework High-velocity Reactive web framework for Java and Scala Lightweight, stateless, and web-friendly architecture Built on Akka Supports predictable and minimal resource-consumption for highly-scalable applications Developer-friendly Java Ninja Web Framework Full-stack web framework Fast Developer-friendly Rapid prototyping Plain vanilla Java, dependency injection, first-class IDE integration Simple and fast to test (mocked tests/integration tests) Excellent build and CI support Clean codebase – easy to extend Java RESTEASY JBoss-based implementation that integrates several frameworks to help to build RESTful Web and Java applications Fast and reliable Large community Enterprise-ready Security support Java RESTLET A lightweight and comprehensive framework based on Java, suitable for both server and client applications. Lightweight Large community Native REST support Connectors set JavaScript Express.js Minimal and flexible Node.js-based JavaScript framework for mobile and web applications HTTP utility methods Security updates Templating engine PHP Laravel An open source web-app builder based on PHP and the MVC architecture pattern Intuitive interface Blade template engine Eloquent ORM as default Elixir Phoenix (Elixir) Powered with the Elixir functional language, a reliable and faster micro-framework MVC-based High application performance Erlong virtual machine enables better use of resources Python Pyramid Python-based micro-framework Lightweight Function decorators Events and subscribers support Easy implementations and high productivity Summary It's evident that Python has two excellent frameworks. Depending on the choice of programming language you are intending to use and the required features, you can choose your type of framework to work on. If you are interested in learning more about the design strategy, guidelines and best practices of Restful API Patterns, you can refer to our book 'Hands-On RESTful API Design Patterns and Best Practices' here. Stack Overflow survey data further confirms Python’s popularity as it moves above Java in the most used programming language list. Svelte 3 releases with reactivity through language instead of an API Microsoft introduces Pyright, a static type checker for the Python language written in TypeScript

0
0
85386

How-To Tutorials

article-image-image-filtering-techniques-opencv

Vijin Boricha

12 Apr 2018

15 min read

Image filtering techniques in OpenCV

Vijin Boricha

12 Apr 2018

15 min read

In the world of computer vision, image filtering is used to modify images. These modifications essentially allow you to clarify an image in order to get the information you want. This could involve anything from extracting edges from an image, blurring it, or removing unwanted objects. There are, of course, lots of reasons why you might want to use image filtering to modify an image. For example, taking a picture in sunlight or darkness will impact an images clarity - you can use image filters to modify the image to get what you want from it. Similarly, you might have a blurred or 'noisy' image that needs clarification and focus. Let's use an example to see how to do image filtering in OpenCV. This image filtering tutorial is an extract from Practical Computer Vision. Here's an example with considerable salt and pepper noise. This occurs when there is a disturbance in the quality of the signal that's used to generate the image. The image above can be easily generated using OpenCV as follows: # initialize noise image with zeros noise = np.zeros((400, 600)) # fill the image with random numbers in given range cv2.randu(noise, 0, 256) Let's add weighted noise to a grayscale image (on the left) so the resulting image will look like the one on the right: The code for this is as follows: # add noise to existing image noisy_gray = gray + np.array(0.2*noise, dtype=np.int) Here, 0.2 is used as parameter, increase or decrease the value to create different intensity noise. In several applications, noise plays an important role in improving a system's capabilities. This is particularly true when you're using deep learning models. The noise becomes a way of testing the precision of the deep learning application, and building it into the computer vision algorithm. Linear image filtering The simplest filter is a point operator. Each pixel value is multiplied by a scalar value. This operation can be written as follows: Here: The input image is F and the value of pixel at (i,j) is denoted as f(i,j) The output image is G and the value of pixel at (i,j) is denoted as g(i,j) K is scalar constant This type of operation on an image is what is known as a linear filter. In addition to multiplication by a scalar value, each pixel can also be increased or decreased by a constant value. So overall point operation can be written like this: This operation can be applied both to grayscale images and RGB images. For RGB images, each channel will be modified with this operation separately. The following is the result of varying both K and L. The first image is input on the left. In the second image, K=0.5 and L=0.0, while in the third image, K is set to 1.0 and L is 10. For the final image on the right, K=0.7 and L=25. As you can see, varying K changes the brightness of the image and varying L changes the contrast of the image: This image can be generated with the following code: import numpy as np import matplotlib.pyplot as plt import cv2 def point_operation(img, K, L): """ Applies point operation to given grayscale image """ img = np.asarray(img, dtype=np.float) img = img*K + L # clip pixel values img[img > 255] = 255 img[img < 0] = 0 return np.asarray(img, dtype = np.int) def main(): # read an image img = cv2.imread('../figures/flower.png') gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # k = 0.5, l = 0 out1 = point_operation(gray, 0.5, 0) # k = 1., l = 10 out2 = point_operation(gray, 1., 10) # k = 0.8, l = 15 out3 = point_operation(gray, 0.7, 25) res = np.hstack([gray,out1, out2, out3]) plt.imshow(res, cmap='gray') plt.axis('off') plt.show() if __name__ == '__main__': main() 2D linear image filtering While the preceding filter is a point-based filter, image pixels have information around the pixel as well. In the previous image of the flower, the pixel values in the petal are all yellow. If we choose a pixel of the petal and move around, the values will be quite close. This gives some more information about the image. To extract this information in filtering, there are several neighborhood filters. In neighborhood filters, there is a kernel matrix which captures local region information around a pixel. To explain these filters, let's start with an input image, as follows: This is a simple binary image of the number 2. To get certain information from this image, we can directly use all the pixel values. But instead, to simplify, we can apply filters on this. We define a matrix smaller than the given image which operates in the neighborhood of a target pixel. This matrix is termed kernel; an example is given as follows: The operation is defined first by superimposing the kernel matrix on the original image, then taking the product of the corresponding pixels and returning a summation of all the products. In the following figure, the lower 3 x 3 area in the original image is superimposed with the given kernel matrix and the corresponding pixel values from the kernel and image are multiplied. The resulting image is shown on the right and is the summation of all the previous pixel products: This operation is repeated by sliding the kernel along image rows and then image columns. This can be implemented as in following code. We will see the effects of applying this on an image in coming sections. # design a kernel matrix, here is uniform 5x5 kernel = np.ones((5,5),np.float32)/25 # apply on the input image, here grayscale input dst = cv2.filter2D(gray,-1,kernel) However, as you can see previously, the corner pixel will have a drastic impact and results in a smaller image because the kernel, while overlapping, will be outside the image region. This causes a black region, or holes, along with the boundary of an image. To rectify this, there are some common techniques used: Padding the corners with constant values maybe 0 or 255, by default OpenCV will use this. Mirroring the pixel along the edge to the external area Creating a pattern of pixels around the image The choice of these will depend on the task at hand. In common cases, padding will be able to generate satisfactory results. The effect of the kernel is most crucial as changing these values changes the output significantly. We will first see simple kernel-based filters and also see their effects on the output when changing the size. Box filtering This filter averages out the pixel value as the kernel matrix is denoted as follows: Applying this filter results in blurring the image. The results are as shown as follows: In frequency domain analysis of the image, this filter is a low pass filter. The frequency domain analysis is done using Fourier transformation of the image, which is beyond the scope of this introduction. We can see on changing the kernel size, the image gets more and more blurred: As we increase the size of the kernel, you can see that the resulting image gets more blurred. This is due to averaging out of peak values in small neighbourhood where the kernel is applied. The result for applying kernel of size 20x20 can be seen in the following image. However, if we use a very small filter of size (3,3) there is negligible effect on the output, due to the fact that the kernel size is quite small compared to the photo size. In most applications, kernel size is heuristically set according to image size: The complete code to generate box filtered photos is as follows: def plot_cv_img(input_image, output_image): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image, cv2.COLOR_BGR2RGB)) ax[1].set_title('Box Filter (5,5)') ax[1].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # To try different kernel, change size here. kernel_size = (5,5) # opencv has implementation for kernel based box blurring blur = cv2.blur(img,kernel_size) # Do plot plot_cv_img(img, blur) if __name__ == '__main__': main() Properties of linear filters Several computer vision applications are composed of step by step transformations of an input photo to output. This is easily done due to several properties associated with a common type of filters, that is, linear filters: The linear filters are commutative such that we can perform multiplication operations on filters in any order and the result still remains the same: a * b = b * a They are associative in nature, which means the order of applying the filter does not affect the outcome: (a * b) * c = a * (b * c) Even in cases of summing two filters, we can perform the first summation and then apply the filter, or we can also individually apply the filter and then sum the results. The overall outcome still remains the same: Applying a scaling factor to one filter and multiplying to another filter is equivalent to first multiplying both filters and then applying scaling factor These properties play a significant role in other computer vision tasks such as object detection and segmentation. A suitable combination of these filters enhances the quality of information extraction and as a result, improves the accuracy. Non-linear image filtering While in many cases linear filters are sufficient to get the required results, in several other use cases performance can be significantly increased by using non-linear image filtering. Mon-linear image filtering is more complex, than linear filtering. This complexity can, however, give you more control and better results in your computer vision tasks. Let's take a look at how non-linear image filtering works when applied to different images. Smoothing a photo Applying a box filter with hard edges doesn't result in a smooth blur on the output photo. To improve this, the filter can be made smoother around the edges. One of the popular such filters is a Gaussian filter. This is a non-linear filter which enhances the effect of the center pixel and gradually reduces the effects as the pixel gets farther from the center. Mathematically, a Gaussian function is given as: where μ is mean and σ is variance. An example kernel matrix for this kind of filter in 2D discrete domain is given as follows: This 2D array is used in normalized form and effect of this filter also depends on its width by changing the kernel width has varying effects on the output as discussed in further section. Applying gaussian kernel as filter removes high-frequency components which results in removing strong edges and hence a blurred photo: While this filter performs better blurring than a box filter, the implementation is also quite simple with OpenCV: def plot_cv_img(input_image, output_image): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image, cv2.COLOR_BGR2RGB)) ax[1].set_title('Gaussian Blurred') ax[1].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # apply gaussian blur, # kernel of size 5x5, # change here for other sizes kernel_size = (5,5) # sigma values are same in both direction blur = cv2.GaussianBlur(img,(5,5),0) plot_cv_img(img, blur) if __name__ == '__main__': main() The histogram equalization technique The basic point operations, to change the brightness and contrast, help in improving photo quality but require manual tuning. Using histogram equalization technique, these can be found algorithmically and create a better-looking photo. Intuitively, this method tries to set the brightest pixels to white and the darker pixels to black. The remaining pixel values are similarly rescaled. This rescaling is performed by transforming original intensity distribution to capture all intensity distribution. An example of this equalization is as following: The preceding image is an example of histogram equalization. On the right is the output and, as you can see, the contrast is increased significantly. The input histogram is shown in the bottom figure on the left and it can be observed that not all the colors are observed in the image. After applying equalization, resulting histogram plot is as shown on the right bottom figure. To visualize the results of equalization in the image , the input and results are stacked together in following figure. Code for the preceding photos is as follows: def plot_gray(input_image, output_image): """ Converts an image from BGR to RGB and plots """ # change color channels order for matplotlib fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(input_image, cmap='gray') ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(output_image, cmap='gray') ax[1].set_title('Histogram Equalized ') ax[1].axis('off') plt.savefig('../figures/03_histogram_equalized.png') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # grayscale image is used for equalization gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # following function performs equalization on input image equ = cv2.equalizeHist(gray) # for visualizing input and output side by side plot_gray(gray, equ) if __name__ == '__main__': main() Median image filtering Median image filtering a similar technique as neighborhood filtering. The key technique here, of course, is the use of a median value. As such, the filter is non-linear. It is quite useful in removing sharp noise such as salt and pepper. Instead of using a product or sum of neighborhood pixel values, this filter computes a median value of the region. This results in the removal of random peak values in the region, which can be due to noise like salt and pepper noise. This is further shown in the following figure with different kernel size used to create output. In this image first input is added with channel wise random noise as: # read the image flower = cv2.imread('../figures/flower.png') # initialize noise image with zeros noise = np.zeros(flower.shape[:2]) # fill the image with random numbers in given range cv2.randu(noise, 0, 256) # add noise to existing image, apply channel wise noise_factor = 0.1 noisy_flower = np.zeros(flower.shape) for i in range(flower.shape[2]): noisy_flower[:,:,i] = flower[:,:,i] + np.array(noise_factor*noise, dtype=np.int) # convert data type for use noisy_flower = np.asarray(noisy_flower, dtype=np.uint8) The created noisy image is used for median image filtering as: # apply median filter of kernel size 5 kernel_5 = 5 median_5 = cv2.medianBlur(noisy_flower,kernel_5) # apply median filter of kernel size 3 kernel_3 = 3 median_3 = cv2.medianBlur(noisy_flower,kernel_3) In the following photo, you can see the resulting photo after varying the kernel size (indicated in brackets). The rightmost photo is the smoothest of them all: The most common application for median blur is in smartphone application which filters input image and adds an additional artifacts to add artistic effects. The code to generate the preceding photograph is as follows: def plot_cv_img(input_image, output_image1, output_image2, output_image3): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=4) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image1, cv2.COLOR_BGR2RGB)) ax[1].set_title('Median Filter (3,3)') ax[1].axis('off') ax[2].imshow(cv2.cvtColor(output_image2, cv2.COLOR_BGR2RGB)) ax[2].set_title('Median Filter (5,5)') ax[2].axis('off') ax[3].imshow(cv2.cvtColor(output_image3, cv2.COLOR_BGR2RGB)) ax[3].set_title('Median Filter (7,7)') ax[3].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # compute median filtered image varying kernel size median1 = cv2.medianBlur(img,3) median2 = cv2.medianBlur(img,5) median3 = cv2.medianBlur(img,7) # Do plot plot_cv_img(img, median1, median2, median3) if __name__ == '__main__': main() Image filtering and image gradients These are more edge detectors or sharp changes in a photograph. Image gradients widely used in object detection and segmentation tasks. In this section, we will look at how to compute image gradients. First, the image derivative is applying the kernel matrix which computes the change in a direction. The Sobel filter is one such filter and kernel in the x-direction is given as follows: Here, in the y-direction: This is applied in a similar fashion to the linear box filter by computing values on a superimposed kernel with the photo. The filter is then shifted along the image to compute all values. Following is some example results, where X and Y denote the direction of the Sobel kernel: This is also termed as an image derivative with respect to given direction(here X or Y). The lighter resulting photographs (middle and right) are positive gradients, while the darker regions denote negative and gray is zero. While Sobel filters correspond to first order derivatives of a photo, the Laplacian filter gives a second-order derivative of a photo. The Laplacian filter is also applied in a similar way to Sobel: The code to get Sobel and Laplacian filters is as follows: # sobel x_sobel = cv2.Sobel(img,cv2.CV_64F,1,0,ksize=5) y_sobel = cv2.Sobel(img,cv2.CV_64F,0,1,ksize=5) # laplacian lapl = cv2.Laplacian(img,cv2.CV_64F, ksize=5) # gaussian blur blur = cv2.GaussianBlur(img,(5,5),0) # laplacian of gaussian log = cv2.Laplacian(blur,cv2.CV_64F, ksize=5) We learnt about types of filters and how to perform image filtering in OpenCV. To know more about image transformation and 3D computer vision check out this book Practical Computer Vision. Check out for more: Fingerprint detection using OpenCV 3 3 ways to deploy a QT and OpenCV application OpenCV 4.0 is on schedule for July release

0
1
85106

article-image-implementing-3-naive-bayes-classifiers-in-scikit-learn

Packt Editorial Staff

07 May 2018

13 min read

Implementing 3 Naive Bayes classifiers in scikit-learn

Packt Editorial Staff

07 May 2018

13 min read

Scikit-learn provide three naive Bayes implementations: Bernoulli, multinomial and Gaussian. The only difference is about the probability distribution adopted. The first one is a binary algorithm particularly useful when a feature can be present or not. Multinomial naive Bayes assumes to have feature vector where each element represents the number of times it appears (or, very often, its frequency). This technique is very efficient in natural language processing or whenever the samples are composed starting from a common dictionary. The Gaussian Naive Bayes, instead, is based on a continuous distribution and it's suitable for more generic classification tasks. Ok, now that we have established naive Bayes variants are a handy set of algorithms to have in our machine learning arsenal and that Scikit-learn is a good tool to implement them, let’s rewind a bit. What is Naive Bayes? Naive Bayes are a family of powerful and easy-to-train classifiers, which determine the probability of an outcome, given a set of conditions using the Bayes' theorem. In other words, the conditional probabilities are inverted so that the query can be expressed as a function of measurable quantities. The approach is simple and the adjective naive has been attributed not because these algorithms are limited or less efficient, but because of a fundamental assumption about the causal factors that we will discuss. Naive Bayes are multi-purpose classifiers and it's easy to find their application in many different contexts. However, the performance is particularly good in all those situations when the probability of a class is determined by the probabilities of some causal factors. A good example is given by natural language processing, where a text can be considered as a particular instance of a dictionary and the relative frequencies of all terms provide enough information to infer a belonging class. Our examples may be generic, so to let you understand the application of naive Bayes in various context. The Bayes' theorem Let's consider two probabilistic events A and B. We can correlate the marginal probabilities P(A) and P(B) with the conditional probabilities P(A|B) and P(B|A) using the product rule: Considering that the intersection is commutative, the first members are equal, so we can derive the Bayes' theorem: This formula has very deep philosophical implications and it's a fundamental element of statistical learning. First of all, let's consider the marginal probability P(A): this is normally a value that determines how probable a target event is, like P(Spam) or P(Rain). As there are no other elements, this kind of probability is called Apriori, because it's often determined by mathematical considerations or simply by a frequency count. For example, imagine we want to implement a very simple spam filter and we've collected 100 emails. We know that 30 are spam and 70 are regular. So we can say that P(Spam) = 0.3. However, we'd like to evaluate using some criteria (for simplicity, let's consider a single one), for example, e-mail text is shorter than 50 characters. Therefore, our query becomes: The first term is similar to P(Spam) because it's the probability of spam given a certain condition. For this reason, it's called a posteriori (in other words, it's a probability that can estimate after knowing some additional elements). On the right side, we need to calculate the missing values, but it's simple. Let's suppose that 35 emails have a text shorter than 50 characters, P(Text < 50 chars) = 0.35 and, looking only into our spam folder, we discover that only 25 spam emails have a short text, so that P(Text < 50 chars|Spam) = 25/30 = 0.83. The result is: So, after receiving a very short email, there is 71% probability that it's a spam. Now we can understand the role of P(Text < 50 chars|Spam): as we have actual data, we can measure how probable is our hypothesis given the query, in other words, we have defined a likelihood (compare this with logistic regression) which is a weight between the Apriori probability and the a posteriori one (the term on the denominator is less important because it works as normalizing factor): The normalization factor is often represented by the Greek letter alpha, so the formula becomes: The last step is considering the case when there are more concurrent conditions (that is more realistic in real-life problems): A common assumption is called conditional independence (in other words, the eﬀects produced by every cause are independent among each other) and allows us to write a simpliﬁed expression: Naive Bayes classifiers A naive Bayes classifier is called in this way because it's based on a naive condition, which implies the conditional independence of causes. This can seem very difficult to accept in many contexts where the probability of a particular feature is strictly correlated to another one. For example, in spam filtering, a text shorter than 50 characters can increase the probability of the presence of an image, or if the domain has been already blacklisted for sending the same spam emails to million users, it's likely to find particular keywords. In other words, the presence of a cause isn't normally independent from the presence of other ones. However, in Zhang H., The Optimality of Naive Bayes, AAAI 1, no. 2 (2004): 3, the author showed that under particular conditions (not so rare to happen), different dependencies clears one another, and a naive Bayes classifier succeeds in achieving very high performances even if its naiveness is violated. Let's consider a dataset: Every feature vector, for simplicity, will be represented as: We need also a target dataset: where each y can belong to one of P different classes. Considering the Bayes' theorem under conditional independence, we can write: The values of the marginal Apriori probability P(y) and of the conditional probabilities P(xi|y) is obtained through a frequency count, therefore, given an input vector x, the predicted class is the one which a posteriori probability is maximum. Naive Bayes in scikit-learn scikit-learn implements three naive Bayes variants based on the same number of different probabilistic distributions: Bernoulli, multinomial, and Gaussian. The first one is a binary distribution useful when a feature can be present or absent. The second one is a discrete distribution used whenever a feature must be represented by a whole number (for example, in natural language processing, it can be the frequency of a term), while the latter is a continuous distribution characterized by its mean and variance. Bernoulli naive Bayes If X is random variable Bernoulli-distributed, it can assume only two values (for simplicity, let's call them 0 and 1) and their probability is: To try this algorithm with scikit-learn, we're going to generate a dummy dataset. Bernoulli naive Bayes expects binary feature vectors, however, the class BernoulliNB has a binarize parameter which allows specifying a threshold that will be used internally to transform the features: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0) We have a generated the bidimensional dataset shown in the following figure: We have decided to use 0.0 as a binary threshold, so each point can be characterized by the quadrant where it's located. Of course, this is a rational choice for our dataset, but Bernoulli naive Bayes is thought for binary feature vectors or continuous values which can be precisely split with a predeﬁned threshold. from sklearn.naive_bayes import BernoulliNB from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> bnb = BernoulliNB(binarize=0.0) >>> bnb.fit(X_train, Y_train) >>> bnb.score(X_test, Y_test) 0.85333333333333339 The score in rather good, but if we want to understand how the binary classifier worked, it's useful to see how the data have been internally binarized: Now, checking the naive Bayes predictions we obtain: >>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) >>> bnb.predict(data) array([0, 0, 1, 1]) Which is exactly what we expected. Multinomial naive Bayes A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences of a term or its relative frequency. If the feature vectors have n elements and each of them can assume k different values with probability pk, then: The conditional probabilities P(xi|y) are computed with a frequency count (which corresponds to applying a maximum likelihood approach), but in this case, it's important to consider the alpha parameter (called Laplace smoothing factor) which default value is 1.0 and prevents the model from setting null probabilities when the frequency is zero. It's possible to assign all non-negative values, however, larger values will assign higher probabilities to the missing features and this choice could alter the stability of the model. In our example, we're going to consider the default value of 1.0. For our purposes, we're going to use the DictVectorizer. There are automatic instruments to compute the frequencies of terms, but we're going to discuss them later. Let's consider only two records: the first one representing a city, while the second one countryside. Our dictionary contains hypothetical frequencies, like if the terms were extracted from a text description: from sklearn.feature_extraction import DictVectorizer >>> data = [ {'house': 100, 'street': 50, 'shop': 25, 'car': 100, 'tree': 20}, {'house': 5, 'street': 5, 'shop': 0, 'car': 10, 'tree': 500, 'river': 1} ] >>> dv = DictVectorizer(sparse=False) >>> X = dv.fit_transform(data) >>> Y = np.array([1, 0]) >>> X array([[ 100., 100., 0., 25., 50., 20.], [ 10., 5., 1., 0., 5., 500.]]) Note that the term 'river' is missing from the first set, so it's useful to keep alpha equal to 1.0 to give it a small probability. The output classes are 1 for city and 0 for the countryside. Now we can train a MultinomialNB instance: from sklearn.naive_bayes import MultinomialNB >>> mnb = MultinomialNB() >>> mnb.fit(X, Y) MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) To test the model, we create a dummy city with a river and a dummy country place without any river. >>> test_data = data = [ {'house': 80, 'street': 20, 'shop': 15, 'car': 70, 'tree': 10, 'river': 1}, ] {'house': 10, 'street': 5, 'shop': 1, 'car': 8, 'tree': 300, 'river': 0} >>> mnb.predict(dv.fit_transform(test_data)) array([1, 0]) As expected the prediction is correct. Later on, when discussing some elements of natural language processing, we're going to use multinomial naive Bayes for text classification with larger corpora. Even if the multinomial distribution is based on the number of occurrences, it can be successfully used with frequencies or more complex functions. Gaussian Naive Bayes Gaussian Naive Bayes is useful when working with continuous values which probabilities can be modeled using a Gaussian distribution: The conditional probabilities P(xi|y) are also Gaussian distributed and, therefore, it's necessary to estimate mean and variance of each of them using the maximum likelihood approach. This quite easy, in fact, considering the property of a Gaussian, we get: Where the k index refers to the samples in our dataset and P(xi|y) is a Gaussian itself. By minimizing the inverse of this expression (in Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson there's a complete analytical explanation), we get mean and variance for each Gaussian associated to P(xi|y) and the model is hence trained. As an example, we compare Gaussian Naive Bayes with logistic regression using the ROC curves. The dataset has 300 samples with two features. Each sample belongs to a single class: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0) A plot of the dataset is shown in the following figure: Now we can train both models and generate the ROC curves (the Y scores for naive Bayes are obtained through the predict_proba method): from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> gnb = GaussianNB() >>> gnb.fit(X_train, Y_train) >>> Y_gnb_score = gnb.predict_proba(X_test) >>> lr = LogisticRegression() >>> lr.fit(X_train, Y_train) >>> Y_lr_score = lr.decision_function(X_test) >>> fpr_gnb, tpr_gnb, thresholds_gnb = roc_curve(Y_test, Y_gnb_score[:, 1]) >>> fpr_lr, tpr_lr, thresholds_lr = roc_curve(Y_test, Y_lr_score) The resulting ROC curves are shown in the following figure: Naive Bayes performances are slightly better than logistic regression, however, the two classifiers have similar accuracy and Area Under the Curve (AUC). It's interesting to compare the performances of Gaussian and multinomial naive Bayes with the MNIST digit dataset. Each sample (belonging to 10 classes) is an 8x8 image encoded as an unsigned integer (0 - 255), therefore, even if each feature doesn't represent an actual count, it can be considered like a sort of magnitude or frequency. from sklearn.datasets import load_digits from sklearn.model_selection import cross_val_score >>> digits = load_digits() >>> gnb = GaussianNB() >>> mnb = MultinomialNB() >>> cross_val_score(gnb, digits.data, digits.target, scoring='accuracy', cv=10).mean() 0.81035375835678214 >>> cross_val_score(mnb, digits.data, digits.target, scoring='accuracy', cv=10).mean() 0.88193962163008377 The multinomial naive Bayes performs better than the Gaussian variant and the result is not really surprising. In fact, each sample can be thought as a feature vector derived from a dictionary of 64 symbols. The value can be the count of each occurrence, so a multinomial distribution can better fit the data, while a Gaussian is slightly more limited by its mean and variance. We've exposed the generic naive Bayes approach starting from the Bayes' theorem and its intrinsic philosophy. The naiveness of such algorithm is due to the choice to assume all the causes to be conditional independent. It means that each contribution is the same in every combination and the presence of a specific cause cannot alter the probability of the other ones. This is not so often realistic, however, under some assumptions; it's possible to show that internal dependencies clear each other so that the resulting probability appears unaffected by their relations. [box type="note" align="" class="" width=""]You read an excerpt from the book, Machine Learning Algorithms, written by Giuseppe Bonaccorso. This book will help you build strong foundation to enter the world of machine learning and data science. You will learn to build a data model and see how it behaves using different ML algorithms, explore support vector machines, recommendation systems, and even create a machine learning architecture from scratch. Grab your copy today![/box] What is Naïve Bayes classifier? Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving

0
0
85097

3 different types of generative adversarial networks (GANs) and how they work

Building Trust in AI: The Role of RAG in Data Security and Transparency

4 ways to implement feature selection in Python for machine learning

Divide and Conquer – Classification Using Decision Trees and Rules

Creating 2D and 3D plots using Matplotlib

Cross-Validation strategies for Time Series forecasting [Tutorial]

25 Datasets for Deep Learning in IoT

How to Customize lines and markers in Matplotlib 2.0

Mastering the API Life Cycle: A Comprehensive Guide to Design, Implementation, Release, and Maintenance

Mastering Threat Detection with VirusTotal: A Guide for SOC Analysts

Trending Topics

Creating and Using Kibana Dashboards

How does Elasticsearch work? [Tutorial]

Which Python framework is best for building RESTful APIs? Django or Flask?

Image filtering techniques in OpenCV

Implementing 3 Naive Bayes classifiers in scikit-learn