Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-3-different-types-of-generative-adversarial-networks-gans-and-how-they-work
Packt Editorial Staff
08 Jan 2020
6 min read
Save for later

3 different types of generative adversarial networks (GANs) and how they work

Packt Editorial Staff
08 Jan 2020
6 min read
Generative adversarial networks (GANs) have been greeted with real excitement since their creation back in 2014 by Ian Goodfellow and his research team. Yann LeCun, Facebook's Director of AI Research went as far as describing GANs as "the most interesting idea in the last 10 years in ML." With all this excitement, however, it can be easy to miss the subtle diversity of GANs; there are a number of different types of generative adversarial networks, each one working in slightly different ways and helping engineers to achieve slightly different results. To give you a deeper insight on GANs, in this article we'll look at three different generative adversarial networks: SRGANs, CycleGANs, and InfoGANs. We'll explore how these different GANs work and how they can be used. This should give you a solid foundation to explore GANs in more depth and begin to apply them in your own experiments and projects. This article is an excerpt from the book, Deep Learning with TensorFlow 2 and Keras, Second Edition by Antonio Gulli, Amita Kapoor, and Sujit Pal.  SRGAN - Super Resolution GANs Remember seeing a crime-thriller where our hero asks the computer guy to magnify the faded image of the crime scene? With the zoom we are able to see the criminal’s face in detail, including the weapon used and anything engraved upon it! Well, SRGAN can perform similar magic. Here a GAN is trained in such a way that it can generate a photorealistic high-resolution image when given a low-resolution image. The SRGAN architecture consists of three neural networks: a very deep generator network, a discriminator network, and a pretrained VGG-16 network. How do SRGANs work? SRGANs use the perceptual loss function (developed by Johnson et al, Perceptual Losses for Real-Time Style Transfer and Super-Resolution). The difference in the feature map activations in high layers of a VGG network between the network output part and the high-resolution part comprises the perceptual loss function. Besides perceptual loss, the authors further added content loss and an adversarial loss so that images generated look more natural and the finer details more artistic. The perceptual loss is defined as the weighted sum of content loss and adversarial loss: lSR = lSR X+ 10−3×lSRGen The first term on the right-hand side is the content loss, obtained using the feature maps generated by pretrained VGG 19. Mathematically it is the Euclidean distance between the feature map of the reconstructed image (that is the one generated by the generator) and the original high-resolution reference image. The second term on the right-hand side is the adversarial loss. It is the standard generative loss term, designed to ensure that images generated by the generator are able to fool the discriminator. You can see in the following figure taken from the original paper that the image generated by SRGAN is much closer to the original high-resolution image: [caption id="attachment_31006" align="aligncenter" width="907"] image via https://arxiv.org/pdf/1609.04802.pdf[/caption] CycleGAN Another noteworthy architecture is CycleGAN; proposed in 2017, it can perform the task of image translation. Once trained you can translate an image from one domain to another domain. For example, when trained on horse and zebra data set, if you give it an image with horses in the ground, the CycleGAN can convert the horses to zebra with the same background. How does CycleGAN work? Have you ever imagined how a scenery would look if Van Gogh or Manet had painted it? We have many sceneries, and many landscapes painted by Gogh/Manet, but we do not have any collection of input-output pairs. CycleGAN performs the image translation, that is, transfers an image given in one domain (scenery for example) to another domain (Van Gogh painting of the same scene, for instance) in the absence of training examples. CycleGAN’s ability to perform image translation in the absence of training pairs is what makes it unique. To achieve image translation the authors of CycleGAN used a very simple and yet effective procedure. They made use of two GANs, the generator of each GAN performing the image translation from one domain to another. To elaborate, let us say the input is X, then the generator of the first GAN performs a mapping G: X → Y, thus its output would be Y = G(X). The generator of the second GAN performs an inverse mapping F: Y → X, resulting in X = F(Y). Each discriminator is trained to distinguish between real images and synthesized images. The idea is shown as follows: To train the combined GANs, the authors added beside the conventional GAN adversarial loss a forward cycle consistency loss (left figure) and a backward cycle consistency loss (right figure). This ensures that if an image X is given as input, then after the two translations F(G(X)) ~ X the obtained image is the same X (similarly the backward cycle consistency loss ensures the G(F(Y)) ~ Y). Following are some of the successful image translations by CycleGAN: Following are few more examples, you can see the translation of seasons (summer → winter), photo → painting and vice versa, horses → zebra: InfoGAN The GAN architectures that we have considered up to now provide us with little or no control over the generated images. InfoGAN changes this; it provides control over various attributes of the images generated. The InfoGAN uses concepts from information theory such that the noise term is transformed into latent codes which provide predictable and systematic control over the output. How does InfoGAN work? The generator in InfoGAN takes two inputs the latent space Z and a latent code c, thus the output of generator is G(Z,c). The GAN is trained such that it maximizes the mutual information between the latent code c and the generated image G(Z,c). The following figure shows the architecture of InfoGAN:   The concatenated vector (Z,c) is fed to the Generator. Q(c|X) is also a neural network, combined with the generator it works to form a mapping between random noise Z and its latent code c_hat, it aims to estimate c given X. This is achieved by adding a regularization term to the objective function of conventional GAN: minDmaxG VI(D,G) = VG(D,G) −λI(c;G(Z,c)) The term VG(D,G) is the loss function of conventional GAN, and the second term is the regularization term, where λ is a constant. Its value was set to 1 in the paper, and I(c;G(Z,c)) is the mutual information between the latent code c and the Generator generated image G(Z,c). Below is the results of InfoGAN on the MNIST dataset: That concludes our brief look at three different types of generative adversarial networks. You can find the book from which this article was taken on the Packt store or you can read the first chapter for free on the Packt subscription platform.
Read more
  • 0
  • 0
  • 100369

article-image-building-trust-in-ai-the-role-of-rag-in-data-security-and-transparency
Keith Bourne
13 Dec 2024
15 min read
Save for later

Building Trust in AI: The Role of RAG in Data Security and Transparency

Keith Bourne
13 Dec 2024
15 min read
This article is an excerpt from the book, "Unlocking Data with Generative AI and RAG", by Keith Bourne. Master Retrieval-Augmented Generation (RAG), the most popular generative AI tool, to unlock the full potential of your data. This book enables you to develop highly sought-after skills as corporate investment in generative AI soars.IntroductionAs the adoption of Retrieval-Augmented Generation (RAG) continues to grow, its potential to address key security challenges in AI-driven applications is becoming evident. Far from merely introducing risks, RAG offers a robust framework to enhance data protection, ensure accuracy, and maintain transparency in content generation. This article delves into the multifaceted security benefits of RAG, while also addressing the unique challenges it poses and strategies to mitigate them.How RAG can be leveraged as a security solutionLet’s start with the most positive security aspect of RAG. RAG can actually be considered a solution to mitigate security concerns, rather than cause them. If done right, you can limit data access via user, ensure more reliable responses, and provide more transparency of sources.Limiting dataRAG applications may be a relatively new concept, but you can still apply the same authentication and database-based access approaches you can with web and similar types of applications. This provides the same level of security you can apply in these other types of applications. By implementing userbased access controls, you can restrict the data that each user or user group can retrieve through the RAG system. This ensures that sensitive information is only accessible to authorized individuals. Additionally, by leveraging secure database connections and encryption techniques, you can safeguard the data at rest and in transit, preventing unauthorized access or data breaches.Ensuring the reliability of generated contentOne of the key benefits of RAG is its ability to mitigate inaccuracies in generated content. By allowing applications to retrieve proprietary data at the point of generation, the risk of producing misleading or incorrect responses is substantially reduced. Feeding the most current data available through your RAG system helps to mitigate inaccuracies that might otherwise occur.With RAG, you have control over the data sources used for retrieval. By carefully curating and maintaining high-quality, up-to-date datasets, you can ensure that the information used to generate responses is accurate and reliable. This is particularly important in domains where precision and correctness are critical, such as healthcare, finance, or legal applications.Maintaining transparencyRAG makes it easier to provide transparency in the generated content. By incorporating data such as citations and references to the retrieved data sources, you can increase the credibility and trustworthiness of the generated responses.When a RAG system generates a response, it can include links or references to the specific data points or documents used in the generation process. This allows users to verify the information and trace it back to its original sources. By providing this level of transparency, you can build trust with your users and demonstrate the reliability of the generated content.Transparency in RAG can also help with accountability and auditing. If there are any concerns or disputes regarding the generated content, having clear citations and references makes it easier to investigate and resolve any issues. This transparency also facilitates compliance with regulatory requirements or industry standards that may require traceability of information.That covers many of the security-related benefits you can achieve with RAG. However, there are some security challenges associated with RAG as well. Let’s discuss these challenges next.RAG security challengesRAG applications face unique security challenges due to their reliance on large language models (LLMs) and external data sources. Let’s start with the black box challenge, highlighting the relative difficulty in understanding how an LLM determines its response.LLMs as black boxesWhen something is in a dark, black box with the lid closed, you cannot see what is going on in there! That is the idea behind the black box when discussing LLMs, meaning there is a lack of transparency and interpretability in how these complex AI models process input and generate output. The most popular LLMs are also some of the largest, meaning they can have more than 100 billion parameters. The intricate interconnections and weights of these parameters make it difficult to understand how the model arrives at a particular output.While the black box aspects of LLMs do not directly create a security problem, it does make it more difficult to identify solutions to problems when they occur. This makes it difficult to trust LLM outputs, which is a critical factor in most of the applications for LLMs, including RAG applications. This lack of transparency makes it more difficult to debug issues you might have in building an RAG application, which increases the risk of having more security issues.There is a lot of research and effort in the academic field to build models that are more transparent and interpretable, called explainable AI. Explainable AI aims at making the operations of A I systems transparent and understandable. It can involve tools, frameworks, and anything else that, when applied to RAG, helps us understand how the language models that we use produce the content they are generating. This is a big movement in the field, but this technology may not be immediately available as you read this. It will hopefully play a larger role in the future to help mitigate black box risk, but right now, none of the most popular LLMs are using explainable models. So, in the meantime, we will talk about other ways to address this issue.You can use human-in-the-loop, where you involve humans at different stages of the process to provide an added line of defense against unexpected outputs. This can often help to reduce the impact of the black box aspect of LLMs. If your response time is not as critical, you can also use an additional LLM to perform a review of the response before it is returned to the user, looking for issues. We will review how to add a second LLM call in code lab 5.3, but with a focus on preventing prompt attacks. But this concept is similar, in that you can add additional LLMs to do a number of extra tasks and improve the security of your application.Black box isn’t the only security issue you face when using RAG applications though; another very important topic is privacy protection.Privacy concerns and protecting user dataPersonally identifiable information (PII) is a key topic in the generative AI space, with governments a round the world trying to determine the best path to balance user privacy with the data-hungry needs of these LLMs. As this gets worked out, it is important to pay attention to the laws and regulations that are taking shape where your company is doing business and make sure all of the technologies you are integrating into your RAG applications adhere. Many companies, such as Google and Microsoft , are taking these efforts into their own hands, establishing their own standards of protection for their user data and emphasizing them in training literature for their platforms.At the corporate level, there is another challenge related to PII and sensitive information. As we have said many times, the nature of the RAG application is to give it access to the company data and combine that with the power of the LLM. For example, for financial institutions, RAG represents a way to give their customers unprecedented access to their own data in ways that allow them to speak naturally with technologies such as chatbots and get near-instant access to hard-to-find answers buried deep in their customer data.In many ways, this can be a huge benefit if implemented properly. But given that this is a security discussion, you may already see where I am going with this. We are giving unprecedented access to customer data using a technology that has artificial intelligence, and as we said previously in the black box discussion, we don’t completely understand how it works! If not implemented properly, this could be a recipe for disaster with massive negative repercussions for companies that get it wrong. Of course, it could be argued that the databases that contain the data are also a potential security risk. Having the data anywhere is a risk! But without taking on this risk, we also cannot provide the significant benefits they represent.As with other IT applications that contain sensitive data, you can forge forward, but you need to have a healthy fear of what can happen to data and proactively take measures to protect that data. The more you understand how RAG works, the better job you can do in preventing a potentially disastrous data leak. These steps can help you protect your company as well as the people who trusted your company with their data.This section was about protecting data that exists. However, a new risk that has risen with LLMs has been the generation of data that isn’t real, called hallucinations. Let’s discuss how this presents a new risk not common in the IT world.HallucinationsWe have discussed this in previous chapters, but LLMs can, at times, generate responses that sound coherent and factual but can be very wrong. These are called hallucinations and there have been many shocking examples provided in the news, especially in late 2022 and 2023, when LLMs became everyday tools for many users.Some are just funny with little consequence other than a good laugh, such as when ChatGPT was asked by a writer for The Economist, “When was the Golden Gate Bridge transported for the second time across Egypt?” ChatGPT responded, “The Golden Gate Bridge was transported for the second time across Egypt in October of 2016” (https://www.economist.com/by-invitation/2022/09/02/artificialneural-networks-today-are-not-conscious-according-to-douglashofstadter).Other hallucinations are more nefarious, such as when a New York lawyer used ChatGPT for legal research in a client’s personal injury case against Avianca Airlines, where he submitted six cases that had been completely made up by the chatbot, leading to court sanctions (https://www. courthousenews.com/sanctions-ordered-for-lawyers-who-relied-onchatgpt-artificial-intelligence-to-prepare-court-brief/). Even worse, generative AI has been known to give biased, racist, and bigoted perspectives, particularly when prompted in a manipulative way.When combined with the black box nature of these LLMs, where we are not always certain how and why a response is generated, this can be a genuine issue for companies wanting to use these LLMs in their RAG applications.From what we know though, hallucinations are primarily a result of the probabilistic nature of LLMs. For all responses that an LLM generates, it typically uses a probability distribution to determine what token it is going to provide next. In situations where it has a strong knowledge base of a certain subject, these probabilities for the next word/token can be 99% or higher. But in situations where the knowledge base is not as strong, the highest probability could be low, such as 20% or even lower. In these cases, it is still the highest probability and, therefore, that is the token that has the highest probability to be selected. The LLM has been trained on stringing tokens together in a very natural language way while using this probabilistic approach to select which tokens to display. As it strings together words with low probability, it forms sentences, and then paragraphs that sound natural and factual but are not based on high probability data. Ultimately, this results in a response that sounds very plausible but is, in fact, based on very loose facts that are incorrect.For a company, this poses a risk that goes beyond the embarrassment of your chatbot saying something wrong. What is said wrong could ruin your relationship(s) with your customer(s), or it could lead to the LLM offering your customer something that you did not intend to offer, or worse, cannot afford to offer. For example, when Microsoft released a chatbot named Tay on Twitter in 2016 with the intention of learning from interactions with Twitter users, users manipulated this spongy personality trait to get it to say numerous racist and bigoted remarks. This reflected poorly on Microsoft, which was promoting its expertise in the AI area with Tay, causing significant damage to its reputation at the time (https://www.theguardian.com/technology/2016/mar/26/microsoftdeeply-sorry-for-offensive-tweets-by-ai-chatbot).Hallucinations, threats related to black box aspects, and protecting user data can all be addressed through red teaming.ConclusionRAG represents a promising avenue for enhancing security in AI applications, offering tools to limit data access, ensure reliable outputs, and promote transparency. However, challenges such as the black box nature of LLMs, privacy concerns, and the risk of hallucinations demand proactive measures. By employing strategies like user-based access controls, explainable AI, and red teaming, organizations can harness the advantages of RAG while mitigating risks. As the technology evolves, a thoughtful approach to its implementation will be crucial for maintaining trust, compliance, and the integrity of data-driven solutions.Author BioKeith Bourne is a senior Generative AI data scientist at Johnson & Johnson. He has over a decade of experience in machine learning and AI working across diverse projects in companies that range in size from start-ups to Fortune 500 companies. With an MBA from Babson College and a master’s in applied data science from the University of Michigan, he has developed several sophisticated modular Generative AI platforms from the ground up, using numerous advanced techniques, including RAG, AI agents, and foundational model fine-tuning. Keith seeks to share his knowledge with a broader audience, aiming to demystify the complexities of RAG for organizations looking to leverage this promising technology.
Read more
  • 0
  • 0
  • 100006

article-image-4-ways-implement-feature-selection-python-machine-learning
Sugandha Lahoti
16 Feb 2018
13 min read
Save for later

4 ways to implement feature selection in Python for machine learning

Sugandha Lahoti
16 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt from Ensemble Machine Learning. This book serves as a beginner's guide to combining powerful machine learning algorithms to build optimized models.[/box] In this article, we will look at different methods to select features from the dataset; and discuss types of feature selection algorithms with their implementation in Python using the Scikit-learn (sklearn) library: Univariate selection Recursive Feature Elimination (RFE) Principle Component Analysis (PCA) Choosing important features (feature importance) We have explained first three algorithms and their implementation in short. Further we will discuss Choosing important features (feature importance) part in detail as it is widely used technique in the data science community. Univariate selection Statistical tests can be used to select those features that have the strongest relationships with the output variable. The scikit-learn library provides the SelectKBest class, which can be used with a suite of different statistical tests to select a specific number of features. The following example uses the chi squared (chi^2) statistical test for non-negative features to select four of the best features from the Pima Indians onset of diabetes dataset: #Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import SelectKBest #Import chi2 for performing chi square test from sklearn.feature_selection import chi2 #URL for loading the dataset url ="https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #We will select the features using chi square test = SelectKBest(score_func=chi2, k=4) #Fit the function for ranking the features by score fit = test.fit(X, Y) #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) #Apply the transformation on to dataset features = fit.transform(X) #Summarize selected features print(features[0:5,:]) You can see the scores for each attribute and the four attributes chosen (those with the highest scores): plas, test, mass, and age. Scores for each feature: [111.52   1411.887 17.605 53.108  2175.565   127.669 5.393 181.304] Selected Features: [[148. 0. 33.6 50. ] [85. 0. 26.6 31. ] [183. 0. 23.3 32. ] [89. 94. 28.1 21. ] [137. 168. 43.1 33. ]] Recursive Feature Elimination RFE works by recursively removing attributes and building a model on attributes that remain. It uses model accuracy to identify which attributes (and combinations of attributes) contribute the most to predicting the target attribute. You can learn more about the RFE class in the scikit-learn documentation. The following example uses RFE with the logistic regression algorithm to select the top three features. The choice of algorithm does not matter too much as long as it is skillful and consistent: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's feature selection algorithm from sklearn.feature_selection import RFE #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Create pandas data frame by loading the data from URL dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d"% fit.n_features_) print("Selected Features: %s"% fit.support_) print("Feature Ranking: %s"% fit.ranking_) After execution, we will get: Num Features: 3 Selected Features: [ True False False False False   True  True False] Feature Ranking: [1 2 3 5 6 1 1 4] You can see that RFE chose the the top three features as preg, mass, and pedi. These are marked True in the support_ array and marked with a choice 1 in the ranking_ array. Principle Component Analysis PCA uses linear algebra to transform the dataset into a compressed form. Generally, it is considered a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the following example, we use PCA and select three principal components: #Import the required packages #Import pandas to read csv import pandas #Import numpy for array related operations import numpy #Import sklearn's PCA algorithm from sklearn.decomposition import PCA #URL for loading the dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data" #Define the attribute names names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) #Create array from data values array = dataframe.values #Split the data into input and target X = array[:,0:8] Y = array[:,8] #Feature extraction pca = PCA(n_components=3) fit = pca.fit(X) #Summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_) You can see that the transformed dataset (three principal components) bears little resemblance to the source data: Explained Variance: [ 0.88854663   0.06159078  0.02579012] [[ -2.02176587e-03    9.78115765e-02 1.60930503e-02    6.07566861e-02 9.93110844e-01          1.40108085e-02 5.37167919e-04   -3.56474430e-03] [ -2.26488861e-02   -9.72210040e-01              -1.41909330e-01  5.78614699e-02 9.46266913e-02   -4.69729766e-02               -8.16804621e-04  -1.40168181e-01 [ -2.24649003e-02 1.43428710e-01                 -9.22467192e-01  -3.07013055e-01 2.09773019e-02   -1.32444542e-01                -6.39983017e-04  -1.25454310e-01]] Choosing important features (feature importance) Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let's understand it in detail. Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy. A random forest consists of a number of decision trees. Every node in a decision tree is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is known as impurity. For classification, it is typically either the Gini impurity or information gain/entropy, and for regression trees, it is the variance. Thus when training a tree, it can be computed by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Let's see how to do feature selection using a random forest classifier and evaluate the accuracy of the classifier before and after feature selection. We will use the Otto dataset. This dataset is available for free from kaggle (you will need to sign up to kaggle to be able to download this dataset). You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory. This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind. The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy). We will start with importing all of the libraries: #Import the supporting libraries #Import pandas to load the dataset from csv file from pandas import read_csv #Import numpy for array based operations and calculations import numpy as np #Import Random Forest classifier class from sklearn from sklearn.ensemble import RandomForestClassifier #Import feature selector class select model of sklearn         from sklearn.feature_selection         import SelectFromModel          np.random.seed(1) Let's define a method to split our dataset into training and testing data; we will train our dataset on the training part and the testing part will be used for evaluation of the trained model: #Function to create Train and Test set from the original dataset def getTrainTestData(dataset,split): np.random.seed(0) training = [] testing = [] np.random.shuffle(dataset) shape = np.shape(dataset) trainlength = np.uint16(np.floor(split*shape[0])) for i in range(trainlength): training.append(dataset[i]) for i in range(trainlength,shape[0]): testing.append(dataset[i]) training = np.array(training) testing = np.array(testing) return training,testing We also need to add a function to evaluate the accuracy of the model; it will take the predicted and actual output as input to calculate the percentage accuracy: #Function to evaluate model performance def getAccuracy(pre,ytest): count = 0 for i in range(len(ytest)): if ytest[i]==pre[i]: count+=1 acc = float(count)/len(ytest) return acc This is the time to load the dataset. We will load the train.csv file; this file contains more than 61,000 training instances. We will use 50000 instances for our example, in which we will use 35,000 instances to train the classifier and 15,000 instances to test the performance of the classifier: #Load dataset as pandas data frame data = read_csv('train.csv') #Extract attribute names from the data frame feat = data.keys() feat_labels = feat.get_values() #Extract data values from the data frame dataset = data.values #Shuffle the dataset np.random.shuffle(dataset) #We will select 50000 instances to train the classifier inst = 50000 #Extract 50000 instances from the dataset dataset = dataset[0:inst,:] #Create Training and Testing data for performance evaluation train,test = getTrainTestData(dataset, 0.7) #Split data into input and output variable with selected features Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain) print("Shape of the dataset ",shape) #Print the size of Data in MBs print("Size of Data set before feature selection: %.2f MB"%(Xtrain.nbytes/1e6)) Let's take note of the data size here; as our dataset contains about 35000 training instances with 94 attributes; the size of our dataset is quite large. Let's see: Shape of the dataset (35000, 94) Size of Data set before feature selection: 26.32 MB As you can see, we are having 35000 rows and 94 columns in our dataset, which is more than 26 MB data. In the next code block, we will configure our random forest classifier; we will use 250 trees with a maximum depth of 30 and the number of random features will be 7. Other hyperparameters will be the default of sklearn: #Lets select the test data for model evaluation purpose Xtest = test[:,0:94] ytest = test[:,94] #Create a random forest classifier with the following Parameters trees            = 250 max_feat     = 7 max_depth = 30 min_sample = 2 clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1) #Train the classifier and calculate the training time import time start = time.time() clf.fit(Xtrain, ytrain) end = time.time() #Lets Note down the model training time print("Execution time for building the Tree is: %f"%(float(end)- float(start))) pre = clf.predict(Xtest) Let's see how much time is required to train the model on the training dataset: Execution time for building the Tree is: 2.913641 #Evaluate the model performance for the test data acc = getAccuracy(pre, ytest) print("Accuracy of model before feature selection is %.2f"%(100*acc)) The accuracy of our model is: Accuracy of model before feature selection is 98.82 As you can see, we are getting very good accuracy as we are classifying almost 99% of the test data into the correct categories. This means we are classifying about 14,823 instances out of 15,000 in correct classes. So, now my question is: should we go for further improvement? Well, why not? We should definitely go for more improvements if we can; here, we will use feature importance to select features. As you know, in the tree building process, we use impurity measurement for node selection. The attribute value that has the lowest impurity is chosen as the node in the tree. We can use similar criteria for feature selection. We can give more importance to features that have less impurity, and this can be done using the feature_importances_ function of the sklearn library. Let's find out the importance of each feature: #Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_): print(feature) ('id', 0.33346650420175183) ('feat_1', 0.0036186958628801214) ('feat_2', 0.0037243050888530957) ('feat_3', 0.011579217472062748) ('feat_4', 0.010297382675187445) ('feat_5', 0.0010359139416194116) ('feat_6', 0.00038171336038056165) ('feat_7', 0.0024867672489765021) ('feat_8', 0.0096689721610546085) ('feat_9', 0.007906150362995093) ('feat_10', 0.0022342480802130366) As you can see here, each feature has a different importance based on its contribution to the final prediction. We will use these importance scores to rank our features; in the following part, we will select those features that have feature importance more than 0.01 for model training: #Select features which have higher contribution in the final prediction sfm = SelectFromModel(clf, threshold=0.01) sfm.fit(Xtrain,ytrain) Here, we will transform the input dataset according to the selected feature attributes. In the next code block, we will transform the dataset. Then, we will check the size and shape of the new dataset: #Transform input dataset Xtrain_1 = sfm.transform(Xtrain) Xtest_1      = sfm.transform(Xtest) #Let's see the size and shape of new dataset print("Size of Data set before feature selection: %.2f MB"%(Xtrain_1.nbytes/1e6)) shape = np.shape(Xtrain_1) print("Shape of the dataset ",shape) Size of Data set before feature selection: 5.60 MB Shape of the dataset (35000, 20) Do you see the shape of the dataset? We are left with only 20 features after the feature selection process, which reduces the size of the database from 26 MB to 5.60 MB. That's about 80% reduction from the original dataset. In the next code block, we will train a new random forest classifier with the same hyperparameters as earlier and test it on the testing dataset. Let's see what accuracy we get after modifying the training set: #Model training time start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time() print("Execution time for building the Tree is: %f"%(float(end)- float(start))) #Let's evaluate the model on test data pre = clf.predict(Xtest_1) count = 0 acc2 = getAccuracy(pre, ytest) print("Accuracy after feature selection %.2f"%(100*acc2)) Execution time for building the Tree is: 1.711518 Accuracy after feature selection 99.97 Can you see that!! We have got 99.97 percent accuracy with the modified dataset, which means we are classifying 14,996 instances in correct classes, while previously we were classifying only 14,823 instances correctly. This is a huge improvement we have got with the feature selection process; we can summarize all the results in the following table: Evaluation criteria Before feature selection After feature selection Number of features 94 20 Size of dataset 26.32 MB 5.60 MB Training time 2.91 seconds 1.71 seconds Accuracy 98.82 percent 99.97 percent The preceding table shows the practical advantages of feature selection. You can see that we have reduced the number of features significantly, which reduces the model complexity and dimensions of the dataset. We are getting less training time after the reduction in dimensions, and at the end, we have overcome the overfitting issue, getting higher accuracy than before. To summarize the article, we explored 4 ways of feature selection in machine learning. If you found this post is useful, do check out the book Ensemble Machine Learning to know more about stacking generalization among other techniques.
Read more
  • 0
  • 4
  • 99736

Packt
11 Aug 2015
17 min read
Save for later

Divide and Conquer – Classification Using Decision Trees and Rules

Packt
11 Aug 2015
17 min read
In this article by Brett Lantz, author of the book Machine Learning with R, Second Edition, we will get a basic understanding about decision trees and rule learners, including the C5.0 decision tree algorithm. This algorithm will cover mechanisms such as choosing the best split and pruning the decision tree. While deciding between several job offers with various levels of pay and benefits, many people begin by making lists of pros and cons, and eliminate options based on simple rules. For instance, ''if I have to commute for more than an hour, I will be unhappy.'' Or, ''if I make less than $50k, I won't be able to support my family.'' In this way, the complex and difficult decision of predicting one's future happiness can be reduced to a series of simple decisions. This article covers decision trees and rule learners—two machine learning methods that also make complex decisions from sets of simple choices. These methods then present their knowledge in the form of logical structures that can be understood with no statistical knowledge. This aspect makes these models particularly useful for business strategy and process improvement. By the end of this article, you will learn: How trees and rules "greedily" partition data into interesting segments The most common decision tree and classification rule learners, including the C5.0, 1R, and RIPPER algorithms We will begin by examining decision trees, followed by a look at classification rules. (For more resources related to this topic, see here.) Understanding decision trees Decision tree learners are powerful classifiers, which utilize a tree structure to model the relationships among the features and the potential outcomes. As illustrated in the following figure, this structure earned its name due to the fact that it mirrors how a literal tree begins at a wide trunk, which if followed upward, splits into narrower and narrower branches. In much the same way, a decision tree classifier uses a structure of branching decisions, which channel examples into a final predicted class value. To better understand how this works in practice, let's consider the following tree, which predicts whether a job offer should be accepted. A job offer to be considered begins at the root node, where it is then passed through decision nodes that require choices to be made based on the attributes of the job. These choices split the data across branches that indicate potential outcomes of a decision, depicted here as yes or no outcomes, though in some cases there may be more than two possibilities. In the case a final decision can be made, the tree is terminated by leaf nodes (also known as terminal nodes) that denote the action to be taken as the result of the series of decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree. A great benefit of decision tree algorithms is that the flowchart-like tree structure is not necessarily exclusively for the learner's internal use. After the model is created, many decision tree algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn't work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results need to be shared with others in order to inform future business practices. With this in mind, some potential uses include: Credit scoring models in which the criteria that causes an applicant to be rejected need to be clearly documented and free from bias Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of disease progression Although the previous applications illustrate the value of trees in informing decision processes, this is not to suggest that their utility ends here. In fact, decision trees are perhaps the single most widely used machine learning technique, and can be applied to model almost any type of data—often with excellent out-of-the-box applications. This said, in spite of their wide applicability, it is worth noting some scenarios where trees may not be an ideal fit. One such case might be a task where the data has a large number of nominal features with many levels or it has a large number of numeric features. These cases may result in a very large number of decisions and an overly complex tree. They may also contribute to the tendency of decision trees to overfit data, though as we will soon see, even this weakness can be overcome by adjusting some simple parameters. Divide and conquer Decision trees are built using a heuristic called recursive partitioning. This approach is also commonly known as divide and conquer because it splits the data into subsets, which are then split repeatedly into even smaller subsets, and so on and so forth until the process stops when the algorithm determines the data within the subsets are sufficiently homogenous, or another stopping criterion has been met. To see how splitting a dataset can create a decision tree, imagine a bare root node that will grow into a mature tree. At first, the root node represents the entire dataset, since no splitting has transpired. Next, the decision tree algorithm must choose a feature to split upon; ideally, it chooses the feature most predictive of the target class. The examples are then partitioned into groups according to the distinct values of this feature, and the first set of tree branches are formed. Working down each branch, the algorithm continues to divide and conquer the data, choosing the best candidate feature each time to create another decision node, until a stopping criterion is reached. Divide and conquer might stop at a node in a case that: All (or nearly all) of the examples at the node have the same class There are no remaining features to distinguish among the examples The tree has grown to a predefined size limit To illustrate the tree building process, let's consider a simple example. Imagine that you work for a Hollywood studio, where your role is to decide whether the studio should move forward with producing the screenplays pitched by promising new authors. After returning from a vacation, your desk is piled high with proposals. Without the time to read each proposal cover-to-cover, you decide to develop a decision tree algorithm to predict whether a potential movie would fall into one of three categories: Critical Success, Mainstream Hit, or Box Office Bust. To build the decision tree, you turn to the studio archives to examine the factors leading to the success and failure of the company's 30 most recent releases. You quickly notice a relationship between the film's estimated shooting budget, the number of A-list celebrities lined up for starring roles, and the level of success. Excited about this finding, you produce a scatterplot to illustrate the pattern: Using the divide and conquer strategy, we can build a simple decision tree from this data. First, to create the tree's root node, we split the feature indicating the number of celebrities, partitioning the movies into groups with and without a significant number of A-list stars: Next, among the group of movies with a larger number of celebrities, we can make another split between movies with and without a high budget: At this point, we have partitioned the data into three groups. The group at the top-left corner of the diagram is composed entirely of critically acclaimed films. This group is distinguished by a high number of celebrities and a relatively low budget. At the top-right corner, majority of movies are box office hits with high budgets and a large number of celebrities. The final group, which has little star power but budgets ranging from small to large, contains the flops. If we wanted, we could continue to divide and conquer the data by splitting it based on the increasingly specific ranges of budget and celebrity count, until each of the currently misclassified values resides in its own tiny partition, and is correctly classified. However, it is not advisable to overfit a decision tree in this way. Though there is nothing to stop us from splitting the data indefinitely, overly specific decisions do not always generalize more broadly. We'll avoid the problem of overfitting by stopping the algorithm here, since more than 80 percent of the examples in each group are from a single class. This forms the basis of our stopping criterion. You might have noticed that diagonal lines might have split the data even more cleanly. This is one limitation of the decision tree's knowledge representation, which uses axis-parallel splits. The fact that each split considers one feature at a time prevents the decision tree from forming more complex decision boundaries. For example, a diagonal line could be created by a decision that asks, "is the number of celebrities is greater than the estimated budget?" If so, then "it will be a critical success." Our model for predicting the future success of movies can be represented in a simple tree, as shown in the following diagram. To evaluate a script, follow the branches through each decision until the script's success or failure has been predicted. In no time, you will be able to identify the most promising options among the backlog of scripts and get back to more important work, such as writing an Academy Awards acceptance speech. Since real-world data contains more than two features, decision trees quickly become far more complex than this, with many more nodes, branches, and leaves. In the next section, you will learn about a popular algorithm to build decision tree models automatically. The C5.0 decision tree algorithm There are numerous implementations of decision trees, but one of the most well-known implementations is the C5.0 algorithm. This algorithm was developed by computer scientist J. Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is an improvement over his Iterative Dichotomiser 3 (ID3) algorithm. Although Quinlan markets C5.0 to commercial clients (see http://www.rulequest.com/ for details), the source code for a single-threaded version of the algorithm was made publically available, and it has therefore been incorporated into programs such as R. To further confuse matters, a popular Java-based open source alternative to C4.5, titled J48, is included in R's RWeka package. Because the differences among C5.0, C4.5, and J48 are minor, the principles in this article will apply to any of these three methods, and the algorithms should be considered synonymous. The C5.0 algorithm has become the industry standard to produce decision trees, because it does well for most types of problems directly out of the box. Compared to other advanced machine learning models, the decision trees built by C5.0 generally perform nearly as well, but are much easier to understand and deploy. Additionally, as shown in the following table, the algorithm's weaknesses are relatively minor and can be largely avoided: Strengths Weaknesses An all-purpose classifier that does well on most problems Highly automatic learning process, which can handle numeric or nominal features, as well as missing data Excludes unimportant features Can be used on both small and large datasets Results in a model that can be interpreted without a mathematical background (for relatively small trees) More efficient than other complex models Decision tree models are often biased toward splits on features having a large number of levels It is easy to overfit or underfit the model Can have trouble modeling some relationships due to reliance on axis-parallel splits Small changes in the training data can result in large changes to decision logic Large trees can be difficult to interpret and the decisions they make may seem counterintuitive To keep things simple, our earlier decision tree example ignored the mathematics involved in how a machine would employ a divide and conquer strategy. Let's explore this in more detail to examine how this heuristic works in practice. Choosing the best split The first challenge that a decision tree will face is to identify which feature to split upon. In the previous example, we looked for a way to split the data such that the resulting partitions contained examples primarily of a single class. The degree to which a subset of examples contains only a single class is known as purity, and any subset composed of only a single class is called pure. There are various measurements of purity that can be used to identify the best decision tree splitting candidate. C5.0 uses entropy, a concept borrowed from information theory that quantifies the randomness, or disorder, within a set of class values. Sets with high entropy are very diverse and provide little information about other items that may also belong in the set, as there is no apparent commonality. The decision tree hopes to find splits that reduce entropy, ultimately increasing homogeneity within the groups. Typically, entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). In each case, the minimum value indicates that the sample is completely homogenous, while the maximum value indicates that the data are as diverse as possible, and no group has even a small plurality. In the mathematical notion, entropy is specified as follows: In this formula, for a given segment of data (S), the term c refers to the number of class levels and pi refers to the proportion of values falling into class level i. For example, suppose we have a partition of data with two classes: red (60 percent) and white (40 percent). We can calculate the entropy as follows: > -0.60 * log2(0.60) - 0.40 * log2(0.40) [1] 0.9709506 We can examine the entropy for all the possible two-class arrangements. If we know that the proportion of examples in one class is x, then the proportion in the other class is (1 – x). Using the curve() function, we can then plot the entropy for all the possible values of x: > curve(-x * log2(x) - (1 - x) * log2(1 - x),        col = "red", xlab = "x", ylab = "Entropy", lwd = 4) This results in the following figure: As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in maximum entropy. As one class increasingly dominates the other, the entropy reduces to zero. To use entropy to determine the optimal feature to split upon, the algorithm calculates the change in homogeneity that would result from a split on each possible feature, which is a measure known as information gain. The information gain for a feature F is calculated as the difference between the entropy in the segment before the split (S1) and the partitions resulting from the split (S2): One complication is that after a split, the data is divided into more than one partition. Therefore, the function to calculate Entropy(S2) needs to consider the total entropy across all of the partitions. It does this by weighing each partition's entropy by the proportion of records falling into the partition. This can be stated in a formula as: In simple terms, the total entropy resulting from a split is the sum of the entropy of each of the n partitions weighted by the proportion of examples falling in the partition (wi). The higher the information gain, the better a feature is at creating homogeneous groups after a split on this feature. If the information gain is zero, there is no reduction in entropy for splitting on this feature. On the other hand, the maximum information gain is equal to the entropy prior to the split. This would imply that the entropy after the split is zero, which means that the split results in completely homogeneous groups. The previous formulae assume nominal features, but decision trees use information gain for splitting on numeric features as well. To do so, a common practice is to test various splits that divide the values into groups greater than or less than a numeric threshold. This reduces the numeric feature into a two-level categorical feature that allows information gain to be calculated as usual. The numeric cut point yielding the largest information gain is chosen for the split. Though it is used by C5.0, information gain is not the only splitting criterion that can be used to build decision trees. Other commonly used criteria are Gini index, Chi-Squared statistic, and gain ratio. For a review of these (and many more) criteria, refer to Mingers J. An Empirical Comparison of Selection Measures for Decision-Tree Induction. Machine Learning. 1989; 3:319-342. Pruning the decision tree A decision tree can continue to grow indefinitely, choosing splitting features and dividing the data into smaller and smaller partitions until each example is perfectly classified or the algorithm runs out of features to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and the model will be overfitted to the training data. The process of pruning a decision tree involves reducing its size such that it generalizes better to unseen data. One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions or when the decision nodes contain only a small number of examples. This is called early stopping or pre-pruning the decision tree. As the tree avoids doing needless work, this is an appealing strategy. However, one downside to this approach is that there is no way to know whether the tree will miss subtle, but important patterns that it would have learned had it grown to a larger size. An alternative, called post-pruning, involves growing a tree that is intentionally too large and pruning leaf nodes to reduce the size of the tree to a more appropriate level. This is often a more effective approach than pre-pruning, because it is quite difficult to determine the optimal depth of a decision tree without growing it first. Pruning the tree later on allows the algorithm to be certain that all the important data structures were discovered. The implementation details of pruning operations are very technical and beyond the scope of this article. For a comparison of some of the available methods, see Esposito F, Malerba D, Semeraro G. A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997;19: 476-491. One of the benefits of the C5.0 algorithm is that it is opinionated about pruning—it takes care of many decisions automatically using fairly reasonable defaults. Its overall strategy is to post-prune the tree. It first grows a large tree that overfits the training data. Later, the nodes and branches that have little effect on the classification errors are removed. In some cases, entire branches are moved further up the tree or replaced by simpler decisions. These processes of grafting branches are known as subtree raising and subtree replacement, respectively. Balancing overfitting and underfitting a decision tree is a bit of an art, but if model accuracy is vital, it may be worth investing some time with various pruning options to see if it improves the performance on test data. As you will soon see, one of the strengths of the C5.0 algorithm is that it is very easy to adjust the training options. Summary This article covered two classification methods that use so-called "greedy" algorithms to partition the data according to feature values. Decision trees use a divide and conquer strategy to create flowchart-like structures, while rule learners separate and conquer data to identify logical if-else rules. Both methods produce models that can be interpreted without a statistical background. One popular and highly configurable decision tree algorithm is C5.0. We used the C5.0 algorithm to create a tree to predict whether a loan applicant will default. This article merely scratched the surface of how trees and rules can be used. Resources for Article: Further resources on this subject: Introduction to S4 Classes [article] First steps with R [article] Supervised learning [article]
Read more
  • 0
  • 0
  • 99630

article-image-creating-2d-3d-plots-using-matplotlib
Pravin Dhandre
22 Mar 2018
10 min read
Save for later

Creating 2D and 3D plots using Matplotlib

Pravin Dhandre
22 Mar 2018
10 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides data science recipes for users to effectively process, manipulate, and visualize massive datasets using SciPy.[/box] In today’s tutorial, we will demonstrate how to create two-dimensional and three-dimensional plots for displaying graphical representation of data using a full-fledged scientific library -  Matplotlib. Creating two-dimensional plots of functions and data We will present the basic kind of plot generated by Matplotlib: a two-dimensional display, with axes, where datasets and functional relationships are represented by lines. Besides the data being displayed, a good graph will contain a title (caption), axes labels, and, perhaps, a legend identifying each line in the plot. Getting ready Start Jupyter and run the following commands in an execution cell: %matplotlib inline import numpy as np import matplotlib.pyplot as plt How to do it… Run the following code in a single Jupyter cell: xvalues = np.linspace(-np.pi, np.pi) yvalues1 = np.sin(xvalues) yvalues2 = np.cos(xvalues) plt.plot(xvalues, yvalues1, lw=2, color='red', label='sin(x)') plt.plot(xvalues, yvalues2, lw=2, color='blue', label='cos(x)') plt.title('Trigonometric Functions') plt.xlabel('x') plt.ylabel('sin(x), cos(x)') plt.axhline(0, lw=0.5, color='black') plt.axvline(0, lw=0.5, color='black') plt.legend() None This code will insert the plot shown in the following screenshot into the Jupyter Notebook: How it works… We start by generating the data to be plotted, with the three following statements: xvalues = np.linspace(-np.pi, np.pi, 300) yvalues1 = np.sin(xvalues) yvalues2 = np.cos(xvalues) We first create an xvalues array, containing 300 equally spaced values between -π and π. We then compute the sine and cosine functions of the values in xvalues, storing the results in the yvalues1 and yvalues2 arrays. Next, we generate the first line plot with the following statement: plt.plot(xvalues, yvalues1, lw=2, color='red', label='sin(x)') The arguments to the plot() function are described as follows: xvalues and yvalues1 are arrays containing, respectively, the x and y coordinates of the points to be plotted. These arrays must have the same length. The remaining arguments are formatting options. lw specifies the line width and color the line color. The label argument is used by the legend() function, discussed as follows. The next line of code generates the second line plot and is similar to the one explained previously. After the line plots are defined, we set the title for the plot and the legends for the axes with the following commands: plt.title('Trigonometric Functions') plt.xlabel('x') plt.ylabel('sin(x), cos(x)') We now generate axis lines with the following statements: plt.axhline(0, lw=0.5, color='black') plt.axvline(0, lw=0.5, color='black') The first arguments in axhline() and axvline() are the locations of the axis lines and the options specify the line width and color. We then add a legend for the plot with the following statement: plt.legend() Matplotlib tries to place the legend intelligently, so that it does not interfere with the plot. In the legend, one item is being generated by each call to the plot() function and the text for each legend is specified in the label option of the plot() function. Generating multiple plots in a single figure Wouldn't it be interesting to know how to generate multiple plots in a single figure? Well, let's get started with that. Getting ready Start Jupyter and run the following three commands in an execution cell: %matplotlib inline import numpy as np import matplotlib.pyplot as plt How to do it… Run the following commands in a Jupyter cell: plt.figure(figsize=(6,6)) xvalues = np.linspace(-2, 2, 100) plt.subplot(2, 2, 1) yvalues = xvalues plt.plot(xvalues, yvalues, color='blue') plt.xlabel('$x$') plt.ylabel('$x$') plt.subplot(2, 2, 2) yvalues = xvalues ** 2 plt.plot(xvalues, yvalues, color='green') plt.xlabel('$x$') plt.ylabel('$x^2$') plt.subplot(2, 2, 3) yvalues = xvalues ** 3 plt.plot(xvalues, yvalues, color='red') plt.xlabel('$x$') plt.ylabel('$x^3$') plt.subplot(2, 2, 4) yvalues = xvalues ** 4 plt.plot(xvalues, yvalues, color='black') plt.xlabel('$x$') plt.ylabel('$x^3$') plt.suptitle('Polynomial Functions') plt.tight_layout() plt.subplots_adjust(top=0.90) None Running this code will produce results like those in the following screenshot: How it works… To start the plotting constructions, we use the figure() function, as shown in the following line of code: plt.figure(figsize=(6,6)) The main purpose of this call is to set the figure size, which needs adjustment, since we plan to make several plots in the same figure. After creating the figure, we add four plots with code, as demonstrated in the following segment: plt.subplot(2, 2, 3) yvalues = xvalues ** 3 plt.plot(xvalues, yvalues, color='red') plt.xlabel('$x$') plt.ylabel('$x^3$') In the first line, the plt.subplot(2, 2, 3) call tells pyplot that we want to organize the plots in a two-by-two layout, that is, in two rows and two columns. The last argument specifies that all following plotting commands should apply to the third plot in the array. Individual plots are numbered, starting with the value 1 and counting across the rows and columns of the plot layout. We then generate the line plot with the following statements: yvalues = xvalues ** 3 plt.plot(xvalues, yvalues, color='red') The first line of the preceding code computes the yvalues array, and the second draws the corresponding graph. Notice that we must set options such as line color individually for each subplot. After the line is plotted, we use the xlabel() and ylabel() functions to create labels for the axes. Notice that these have to be set up for each individual subplot too. After creating the subplots, we explain the subplots: plt.suptitle('Polynomial Functions') sets a common title for all Subplots plt.tight_layout() adjusts the area taken by each subplot, so that axes' legends do not overlap plt.subplots_adjust(top=0.90) adjusts the overall area taken by the plots, so that the title displays correctly Creating three-dimensional plots Matplotlib offers several different ways to visualize three-dimensional data. In this recipe, we will demonstrate the following methods: Drawing surfaces plots Drawing two-dimensional contour plots Using color maps and color bars Getting ready Start Jupyter and run the following three commands in an execution cell: %matplotlib inline import numpy as np import matplotlib.pyplot as plt How to do it… Run the following code in a Jupyter code cell: from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm f = lambda x,y: x**3 - 3*x*y**2 fig = plt.figure(figsize=(12,6)) ax = fig.add_subplot(1,2,1,projection='3d') xvalues = np.linspace(-2,2,100) yvalues = np.linspace(-2,2,100) xgrid, ygrid = np.meshgrid(xvalues, yvalues) zvalues = f(xgrid, ygrid) surf = ax.plot_surface(xgrid, ygrid, zvalues, rstride=5, cstride=5, linewidth=0, cmap=cm.plasma) ax = fig.add_subplot(1,2,2) plt.contourf(xgrid, ygrid, zvalues, 30, cmap=cm.plasma) fig.colorbar(surf, aspect=18) plt.tight_layout() None Running this code will produce a plot of the monkey saddle surface, which is a famous example of a surface with a non-standard critical point. The displayed graph is shown in the following screenshot: How it works… We start by importing the Axes3D class from the mpl_toolkits.mplot3d library, which is the Matplotlib object used for creating three-dimensional plots. We also import the cm class, which represents a color map. We then define a function to be plotted, with the following line of code: f = lambda x,y: x**3 - 3*x*y**2 The next step is to define the Figure object and an Axes object with a 3D projection, as done in the following lines of code: fig = plt.figure(figsize=(12,6)) ax = fig.add_subplot(1,2,1,projection='3d') Notice that the approach used here is somewhat different than the other recipes in this chapter. We are assigning the output of the figure() function call to the fig variable and then adding the subplot by calling the add_subplot() method from the fig object. This is the recommended method of creating a three-dimensional plot in the most recent version of Matplotlib. Even in the case of a single plot, the add_subplot() method should be used, in which case the command would be ax = fig.add_subplot(1,1,1,projection='3d'). The next few lines of code, shown as follows, compute the data for the plot: xvalues = np.linspace(-2,2,100) yvalues = np.linspace(-2,2,100) xgrid, ygrid = np.meshgrid(xvalues, yvalues) zvalues = f(xgrid, ygrid) The most important feature of this code is the call to meshgrid(). This is a NumPy convenience function that constructs grids suitable for three-dimensional surface plots. To understand how this function works, run the following code: xvec = np.arange(0, 4) yvec = np.arange(0, 3) xgrid, ygrid = np.meshgrid(xvec, yvec) After running this code, the xgrid array will contain the following values: array([[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]) The ygrid array will contain the following values: array([[0, 0, 0, 0], [1, 1, 1, 1], [2, 2, 2, 2]]) Notice that the two arrays have the same dimensions. Each grid point is represented by a pair of the (xgrid[i,j],ygrid[i,j]) type. This convention makes the computation of a vectorized function on a grid easy and efficient, with the f(xgrid, ygrid) expression. The next step is to generate the surface plot, which is done with the following function call: surf = ax.plot_surface(xgrid, ygrid, zvalues, rstride=5, cstride=5, linewidth=0, cmap=cm.plasma) The first three arguments, xgrid, ygrid, and zvalues, specify the data to be plotted. We then use the rstride and cstride options to select a subset of the grid points. Notice that the xvalues and yvalues arrays both have length 100, so that xgrid and ygrid will have 10,000 entries each. Using all grid points would be inefficient and produce a poor plot from the visualization point of view. Thus, we set rstride=5 and cstride=5, which results in a plot containing every fifth point across each row and column of the grid. The next option, linewidth=0, sets the line width of the plot to zero, preventing the display of a wireframe. The final argument, cmap=cm.plasma, specifies the color map for the plot. We use the cm.plasma color map, which has the effect of plotting higher functional values with a hotter color. Matplotlib offer as large number of built-in color maps, listed at https:/​/​matplotlib.​org/​examples/​color/​colormaps_​reference.​html.​ Next, we add the filled contour plot with the following code: ax = fig.add_subplot(1,2,2) ax.contourf(xgrid, ygrid, zvalues, 30, cmap=cm.plasma) Notice that, when selecting the subplot, we do not specify the projection option, which is not necessary for two-dimensional plots. The contour plot is generated with the contourf() method. The first three arguments, xgrid, ygrid, zvalues, specify the data points, and the fourth argument, 30, sets the number of contours. Finally, we set the color map to be the same one used for the surface plot. The final component of the plot is a color bar, which provides a visual representation of the value associated with each color in the plot, with the fig.colorbar(surf, aspect=18) method call. Notice that we have to specify in the first argument which plot the color bar is associated to. The aspect=18 option is used to adjust the aspect ratio of the bar. Larger values will result in a narrower bar. To finish the plot, we call the tight_layout() function. This adjusts the sizes of each plot, so that axis labels are displayed correctly.   We generated 2D and 3D plots using Matplotlib and represented the results of technical computation in graphical manner. If you want to explore other types of plots such as scatter plot or bar chart, you may read Visualizing 3D plots in Matplotlib 2.0. Do check out the book SciPy Recipes to take advantage of other libraries of the SciPy stack and perform matrices, data wrangling and advanced computations with ease.
Read more
  • 0
  • 0
  • 91754

article-image-cross-validation-strategies-for-time-series-forecasting-tutorial
Packt Editorial Staff
06 May 2019
12 min read
Save for later

Cross-Validation strategies for Time Series forecasting [Tutorial]

Packt Editorial Staff
06 May 2019
12 min read
Time series modeling and forecasting are tricky and challenging. The i.i.d (identically distributed independence) assumption does not hold well to time series data. There is an implicit dependence on previous observations and at the same time, a data leakage from response variables to lag variables is more likely to occur in addition to inherent non-stationarity in the data space. By non-stationarity, we mean flickering changes of observed statistics such as mean and variance. It even gets trickier when taking inherent nonlinearity into consideration. Cross-validation is a well-established methodology for choosing the best model by tuning hyper-parameters or performing feature selection. There are a plethora of strategies for implementing optimal cross-validation. K-fold cross-validation is a time-proven example of such techniques. However, it is not robust in handling time series forecasting issues due to the nature of the data as explained above. In this tutorial, we shall explore two more techniques for performing cross-validation; time series split cross-validation and blocked cross-validation, which is carefully adapted to solve issues encountered in time series forecasting. We shall use Python 3.5, SciKit Learn, Matplotlib, Numpy, and Pandas. By the end of this tutorial you will have explored the following topics: Time Series Split Cross-Validation Blocked Cross-Validation Grid Search Cross-Validation Loss Function Elastic Net Regression Cross-Validation Image Source: scikit-learn.org First, the data set is split into a training and testing set. The testing set is preserved for evaluating the best model optimized by cross-validation. In k-fold cross-validation, the training set is further split into k folds aka partitions. During each iteration of the cross-validation, one fold is held as a validation set and the remaining k - 1 folds are used for training. This allows us to make the best use of the data available without annihilation. It also allows us to avoid biasing the model towards patterns that may be overly represented in a given fold. Then the error obtained on all folds is averaged and the standard deviation is calculated. One usually performs cross-validation to find out which settings give the minimum error before training a final model using these elected settings on the complete training set. Flavors of k-fold cross-validations exist, for example, leave-one-out and nested cross-validation. However, these may be the topic of another tutorial. Grid Search Cross-Validation One idea to fine-tune the hyper-parameters is to randomly guess the values for model parameters and apply cross-validation to see if they work. This is infeasible as there may be exponential combinations of such parameters. This approach is also called Random Search in the literature. Grid search works by exhaustively searching the possible combinations of the model’s parameters, but it makes use of the loss function to guide the selection of the values to be tried at each iteration. That is solving a minimization optimization problem. However, in SciKit Learn it explicitly tries all the possible combination which makes it computationally expensive. When cross-validation is used in the inner loop of the grid search, it is called grid search cross-validation. Hence, the optimization objective becomes minimizing the average loss obtained on the k folds. R2 Loss Function Choosing the loss function has a very high impact on model performance and convergence. In this tutorial, I would like to introduce to you a loss function, most commonly used in regression tasks. R2 loss works by calculating correlation coefficients between the ground truth target values and the response output from the model. The formula is, however, slightly modified so that the range of the function is in the open interval [+1, -∞]. Hence, +1 indicates maximum positive correlation and negative values indicate the opposite. Thus, all the errors obtained in this tutorial should be interpreted as desirable if their value is close to +1. It is worth mentioning that we could have chosen a different loss function such as L1-norm or L2-norm. I would encourage you to try the ideas discussed in this tutorial using other loss functions and observe the difference. Elastic Net Regression This also goes in the literature by the name elastic net regularization. Regularization is a very robust technique to avoid overfitting by penalizing large weights or in other words it alters the objective function by emphasizing the errors caused by memorizing the training set. Vanilla linear regression can be tricked into learning the parameters that perform very well on the training set, but yet fail to generalize for unseen new samples. Both L1-regularization and L2-regularization were incorporated to resolve overfitting and are known in the literature as Lasso and Ridge regression respectively. Due to the critique of both Lasso and Ridge regression, Elastic Net regression was introduced to mix the two models. As a result, some variables’ coefficients are set to zero as per L1-norm and some others are penalized or shrank as per the L2-norm. This model combines the best from both worlds and the result is a stable, robust, and a sparse model. As a consequence, there are more parameters to be fine-tuned. That’s why this is a good example to demonstrate the power of cross-validation. Crypto Data Set I have obtained ETHereum/USD exchange prices for the year 2019 from cryptodatadownload.com which you can get for free from the website or by running the following command: $ wget http://www.cryptodatadownload.com/cdd/Gemini_ETHUSD_d.csv Now that you have the CSV file you can import it to Python using Pandas. The daily close price is used as both regressor and response variables. In this setup, I have used a lag of 64 days for regressors and a target of 8 days for responses. That is, given the past 64 days closing prices forecast the next 8 days. Then the resulting nan rows at the tail are dropped as a way to handle missing values. df = pd.read_csv('./Gemini_ETHUSD_d.csv', skiprows=1) for i in range(1, STEPS): col_name = 'd{}'.format(i) df[col_name] = df['d0'].shift(periods=-1 * i) df = df.dropna() Next, we split the data frame into two one for the regressors and the other for the responses. And then split both into two one for training and the other for testing. X = df.iloc[:, :TRAIN_STEPS] y = df.iloc[:, TRAIN_STEPS:] X_train = X.iloc[:SPLIT_IDX, :] y_train = y.iloc[:SPLIT_IDX, :] X_test = X.iloc[SPLIT_IDX:, :] y_test = y.iloc[SPLIT_IDX:, :] Model Design Let’s define a method that creates an elastic net model from sci-kit learn and since we are going to forecast more than one future time step, let’s use a multi-output regressor wrapper that trains a separate model for each target time step. However, this introduces more demand for computation resources. def build_model(_alpha, _l1_ratio): estimator = ElasticNet( alpha=_alpha, l1_ratio=_l1_ratio, fit_intercept=True, normalize=False, precompute=False, max_iter=16, copy_X=True, tol=0.1, warm_start=False, positive=False, random_state=None, selection='random' ) return MultiOutputRegressor(estimator, n_jobs=4) Blocked and Time Series Splits Cross-Validation The best way to grasp the intuition behind blocked and time series splits is by visualizing them. The three split methods are depicted in the above diagram. The horizontal axis is the training set size while the vertical axis represents the cross-validation iterations. The folds used for training are depicted in blue and the folds used for validation are depicted in orange. You can intuitively interpret the horizontal axis as time progression line since we haven’t shuffled the dataset and maintained the chronological order. The idea for time series splits is to divide the training set into two folds at each iteration on condition that the validation set is always ahead of the training split. At the first iteration, one trains the candidate model on the closing prices from January to March and validates on April’s data, and for the next iteration, train on data from January to April, and validate on May’s data, and so on to the end of the training set. This way dependence is respected. However, this may introduce leakage from future data to the model. The model will observe future patterns to forecast and try to memorize them. That’s why blocked cross-validation was introduced.  It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response. The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next. Implementing k-fold cross-validation using sci-kit learn is pretty straightforward, but in the following lines of code, we pass the k-fold splitter explicitly as we will develop the idea further in order to implement other kinds of cross-validation. model = build_model(_alpha=1.0, _l1_ratio=0.3) kfcv = KFold(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=kfcv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -103.076 (+/- 205.979) The same applies to time series splitter as follows: model = build_model(_alpha=1.0, _l1_ratio=0.3) tscv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -9.799 (+/- 19.292) Sci-kit learn gives us the luxury to define any new types of splitters as long as we abide by its splitter API and inherit from the base splitter. class BlockingTimeSeriesSplit(): def __init__(self, n_splits): self.n_splits = n_splits def get_n_splits(self, X, y, groups): return self.n_splits def split(self, X, y=None, groups=None): n_samples = len(X) k_fold_size = n_samples // self.n_splits indices = np.arange(n_samples) margin = 0 for i in range(self.n_splits): start = i * k_fold_size stop = start + k_fold_size mid = int(0.8 * (stop - start)) + start yield indices[start: mid], indices[mid + margin: stop] Then we can use it exactly the same way like before. model = build_model(_alpha=1.0, _l1_ratio=0.3) btscv = BlockingTimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X_train, y_train, cv=btscv, scoring=r2) print("Loss: {0:.3f} (+/- {1:.3f})".format(scores.mean(), scores.std())) This outputs: Loss: -15.527 (+/- 27.488) Please notice how the loss is different among the different types of splitters. In order to interpret the results correctly, let’s put it to test by using grid search cross-validation to find the optimal values for both regularization parameter alpha and -ratio that controls how much -norm contributes to the regularization. It follows that -norm contributes 1 - . params = { 'estimator__alpha':(0.1, 0.3, 0.5, 0.7, 0.9), 'estimator__l1_ratio':(0.1, 0.3, 0.5, 0.7, 0.9) } for i in range(100): model = build_model(_alpha=1.0, _l1_ratio=0.3) finder = GridSearchCV( estimator=model, param_grid=params, scoring=r2, fit_params=None, n_jobs=None, iid=False, refit=False, cv=kfcv, # change this to the splitter subject to test verbose=1, pre_dispatch=8, error_score=-999, return_train_score=True ) finder.fit(X_train, y_train) best_params = finder.best_params_ Experimental Results K-Fold Cross-Validation Optimal Parameters Grid-search cross-validation was run 100 times in order to objectively measure the consistency of the results obtained using each splitter. This way we can evaluate the effectiveness and robustness of the cross-validation method on time series forecasting. As for the k-fold cross-validation, the parameters suggested were almost uniform. That is, it did not really help us in discriminating the optimal parameters since all were equally good or bad. Time Series Split Cross-Validation Optimal Parameters Blocked Cross-Validation Optimal Parameters However, in both the cases of time series split cross-validation and blocked cross-validation, we have obtained a clear indication of the optimal values for both parameters. In case of blocked cross-validation, the results were even more discriminative as the blue bar indicates the dominance of -ratio optimal value of 0.1. Ground Truth vs Forecasting After having obtained the optimal values for our model parameters, we can train the model and evaluate it on the testing set. The results, as depicted in the plot above, indicate smooth capture of the trend and minimum error rate. # optimal model model = build_model(_alpha=0.1, _l1_ratio=0.1) # train model model.fit(X_train, y_train) # test score y_predicted = model.predict(X_test) score = r2_score(y_test, y_predicted, multioutput='uniform_average') print("Test Loss: {0:.3f}".format(score)) The output is: Test Loss: 0.925 Ideas for the Curious In this tutorial, we have demonstrated the power of using the right cross-validation strategy for time-series forecasting. The beauty of machine learning is endless. Here you’re a few ideas to try out and experiment on your own: Try using a different more volatile data set Try using different lag and target length instead of 64 and 8 days each. Try different regression models Try different loss functions Try RNN models using Keras Try increasing or decreasing the blocked splits margins Try a different value for k in cross-validation References Jeff Racine,Consistent cross-validatory model-selection for dependent data: hv-block cross-validation,Journal of Econometrics,Volume 99, Issue 1,2000,Pages 39-61,ISSN 0304-4076. Dabbs, Beau & Junker, Brian. (2016). Comparison of Cross-Validation Methods for Stochastic Block Models. Marcos Lopez de Prado, 2018, Advances in Financial Machine Learning (1st ed.), Wiley Publishing. Doctor, Grado DE et al. “New approaches in time series forecasting: methods, software, and evaluation procedures.” (2013). Learn More Seize the chance to learn more about time series forecasting techniques, machine learning, trading strategies, and algorithmic trading on my step by step online video course: Hands-on Machine Learning for Algorithmic Trading Bots with Python on PacktPub. Author Bio Mustafa Qamar-ud-Din is a machine learning engineer with over 10 years of experience in the software development industry engaged with startups on solving problems in various domains; e-commerce applications, recommender systems, biometric identity control, and event management. Time series modeling: What is it, Why it matters and How it’s used Implementing a simple Time Series Data Analysis in R Training RNNs for Time Series Forecasting
Read more
  • 0
  • 0
  • 90763
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-25-datasets-deep-learning-iot
Sugandha Lahoti
20 Mar 2018
8 min read
Save for later

25 Datasets for Deep Learning in IoT

Sugandha Lahoti
20 Mar 2018
8 min read
Deep Learning is one of the major players for facilitating the analytics and learning in the IoT domain. A really good roundup of the state of deep learning advances for big data and IoT is described in the paper Deep Learning for IoT Big Data and Streaming Analytics: A Survey by Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. In this article, we have attempted to draw inspiration from this research paper to establish the importance of IoT datasets for deep learning applications. The paper also provides a handy list of commonly used datasets suitable for building deep learning applications in IoT, which we have added at the end of the article. IoT and Big Data: The relationship IoT and Big data have a two-way relationship. IoT is the main producer of big data, and as such an important target for big data analytics to improve the processes and services of IoT. However, there is a difference between the two. Large-Scale Streaming data: IoT data is a large-scale streaming data. This is because a large number of IoT devices generate streams of data continuously. Big data, on the other hand, lack real-time processing. Heterogeneity: IoT data is heterogeneous as various IoT data acquisition devices gather different information. Big data devices are generally homogeneous in nature. Time and space correlation: IoT sensor devices are also attached to a specific location, and thus have a location and time-stamp for each of the data items. Big data sensors lack time-stamp resolution. High noise data: IoT data is highly noisy, owing to the tiny pieces of data in IoT applications, which are prone to errors and noise during acquisition and transmission. Big data, in contrast, is generally less noisy. Big data, on the other hand, is classified according to conventional 3V’s, Volume, Velocity, and Variety. As such techniques used for Big data analytics are not sufficient to analyze the kind of data, that is being generated by IoT devices. For instance, autonomous cars need to make fast decisions on driving actions such as lane or speed change. These decisions should be supported by fast analytics with data streaming from multiple sources (e.g., cameras, radars, left/right signals, traffic light etc.). This changes the definition of IoT big data classification to 6V’s. Volume: The quantity of generated data using IoT devices is much more than before and clearly fits this feature. Velocity: Advanced tools and technologies for analytics are needed to efficiently operate the high rate of data production. Variety: Big data may be structured, semi-structured, and unstructured data. The data types produced by IoT include text, audio, video, sensory data and so on. Veracity: Veracity refers to the quality, consistency, and trustworthiness of the data, which in turn leads to accurate analytics. Variability: This property refers to the different rates of data flow. Value: Value is the transformation of big data to useful information and insights that bring competitive advantage to organizations. Despite the recent advancement in DL for big data, there are still significant challenges that need to be addressed to mature this technology. Every 6 characteristics of IoT big data imposes a challenge for DL techniques. One common denominator for all is the lack of availability of IoT big data datasets.   IoT datasets and why are they needed Deep learning methods have been promising with state-of-the-art results in several areas, such as signal processing, natural language processing, and image recognition. The trend is going up in IoT verticals as well. IoT datasets play a major role in improving the IoT analytics. Real-world IoT datasets generate more data which in turn improve the accuracy of DL algorithms. However, the lack of availability of large real-world datasets for IoT applications is a major hurdle for incorporating DL models in IoT. The shortage of these datasets acts as a barrier to deployment and acceptance of IoT analytics based on DL since the empirical validation and evaluation of the system should be shown promising in the natural world. The lack of availability is mainly because: Most IoT datasets are available with large organizations who are unwilling to share it so easily. Access to the copyrighted datasets or privacy considerations. These are more common in domains with human data such as healthcare and education. While there is a lot of ground to be covered in terms of making datasets for IoT available, here is a list of commonly used datasets suitable for building deep learning applications in IoT. Dataset Name Domain Provider Notes Address/Link CGIAR dataset Agriculture, Climate CCAFS High-resolution climate datasets for a variety of fields including agricultural http://www.ccafs-climate.org/ Educational Process Mining Education University of Genova Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator http://archive.ics.uci.edu/ml/datasets/Educational+Process+Mining+%28EPM%29%3A+A+Learning+Analytics+Data+Set Commercial Building Energy Dataset Energy, Smart Building IIITD Energy related data set from a commercial building where data is sampled more than once a minute. http://combed.github.io/ Individual household electric power consumption Energy, Smart home EDF R&D, Clamart, France One-minute sampling rate over a period of almost 4 years http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption AMPds dataset Energy, Smart home S. Makonin AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring http://ampds.org/ UK Domestic Appliance-Level Electricity Energy, Smart Home Kelly and Knottenbelt Power demand from five houses. In each house both the whole-house mains power demand as well as power demand from individual appliances are recorded. http://www.doc.ic.ac.uk/∼dk3810/data/ PhysioBank databases Healthcare PhysioNet Archive of over 80 physiological datasets. https://physionet.org/physiobank/database/ Saarbruecken Voice Database Healthcare Universitat¨ des Saarlandes A collection of voice recordings from more than 2000 persons for pathological voice detection. http://www.stimmdatebank.coli.uni-saarland.de/help_en.php4   T-LESS   Industry CMP at Czech Technical University An RGB-D dataset and evaluation methodology for detection and 6D pose estimation of texture-less objects http://cmp.felk.cvut.cz/t-less/ CityPulse Dataset Collection Smart City CityPulse EU FP7 project Road Traffic Data, Pollution Data, Weather, Parking http://iot.ee.surrey.ac.uk:8080/datasets.html Open Data Institute - node Trento Smart City Telecom Italia Weather, Air quality, Electricity, Telecommunication http://theodi.fbk.eu/openbigdata/ Malaga datasets Smart City City of Malaga A broad range of categories such as energy, ITS, weather, Industry, Sport, etc. http://datosabiertos.malaga.eu/dataset Gas sensors for home activity monitoring Smart home Univ. of California San Diego Recordings of 8 gas sensors under three conditions including background, wine and banana presentations. http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring CASAS datasets for activities of daily living Smart home Washington State University Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings. http://ailab.wsu.edu/casas/datasets.html ARAS Human Activity Dataset Smart home Bogazici University Human activity recognition datasets collected from two real houses with multiple residents during two months. https://www.cmpe.boun.edu.tr/aras/ MERLSense Data Smart home, building Mitsubishi Electric Research Labs Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records. http://www.merl.com/wmd SportVU   Sport Stats LLC   Video of basketball and soccer games captured from 6 cameras. http://go.stats.com/sportvu RealDisp Sport O. Banos   Includes a wide range of physical activities (warm up, cool down and fitness exercises). http://orestibanos.com/datasets.htm   Taxi Service Trajectory Transportation Prediction Challenge, ECML PKDD 2015 Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal. http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html GeoLife GPS Trajectories Transportation Microsoft A GPS trajectory by a sequence of time-stamped points https://www.microsoft.com/en-us/download/details.aspx?id=52367 T-Drive trajectory data Transportation Microsoft Contains a one-week trajectories of 10,357 taxis https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ Chicago Bus Traces data Transportation M. Doering   Bus traces from the Chicago Transport Authority for 18 days with a rate between 20 and 40 seconds. http://www.ibr.cs.tu-bs.de/users/mdoering/bustraces/   Uber trip data Transportation FiveThirtyEight About 20 million Uber pickups in New York City during 12 months. https://github.com/fivethirtyeight/uber-tlc-foil-response Traffic Sign Recognition Transportation K. Lim   Three datasets: Korean daytime, Korean nighttime, and German daytime traffic signs based on Vienna traffic rules. https://figshare.com/articles/Traffic_Sign_Recognition_Testsets/4597795 DDD17   Transportation J. Binas End-To-End DAVIS Driving Dataset. http://sensors.ini.uzh.ch/databases.html      
Read more
  • 0
  • 2
  • 88157

article-image-how-to-customize-lines-and-markers-in-matplotlib-2-0
Sugandha Lahoti
13 Dec 2017
6 min read
Save for later

How to Customize lines and markers in Matplotlib 2.0

Sugandha Lahoti
13 Dec 2017
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim, titled Matplotlib 2.x By Example. The book illustrates methods and applications of various plot types through real world examples.[/box] In this post we demonstrate how you can manipulate Lines and Markers in Matplotlib 2.0. It covers steps to plot, customize, and adjust line graphs and markers. What are Lines and Markers Lines and markers are key components found among various plots. Many times, we may want to customize their appearance to better distinguish different datasets or for better or more consistent styling. Whereas markers are mainly used to show data, such as line plots and scatter plots, lines are involved in various components, such as grids, axes, and box outlines. Like text properties, we can easily apply similar settings for different line or marker objects with the same method. Lines Most lines in Matplotlib are drawn with the lines class, including the ones that display the data and those setting area boundaries. Their style can be adjusted by altering parameters in lines.Line2D. We usually set color, linestyle, and linewidth as keyword arguments. These can be written in shorthand as c, ls, and lw respectively. In the case of simple line graphs, these parameters can be parsed to the plt.plot() function: import numpy as np import matplotlib.pyplot as plt # Prepare a curve of square numbers x = np.linspace(0,200,100) # Prepare 100 evenly spaced numbers from # 0 to 200 y = x**2                                   # Prepare an array of y equals to x squared # Plot a curve of square numbers plt.plot(x,y,label = '$x^2$',c='burlywood',ls=('dashed'),lw=2) plt.legend() plt.show() With the preceding keyword arguments for line color, style, and weight, you get a woody dashed curve: Choosing dash patterns Whether a line will appear solid or with dashes is set by the keyword argument linestyle. There are a few simple patterns that can be set by the linestyle name or the corresponding shorthand. We can also define our own dash pattern: 'solid' or '-': Simple solid line (default) 'dashed' or '--': Dash strokes with equal spacing 'dashdot' or '-.': Alternate dashes and dots 'None', ' ', or '': No lines (offset, on-off-dash-seq): Customized dashes; we will demonstrate in the following advanced example Setting capstyle of dashes The cap of dashes can be rounded by setting the dash_capstyle parameter if we want to create a softer image such as in promotion: import numpy as np import matplotlib.pyplot as plt # Prepare 6 lines x = np.linspace(0,200,100) y1 = x*0.5 y2 = x y3 = x*2 y4 = x*3 y5 = x*4 y6 = x*5 # Plot lines with different dash cap styles plt.plot(x,y1,label = '0.5x', lw=5, ls=':',dash_capstyle='butt') plt.plot(x,y2,label = 'x', lw=5, ls='--',dash_capstyle='butt') plt.plot(x,y3,label = '2x', lw=5, ls=':',dash_capstyle='projecting') plt.plot(x,y4,label = '3x', lw=5, ls='--',dash_capstyle='projecting') plt.plot(x,y5,label = '4x', lw=5, ls=':',dash_capstyle='round') plt.plot(x,y6,label = '5x', lw=5, ls='--',dash_capstyle='round') plt.show() Looking closely, you can see that the top two lines are made up of rounded dashes. The middle two lines with projecting capstyle have closer spaced dashes than the lower two with butt one, given the same default spacing: Markers A marker is another type of important component for illustrating data, for example, in scatter plots, swarm plots, and time series plots. Choosing markers There are two groups of markers, unfilled markers and filled_markers. The full set of available markers can be found by calling Line2D.markers, which will output a dictionary of symbols and their corresponding marker style names. A subset of filled markers that gives more visual weight is under Line2D.filled_markers. Here are some of the most typical markers: 'o' : Circle 'x' : Cross  '+' : Plus sign 'P' : Filled plus sign 'D' : Filled diamond 'S' : Square '^' : Triangle Here is a scatter plot of random numbers to illustrate the various marker types: import numpy as np import matplotlib.pyplot as plt from matplotlib.lines import Line2D # Prepare 100 random numbers to plot x = np.random.rand(100) y = np.random.rand(100) # Prepare 100 random numbers within the range of the number of # available markers as index # Each random number will serve as the choice of marker of the # corresponding coordinates markerindex = np.random.randint(0, len(Line2D.markers), 100) # Plot all kinds of available markers at random coordinates # for each type of marker, plot a point at the above generated # random coordinates with the marker type for k, m in enumerate(Line2D.markers): i = (markerindex == k) plt.scatter(x[i], y[i], marker=m)         plt.show() The different markers suit different densities of data for better distinction of each point: Adjusting marker sizes We often want to change the marker sizes so as to make them clearer to read from a slideshow. Sometimes we need to adjust the markers to have a different numerical value of marker size to: import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as ticker # Prepare 5 lines x = np.linspace(0,20,10) y1 = x y2 = x*2 y3 = x*3 y4 = x*4 y5 = x*5 # Plot lines with different marker sizes plt.plot(x,y1,label = 'x', lw=2, marker='s', ms=10)     # square size 10 plt.plot(x,y2,label = '2x', lw=2, marker='^', ms=12) # triangle size 12 plt.plot(x,y3,label = '3x', lw=2, marker='o', ms=10) # circle size 10 plt.plot(x,y4,label = '4x', lw=2, marker='D', ms=8)     # diamond size 8 plt.plot(x,y5,label = '5x', lw=2, marker='P', ms=12) # filled plus sign # size 12 # get current axes and store it to ax ax = plt.gca() plt.show() After tuning the marker sizes, the different series look quite balanced: If all markers are set to have the same markersize value, the diamonds and squares may look heavier: Thus, we learned how to customize lines and markers in a Matplotlib plot for better visualization and styling. To know more about how to create and customize plots in Matplotlib, check out this book Matplotlib 2.x By Example.
Read more
  • 0
  • 0
  • 87978

article-image-mastering-the-api-life-cycle-a-comprehensive-guide-to-design-implementation-release-and-maintenance
Bruno Pedro
06 Nov 2024
15 min read
Save for later

Mastering the API Life Cycle: A Comprehensive Guide to Design, Implementation, Release, and Maintenance

Bruno Pedro
06 Nov 2024
15 min read
This article is an excerpt from the book, "Building an API Product", by Bruno Pedro. Build cutting-edge API products confidently, excelling in today's competitive market with this comprehensive guide on API fundamentals, inner workings, and steps for successful API product development.Introduction The life of an API product consists of a series of stages. Those stages form a cycle that starts with the initial conception of the API product and ends with the retirement of the API. The name of this sequence of stages is called a life cycle. This term started to gain popularity in software and product development in the 1980s. It’s used as a common framework to align the different participants during the life of a software application or product. Each stage of the API life cycle has specific goals, deliverables, and activities that must be completed before advancing to the next stage. There are many variations on the concept of API life cycles. I use my own version to simplify learning and focus on what is essential. Over the years, I have distilled the API life cycle into four easy-to-understand stages.  They are the design, implementation, release, and maintenance stages. Keep reading to gain an overview of what each of the stages looks like.  Figure 4.1 – The API life cycle The goal of this chapter is to provide you with a global overview of what an API life cycle is. You will see each one of the stages of the API life cycle as a transition and not simply an isolated step. You will first learn about the design stage and understand how it’s foundational to the success of an API product. Th en, you’ll continue o n to the implementation stage, where you’ll learn that a big part of an API server can be generated. After that, the chapter explores the release stage, where you’ll learn the importance of finding the right distribution model. Finally, you’ll understand the importance of versioning and sunsetting your API in the maintenance stage. After reading the chapter, you will understand and be able to recognize the API life cycle’s diff erent stages. You will understand how each API life cycle stage connects to the others. You will also know the participants and stakeholders of each stage of the API life cycle. Finally, you will know the most critical aspects of each stage of the API life cycle. In this article, you’ll learn about the four stages of the API life cycle: Design Implement Release Maintain  Design The first stage of the API life cycle is where you decide what you will build. You can view the design stage as a series of steps where your view of what your API will become gets more refined and validated. At the end of the design stage, you will be able to confidently implement your API, knowing that it’s aligned with the needs of your business and your customers. The steps I take in the design stage are as follows: Ideation Strategy Definition Validation Specification These steps help me advance in holistically designing the API, involving as many different stakeholders as possible so I get a complete alignment. I usually start with a rough idea of what the ideal API would look like. Then I start asking different stakeholders as many questions as possible to understand whether my initial assumptions were correct. Something I always ask is why an API should be built. Even though it looks like a simple question, its answer can reveal the real intentions behind building the API. Also, the answer is different depending on whom you ask the question. Your job is to synthesize the information you gather and document pieces of evidence that back up the decisions you make about the API design. You will, at this stage, interview as many stakeholders as possible. They can include potential API users, engineers who work with you, and your company’s leadership team. The goal is to find out why you’re building the API and to document it. Once you know why you’re building the API, you’ll learn what the API will look like to fit the needs of potential users. To learn what API users need, identify the personas you want to serve and then put yourself in their shoes. You’ve already seen a few proto-personas in Chapter 2. In this API life cycle stage, you draw from those generic personas and identify your API users. You then contact people representing your API user personas and interview them. During the interviews, you should understand their JTBDs, the challenges they face during their work, and the tools they use. From the information you obtain, you can infer the benefits they would get from the API you’re building and how they would use the API. This last piece of information is critical because it lets you define the architectural style of the API. By knowing what tools your user personas use daily, you can make an informed decision about the architectural style of your API. Architectural styles are how you identify the technology and type of communication that the API will use. For example, REST is one architectural style that lets API consumers interact with remote resources by executing one of the HTTP verbs. Among those verbs, there’s one that’s natively supported by web browsers—HTTP GET. So, if you identify that a user persona wants to use a web browser to consume your API, then you will want to follow the REST architectural style and limit it to HTTP GET. Otherwise, that user persona won’t be able to use your API directly from their tool of choice. Something else you’ll want to define is the capabilities your API will offer users. Defining capabilities is an exercise that combines the information you gathered from interviews. You translate JTBDs, benefits, and behaviors into a set of capabilities that your API will have. Ideally, those capabilities will cover all the needs of the users whom you interviewed. However, you might want to prioritize the capabilities according to their degree of urgency and the cost of implementation. In any case, you want to validate your assumptions before investing in actually implementing the API. Validation of your API design happens first at a high level, and after a positive review, you attempt a low-level validation. High-level validation involves sharing the definition of the architectural style and capabilities that you have created with the API stakeholders. You present your findings to the stakeholders, explain how you came up with the definitions, and then ask for their review. Sometimes the feedback will make you question your assumptions, and you must refine your definitions. Eventually, you will get to a point where the stakeholders are all aligned with what you think the API should be. At that point, you’re ready to attempt a low-level validation. The difference between a high-level and a low-level validation is the amount of detail you share with your stakeholders and how technical the feedback you expect needs to be. While in high-level validation, you mostly expect an opinion about the design of the API, in low-level validation, you actually want the stakeholders to test the API before you start building it. You do that by creating what is called an API mock server. It allows anyone to make real API requests to a server as if they were making requests to the real API. The mock server responds with data that is not real but has the same shape that the responses of the real API would have. Stakeholders can then test making requests to the mock server from their tools of choice to see how the API would work. You might need to make changes during this low-level validation process until the stakeholders are comfortable with how your API will work. After that, you’re ready to translate the API design into a machine-readable definition document that will be used during the implementation stage of the API life cycle. The type of machine-readable definition depends on the architectural style identified earlier. If, for example, the architectural style is REST, then you’ll create an OpenAPI document. Otherwise, you will work with the type of machine-readable definition most appropriate for the architectural style of the API. Once you have a machine-readable API definition, you’re ready to advance to the implementation stage of the API life cycle. Implementation Having a machine-readable API definition is halfway to getting an entire API server up and running. I won’t focus on any particular architectural style, so you can keep all options open at this point. The goal of the machine-readable definition is to make it easy to generate server code and configuration and give your API consumers a simple way to interact with your API. Some API server solutions require almost no coding as long as you have a machine-readable definition. One type of coding you’ll need to do—or ask an engineer to do—is the code responsible for the business logic behind each API capability. While the API itself can be almost entirely generated, the logic behind each capability must be programmed and linked to the API. Usually, you’ll start with a first version of your API server that can run locally and will be used to iteratively implement all the business logic behind each of the capabilities. Later, you’ll make your API server publicly available to your API consumers. When I say publicly available, I mean that your API consumers should be able to securely make requests. One of the elements of security that you should think about is authentication. Many APIs are fully open to the public without requiring any type of authentication. However, when building an API product, you want to identify who your users are. Monetization is only possible if you know who is making requests to your API. Other security factors to consider have already been covered in Chapter 3. They include things such as logging, monitoring, and rate limiting. In any case, you should always test your API thoroughly during the implementation stage to make sure that everything is working according to plan. One type of test that is particularly useful at this stage is contract testing. This type of test aims to verify whether the API responses include the expected information in the expected format. The word contract is used to describe the API definition as something that both you—the API producers—and your consumers agree to. By performing a contract test, you’ll verify whether the implementation of the API has been done according to what has been designed and defined in the machine-readable document. For example, you can verify whether a particular capability is responding with the type of data that you defined. Before deploying your API to production, though, you want to be more thorough with your testing. Other types of tests that are well suited to be performed at this stage are functional and performance testing. Functional tests, in particular, can help you identify areas of the API that are not behaving as functionally as intended. Testing different elements of your API helps you increase its quality. Nevertheless, there’s another activity that focuses on API quality and relies on tests to obtain insights. Quality assurance, or QA, is one type of activity where you test your API capabilities using different inputs and check whether the responses are the expected ones. QA can be performed manually or  automatically by following a programmable script. Performing API QA has the advantage of improving the quality of your API, its overall user experience, and even the security of the product. Since a QA process can identify defects early on during the implementation stage of an API product, it can reduce the cost of fi xing those defects if they’re found when consumers are already using the API. While contract and functional tests provide information on how an API works, QA off ers a broader perspective on how consumers experience the API. A QA process can be a part of the release process of your API and can determine whether the proposed changes have production quality. Release In soft ware development, you can say that a release happens whenever you make your soft ware available to users. Diff erent release environments target diff erent kinds of users. You can have a development environment that is mostly used to share your soft ware with other developers and to make testing easy. Th ere can also be a staging environment where the soft ware is available to a broader audience, and QA testing can happen. Finally, there is a production environment where the soft ware is made available generally to your customers. Releasing soft ware—and API products—can be done manually or automatically. While manual releases work well for small projects, things can get more complicated if you have a large code base and a growing team working on the project. In those situations, you want to automate the release as much as possible with something called a build process. During implementation, you focus on developing your API and ensuring you have all tests in place. If those tests are all fully automated, you can make them run every time you try to release your API. Each build process can automatically run a series of steps, including packaging the soft ware, making it available on a mock server, and running tests. If any of the build steps fail, you can consider that the whole build process failed, and the API isn’t released. If the build process succeeds, you have a packaged API ready to be deployed into your environment of choice. Deploying the API means it will become available to any users with access to the environment where you’re doing the release. You can either manage the deployment process yourself, including the servers where your API will run, or use one of the many available API gateway products. Either way, you’ll want to have a layer of control between your users and your API. If controlling how users interact with your API is important, knowing how your API is behaving is also fundamental. If you know how your API behaves, you can understand whether its behavior is aff ecting your users’ experience. By anticipating how users can be negatively aff ected, you can proactively take measures and improve the quality of your API. Using an API monitor lets you periodically receive information about the behavior and quality of your API. You can understand whether any part of your API is not working as expected by using a solution such as a Postman Monitor. Diff erent solutions let you gather information about API availability, response times, and error rates. If you want to go deeper and understand how the API server is performing, you can also use an Application Performance Monitor (APM). Services such as New Relic give you information about the performance and error rate of the server and the code that is running your API. Another area that you want to pay attention to during the release stage of the API life cycle is documentation. While you can have an API reference automatically built from your machine-readable defi nition, you’ll want to pay attention to other aspects of documentation. As you’ve seen in Chapter 2, good API documentation is fundamental to obtaining a good user experience. In Chapter 3, you learned how documentation can enhance support and help users get answers to their questions when interacting with your API. Documentation also involves tutorials covering the JTBDs of the API user personas and clearly showing how consumers can interact with each API feature. To promote the whole API and the features you’re releasing, you can make an announcement to your customers and the community. Announcing a release is a good idea because it raises the general public’s awareness and helps users understand what has changed since the last release. Depending on the size of your company, your available marketing budget, and the importance of the release, you choose the media where you make the announcement. You could simply share the news on your blog, or go all the way and promote the new version of your API with a marketing campaign. Your goal is always to reach the existing users of your API and to make the news available to other potential users. Sharing news about your release is a way to increase the reach of your API. Another way is to distribute your API reference in existing API marketplaces that already have their own audience. Online marketplaces let you list your API so potential users can fi nd it and start using it. Th ere are vertical marketplaces that focus on specifi c sectors, such as healthcare or education. Other marketplaces are more generic and let you list any API. Th e elements you make available are usually your API reference, documentation, and pointers on signing up and starting to use the API. You can pick as many marketplaces as you like. Keep in mind that some of the existing solutions charge you for listing your API, so measure each marketplace as a distribution channel. You can measure how many users sign up and use your API across the marketplaces where your API is listed. Over time, you’ll understand which marketplaces aren’t worth keeping, and you can remove your API from those. Th is measurement is part of API analytics, one of the activities of the maintenance stage of the API life cycle. Keep rea ding to learn more about it. Maintenance You’re now in the last stage of the API life cycle. This is the stage where you make sure that your API is continuously running without disturbances. Of all the activities at this stage, the one where you’ll spend the most time will be analyzing how users interact with your API. Analytics is where you understand who your users are, what they’re doing, whether they’re being successful, and if not, how you can help them succeed. The information you gather will help you identify features that you should keep, the ones that you should improve, and the ones that you should shut down. But analytics is not limited to usage. You can also obtain performance, security, and even business metrics. For example, with analytics, you can identify the customers who interact with the top features of your API and understand how much revenue is being generated. That information can tell you whether the investment in those top features is paying off. You can also understand what errors are the most common and which customers are having the most difficulties. Being able to do that allows you to proactively fix problems before users get in touch with your support team. Something to keep in mind is that there will be times when users will have difficulties working with your API. The issues can be related to your API server being slow or not working at all. There can be problems related to connectivity between some users and your API. Alternatively, individual users can have issues that only affect them. All these situations usually lead to customers contacting your support team. Having a support system in place is important because it increases the satisfaction of your users and their trust in your product. Without support, users will feel lost when they have difficulties. Worse, they’ll share their problems publicly without you having a chance to help. One situation where support is particularly requested is when you need to release a new version of your API. Versioning happens whenever you introduce new features, fix existing ones, or deprecate some part of your API. Having a version helps your users know what they should expect when interacting with your API. Versioning also enables you to communicate and identify those changes in different categories. You can have minor bug fixes, new features, or breaking changes. All those can affect how customers use your API, and communicating them is essential to maintaining a good experience. Another aspect of versioning is the ability to keep several versions running. As the API producer, running more than one version can be helpful but can increase your costs. The advantage of having at least two versions is that you can roll back to the previous version if the current one is having issues. This is often considered a good practice. Knowing when to end the life of your entire API or some of its features is a simple task, especially when there are customers using your API regularly. First of all, it’s essential that you have a communication plan so your customers know in advance when your API will stop working. Things to mention in the communication plan include a timeline of the shutdown and any alternative options, if available, even from a competitor of yours. A second aspect to account for is ensuring the API sunset is done according to existing laws and regulations. Other elements include handling the retention of data processed or generated by usage of the API and continuing to monitor accesses to the API even after you shut it down. ConclusionAt this point, you know how to identify the different stages of the API life cycle and how they’re all interconnected. You also understand which stakeholders participate at each stage of the API life cycle. You can describe the most important elements of each stage of the API life cycle and know why they must be considered to build a successful API product. You first learned about my simplified version of the API life cycle and its four stages. You then went into each of them, starting with the design stage. You learned how designing an API can affect its success. You understood the connection between user personas, their attributes, and the architectural type of the API that you’re building. After that, you got to know what high and low-level design validations are and how they can help you reach a product-market fit. You then learned that having a machine-readable definition enables you to document your API but is also a shortcut to implementing its server and infrastructure. Afterward, you learned about contract testing and QA and how they connect to the implementation and release stages. You acquired knowledge about the different release environments and learned how they’re used. You knew about distribution and API marketplaces and how to measure API usage and performance. Finally, you learned how to version and eventually shut down your API. Author BioBruno Pedro is a computer science professional with over 25 years of experience in the industry. Throughout his career, he has worked on a variety of projects, including Internet traffic analysis, API backends and integrations, and Web applications. He has also managed teams of developers and founded several companies, including tarpipe, an iPaaS, in 2008, and the API Changelog in 2015. In addition to his work experience, Bruno has also made contributions to the API industry through his written work, including two published books on API-related topics and numerous technical magazine and web articles. He has also been a speaker at numerous API industry conferences and events from 2013 to 2018.
Read more
  • 1
  • 0
  • 87683

article-image-mastering-threat-detection-with-virustotal-a-guide-for-soc-analysts
Mostafa Yahia
11 Nov 2024
15 min read
Save for later

Mastering Threat Detection with VirusTotal: A Guide for SOC Analysts

Mostafa Yahia
11 Nov 2024
15 min read
This article is an excerpt from the book, "Effective Threat Investigation for SOC Analysts", by Mostafa Yahia. This is a practical guide that enables SOC professionals to analyze the most common security appliance logs that exist in any environment.IntroductionIn today’s cybersecurity landscape, threat detection and investigation are essential for defending against sophisticated attacks. VirusTotal, a powerful Threat Intelligence Platform (TIP), provides security analysts with robust tools to analyze suspicious files, domains, URLs, and IP addresses. Leveraging VirusTotal’s extensive security database and community-driven insights, SOC analysts can efficiently detect potential malware and other cyber threats. This article delves into the ways VirusTotal empowers analysts to investigate suspicious digital artifacts and enhance their organization’s security posture, focusing on critical features such as file analysis, domain reputation checks, and URL scanning.Investigating threats using VirusTotalVirusTotal is a  Threat Intelligence Platform (TIP) that allows security analysts to analyze suspicious files, hashes, domains, IPs, and URLs to detect and investigate malware and other cyber threats. Moreover, VirusTotal is known for its robust automation capabilities, which allow for the automatic sharing of this intelligence with the broader security community. See Figure 14.1:Figure 14.1 – The VirusTotal platform main web pageThe  VirusTotal scans submitted artifacts, such as hashes, domains, URLs, and IPs, against more than 88 security solution signatures and intelligence databases. As a SOC analyst, you should use the VirusTotal platform to investigate the  following:Suspicious filesSuspicious domains and URLsSuspicious outbound IPsInvestigating suspicious filesVirusTotal allows cyber security analysts to analyze suspicious files either by uploading the file or searching for the file hash’s reputation. Either after uploading a fi le or submitting a file hash for analysis, VirusTotal scans it against multiple antivirus signature databases and predefined YARA rules and analyzes the file behavior by using different sandboxes.After the analysis of the submitted file is completed, VirusTotal provides analysts with general information about the analyzed file in five tabs; each tab contains a wealth of information. See Figure 14.2:Figure 14.2 – The details and tabs provided by analyzing a file on VirusTotalAs you see in the preceding figure, aft er submitting the file to the VirusTotal platform for analysis, the file was analyzed against multiple vendors’ antivirus signature databases, Sigma detection rules, IDS detection rules, and several sandboxes for dynamic analysis.The preceding figure is the first page provided by VirusTotal after submitting the file. As you can see, the first section refers to the most common name of the submitted file hash, the file hash, the number of antivirus vendors and sandboxes that flagged the submitted hash as malicious, and tags of the suspicious activities performed by the file when analyzed on the sandboxes, such as the persistence tag, which means that the executable file tried to maintain persistence. See Figure 14.3:Figure 14.3 – The first section of the first page from VirusTotal when analyzing a fileThe first tab of the five tabs provided by the VirusTotal platform that appear is the DETECTION tab. The first parts of the DETECTION tab include the matched Sigma rules, IDS rules, and dynamic analysis results from the sandboxes. See Figure 14.4:Figure 14.4 – The first parts of the DETECTION tabThe Sigma rules are threat detection rules designed to analyze system logs. Sigma was built to allow collaboration between the SOC teams as it allows them to share standardized detection rules regardless of the SIEM in place to detect the various threats by using the event logs. VirusTotal sandboxes store all event logs that are generated during the file detonation, which are later used to test against the list of the collected Sigma rules from different repositories. VirusTotal users will find the list of Sigma rules matching a submitted file in the DETECTION tab. As you can see in the preceding figure, it appears that the executed file has performed certain actions that have been identified by running the Sigma rules against the sandbox logs. Specifically, it disabled the Defender service, created an Auto-Start Extensibility Point (ASEP) entry to maintain persistence, and created another executable.Then as can be  observed, VirusTotal shows that the Intrusion Detection System (IDS) rules successfully detected the presence of Redline info-stealer malware's Command and Control (C&C) communication that matched four IDS rules.Important Note: It is noteworthy that both Sigma and IDS rules are assigned a severity level, and analysts can easily view the matched rule as well as the number of matches.Following the successful matching against IDS rules, you will find the dynamic sandboxes’ detections of the submitted file. In this case, the sandboxes categorized the submitted file/hash as info-stealer malware.Finally, the last part of the DETECTION tab is Security vendors’ analysis. See Figure 14.5:Figure 14.5 – The Security vendors’ analysis sectionAs you see in the preceding figure, the submitted fi le or hash is flagged as malicious by several security vendors and most of them label the given file as a Redline info-stealer malware.The second tab is the DETAILS tab, which includes the Basic properties section on the given file, which includes the file hashes, file type, and file size. That tab also includes times such as file creation, first submission on the platform, last submission on the platform, and last analysis times. Additionally, this tab provides analysts with all the filenames associated with previous submissions of the same file. See Figure 14.6:Figure 14.6 – The first three sections of the DETAILS tabMoreover, the DETAILS tab provides analysts with useful information such as signature verification, enabling identification of whether the file is digitally signed, a key indicator of its authenticity and trustworthiness. Additionally, the tab presents crucial insights into the imported Dynamic Link Libraries (DLLs) and called libraries, allowing analysts to understand the file intents.The third tab is the RELATIONS tab, which includes the IoCs of the analyzed file, such as the domains and IPs that the file is connected with, the files bundled with the executable, and the files dropped by the executable. See Figure 14.7:Figure 14.7 – The RELATIONS tabImportant noteWhen analyzing a malicious file, you can use the connected IPs and domains to scope the infection in your environment by using network security system logs such as the firewall and the proxy logs. However, not all the connected IPs and domains are necessarily malicious and may also be legitimate domains or IPs used by the malware for malicious intents.At the bottom of the RELATIONS tab, VirusTotal provides a great graph that binds the given file and all its relations into one graph, which should facilitate your investigations. To maximize the graph in a new tab, click on it. See Figure 14.8:Figure 14.8 – VT Relations graphThe fourth tab is the BEHAVIOR tab, which contains the detailed sandbox analysis of the submitted file. This report is presented in a structured format and includes the tags, MITRE ATT&CK Tactics and Techniques conducted by the executed file, matched IDS and Sigma rules, dropped files, network activities, and process tree information that was observed during the analysis of the given file. See Figure 14.9:Figure 14.9 – The BEHAVIOR tabRegardless of the matched signatures of security vendors, Sigma rules, and IDS rules, the BEHAVIOR tab allows analysts to examine the file’s actions and behavior to determine whether it is malicious or not. This feature is especially critical in the investigation of zero-day malware, where traditional signature-based detection methods may not be effective, and in-depth behavior analysis is required to identify and respond to potential threats.The fifth tab is the COMMUNITY tab, which allows analysts to contribute to the VirusTotal community with their thoughts and to read community members’ thoughts regarding the given file. See Figure 14.10:Figure 14.10 – The COMMUNITY tabAs you can see, we have two comments from two sandbox vendors indicating that the file is malicious and belongs to the Redline info-stealer family according to its behavior during the dynamic analysis of the file.Investigating suspicious domains and URLsA SOC analyst may depend on the VirusTotal platform to investigate suspicious domains and URLs. You can analyze the suspicious domain or URL on the VirusTotal platform either by entering it into the URL or Search form.During the Investigating suspicious files section, we noticed while navigating the RELATION tab that the file had established communication with the hueref[.]eu domain. In this section, we will investigate the hueref[.]eu domain by using the VirusTotal platform. See Figure 14.11:Figure 14.11 – The DETECTION tabUpon submitting the suspicious domain to the Search form in VirusTotal, it was discovered that the domain had several tags indicating potential security risks. These tags refer to the web domain category. As you can see in the preceding screenshot, there are two tags indicating that the domain is malicious.The first provided tab is the DETECTION tab, which include the Security vendors’ analysis. In this case, several security vendors labeled the domain as Malware or a Malicious domain.The second tab is the DETAILS tab, which includes information about the given domain such as the web domain categories from different sources, the last DNS records of the domain, and the domain Whois lookup results. See Figure 14.12:Figure 14.12 – The DETAILS tabThe third tab is the RELATIONS tab, which provides analysts with all domain relations, such as the DNS resolving the IP(s) of the given domain, along with their reputations, and the files that communicated with the given domain when previously analyzed in the VirusTotal sandboxes, along with their reputations. See Figure 14.13.Figure 14.13 – The RELATIONS tabThe RELATIONS tab is very useful, especially when investigating potential zero-day malicious domains that have not yet been detected and fl agged by security vendors. By analyzing the domain’s resolving IP(s) and their reputation, as well as any connections between the domain and previously analyzed malicious files on the VT platform, SOC analysts can quickly and accurately identify potential threats that potentially indicate a C&C server domain.At the bottom of the RELATIONS tab, you will find the same VirusTotal graph discussed in the previous section.The fourth tab is the COMMUNITY tab, which allows you to contribute to the VirusTotal community with your thoughts and read community members’ thoughts regarding the given domain.Investigating suspicious outbound IPsAs a security analyst, you may depend on the VirusTotal platform to investigate suspicious outbound IPs that your internal systems may have communicated with. By entering the IP into the search form, the VirusTotal platform will show you nearly the same tab details provided when analyzing domains in the last section.In this section, we will investigate the IP of the hueref[.]eu domain. As we mentioned, the tabs and details provided by VirusTotal when analyzing an IP are the same as those provided when analyzing a domain. Moreover, the RELATIONS tab in VirusTotal provides all domains hosted on this IP and their reputations. See Figure 14.14:Figure 14.14 – Domains hosted on the same IP and their reputationsImportant noteIt’s not preferred to depend on the VirusTotal platform to investigate suspicious inbound IPs such as port-scanning IPs and vulnerability-scanning IPs. This is due to the fact that VirusTotal relies on the reputation assessments provided by security vendors, which are particularly effective in detecting outbound IPs such as those associated with C&C servers or phishing activities.By the end of this section, you should have learned how to investigate suspicious files, domains, and outbound IPs by using the VirusTotal platform.ConclusionIn conclusion, VirusTotal is an invaluable resource for SOC analysts, enabling them to streamline threat investigations by analyzing artifacts through multiple detection engines and sandbox environments. From identifying malicious file behavior to assessing suspicious domains and URLs, VirusTotal’s capabilities offer comprehensive insights into potential threats. By integrating this tool into daily workflows, security professionals can make data-driven decisions that enhance response times and threat mitigation strategies. Ultimately, VirusTotal not only assists in pinpointing immediate risks but also contributes to a collaborative, community-driven approach to cybersecurity.Author BioMostafa Yahia is a passionate threat investigator and hunter who hunted and investigated several cyber incidents. His experience includes building and leading cyber security managed services such as SOC and threat hunting services. He earned a bachelor's degree in computer science in 2016. Additionally, Mostafa has the following certifications: GCFA, GCIH, CCNA, IBM Qradar, and FireEye System engineer. Mostafa also provides free courses and lessons through his Youtube channel. Currently, he is the cyber defense services senior leader for SOC, Threat hunting, DFIR, and Compromise assessment services in an MSSP company.
Read more
  • 0
  • 0
  • 87002
article-image-creating-and-using-kibana-dashboards
Huage Chen
02 Jul 2024
12 min read
Save for later

Creating and Using Kibana Dashboards

Huage Chen
02 Jul 2024
12 min read
This article is an excerpt from the book, Elastic Stack 8.x Cookbook, by Huage Chen and Yazid Akadiri. Unlock the full potential of Elastic Stack for search, analytics, security, and observability and manage substantial data workloads in both on-premise and cloud environmentsIntroductionIn this guide, we will integrate all previously created visualizations into a comprehensive dashboard consisting of multiple panels. Additionally, we will explore how to enhance user interaction using control-based drilldowns.Getting readyMake sure to complete the following recipes from this chapter:Creating visualizations with Kibana LensCreating visualizations from runtime fieldsCreating Kibana mapsAt the end of this recipe, you will have dashboards composed of the various visualizations and elements built into the aforementioned recipes.How to do it...Building dashboards is very straightforward in Kibana, especially if you’ve already created some visualizations. Follow these steps:1. Go to Kibana | Analytics | Dashboard and click on Create dashboard.This will bring you to a blank canvas, where you can start adding some visualizations.2. We will start by adding a nice image! You can be creative, but we provided a sample picture:A. Click on Add panel | Image. B. Select the Use link tab and set Link to image with the following URL: https://upload. wikimedia.org/wikipedia/commons/6/60/Ville_de_RENNES_Noir. svg. Then, click on Save:Figure 6.54 – Adding an image for a logoThe logo will be added to the panel. Including a picture is a great way to add some personalization and branding to your dashboards. Let’s add some proper visualizations from the ones we’ve built in the last three recipes.3. Click on Add from library and select the [Rennes Traffic] Number of locations visualization. Make sure to align it to the right with the image panel.4. Let’s add another visualization; this time, we’ll pick [Rennes Traffic] Average speed gauge.At this stage, your dashboard should look like the one shown in Figure 6.55:Figure 6.55 – Rennes traffic dashboard – first stepYou can easily rearrange the position of the different panels by clicking on the title section and moving the panel with your mouse anywhere you want on the canvas. To adjust the size and fit of the panel, position your mouse on the small arrow at the bottom right of the panel. Let’s keep adding more panels to our dashboard.5. Click on Add from library and add the following visualizations in the respective order:I. [Rennes Traffic] Traffic status waffleII. [Rennes Traffic] Speed by road hierarchyIII. [Rennes Traffic] Average speed & Traffic StatusIV. [Rennes Traffic] Traffic status by hour6. Finally, let’s add a Map visualization for a real-time view of the traffic; select the one named [Rennes Traffic] Traffic fluidity.By now, your dashboard should look like the one shown in Figure 6.56:Figure 6.56 – Rennes traffic dashboard – more visualizationsYou can start playing around with the dashboard to see the built-in interactivity of the panels. For example, clicking on a specific road hierarchy will automatically apply the filter to the entire dashboard.You can also have dedicated panels to filter and display only the data you are interested in with Controls. Let’s add some to our dashboard.7. On the dashboard toolbar, click on Controls:Figure 6.57 – Adding controls to the dashboard8. From the drop-down list, select Add control; the Create control flyout will appear on the right of the screen.9. Select the traffic_status field and click on Save and close.10. Back to the dashboard, you now have a new panel on top of the visualization named traffic_status. By clicking on it, you will see a drop-down list where you can select the values associated with the status of the traffic you want to filter, as shown in Figure 6.58. Select congested as an example:Figure 6.58 – Using controls in the dashboard11. You can see on your dashboard that all the panels have been updated according to the value selected in the traffic_status control.Imagine you want to filter your traffic data to analyze it within a specific time range, such as early in the morning or late in the afternoon, to better understand traffic patterns. This is where the time slider control proves to be incredibly useful.12. Go to the Controls menu again in the dashboard toolbar and select Add time slider control.You’ll see a new panel to the right of traffic_status:Figure 6.59 – Time slider controlBy clicking the play icon, you will see your dashboard animate and your data change over the defined time range. You can advance the time range forward as well as backward, which is especially useful when working with time series data.Your dashboard should now look as shown in Figure 6.60, with our two controls:                                                                                     Figure 6.60 – Rennes traffic dashboard with controls13. Save the dashboard by clicking the Save button in the upper-right corner. Name it [Rennes Traffic] Overview.To enhance our dashboard further, consider this: users frequently manage multiple dashboards, and the ability to navigate seamlessly from one to another is crucial, especially when aiming to refine analysis or focus on more detailed panels related to a specific dataset. Dashboard drilldowns are invaluable in this scenario as they allow you to transition between dashboards while maintaining the overall context. Let’s explore how to implement and use this feature effectively!For this exercise, we have already built a drilldown dashboard. Download and save the NDJSON file of the exported dashboard from the following location: https://github.com/PacktPublishing/ Elastic-Stack-8.x-Cookbook/blob/main/Chapter6/kibana-objects/rennesdata-drilldown-dashboard.ndjson. Then, follow these steps:1. To import the dashboard, go to Stack Management | Saved Objects.2. Click on Import and select the NDJSON file you have previously downloaded from the GitHub repository. Upon completing the import process, you will notice a warning in the flyout about data view conflicts. The reason is straightforward: our saved objects rely on an existing data view. To resolve the conflict, simply click on the drop-down list under the New data view column and select metrics-rennes_traffic-raw, as shown in Figure 6.61, then click on Confirm all changes to finalize the import procedure:Figure 6.61 – Importing saved objects and selecting the right data view3. Once all the objects have been imported, you will get a recap as shown in the following screenshot:                                                                                      Figure 6.62 – Saved objects successfully imported from the fileReturn to the [Rennes Traffic] Overview dashboard. Then, open the menu for the [Rennes Traffic] Speed by road hierarchie panel and select Create drilldown:Figure 6.63 – Creating drilldown from the panel4. Navigate to the drilldowns page and select the Go to Dashboard option. Here, you will need to name your drilldown—consider View Details for Road Hierarchy as a suggestion. Then, from the Choose destination dashboard drop-down menu, select [Rennes Traffic] Detailed traffic drilldown dashboard, which you have recently imported. This process sets up a targeted navigation path within your dashboard environment, allowing for a seamless transition between your overview and detailed analysis dashboards:Figure 6.64 – Configuring dashboard drilldown5. Click on Create drilldown. Save the dashboard to test our drilldown, click on one of the five charts in the [Rennes Traffic] Speed by road hierarchie panel. You will be redirected to the detailed dashboard filtered on the value you have selected.Figure 6.65 – Dashboard view after drilldownEt voilà! You have just built your first dashboard with a nice touch of interactivity thanks to controls and drilldowns.How it works...In Kibana, a dashboard is a collection of visualizations and saved searches that you can arrange and customize to display the data that is most important to you. You can create multiple dashboards for different use cases, and each dashboard can have its own set of visualizations and searches.Dashboards are a powerful tool for data analysis because they allow you to see multiple visualizations side by side and quickly identify patterns and trends in your data. You can also use dashboards to monitor key metrics in real time, which is especially useful for operational use cases. Kibana provides a wide range of visualization types that you can use to create custom dashboards, including bar charts, line charts, pie charts, tables, and more.The following table outlines a framework for choosing the right visualization:Use caseRecommended type of visualizationComparison and correlationMany items: Horizontal barFew items: Vertical barComparison over timeFew periods and categories: Stacked barFew time periods but many categories: Line graphDistribution of valuesFew numbers of points: Vertical bar histogramMany points: Line histogramComposition of a wholeSimple compositions with few items: Waffle or TreemapMultiple grouping dimensions for a few bottomlevel items: MosaicMultiple grouping dimensions for many bottomlevel items: TreemapEye-catching summaryOne value: MetricMany values: Table with color stylingVisualizing goals or targetsVertical bar or Line with reference linesMetricTable 6.2 – Choosing the right visualizationIn addition to visualizations, Kibana dashboards also support saved searches, which allow you to quickly filter your data based on specific criteria. You can save searches that you use frequently and add them to your dashboard for easy access.Overall, Kibana dashboards are a powerful tool for data analysis and monitoring. They allow you to quickly identify patterns and trends in your data, monitor key metrics in real time, and customize your view of the data to suit your needs.There’s more...In our recipe, we have used dashboard drilldowns, but you can also create URL and Discover drilldowns. With the former, you can link to data outside of Kibana, and with the latter, you can open Discover from a Lens panel while keeping all the contextual information.Dashboards are great when used in Kibana, but you can also share them with teams and colleagues outside of Kibana. You have many options that are easily accessible from the Share menu in the toolbar when it comes to sharing dashboards: you can interactively embed dashboards as an iFrame, export them as reports in various formats (PNG, CSV, PDF, etc.), and share them as direct links for easy access.When building dashboards, design thinking is a good practice. Start by asking yourself the following questions:What is the outcome or the goal of the dashboard? Is it about understanding high-level behaviors, visually correlating specific metrics at the same time, or finding the root cause of an issue?Who is using this dashboard to do their job? If you are building it for a team or someone else, step into their shoes to visualize their perspective when they will need that data.See alsoLooking for more design tips to elevate your dashboards? Look no further and check out this blog: https://www.elastic.co/blog/designing-intuitive-kibanadashboards-as-a-non-designerIf you’re interested in delving deeper into the topics of creating dashboards more efficiently, be sure to check out this technical blog: https://www.elastic.co/blog/buildingkibana-dashboards-more-efficientlyFor developers interested in debugging their Kibana dashboard, the following article will be very useful: https://www.elastic.co/blog/debugging-kibana-dashboardsConclusionIn this guide, we've explored the process of integrating various visualizations into a comprehensive Kibana dashboard, enhancing user interaction through control-based drilldowns. By following the steps outlined, you should now have a functional and interactive dashboard that can provide valuable insights into your data.We began by preparing the necessary visualizations and then moved on to assembling the dashboard by adding images for personalization and aligning various traffic visualizations. We also incorporated control panels for dynamic filtering, allowing for more precise data analysis. The final touch was adding drilldowns to enable seamless navigation between detailed and overview dashboards.Kibana dashboards offer powerful tools for data analysis and real-time monitoring. By displaying multiple visualizations side by side, you can quickly identify patterns and trends, making dashboards invaluable for operational and analytical use cases.Remember, the key to a successful dashboard is thoughtful design—consider the goals, the audience, and the specific data insights needed. Utilize the wide range of visualization types that Kibana offers and don't hesitate to leverage the sharing options to collaborate with your team effectively.For further reading and advanced tips on designing intuitive dashboards, building them efficiently, or debugging, check out the additional resources provided. Happy dashboarding!Author BioHuage Chen is a member of Elastic's customer engineering team and has been with Elastic for over five years, helping users throughout Europe to innovate and implement cloud-based solutions for search, data analysis, observability, and security. Before joining Elastic, he worked for 10 years in web content management, web portals, and digital experience platforms.Yazid Akadiri has been a solutions architect at Elastic for over four years, helping organizations and users solve their data and most critical business issues by harnessing the power of the Elastic Stack. At Elastic, he works with a broad range of customers, with a particular focus on Elastic observability and security solutions. He previously worked in web services-oriented architecture, focusing on API management and helping organizations build modern applications.
Read more
  • 0
  • 0
  • 86916

article-image-how-does-elasticsearch-work-tutorial
Savia Lobo
30 Jul 2018
12 min read
Save for later

How does Elasticsearch work? [Tutorial]

Savia Lobo
30 Jul 2018
12 min read
Elasticsearch is much more than just a search engine; it supports complex aggregations, geo filters, and the list goes on. Best of all, you can run all your queries at a speed you have never seen before.  Elasticsearch, like any other open source technology, is very rapidly evolving, but the core fundamentals that power Elasticsearch don't change. In this article, we will briefly discuss how Elasticsearch works internally and explain the basic query APIs.  All the data in Elasticsearch is internally stored in  Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs. This Elasticsearch tutorial is an excerpt taken from the book,'Learning Elasticsearch' written by Abhishek Andhavarapu. Inverted index in Elasticsearch Inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on. We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear. Document1: Fear leads to anger Document2: Anger leads to hate Document3: Hate leads to suffering In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it's the book you need. Computer-based information retrieval systems do the same. Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value. Term Document Fear 1 Anger 1,2 Hate 2,3 Suffering 3 Leads 1,2,3 Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don't use the exact words. Now let's say we encountered a document containing the following: Yosemite national park may be closed for the weekend due to forecast of substantial rainfall We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown: Document Query rainfall rain To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections. Stemming Stemming is the process of reducing a derived word into its root word. For example, rain, raining, rained, rainfall has the common root word "rain". When a document is indexed, the root word is stored in the index instead of the actual word. Without stemming, we end up storing rain, raining, rained in the index, and search relevance would be very low. The query terms also go through the stemming process, and the root words are looked up in the index. Stemming increases the likelihood of the user finding what he is looking for. When we query for rain in yosemite, even though the document originally had rainfall, the inverted index will contain term rain. We can configure stemming in Elasticsearch using Analyzers. Synonyms Similar to rain and raining, weekend and sunday mean the same thing. The document might not contain Sunday, but if the information retrieval system can also search for synonyms, it will significantly improve the search quality. Human language deals with a lot of things, such as tense, gender, numbers. Stemming and synonyms will not only improve the search quality but also reduce the index size by removing the differences between similar words. More examples: Pen, Pen[s] -> Pen Eat, Eating  -> Eat Phrase search As a user, we almost always search for phrases rather than single words. The inverted index in the previous section would work great for individual terms but not for phrases. Continuing the previous example, if we want to query all the documents with a phrase anger leads to in the inverted index, the previous index would not be sufficient. The inverted index for terms anger and leads is shown below: Term Document Anger 1,2 Leads 1,2,3 From the preceding table, the words anger and leads exist both in document1 and document2. To support phrase search along with the document, we also need to record the position of the word in the document. The inverted index with word position is shown here: Term Document Fear 1:1 Anger 1:3, 2:1 Hate 2:3, 3:1 Suffering 3:3 Leads 1:2, 2:2, 3:2 Now, since we have the information regarding the position of the word, we can search if a document has the terms in the same order as the query. Term Document anger 1:3, 2:1 leads 1:2, 2:2 Since document2 has anger as the first word and leads as the second word, the same order as the query, document2 would be a better match than document1. With the inverted index, any query on the documents is just a simple lookup. This is just an introduction to inverted index; in real life, it's much more complicated, but the fundamentals remain the same. When the documents are indexed into Elasticsearch, documents are processed into the inverted index. Scalability and availability in Elasticsearch Let's say you want to index a billion documents; having just a single machine might be very challenging. Partitioning data across multiple machines allows Elasticsearch to scale beyond what a single machine do and support high throughput operations. Your data is split into small parts called shards. When you create an index, you need to tell Elasticsearch the number of shards you want for the index and Elasticsearch handles the rest for you. As you have more data, you can scale horizontally by adding more machines. We will go in to more details in the sections below. There are type of shards in Elasticsearch - primary and replica. The data you index is written to both primary and replica shards. Replica is the exact copy of the primary. In case of the node containing the primary shard goes down, the replica takes over. This process is completely transparent and managed by Elasticsearch. We will discuss this in detail in the Failure Handling section below. Since primary and replicas are the exact copies, a search query can be answered by either the primary or the replica shard. This significantly increases the number of simultaneous requests Elasticsearch can handle at any point in time. As the index is distributed across multiple shards, a query against an index is executed in parallel across all the shards. The results from each shard are then gathered and sent back to the client. Executing the query in parallel greatly improves the search performance. Now, we will discuss the relation between node, index and shard. Relation between node, index, and shard Shard is often the most confusing topic when I talk about Elasticsearch at conferences or to someone who has never worked on Elasticsearch. In this section, I want to focus on the relation between node, index, and shard. We will use a cluster with three nodes and create the same index with multiple shard configuration, and we will talk through the differences. Three shards with zero replicas We will start with an index called esintroduction with three shards and zero replicas. The distribution of the shards in a three node cluster is as follows: In the above screenshot, shards are represented by the green squares. We will talk about replicas towards the end of this discussion. Since we have three nodes(servers) and three shards, the shards are evenly distributed across all three nodes. Each node will contain one shard. As you index your documents into the esintroduction index, data is spread across the three shards. Six shards with zero replicas Now, let's recreate the same esintroduction index with six shards and zero replicas. Since we have three nodes (servers) and six shards, each node will now contain two shards. The esintroduction index is split between six shards across three nodes. The distribution of shards for an index with six shards is as follows: The esintroduction index is spread across three nodes, meaning these three nodes will handle the index/query requests for the index. If these three nodes are not able to keep up with the indexing/search load, we can scale the esintroduction index by adding more nodes. Since the index has six shards, you could add three more nodes, and Elasticsearch automatically rearranges the shards across all six nodes. Now, index/query requests for the esintroduction index will be handled by six nodes instead of three nodes. If this is not clear, do not worry, we will discuss more about this as we progress in the book. Six shards with one replica Let's now recreate the same esintroduction index with six shards and one replica, meaning the index will have 6 primary shards and 6 replica shards, a total of 12 shards. Since we have three nodes (servers) and twelve shards, each node will now contain four shards. The esintroduction index is split between six shards across three nodes. The green squares represent shards in the following figure. The solid border represents primary shards, and replicas are the dotted squares: As we discussed before, the index is distributed into multiple shards across multiple nodes. In a distributed environment, a node/server can go down due to various reasons, such as disk failure, network issue, and so on. To ensure availability, each shard, by default, is replicated to a node other than where the primary shard exists. If the node containing the primary shard goes down, the shard replica is promoted to primary, and the data is not lost, and you can continue to operate on the index. In the preceding figure, the esintroduction index has six shards split across the three nodes. The primary of shard 2 belongs to node elasticsearch 1, and the replica of the shard 2 belongs to node elasticsearch 3. In the case of the elasticsearch 1 node going down, the replica in elasticsearch 3 is promoted to primary. This switch is completely transparent and handled by Elasticsearch. Distributed search One of the reasons queries executed on Elasticsearch are so fast is because they are distributed. Multiple shards act as one index. A search query on an index is executed in parallel across all the shards. Let's take an example: in the following figure, we have a cluster with two nodes: Node1, Node2 and an index named chapter1 with two shards: S0, S1 with one replica: Assuming the chapter1 index has 100 documents, S1 would have 50 documents, and S0 would have 50 documents. And you want to query for all the documents that contain the word Elasticsearch. The query is executed on S0 and S1 in parallel. The results are gathered back from both the shards and sent back to the client. Imagine, you have to query across million of documents, using Elasticsearch the search can be distributed. For the application I'm currently working on, a query on more than 100 million documents comes back within 50 milliseconds; which is simply not possible if the search is not distributed. Failure handling in Elasticsearch Elasticsearch handles failures automatically. This section describes how the failures are handled internally. Let's say we have an index with two shards and one replica. In the following diagram, the shards represented in solid line are primary shards, and the shards in the dotted line are replicas: As shown in preceding diagram, we initially have a cluster with two nodes. Since the index has two shards and one replica, shards are distributed across the two nodes. To ensure availability, primary and replica shards never exist in the same node. If the node containing both primary and replica shards goes down, the data cannot be recovered. In the preceding diagram, you can see that the primary shard S0 belongs to Node 1 and the replica shard S0 to the Node 2. Next, just like we discussed in the Relation between Node, Index and Shard section, we will add two new nodes to the existing cluster, as shown here: The cluster now contains four nodes, and the shards are automatically allocated to the new nodes. Each node in the cluster will now contain either a primary or replica shard. Now, let's say Node2, which contains the primary shard S1, goes down as shown here: Since the node that holds the primary shard went down, the replica of S1, which lives in Node3, is promoted to primary. To ensure the replication factor of 1, a copy of the shard S1 is made on Node1. This process is known as rebalancing of the cluster. Depending on the application, the number of shards can be configured while creating the index. The process of rebalancing the shards to other nodes is entirely transparent to the user and handled automatically by Elasticsearch. We discussed inverted indexes, relation between nodes, index and shard, distributed search and how failures are handled automatically in Elasticsearch. Check out this book, 'Learning Elasticsearch' to know about handling document relationships, working with geospatial data, and much more. How to install Elasticsearch in Ubuntu and Windows Working with Kibana in Elasticsearch 5.x CRUD (Create Read, Update and Delete) Operations with Elasticsearch
Read more
  • 0
  • 2
  • 86390

article-image-which-python-framework-is-best-for-building-restful-apis-django-or-flask
Vincy Davis
07 May 2019
9 min read
Save for later

Which Python framework is best for building RESTful APIs? Django or Flask?

Vincy Davis
07 May 2019
9 min read
Python is one of the top-rated programming languages. It's also known for its less-complex syntax, and its high-level, object-oriented, robust, and general-purpose programming. Python is the top choice for any first-time programmer. Since its release in 1991, Python has evolved and powered by several frameworks for web application development, scientific and mathematical computing, and graphical user interfaces to the latest REST API frameworks. This article is an excerpt taken from the book, 'Hands-On RESTful API Design Patterns and Best Practices' written by Harihara Subramanian and Pethura Raj. This book covers design strategy, essential and advanced Restful API Patterns, Legacy Modernization to Microservices centric apps. In this article, we'll explore two comprehensive frameworks, Django and Flask, so that you can choose the best one for developing your RESTful API. Django Django is a web framework also available as open source with the BSD license, designed to help developers create their web app very quickly as it takes care of additional web-development needs. It includes several packages (also known as applications) to handle typical web-development tasks, such as authentication, content administration, scaffolding, templates, caching, and syndication. Let's use the Django REST Framework (DRF) built with Python, and use it for REST API development and deployment. Django Rest Framework DRF is an open source, well-matured Python and Django library intended to help APP developers build sophisticated web APIs. DRF's modular, flexible, and customizable architecture makes the development of both simple, turnkey API endpoints and complicated REST constructs possible. The goal of DRF is to divide a model, generalize the wire representation, such as JSON or XML, and customize a set of class-based views to satisfy the specific API endpoint using a serializer that describes the mapping between views and API endpoints. Core features Django has many distinct features including: Web-browsable API This feature enhances the REST API developed with DRF. It has a rich interface, and the web-browsable API supports multiple media types too. The browsable API does mean that the APIs we build will be self-describing and the API endpoints that we create as part of the REST services and return JSON or HTML representations. The interesting fact about the web-browsable API is that we can interact with it fully through the browser, and any endpoint that we interact with using a programmatic client will also be capable of responding with a browser-friendly view onto the web-browsable API. Authentication One of the main attractive features of Django is authentication; it supports broad categories of authentication schemes, from basic authentication, token authentication, session authentication, remote user authentication, to OAuth Authentication. It also supports custom authentication schemes if we wish to implement one. DRF runs the authentication scheme at the start of the view, that is, before any other code is allowed to proceed. DRF determines the privileges of the incoming request from the permission and throttling policies and then decides whether the incoming request can be allowed or disallowed with the matched credentials. Serialization and deserialization Serialization is the process of converting complex data, such as querysets and model instances, into native Python datatypes. Converting facilitates the rendering of native data types, such as JSON or XML. DRF supports serialization through serializers classes. The serializers of DRF are similar to Django's Form and ModelForm classes. It provides a serializer class, which helps to control the output of responses. The DRF ModelSerializer classes provide a simple mechanism with which we can create serializers that deal with model instances and querysets. Serializers also do deserialization, that is, serializers allow parsed data that needs to be converted back into complex types. Also, deserialization happens only after validating the incoming data. Other noteworthy features Here are some other noteworthy features of the DRF: Routers: The DRF supports automatic URL routing to Django and provides a consistent and straightforward way to wire the view logic to a set of URLs Class-based views: A dominant pattern that enables the reusability of common functionalities Hyperlinking APIs: The DRF supports various styles (using primary keys, hyperlinking between entities, and so on) to represent the relationship between entities Generic views: Allows us to build API views that map to the database models DRF has many other features such as caching, throttling, testing, etc. Benefits of the DRF Here are some of the benefits of the DRF: Web-browsable API Authentication policies Powerful serialization Extensive documentation and excellent community support Simple yet powerful Test coverage of source code Secure and scalable Customizable Drawbacks of the DRF Here are some facts that may disappoint some Python app developers who intend to use the DRF: Monolithic and components get deployed together Based on Django ORM Steep learning curve Slow response time Flask Flask is a microframework for Python developers based on Werkzeug (WSGI toolkit) and Jinja 2 (template engine). It comes under BSD licensing. Flask is very easy to set up and simple to use. Like other frameworks, it comes with several out-of-the-box capabilities, such as a built-in development server, debugger, unit test support, templating, secure cookies, and RESTful request dispatching. The powerful Flask  RESTful API framework is discussed below. Flask-RESTful Flask-RESTful is an extension for Flask that provides additional support for building REST APIs. You will never be disappointed with the time it takes to develop an API. Flask-Restful is a lightweight abstraction that works with the existing ORM/libraries. Flask-RESTful encourages best practices with minimal setup. Core features of Flask-RESTful Flask-RESTful comes with several built-in features. Django and Flask have many common RESTful frameworks, because they have almost the same supporting core features. The unique RESTful features of Flask is mentioned below. Resourceful routing The design goal of Flask-RESTful is to provide resources built on top of Flask pluggable views. The pluggable views provide a simple way to access the HTTP methods. Consider the following example code: class Todo(Resource): def get(self, user_id): .... def delete(self, user_id): .... def put(self, user_id): args = parser.parse_args() .... Restful request parsing Request parsing refers to an interface, modeled after the Python parser interface for command-line arguments, called argparser. The RESTful request parser is designed to provide uniform and straightforward access to any variable that comes within the (flask.request) request object. Output fields In most cases, app developers prefer to control rendering response data, and Flask-RESTful provides a mechanism where you can use ORM models or even custom classes as an object to render. Another interesting fact about this framework is that app developers don't need to worry about exposing any internal data structures as its let one format and filter the response objects. So, when we look at the code, it'll be evident which data would go for rendering and how it'll be formatted. Other noteworthy features Here are some other noteworthy features of Flask-RESTful: API: This is the main entry point for the restful API, which we'll initialize with the Flask application. ReqParse: This enables us to add and parse multiple arguments in the context of the single request. Input: A useful functionality, it parses the input string and returns true or false depending on the Input. If the input is from the JSON body,  the type is already native Boolean and passed through without further parsing. Benefits of the Flask framework Here are some of the benefits of Flask framework: Built-in development server and debugger Out-of-the-box RESTful request dispatching Support for secure cookies Integrated unit-test support Lightweight Very minimal setup Faster (performance) Easy NoSQL integration Extensive documentation Drawbacks of Flask Here are some of Flask and Flask-RESTful's disadvantages: Version management (managed by developers) No brownie points as it doesn't have browsable APIs May incur a steep learning curve Frameworks – a table of reference The following table provides a quick reference of a few other prominent micro-frameworks, their features, and supported programming languages: Language Framework Short description Prominent features Java Blade Fast and elegant MVC framework for Java8 Lightweight High performance Based on the MVC pattern RESTful-style router interface Built-in security Java/Scala Play Framework High-velocity Reactive web framework for Java and Scala Lightweight, stateless, and web-friendly architecture Built on Akka Supports predictable and minimal resource-consumption for highly-scalable applications Developer-friendly Java Ninja Web Framework Full-stack web framework Fast Developer-friendly Rapid prototyping Plain vanilla Java, dependency injection, first-class IDE integration Simple and fast to test (mocked tests/integration tests) Excellent build and CI support Clean codebase – easy to extend Java RESTEASY JBoss-based implementation that integrates several frameworks to help to build RESTful Web and Java applications Fast and reliable Large community Enterprise-ready Security support Java RESTLET A lightweight and comprehensive framework based on Java, suitable for both server and client applications. Lightweight Large community Native REST support Connectors set JavaScript Express.js Minimal and flexible Node.js-based JavaScript framework for mobile and web applications HTTP utility methods Security updates Templating engine PHP Laravel An open source web-app builder based on PHP and the MVC architecture pattern Intuitive interface Blade template engine Eloquent ORM as default Elixir Phoenix (Elixir) Powered with the Elixir functional language, a reliable and faster micro-framework MVC-based High application performance Erlong virtual machine enables better use of resources Python Pyramid Python-based micro-framework Lightweight Function decorators Events and subscribers support Easy implementations and high productivity Summary It's evident that Python has two excellent frameworks. Depending on the choice of programming language you are intending to use and the required features, you can choose your type of framework to work on. If you are interested in learning more about the design strategy, guidelines and best practices of Restful API Patterns, you can refer to our book 'Hands-On RESTful API Design Patterns and Best Practices' here. Stack Overflow survey data further confirms Python’s popularity as it moves above Java in the most used programming language list. Svelte 3 releases with reactivity through language instead of an API Microsoft introduces Pyright, a static type checker for the Python language written in TypeScript
Read more
  • 0
  • 0
  • 85386
article-image-image-filtering-techniques-opencv
Vijin Boricha
12 Apr 2018
15 min read
Save for later

Image filtering techniques in OpenCV

Vijin Boricha
12 Apr 2018
15 min read
In the world of computer vision, image filtering is used to modify images. These modifications essentially allow you to clarify an image in order to get the information you want. This could involve anything from extracting edges from an image, blurring it, or removing unwanted objects.  There are, of course, lots of reasons why you might want to use image filtering to modify an image. For example, taking a picture in sunlight or darkness will impact an images clarity - you can use image filters to modify the image to get what you want from it. Similarly, you might have a blurred or 'noisy' image that needs clarification and focus. Let's use an example to see how to do image filtering in OpenCV. This image filtering tutorial is an extract from Practical Computer Vision. Here's an example with considerable salt and pepper noise. This occurs when there is a disturbance in the quality of the signal that's used to generate the image. The image above can be easily generated using OpenCV as follows: # initialize noise image with zeros noise = np.zeros((400, 600)) # fill the image with random numbers in given range cv2.randu(noise, 0, 256) Let's add weighted noise to a grayscale image (on the left) so the resulting image will look like the one on the right: The code for this is as follows: # add noise to existing image noisy_gray = gray + np.array(0.2*noise, dtype=np.int) Here, 0.2 is used as parameter, increase or decrease the value to create different intensity noise. In several applications, noise plays an important role in improving a system's capabilities. This is particularly true when you're using deep learning models. The noise becomes a way of testing the precision of the deep learning application, and building it into the computer vision algorithm. Linear image filtering The simplest filter is a point operator. Each pixel value is multiplied by a scalar value. This operation can be written as follows: Here: The input image is F and the value of pixel at (i,j) is denoted as f(i,j) The output image is G and the value of pixel at (i,j) is denoted as g(i,j) K is scalar constant This type of operation on an image is what is known as a linear filter. In addition to multiplication by a scalar value, each pixel can also be increased or decreased by a constant value. So overall point operation can be written like this: This operation can be applied both to grayscale images and RGB images. For RGB images, each channel will be modified with this operation separately. The following is the result of varying both K and L. The first image is input on the left. In the second image, K=0.5 and L=0.0, while in the third image, K is set to 1.0 and L is 10. For the final image on the right, K=0.7 and L=25. As you can see, varying K changes the brightness of the image and varying L changes the contrast of the image: This image can be generated with the following code: import numpy as np import matplotlib.pyplot as plt import cv2 def point_operation(img, K, L): """ Applies point operation to given grayscale image """ img = np.asarray(img, dtype=np.float) img = img*K + L # clip pixel values img[img > 255] = 255 img[img < 0] = 0 return np.asarray(img, dtype = np.int) def main(): # read an image img = cv2.imread('../figures/flower.png') gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # k = 0.5, l = 0 out1 = point_operation(gray, 0.5, 0) # k = 1., l = 10 out2 = point_operation(gray, 1., 10) # k = 0.8, l = 15 out3 = point_operation(gray, 0.7, 25) res = np.hstack([gray,out1, out2, out3]) plt.imshow(res, cmap='gray') plt.axis('off') plt.show() if __name__ == '__main__': main() 2D linear image filtering While the preceding filter is a point-based filter, image pixels have information around the pixel as well. In the previous image of the flower, the pixel values in the petal are all yellow. If we choose a pixel of the petal and move around, the values will be quite close. This gives some more information about the image. To extract this information in filtering, there are several neighborhood filters. In neighborhood filters, there is a kernel matrix which captures local region information around a pixel. To explain these filters, let's start with an input image, as follows: This is a simple binary image of the number 2. To get certain information from this image, we can directly use all the pixel values. But instead, to simplify, we can apply filters on this. We define a matrix smaller than the given image which operates in the neighborhood of a target pixel. This matrix is termed kernel; an example is given as follows: The operation is defined first by superimposing the kernel matrix on the original image, then taking the product of the corresponding pixels and returning a summation of all the products. In the following figure, the lower 3 x 3 area in the original image is superimposed with the given kernel matrix and the corresponding pixel values from the kernel and image are multiplied. The resulting image is shown on the right and is the summation of all the previous pixel products: This operation is repeated by sliding the kernel along image rows and then image columns. This can be implemented as in following code. We will see the effects of applying this on an image in coming sections. # design a kernel matrix, here is uniform 5x5 kernel = np.ones((5,5),np.float32)/25 # apply on the input image, here grayscale input dst = cv2.filter2D(gray,-1,kernel) However, as you can see previously, the corner pixel will have a drastic impact and results in a smaller image because the kernel, while overlapping, will be outside the image region. This causes a black region, or holes, along with the boundary of an image. To rectify this, there are some common techniques used: Padding the corners with constant values maybe 0 or 255, by default OpenCV will use this. Mirroring the pixel along the edge to the external area Creating a pattern of pixels around the image The choice of these will depend on the task at hand. In common cases, padding will be able to generate satisfactory results. The effect of the kernel is most crucial as changing these values changes the output significantly. We will first see simple kernel-based filters and also see their effects on the output when changing the size. Box filtering This filter averages out the pixel value as the kernel matrix is denoted as follows: Applying this filter results in blurring the image. The results are as shown as follows: In frequency domain analysis of the image, this filter is a low pass filter. The frequency domain analysis is done using Fourier transformation of the image, which is beyond the scope of this introduction. We can see on changing the kernel size, the image gets more and more blurred: As we increase the size of the kernel, you can see that the resulting image gets more blurred. This is due to averaging out of peak values in small neighbourhood where the kernel is applied. The result for applying kernel of size 20x20 can be seen in the following image. However, if we use a very small filter of size (3,3) there is negligible effect on the output, due to the fact that the kernel size is quite small compared to the photo size. In most applications, kernel size is heuristically set according to image size: The complete code to generate box filtered photos is as follows: def plot_cv_img(input_image, output_image): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image, cv2.COLOR_BGR2RGB)) ax[1].set_title('Box Filter (5,5)') ax[1].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # To try different kernel, change size here. kernel_size = (5,5) # opencv has implementation for kernel based box blurring blur = cv2.blur(img,kernel_size) # Do plot plot_cv_img(img, blur) if __name__ == '__main__': main() Properties of linear filters Several computer vision applications are composed of step by step transformations of an input photo to output. This is easily done due to several properties associated with a common type of filters, that is, linear filters: The linear filters are commutative such that we can perform multiplication operations on filters in any order and the result still remains the same: a * b = b * a They are associative in nature, which means the order of applying the filter does not affect the outcome: (a * b) * c = a * (b * c) Even in cases of summing two filters, we can perform the first summation and then apply the filter, or we can also individually apply the filter and then sum the results. The overall outcome still remains the same: Applying a scaling factor to one filter and multiplying to another filter is equivalent to first multiplying both filters and then applying scaling factor These properties play a significant role in other computer vision tasks such as object detection and segmentation. A suitable combination of these filters enhances the quality of information extraction and as a result, improves the accuracy. Non-linear image filtering While in many cases linear filters are sufficient to get the required results, in several other use cases performance can be significantly increased by using non-linear image filtering. Mon-linear image filtering is more complex, than linear filtering. This complexity can, however, give you more control and better results in your computer vision tasks. Let's take a look at how non-linear image filtering works when applied to different images. Smoothing a photo Applying a box filter with hard edges doesn't result in a smooth blur on the output photo. To improve this, the filter can be made smoother around the edges. One of the popular such filters is a Gaussian filter. This is a non-linear filter which enhances the effect of the center pixel and gradually reduces the effects as the pixel gets farther from the center. Mathematically, a Gaussian function is given as: where μ is mean and σ is variance. An example kernel matrix for this kind of filter in 2D discrete domain is given as follows: This 2D array is used in normalized form and effect of this filter also depends on its width by changing the kernel width has varying effects on the output as discussed in further section. Applying gaussian kernel as filter removes high-frequency components which results in removing strong edges and hence a blurred photo: While this filter performs better blurring than a box filter, the implementation is also quite simple with OpenCV: def plot_cv_img(input_image, output_image): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image, cv2.COLOR_BGR2RGB)) ax[1].set_title('Gaussian Blurred') ax[1].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # apply gaussian blur, # kernel of size 5x5, # change here for other sizes kernel_size = (5,5) # sigma values are same in both direction blur = cv2.GaussianBlur(img,(5,5),0) plot_cv_img(img, blur) if __name__ == '__main__': main() The histogram equalization technique The basic point operations, to change the brightness and contrast, help in improving photo quality but require manual tuning. Using histogram equalization technique, these can be found algorithmically and create a better-looking photo. Intuitively, this method tries to set the brightest pixels to white and the darker pixels to black. The remaining pixel values are similarly rescaled. This rescaling is performed by transforming original intensity distribution to capture all intensity distribution. An example of this equalization is as following: The preceding image is an example of histogram equalization. On the right is the output and, as you can see, the contrast is increased significantly. The input histogram is shown in the bottom figure on the left and it can be observed that not all the colors are observed in the image. After applying equalization, resulting histogram plot is as shown on the right bottom figure. To visualize the results of equalization in the image , the input and results are stacked together in following figure. Code for the preceding photos is as follows: def plot_gray(input_image, output_image): """ Converts an image from BGR to RGB and plots """ # change color channels order for matplotlib fig, ax = plt.subplots(nrows=1, ncols=2) ax[0].imshow(input_image, cmap='gray') ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(output_image, cmap='gray') ax[1].set_title('Histogram Equalized ') ax[1].axis('off') plt.savefig('../figures/03_histogram_equalized.png') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # grayscale image is used for equalization gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # following function performs equalization on input image equ = cv2.equalizeHist(gray) # for visualizing input and output side by side plot_gray(gray, equ) if __name__ == '__main__': main() Median image filtering Median image filtering a similar technique as neighborhood filtering. The key technique here, of course, is the use of a median value. As such, the filter is non-linear. It is quite useful in removing sharp noise such as salt and pepper. Instead of using a product or sum of neighborhood pixel values, this filter computes a median value of the region. This results in the removal of random peak values in the region, which can be due to noise like salt and pepper noise. This is further shown in the following figure with different kernel size used to create output. In this image first input is added with channel wise random noise as: # read the image flower = cv2.imread('../figures/flower.png') # initialize noise image with zeros noise = np.zeros(flower.shape[:2]) # fill the image with random numbers in given range cv2.randu(noise, 0, 256) # add noise to existing image, apply channel wise noise_factor = 0.1 noisy_flower = np.zeros(flower.shape) for i in range(flower.shape[2]): noisy_flower[:,:,i] = flower[:,:,i] + np.array(noise_factor*noise, dtype=np.int) # convert data type for use noisy_flower = np.asarray(noisy_flower, dtype=np.uint8) The created noisy image is used for median image filtering as: # apply median filter of kernel size 5 kernel_5 = 5 median_5 = cv2.medianBlur(noisy_flower,kernel_5) # apply median filter of kernel size 3 kernel_3 = 3 median_3 = cv2.medianBlur(noisy_flower,kernel_3) In the following photo, you can see the resulting photo after varying the kernel size (indicated in brackets). The rightmost photo is the smoothest of them all: The most common application for median blur is in smartphone application which filters input image and adds an additional artifacts to add artistic effects. The code to generate the preceding photograph is as follows: def plot_cv_img(input_image, output_image1, output_image2, output_image3): """ Converts an image from BGR to RGB and plots """ fig, ax = plt.subplots(nrows=1, ncols=4) ax[0].imshow(cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)) ax[0].set_title('Input Image') ax[0].axis('off') ax[1].imshow(cv2.cvtColor(output_image1, cv2.COLOR_BGR2RGB)) ax[1].set_title('Median Filter (3,3)') ax[1].axis('off') ax[2].imshow(cv2.cvtColor(output_image2, cv2.COLOR_BGR2RGB)) ax[2].set_title('Median Filter (5,5)') ax[2].axis('off') ax[3].imshow(cv2.cvtColor(output_image3, cv2.COLOR_BGR2RGB)) ax[3].set_title('Median Filter (7,7)') ax[3].axis('off') plt.show() def main(): # read an image img = cv2.imread('../figures/flower.png') # compute median filtered image varying kernel size median1 = cv2.medianBlur(img,3) median2 = cv2.medianBlur(img,5) median3 = cv2.medianBlur(img,7) # Do plot plot_cv_img(img, median1, median2, median3) if __name__ == '__main__': main() Image filtering and image gradients These are more edge detectors or sharp changes in a photograph. Image gradients widely used in object detection and segmentation tasks. In this section, we will look at how to compute image gradients. First, the image derivative is applying the kernel matrix which computes the change in a direction. The Sobel filter is one such filter and kernel in the x-direction is given as follows: Here, in the y-direction: This is applied in a similar fashion to the linear box filter by computing values on a superimposed kernel with the photo. The filter is then shifted along the image to compute all values. Following is some example results, where X and Y denote the direction of the Sobel kernel: This is also termed as an image derivative with respect to given direction(here X or Y). The lighter resulting photographs (middle and right) are positive gradients, while the darker regions denote negative and gray is zero. While Sobel filters correspond to first order derivatives of a photo, the Laplacian filter gives a second-order derivative of a photo. The Laplacian filter is also applied in a similar way to Sobel: The code to get Sobel and Laplacian filters is as follows: # sobel x_sobel = cv2.Sobel(img,cv2.CV_64F,1,0,ksize=5) y_sobel = cv2.Sobel(img,cv2.CV_64F,0,1,ksize=5) # laplacian lapl = cv2.Laplacian(img,cv2.CV_64F, ksize=5) # gaussian blur blur = cv2.GaussianBlur(img,(5,5),0) # laplacian of gaussian log = cv2.Laplacian(blur,cv2.CV_64F, ksize=5) We learnt about types of filters and how to perform image filtering in OpenCV. To know more about image transformation and 3D computer vision check out this book Practical Computer Vision. Check out for more: Fingerprint detection using OpenCV 3 3 ways to deploy a QT and OpenCV application OpenCV 4.0 is on schedule for July release  
Read more
  • 0
  • 1
  • 85106

article-image-implementing-3-naive-bayes-classifiers-in-scikit-learn
Packt Editorial Staff
07 May 2018
13 min read
Save for later

Implementing 3 Naive Bayes classifiers in scikit-learn

Packt Editorial Staff
07 May 2018
13 min read
Scikit-learn provide three naive Bayes implementations: Bernoulli, multinomial and Gaussian. The only difference is about the probability distribution adopted. The first one is a binary algorithm particularly useful when a feature can be present or not. Multinomial naive Bayes assumes to have feature vector where each element represents the number of times it appears (or, very often, its frequency). This technique is very efficient in natural language processing or whenever the samples are composed starting from a common dictionary. The Gaussian Naive Bayes, instead, is based on a continuous distribution and it's suitable for more generic classification tasks. Ok, now that we have established naive Bayes variants are a handy set of algorithms to have in our machine learning arsenal and that Scikit-learn is a good tool to implement them, let’s rewind a bit. What is Naive Bayes? Naive Bayes are a family of powerful and easy-to-train classifiers, which determine the probability of an outcome, given a set of conditions using the Bayes' theorem. In other words, the conditional probabilities are inverted so that the query can be expressed as a function of measurable quantities. The approach is simple and the adjective naive has been attributed not because these algorithms are limited or less efficient, but because of a fundamental assumption about the causal factors that we will discuss. Naive Bayes are multi-purpose classifiers and it's easy to find their application in many different contexts. However, the performance is particularly good in all those situations when the probability of a class is determined by the probabilities of some causal factors. A good example is given by natural language processing, where a text can be considered as a particular instance of a dictionary and the relative frequencies of all terms provide enough information to infer a belonging class. Our examples may be generic, so to let you understand the application of naive Bayes in various context. The Bayes' theorem Let's consider two probabilistic events A and B. We can correlate the marginal probabilities P(A) and P(B) with the conditional probabilities P(A|B) and P(B|A) using the product rule: Considering that the intersection is commutative, the first members are equal, so we can derive the Bayes' theorem: This formula has very deep philosophical implications and it's a fundamental element of statistical learning. First of all, let's consider the marginal probability P(A): this is normally a value that determines how probable a target event is, like P(Spam) or P(Rain). As there are no other elements, this kind of probability is called Apriori, because it's often determined by mathematical considerations or simply by a frequency count. For example, imagine we want to implement a very simple spam filter and we've collected 100 emails. We know that 30 are spam and 70 are regular. So we can say that P(Spam) = 0.3. However, we'd like to evaluate using some criteria (for simplicity, let's consider a single one), for example, e-mail text is shorter than 50 characters. Therefore, our query becomes: The first term is similar to P(Spam) because it's the probability of spam given a certain condition. For this reason, it's called a posteriori (in other words, it's a probability that can estimate after knowing some additional elements). On the right side, we need to calculate the missing values, but it's simple. Let's suppose that 35 emails have a text shorter than 50 characters, P(Text < 50 chars) = 0.35 and, looking only into our spam folder, we discover that only 25 spam emails have a short text, so that P(Text < 50 chars|Spam) = 25/30 = 0.83. The result is: So, after receiving a very short email, there is 71% probability that it's a spam. Now we can understand the role of P(Text < 50 chars|Spam): as we have actual data, we can measure how probable is our hypothesis given the query, in other words, we have defined a likelihood (compare this with logistic regression) which is a weight between the Apriori probability and the a posteriori one (the term on the denominator is less important because it works as normalizing factor): The normalization factor is often represented by the Greek letter alpha, so the formula becomes: The last step is considering the case when there are more concurrent conditions (that is more realistic in real-life problems): A common assumption is called conditional independence (in other words, the effects produced by every cause are independent among each other) and allows us to write a simplified expression: Naive Bayes classifiers A naive Bayes classifier is called in this way because it's based on a naive condition, which implies the conditional independence of causes. This can seem very difficult to accept in many contexts where the probability of a particular feature is strictly correlated to another one. For example, in spam filtering, a text shorter than 50 characters can increase the probability of the presence of an image, or if the domain has been already blacklisted for sending the same spam emails to million users, it's likely to find particular keywords. In other words, the presence of a cause isn't normally independent from the presence of other ones. However, in Zhang H., The Optimality of Naive Bayes, AAAI 1, no. 2 (2004): 3, the author showed that under particular conditions (not so rare to happen), different dependencies clears one another, and a naive Bayes classifier succeeds in achieving very high performances even if its naiveness is violated. Let's consider a dataset: Every feature vector, for simplicity, will be represented as: We need also a target dataset: where each y can belong to one of P different classes. Considering the Bayes' theorem under conditional independence, we can write: The values of the marginal Apriori probability P(y) and of the conditional probabilities P(xi|y) is obtained through a frequency count, therefore, given an input vector x, the predicted class is the one which a posteriori probability is maximum. Naive Bayes in scikit-learn scikit-learn implements three naive Bayes variants based on the same number of different probabilistic distributions: Bernoulli, multinomial, and Gaussian. The first one is a binary distribution useful when a feature can be present or absent. The second one is a discrete distribution used whenever a feature must be represented by a whole number (for example, in natural language processing, it can be the frequency of a term), while the latter is a continuous distribution characterized by its mean and variance. Bernoulli naive Bayes If X is random variable Bernoulli-distributed, it can assume only two values (for simplicity, let's call them 0 and 1) and their probability is: To try this algorithm with scikit-learn, we're going to generate a dummy dataset. Bernoulli naive Bayes expects binary feature vectors, however, the class BernoulliNB has a binarize parameter which allows specifying a threshold that will be used internally to transform the features: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0) We have a generated the bidimensional dataset shown in the following figure: We have decided to use 0.0 as a binary threshold, so each point can be characterized by the quadrant where it's located. Of course, this is a rational choice for our dataset, but Bernoulli naive Bayes is thought for binary feature vectors or continuous values which can be precisely split with a predefined threshold. from sklearn.naive_bayes import BernoulliNB from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> bnb = BernoulliNB(binarize=0.0) >>> bnb.fit(X_train, Y_train) >>> bnb.score(X_test, Y_test) 0.85333333333333339 The score in rather good, but if we want to understand how the binary classifier worked, it's useful to see how the data have been internally binarized: Now, checking the naive Bayes predictions we obtain: >>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) >>> bnb.predict(data) array([0, 0, 1, 1]) Which is exactly what we expected. Multinomial naive Bayes A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences of a term or its relative frequency. If the feature vectors have n elements and each of them can assume k different values with probability pk, then: The conditional probabilities P(xi|y) are computed with a frequency count (which corresponds to applying a maximum likelihood approach), but in this case, it's important to consider the alpha parameter (called Laplace smoothing factor) which default value is 1.0 and prevents the model from setting null probabilities when the frequency is zero. It's possible to assign all non-negative values, however, larger values will assign higher probabilities to the missing features and this choice could alter the stability of the model. In our example, we're going to consider the default value of 1.0. For our purposes, we're going to use the DictVectorizer. There are automatic instruments to compute the frequencies of terms, but we're going to discuss them later. Let's consider only two records: the first one representing a city, while the second one countryside. Our dictionary contains hypothetical frequencies, like if the terms were extracted from a text description: from sklearn.feature_extraction import DictVectorizer >>> data = [ {'house': 100, 'street': 50, 'shop': 25, 'car': 100, 'tree': 20}, {'house': 5, 'street': 5, 'shop': 0, 'car': 10, 'tree': 500, 'river': 1} ] >>> dv = DictVectorizer(sparse=False) >>> X = dv.fit_transform(data) >>> Y = np.array([1, 0]) >>> X array([[ 100., 100., 0., 25., 50., 20.], [ 10., 5., 1., 0., 5., 500.]]) Note that the term 'river' is missing from the first set, so it's useful to keep alpha equal to 1.0 to give it a small probability. The output classes are 1 for city and 0 for the countryside. Now we can train a MultinomialNB instance: from sklearn.naive_bayes import MultinomialNB >>> mnb = MultinomialNB() >>> mnb.fit(X, Y) MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) To test the model, we create a dummy city with a river and a dummy country place without any river. >>> test_data = data = [ {'house': 80, 'street': 20, 'shop': 15, 'car': 70, 'tree': 10, 'river': 1}, ] {'house': 10, 'street': 5, 'shop': 1, 'car': 8, 'tree': 300, 'river': 0} >>> mnb.predict(dv.fit_transform(test_data)) array([1, 0]) As expected the prediction is correct. Later on, when discussing some elements of natural language processing, we're going to use multinomial naive Bayes for text classification with larger corpora. Even if the multinomial distribution is based on the number of occurrences, it can be successfully used with frequencies or more complex functions. Gaussian Naive Bayes Gaussian Naive Bayes is useful when working with continuous values which probabilities can be modeled using a Gaussian distribution: The conditional probabilities P(xi|y) are also Gaussian distributed and, therefore, it's necessary to estimate mean and variance of each of them using the maximum likelihood approach. This quite easy, in fact, considering the property of a Gaussian, we get: Where the k index refers to the samples in our dataset and P(xi|y) is a Gaussian itself. By minimizing the inverse of this expression (in Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson there's a complete analytical explanation), we get mean and variance for each Gaussian associated to P(xi|y) and the model is hence trained. As an example, we compare Gaussian Naive Bayes with logistic regression using the ROC curves. The dataset has 300 samples with two features. Each sample belongs to a single class: from sklearn.datasets import make_classification >>> nb_samples = 300 >>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0) A plot of the dataset is shown in the following figure: Now we can train both models and generate the ROC curves (the Y scores for naive Bayes are obtained through the predict_proba method): from sklearn.naive_bayes import GaussianNB from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25) >>> gnb = GaussianNB() >>> gnb.fit(X_train, Y_train) >>> Y_gnb_score = gnb.predict_proba(X_test) >>> lr = LogisticRegression() >>> lr.fit(X_train, Y_train) >>> Y_lr_score = lr.decision_function(X_test) >>> fpr_gnb, tpr_gnb, thresholds_gnb = roc_curve(Y_test, Y_gnb_score[:, 1]) >>> fpr_lr, tpr_lr, thresholds_lr = roc_curve(Y_test, Y_lr_score) The resulting ROC curves are shown in the following figure: Naive Bayes performances are slightly better than logistic regression, however, the two classifiers have similar accuracy and Area Under the Curve (AUC). It's interesting to compare the performances of Gaussian and multinomial naive Bayes with the MNIST digit dataset. Each sample (belonging to 10 classes) is an 8x8 image encoded as an unsigned integer (0 - 255), therefore, even if each feature doesn't represent an actual count, it can be considered like a sort of magnitude or frequency. from sklearn.datasets import load_digits from sklearn.model_selection import cross_val_score >>> digits = load_digits() >>> gnb = GaussianNB() >>> mnb = MultinomialNB() >>> cross_val_score(gnb, digits.data, digits.target, scoring='accuracy', cv=10).mean() 0.81035375835678214 >>> cross_val_score(mnb, digits.data, digits.target, scoring='accuracy', cv=10).mean() 0.88193962163008377 The multinomial naive Bayes performs better than the Gaussian variant and the result is not really surprising. In fact, each sample can be thought as a feature vector derived from a dictionary of 64 symbols. The value can be the count of each occurrence, so a multinomial distribution can better fit the data, while a Gaussian is slightly more limited by its mean and variance. We've exposed the generic naive Bayes approach starting from the Bayes' theorem and its intrinsic philosophy. The naiveness of such algorithm is due to the choice to assume all the causes to be conditional independent. It means that each contribution is the same in every combination and the presence of a specific cause cannot alter the probability of the other ones. This is not so often realistic, however, under some assumptions; it's possible to show that internal dependencies clear each other so that the resulting probability appears unaffected by their relations. [box type="note" align="" class="" width=""]You read an excerpt from the book, Machine Learning Algorithms, written by Giuseppe Bonaccorso. This book will help you build strong foundation to enter the world of machine learning and data science. You will learn to build a data model and see how it behaves using different ML algorithms, explore support vector machines, recommendation systems, and even create a machine learning architecture from scratch. Grab your copy today![/box] What is Naïve Bayes classifier? Machine Learning Algorithms: Implementing Naive Bayes with Spark MLlib Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving  
Read more
  • 0
  • 0
  • 85097