Machine Learning – Easy Reference

In this post, we have included the must known things when you deal with Machine Learning Algorithms. Here is the list of things for your easy reference, bookmark this page!

Classification metrics

In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.

Confusion matrix: The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of classification models:

ROC: The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

AUC: The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:

Regression metrics:

Basic metrics: Given a regression model f, the following metrics are commonly used to assess the performance of the model:

Coefficient of determination: The coefficient of determination, often noted R^2 or r^2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:

Model selection:

Vocabulary– When selecting a model, we distinguish 3 different parts of the data that we have as follows:

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:

Cross-validation: It also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:

The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.

Regularization: The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:

Diagnostics:

Bias: The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.

Variance: The variance of a model is the variability of the model prediction for given data points.

Bias/variance tradeoff: The simpler the model, the higher the bias, and the more complex the model, the higher the variance.

Error analysis: Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.

Ablative analysis: Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.

I hope this is useful for you to refer the important things easily.

Credits: Amidi brothers!

Data Engineer vs Data Scientist (Infographic)

This Infographic will assist us to understand better about the skills and responsibilities of Data Engineer and Data Scientist. Also, it helps us to compare salaries, popular software and tools used by each. Hope this helps!

data-engineer-vs-data-scientist

Eight Steps to become a Data Scientist ! (The Sexiest and the Hot Job of the Decade)

Thinking how to become a Data Scientist? Here we go, the 8 Steps to become a Data Scientist (The Sexiest and the Hot Job of the Decade)

Well, these steps are not so easy but possible if we try. Most of the steps come with no-cost or very low-cost.

https://i0.wp.com/blog.datacamp.com/wp-content/uploads/2014/08/How-to-become-a-data-scientist.jpg

Thanks for DataCamp for the nice infographic. Is this info useful? Then please share this info with your circle.

Clash of the Titans ! (R vs Python)

This is to all out there who are wondering which is better language to learn for data analysis and visualization. Whether one should use R or Python when they do their everyday data analysis tasks.

Both Python and R are amongst the most extensively held languages for data analysis, and have their supporters and opponents. While Python is a lot praised for being a general-purpose language with an easy-to-understand syntax, R’s functionality is developed with statisticians in thoughts, thus giving it field-specific advantages such as excessive features for data visualization.

The DataCamp has recently released a new infographic for everyone interested in how these two (statistical) programming languages relate to each other. This superb infographic discovers what the strengths of R over Python and vice versa, and aims to provide a basic comparison between these two programming languages from a data science and statistics perspective.

R vs Python for data science

Note:

Not to ignore the new entrant in war field “Julia” language. It is a high-level dynamic programming language designed to address the requirements of high-performance numerical and scientific computing while also being effective for general purpose programming. Influenced by MATLAB, C, Python, Perl, R, Ruby and others.

Soon we expect Julia to join the clash !

Steps to Learn Data Science using R

One of the common difficulties individuals face in learning R is lack of an organized way. They don’t know, from where to start, how to proceed, which way to choose? However, there is a surplus of good free resources accessible on the Internet, this could be overwhelming as well as puzzling at the mean time.

After mining through infinite resources & archives, here is a comprehensive Learning way on R to learn R from the beginning. This will help you to learn R rapidly and proficiently.

Step 1: Download and Install R

The easy way to proceed is to download the basic version of R and installation instructions from CRAN site. R is available for Windows, Mac and Linux. Windows and Mac users most likely want one of these versions of R. R is part of many Linux distributions, you should check with your Linux package management system in addition to the link above.

You can now install various packages. There are more than 9000 packages in R for different purposes. Here is a link to understand packages called CRAN Views.  You can accordingly select the sub type of packages that you want.

To install a package you can just do this

For example, if we want to install a package called “animation” then we use

install.packages("animation")

Normally the package should just install, however:

  • if you are using Linux and don’t have root access, this command won’t work.
  • you will be asked to select your local mirror, i.e. which server should you use to download the package.

You must also install RStudio. It helps R coding much easier since it allows you to type multiple lines of code, handle plots, install and maintain packages and navigate your programming environment.

Step 2: Learn the basics

You need to start by knowing the basics of the language, libraries and data structure. The R track from Datacamp is the best place to start your journey. See the free Introduction to R course at https://www.datacamp.com/courses/introduction-to-r. After doping this course, you would be comfortable writing basic scripts on R and also understand data analysis. Alternately, you can also see Code School for R at http://tryr.codeschool.com/

If you want to learn R offline on your own time – you can use the interactive package swirl from http://swirlstats.com

Primarily learn  read.table, data frames, table, summary, describe, loading and installing packages, data visualization using plot command.

Step 3: Learn Data Management:

You need to use them a lot for data cleaning, especially if you are going to work on text data. The best way is to go through the text manipulation and numerical manipulation assignments. You can learn about connecting to databases through the RODBC  package and writing sql queries to data frames through sqldf  package.

Step 4: Study specific packages in R– data.table and dplyr Here we go ! Here is a brief introduction to numerous libraries. We need to start practising some common operations.

  • Practice the data.table tutorial  thoroughly here. Print and study the cheat sheet for data.table
  • Next, you can have a look at the dplyr tutorial here.
  • For text mining, start with creating a word cloud in R and then learn learn through this series of tutorial: Part 1 and Part 2.
  • For social network analysis read through these pages.
  • Do sentiment analysis using Twitter data – check out this and this analysis.
  • For optimization through R read here and here

Step 5: Effective Data Visualization through ggplot2

  • Read Edward Tufte and his principles on how to make data visualizations here . Especially read on data-ink, lie factor and data density.
  • Read about the common pitfalls on dashboard design by Stephen Few.
  • For learning grammar of graphics and a good way to do it in R. Go through this link from Dr Hadley Wickham creator of ggplot2 and one of the most brilliant R package creators in the world today. You can download the data and slides as well.
  • Are you interested in visualzing data on spatial analsysis. Go through the amazing ggmap package.
  • Interested in making animations thorugh R. Look through these examples. Animate package will help you here.
  • Slidify will help supercharge your graphics with HTML5.

Step 6: Learn Data mining and Machine Learning Now, we come to the most valuable skill for a data scientist which is data mining and machine learning. You can see a very comprehensive set of resources on data mining in R here at http://www.rdatamining.com/ . The rattle package really helps you with an easy to use Graphical User Interface (GUI).  You can see a free open source easy to understand book here at http://togaware.com/datamining/survivor/index.html You will go through an overview of  algorithms like regressions, decision trees, ensemble modelling and   clustering.  You can also see the various machine learning options available in R by seeing the relevant CRAN view here. Resources:

Step 7: Practice Practice with example data available with you and on the internet. Stay in touch with what your fellow R coders are doing by subscribing to http://www.r-bloggers.com/ , http://stats.stackexchange.com and www.stackoverflow.com. Go through the questions and answers that users come up with. Start interacting by asking questions and providing the answers for the questions which you can ! Happy learning !!! 🙂