0% found this document useful (0 votes)
32 views47 pages

Data Science Study Materials

The document provides an overview of data science, highlighting its significance in the 21st century job market, the need for data science due to data explosion, and various job roles within the field. It outlines the essential skills and prerequisites for aspiring data scientists, including technical and non-technical skills, as well as the differences between business intelligence and data science. Additionally, it discusses the components of data science, tools used, and machine learning algorithms relevant to the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views47 pages

Data Science Study Materials

The document provides an overview of data science, highlighting its significance in the 21st century job market, the need for data science due to data explosion, and various job roles within the field. It outlines the essential skills and prerequisites for aspiring data scientists, including technical and non-technical skills, as well as the differences between business intelligence and data science. Additionally, it discusses the components of data science, tools used, and machine learning algorithms relevant to the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Data Science Study Materials

Unit-1
Data Science has become the most demanding job of the 21st century. Every organization is
looking for candidates with knowledge of data science. In this tutorial, we are giving an
introduction to data science, with data science Job roles, tools for data science, components
of data science, application, etc.

What is Data Science?

Data Science is a multidisciplinary field that involves the use of statistical and computational
methods to extract insights and knowledge from data. To analyze and comprehend large
data sets, it uses techniques from computer science, mathematics, and statistics.

Data mining, machine learning, and data visualization are just a few of the tools and
methods we frequently employ to draw meaning from data. They may deal with both
structured and unstructured data, including text and pictures, databases, and spreadsheets.

A number of sectors, including healthcare, finance, marketing, and more, use the insights
and experience gained via data analysis to steer innovation, advise business decisions, and
address challenging problems.

In short, we can say that data science is all about:

o Collecting data from a range of sources, including databases, sensors, websites, etc.

o Making sure data is in a format that can be analyzed while also organizing and
processing it to remove mistakes and inconsistencies.
o Finding patterns and correlations in the data using statistical and machine learning
approaches.

o Developing visual representations of the data to aid in comprehension of the


conclusions and insights.

o Creating mathematical models and computer programs that can classify and forecast
based on data.

o Conveying clear and understandable facts and insights to others.

Example:

Let's suppose we want to travel from station A to staztion B by car. Now, we need to make
some decisions such as which route will be the best route to reach faster at the location, in
which route there will be no traffic jam, and which will be cost-effective. All these decision
factors will act as input data, and we will get an appropriate answer from these decisions, so
this analysis of data is called the data analysis, which is a part of data science.

Need for Data Science:

Some years ago, data was less and mostly available in a structured form, which could be
easily stored in excel sheets, and processed using BI tools.

But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data
is generating on every day, which led to data explosion. It is estimated as per researches,
that by 2020, 1.7 MB of data will be created at every single second, by a single person on
earth. Every Company requires data to work, grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every organization. So to
handle, process, and analysis of this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into existence as data Science.
Following are some main reasons for using data science technology:

o Every day, the world produces enormous volumes of data, which must be processed
and analysed by data scientists in order to provide new information and
understanding.

o To maintain their competitiveness in their respective industries, businesses and


organizations must make data-driven choices. Data science offers the methods and
tools needed to harvest valuable information from data in order to help decision-
making.

o In many disciplines, including healthcare, economics, and climate research, data


science is essential for finding solutions to complicated issues.

o Data science is now crucial for creating and educating intelligent systems as artificial
intelligence and machine learning have grown in popularity.

o Data science increases productivity and lowers costs in a variety of industries,


including manufacturing and logistics, by streamlining procedures and forecasting
results.

Data science Jobs:

As per various surveys, data scientist job is becoming the most demanding Job of the 21st
century due to increasing demands for data science. Some people also called it "the hottest
job title of the 21st century". Data scientists are the experts who can use various statistical
tools and machine learning algorithms to understand and analyze the data.

The average salary range for data scientist will be approximately $95,000 to $ 165,000 per
annum, and as per different researches, about 11.5 millions of job will be created by the
year 2026.

Types of Data Science Job

If you learn data science, then you get the opportunity to find the various exciting job roles
in this domain. The main job roles are given below:

1. Data Scientist

2. Data Analyst

3. Machine learning expert

4. Data engineer
5. Data Architect

6. Data Administrator

7. Business Analyst

8. Business Intelligence Manager

1. Data Scientist: A data scientist is in charge of deciphering large, complicated data sets for
patterns and trends, as well as creating prediction models that may be applied to business
choices. They could also be in charge of creating data-driven solutions for certain business
issues.

Skill Required: To become a data scientist, one needs skills in mathematics, statistics,
programming languages(such as Python, R, and Julia), Machine Learning, Data Visualisation,
Big Data Technologies (such as Hadoop), domain expertise( such that the person is capable
of understanding data which is related to the domain), and communication and presentation
skills to efficiently convey the insights from the data.

2. Machine Learning Engineer: A machine learning engineer is in charge of creating, testing,


and implementing machine learning algorithms and models that may be utilized to
automate tasks and boost productivity.

Skill Required: Programming languages like Python and Java, statistics, machine learning
frameworks like TensorFlow and PyTorch, big data technologies like Hadoop and Spark,
software engineering, and problem-solving skills are all necessary for a machine learning
engineer.

3. Data Analyst: Data analysts are in charge of gathering and examining data in order to spot
patterns and trends and offer insights that may be applied to guide business choices.
Creating data visualizations and reports to present results to stakeholders may also fall
within the scope of their responsibility.

Skill Required: Data analysis and visualization, statistical analysis, database querying,
programming in languages like SQL or Python, critical thinking, and familiarity with tools and
technologies like Excel, Tableau, SQL Server, and Jupyter Notebook are all necessary for a
data analyst.

4. Business Intelligence Analyst: Data analysis for business development and improvement
is the responsibility of a business intelligence analyst. They could also be in charge of
developing and putting into use data warehouses and other types of data management
systems.

Skill Required: A business intelligence analyst has to be skilled in data analysis and
visualization, business knowledge, SQL and data warehousing, data modeling, and ETL
procedures, as well as programming languages like Python and knowledge of BI tools like
Tableau, Power BI, or QlikView.
5. Data Engineer: A data engineer is in charge of creating, constructing, and maintaining the
infrastructure and pipelines for collecting and storing data from diverse sources. In addition
to guaranteeing data security and quality, they could also be in charge of creating data
integration solutions.

Skill Required: To create, build, and maintain scalable and effective data pipelines and data
infrastructure for processing and storing large volumes of data, a data engineer needs
expertise in database architecture, ETL procedures, data modeling, programming languages
like Python and SQL, big data technologies like Hadoop and Spark, cloud computing
platforms like AWS or Azure, and tools like Apache Airflow or Talend.

6. Big Data Engineer: Big data engineers are in charge of planning and constructing systems
that can handle and analyze massive volumes of data. Additionally, they can be in charge of
putting scalable data storage options into place and creating distributed computing systems.

Skilled Required: Big Data Engineers must be proficient in distributed systems, programming
languages like Java or Scala, data modeling, database management, cloud computing
platforms like AWS or Azure, big data technologies like Apache Spark, Kafka, and Hive, and
experience with tools like Apache NiFi or Apache Beam in order to design, build, and
maintain large-scale distributed data processing systems for hand.

7. Data Architect: Data models and database systems that can support data-intensive
applications must be designed and implemented by a data architect. They could also be in
charge of maintaining data security, privacy, and compliance.

Skill Required: A data architect needs knowledge of database design and modeling, data
warehousing, ETL procedures, programming languages like SQL or Python, proficiency with
data modeling tools like ER/Studio or ERwin, familiarity with cloud computing platforms like
AWS or Azure, and expertise in data governance and security.

8. Data Administrator: An organization's data assets must be managed and organized by a


data administrator. They are in charge of guaranteeing the security, accuracy, and
completeness of data as well as making sure that those who require it can readily access it.

Skill Required: A data administrator needs expertise in database management, backup, and
recovery, data security, SQL programming, data modeling, familiarity with database
platforms like Oracle or SQL Server, proficiency with data management tools like SQL
Developer or Toad, and experience with cloud computing platforms like AWS or Azure.

9. Business Analyst: A business analyst is a professional who helps organizations identify


business problems and opportunities and recommends solutions to those problems through
the use of data and analysis.

Skill Required: A business analyst needs expertise in data analysis, business process
modeling, stakeholder management, requirements gathering and documentation,
proficiency in tools like Excel, Power BI, or Tableau, and experience with project
management.

Prerequisite for Data Science

Non-Technical Prerequisite:

While technical skills are essential for data science, there are also non-technical skills that
are important for success in this field. Here are some non-technical prerequisites for data
science:

1. Domain knowledge: To succeed in data science, it might be essential to have a


thorough grasp of the sector or area you are working in. Your understanding of the
data and its importance to the business will improve as a result of this information.

2. Problem-solving skills: Solving complicated issues is a common part of data science,


thus, the capacity to do it methodically and systematically is crucial.

3. Communication skills: Data scientists need to be good communicators. You must be


able to communicate the insights to others.

4. Curiosity and creativity: Data science frequently entails venturing into unfamiliar
territory, so being able to think creatively and approach issues from several
perspectives may be a significant skill.

5. Business Acumen: For data scientists, it is crucial to comprehend how organizations


function and create value. This aids in improving your comprehension of the context
and applicability of your work as well as pointing up potential uses of data to
produce commercial results.

6. Critical thinking: In data science, it's critical to be able to assess information with
objectivity and reach logical conclusions. This involves the capacity to spot biases and
assumptions in data and analysis as well as the capacity to form reasonable
conclusions based on the facts at hand.

Technical Prerequisite:

Since data science includes dealing with enormous volumes of data and necessitates a
thorough understanding of statistical analysis, machine learning algorithms, and
programming languages, technical skills are crucial. Here are some technical prerequisites
for data science:

1. Mathematics and Statistics: Data science is working with data and analyzing it using
statistical methods. As a result, you should have a strong background in statistics and
mathematics. Calculus, linear algebra, probability theory, and statistical inference are
some of the important ideas you should be familiar with.
2. Programming: A fundamental skill for data scientists is programming. A solid
command of at least one programming language, such as Python, R, or SQL, is
required. Additionally, you must be knowledgeable about well-known data science
libraries like Pandas, NumPy, and Matplotlib.

3. Data Manipulation and Analysis: Working with data is an important component of


data science. You should be skilled in methods for cleaning, transforming, and
analyzing data, as well as in data visualization. Knowledge of programs like Tableau or
Power BI might be helpful.

4. Machine Learning: A key component of data science is machine learning. Decision


trees, random forests, and clustering are a few examples of supervised and
unsupervised learning algorithms that you should be well-versed in. Additionally, you
should be familiar with well-known machine learning frameworks like Scikit-learn and
TensorFlow.

5. Deep Learning: Neural networks are used in deep learning, a kind of machine
learning. Deep learning frameworks like TensorFlow, PyTorch, or Keras should be
familiar to you.

6. Big Data Technologies: Large and intricate datasets are a common tool used by data
scientists. Big data technologies like Hadoop, Spark, and Hive should be known to
you.

7. Databases: The depth of understanding of Databases, such as SQL, is essential for


data science to get the data and to work with data.

Difference between BI and Data Science

BI stands for business intelligence, which is also used for data analysis of business
information: Below are some differences between BI and Data sciences:

Criterion Business intelligence Data science

Data science deals with


Business intelligence deals
structured and
Data Source with structured data, e.g.,
unstructured data, e.g.,
data warehouse.
weblogs, feedback, etc.

Scientific(goes deeper to
Method Analytical(historical data) know the reason for the
data report)
Statistics, Visualization, and
Statistics and Visualization
Machine learning are the
Skills are the two skills required
required skills for data
for business intelligence.
science.

Business intelligence Data science focuses on


Focus focuses on both Past and past data, present data,
present data and also future predictions.

Data Science Components:

Data science involves several components that work together to extract insights and value
from data. Here are some of the key components of data science:

1. Statistics: Statistics is one of the most important components of data science.


Statistics is a way to collect and analyze numerical data in a large amount and find
meaningful insights from it.

2. Mathematics: Mathematics is a critical part of data science. Mathematics involves


the study of quantity, structure, space, and changes. For a data scientist, knowledge
of good mathematics is essential.
3. Domain Expertise: In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills in a particular area. In data
science, there are various areas for which we need domain experts.

4. Data Collection: Data is gathered and acquired from a number of sources. This can be
unstructured data from social media, text, or photographs, as well as structured data
from databases.

5. Data Preprocessing: Raw data is frequently unreliable, erratic, or incomplete. In order


to remove mistakes, handle missing data, and standardize the data, data cleaning
and preprocessing is a crucial steps.

6. Data Exploration and Visualization: This entails exploring the data and gaining
insights using methods like statistical analysis and data visualization. To aid in
understanding the data, this may entail developing graphs, charts, and dashboards.

7. Data Modeling: In order to analyze the data and derive insights, this component
entails creating models and algorithms. Regression, classification, and clustering are
a few examples of supervised and unsupervised learning techniques that may be
used in this.

8. Machine Learning: Building predictive models that can learn from data is required for
this. This might include the increasingly significant deep learning methods, such as
neural networks, in data science.

9. Communication: This entails informing stakeholders of the data analysis's findings.


Explain the results, and this might involve producing reports, visualizations, and
presentations.

10. Deployment and Maintenance: The models and algorithms need to be deployed and
maintained when the data science project is over. This may entail keeping an eye on
the models' performance and upgrading them as necessary.
Tools for Data Science

Following are some tools required for data science:

o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.

o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift

o Data Visualization tools: R, Jupyter, Tableau, Cognos.

o Machine learning tools: Spark, Mahout, Azure ML studio.

Machine learning in Data Science

To become a data scientist, one should also be aware of machine learning and its algorithms,
as in data science, there are various machine learning algorithms which are broadly being
used. Following are the name of some machine learning algorithms used in data science:

o Regression

o Decision tree

o Clustering

o Principal component analysis

o Support vector machines

o Naive Bayes

o Artificial neural network

o Apriori

We will provide you some brief introduction for few of the important algorithms here,

1. Linear Regression Algorithm: Linear regression is the most popular machine learning
algorithm based on supervised learning. This algorithm work on regression, which is a
method of modeling target values based on independent variables. It represents the form of
the linear equation, which has a relationship between the set of inputs and predictive
output. This algorithm is mostly used in forecasting and predictions. Since it shows the linear
relationship between input and output variable, hence it is called linear regression.
The below equation can describe the relationship between x and y variables:

1. Y= mx+c

Where, y= Dependent variable


X= independent variable
M= slope
C= intercept.

2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which
belongs to the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.

In the decision tree algorithm, we can solve the problem, by using tree representation in
which, each node represents a feature, each branch represents a decision, and each leaf
represents the outcome.

Following is the example for a Job offer problem:


In the decision tree, we start from the root of the tree and compare the values of the root
attribute with record attribute. On the basis of this comparison, we follow the branch as per
the value and then move to the next node. We continue comparing these values until we
reach the leaf node with predicated class value.

3. K-Means Clustering: K-means clustering is one of the most popular algorithms of machine
learning, which belongs to the unsupervised learning algorithm. It solves the clustering
problem.

If we are given a data set of items, with certain features and values, and we need to
categorize those set of items into groups, so such type of problems can be solved using k-
means clustering algorithm.

K-means clustering algorithm aims at minimizing an objective function, which known as


squared error function, and it is given as:

Where, J(V) => Objective function


'||xi - vj||' => Euclidean distance between xi and vj.
th
ci' => Number of data points in i cluster.
C => Number of clusters.

4. SVM: The supervised learning technique known as SVM, or support vector machine, is
used for regression and classification. The fundamental principle of SVM is to identify the
hyperplane in a high-dimensional space that best discriminates between the various classes
of data.

SVM, to put it simply, seeks to identify a decision boundary that maximizes the margin
between the two classes of data. The margin is the separation of each class's nearest data
points, known as support vectors, from the hyperplane.

The use of various kernel types that translate the input data to a higher-dimensional space
where it may be linearly separated allows SVM to be used for both linearly separable and
non-linearly separable data.

Among the various uses for SVM are bioinformatics, text classification, and picture
classification. Due to its strong performance and theoretical assurances, it has been widely
employed in both industry and academic studies.

5. KNN: The supervised learning technique known as KNN, or k-Nearest Neighbours, is used
for regression and classification. The fundamental goal of KNN is to categorize a data point
by selecting the class that appears most frequently among the "k" nearest labeled data
points in the feature space.

Simply said, KNN is a lazy learning method that saves all training data points in memory and
uses them for classification or regression whenever a new data point is provided, rather than
developing a model manually.

The value of "k" indicates how many neighbors should be taken into account for
classification when using KNN, which may be utilized for both classification and regression
issues. A smoother choice boundary will be produced by a bigger value of "k," whereas a
more complicated decision boundary will be produced by a lower value of "k".
There are several uses for KNN, including recommendation systems, text classification, and
picture classification. Due to its efficacy and simplicity, it has been extensively employed in
both academic and industrial research. When working with big datasets can be
computationally costly and necessitates the careful selection of the value of "k" and the
distance metric employed to determine the separation between data points.

6. Naive Bayes: A supervised learning method used for classification and regression analysis
is called Naive Bayes. It is founded on the Bayes theorem, a probability theory that
determines the likelihood of a hypothesis in light of the data currently available.

The term "naive" refers to the assumption made by Naive Bayes, which is that the existence
of one feature in a class is unrelated to the presence of any other features in that class. This
presumption makes conditional probability computation easier and increases the algorithm's
computing efficiency.

Naive Bayes utilizes the Bayes theorem to determine the likelihood of each class given a
collection of input characteristics for binary and multi-class classification problems. The
projected class for the input data is then determined by selecting the class with the highest
probability.

Naive Bayes has several uses, including document categorization, sentiment analysis, and
email spam screening. Due to its ease of use, effectiveness, and strong performance across a
wide range of activities, it has received extensive use in both academic research and
industry. However, it could not be effective for complicated issues in which the
independence assumption is violated.

7. Random Forest: A supervised learning system called Random Forest is utilized for
regression and classification. It is an ensemble learning technique that mixes various
decision trees to increase the model's robustness and accuracy.

Simply said, Random Forest builds a number of decision trees using randomly chosen
portions of the training data and features, combining the results to provide a final
prediction. The characteristics and data used to construct each decision tree in the Random
Forest are chosen at random, and each tree is trained independently of the others.
Both classification and regression issues may be solved with Random Forest, which is
renowned for its excellent accuracy, resilience, and resistance to overfitting. It may be used
for feature selection and ranking and can handle huge datasets with high dimensionality and
missing values.

There are several uses for Random Forest, including bioinformatics, text classification, and
picture classification. Due to its strong performance and capacity for handling complicated
issues, it has been widely employed in both academic research and industry. For issues
involving strongly linked traits or class inequalities, it might not be very effective.

8. Logistic Regression: For binary classification issues, where the objective is to predict the
likelihood of a binary result (such as Yes/No, True/False, or 1/0), logistic regression is a form
of supervised learning technique. It is a statistical model that converts the result of a linear
regression model into a probability value between 0 and 1. It does this by using the logistic
function.

Simply expressed, logistic functions are used in logistic regression to represent the
connection between the input characteristics and the output probability. Any input value is
converted by the logistic function to a probability value between 0 and 1. Given the input
attributes, this probability number indicates the possibility that the binary result will be 1.

Both basic and difficult issues may be solved using logistic regression, which can handle
input characteristics with both numerical and categorical data. It may be used for feature
selection and ranking since it is computationally efficient and simple to understand.
How to solve a problem in Data Science using Machine learning algorithms?

Now, let's understand what are the most common types of problems occurred in data
science and what is the approach to solving the problems. So in data science, problems are
solved using algorithms, and below is the diagram representation for applicable algorithms
for possible questions:

Is this A or B? :

We can refer to this type of problem which has only two fixed solutions such as Yes or No, 1
or 0, may or may not. And this type of problems can be solved using classification
algorithms.

Is this different? :

We can refer to this type of question which belongs to various patterns, and we need to find
odd from them. Such type of problems can be solved using Anomaly Detection Algorithms.

How much or how many?


The other type of problem occurs which ask for numerical values or figures such as what is
the time today, what will be the temperature today, can be solved using regression
algorithms.

How is this organized?

Now if you have a problem which needs to deal with the organization of data, then it can be
solved using clustering algorithms.

Clustering algorithm organizes and groups the data based on features, colors, or other
common characteristics.

Data Science Lifecycle


The life-cycle of data science is explained as below diagram.

The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right questions. When
you start any data science project, you need to determine what are the basic requirements,
priorities, and project budget. In this phase, we need to determine all the requirements of
the project such as the number of people, technology, time, data, an end goal, and then we
can frame the business problem on first hypothesis level.

2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need
to perform the following tasks:

o Data cleaning
o Data Reduction

o Data integration

o Data transformation,

After performing all the above tasks, we can easily use this data for our further processes.

3. Model Planning: In this phase, we need to determine the various methods and techniques
to establish the relation between input variables. We will apply Exploratory data
analytics(EDA) by using various statistical formula and visualization tools to understand the
relations between variable and to see what data can inform us. Common tools used for
model planning are:

o SQL Analysis Services

o R

o SAS

o Python

4. Model-building: In this phase, the process of model building starts. We will create
datasets for training and testing purpose. We will apply different techniques such as
association, classification, and clustering, to build the model.

Following are some common Model building tools:

o SAS Enterprise Miner

o WEKA

o SPCS Modeler

o MATLAB

5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of
complete project performance and other components on a small scale before the full
deployment.

6. Communicate results: In this phase, we will check if we reach the goal, which we have set
on the initial phase. We will communicate the findings and final result with the business
team.

Applications of Data Science:


o Image recognition and speech recognition:
Data science is currently using for Image and speech recognition. When you upload
an image on Facebook and start getting the suggestion to tag to your friends. This
automatic tagging suggestion uses image recognition algorithm, which is part of data
science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition algorithm.

o Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing day by day.
EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.

o Internet search:
When we want to search for something on the internet, then we use different types
of search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use
the data science technology to make the search experience better, and you can get a
search result with a fraction of seconds.

o Transport:
Transport industries also using data science technology to create self-driving cars.
With self-driving cars, it will be easy to reduce the number of road accidents.

o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is
being used for tumor detection, drug discovery, medical image analysis, virtual
medical bots, etc.

o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data
science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and you
started getting suggestions for similar products, so this is because of data science
technology.

o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the help
of data science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk and any
type of losses with an increase in customer satisfaction.

Data Science Toolbox


Introduction:
In today’s data-driven world, data science has emerged as a critical field that enables
organizations to uncover valuable insights, make informed decisions, and drive innovation.
Data scientists help in the process, leveraging a wide range of tools and techniques to
extract knowledge from data. In this article, we will explore the essential tools and
techniques that every aspiring data scientist should know. By mastering these foundational
elements, you will be well-equipped to tackle real-world data challenges and deliver
impactful results.

1. Programming Languages:

A fundamental tool for every data scientist is a programming language. ‘Python’ and ‘R’ are
two of the most popular languages in the field. Python offers a versatile ecosystem with
libraries such as NumPy, Pandas, and Scikit-learn, making it ideal for data manipulation,
analysis, and machine learning. R, on the other hand, excels in statistical analysis and
visualization, with packages like dplyr and ggplot2. Familiarity with one or both of these
languages will enable you to efficiently handle and process data.

‘Python’, with its clean syntax and extensive community support, has become the go-to
choice for many data scientists. Its rich ecosystem of libraries provides powerful tools for
tasks ranging from data cleaning and preprocessing to model building and deployment. The
simplicity and readability of Python code make it accessible to both beginners and
experienced programmers alike.

‘R’, on the other hand, has a strong foundation in statistical analysis and is highly regarded in
academic and research settings. It provides a wide range of statistical techniques and
packages that are specifically designed for data analysis and visualization. R's syntax and
capabilities make it a powerful tool for exploratory data analysis and statistical modeling.

It's important to note that while Python and R are popular, there are other programming
languages like Julia and Scala that are gaining traction in the data science community. As a
data scientist, being adaptable and open to learning new languages will widen your range of
tools and increase your versatility.

2. Data Visualization:

Data visualization is a powerful technique for communicating insights effectively. Tools like
Matplotlib, Seaborn, Plotly, and Tableau allow data scientists to create visual representations
that aid in understanding complex patterns and trends. Visualizations can simplify complex
concepts, identify outliers, and present data-driven narratives that resonate with
stakeholders. Developing proficiency in data visualization empowers you to tell compelling
stories with data.

Matplotlib, a popular plotting library for Python, provides a flexible framework for creating
static, animated, and interactive visualizations. It offers a wide range of plot types,
customization options, and control over every aspect of the visualization. Seaborn, built on
top of Matplotlib, specializes in statistical graphics and provides a high-level interface for
creating aesthetically pleasing visualizations with minimal code.

Plotly, a powerful data visualization library, offers interactive and dynamic visualizations that
can be embedded in web applications. It allows you to create interactive plots, charts, and
dashboards that enhance the user experience and enable exploration of complex datasets.

Tableau, a widely used data visualization tool, provides a user-friendly interface for creating
visually appealing and interactive dashboards. It offers drag-and-drop functionality, intuitive
design features, and robust data connectivity options. Tableau’s strength lies in its ability to
quickly transform data into actionable insights, making it popular among data analysts and
business users.

By mastering these visualization tools, you can effectively communicate your findings,
engage stakeholders, and drive data-informed decision-making within your organization.

3. Machine Learning Algorithms:

Machine learning algorithms enable data scientists to extract valuable insights and make
predictions from data. Familiarity with a range of algorithms empowers you to select the
most appropriate approach for a given problem and optimize model performance.

a. Supervised Learning: Supervised learning algorithms learn patterns from labeled data to
make predictions or classify new instances. Linear regression, decision trees, random forests,
support vector machines (SVM), and neural networks are common examples of supervised
learning algorithms. Each algorithm has its strengths and is suitable for different types of
problems. For instance, linear regression is used for predicting continuous values, while
decision trees and random forests excel in handling categorical or binary outcomes.

b. Unsupervised Learning: Unsupervised learning algorithms are used when there is no


labeled data available. These algorithms discover hidden patterns or groupings within the
data. Clustering algorithms, such as k-means and hierarchical clustering, help identify similar
groups of data points. Dimensionality reduction techniques like principal component
analysis (PCA) and t-SNE are valuable for visualizing high-dimensional data and extracting its
essential features.

c. Deep Learning: Deep learning, a subset of machine learning, focuses on neural networks
with multiple hidden layers. Deep learning algorithms have achieved remarkable success in
various domains, including computer vision, natural language processing, and speech
recognition. Convolutional neural networks (CNN) and recurrent neural networks (RNN) are
widely used architectures in deep learning. They have revolutionized image recognition,
language translation, and sentiment analysis, among other applications.

4. Data Wrangling:
Data rarely comes in a clean and ready-to-use format. Data wrangling involves cleaning,
transforming, and preparing raw data for analysis. Libraries like Pandas in Python and tidyr in
R provide powerful tools for data wrangling tasks, including handling missing values, merging
datasets, and reshaping data structures. Proficiency in data wrangling allows you to handle
messy data efficiently and extract meaningful insights.

Data wrangling is often an iterative and time-consuming process. It requires skills in data
cleaning, data integration, and data transformation. Cleaning involves removing duplicates,
dealing with missing values, and handling outliers. Integration combines data from different
sources or merges multiple datasets. Transformation includes reshaping data, creating new
variables, or aggregating data at different levels of granularity.

By mastering data wrangling techniques, you ensure data quality, enhance the reliability of
your analyses, and lay a solid foundation for further exploration and modeling.

5. SQL and Database Systems:

Data is often stored in databases, and SQL (Structured Query Language) is a powerful tool
for querying and manipulating structured data. Understanding SQL and working with
database systems like MySQL, PostgreSQL, or SQLite enables data scientists to extract
relevant information, perform aggregations, and join datasets efficiently. SQL skills are
essential for accessing and manipulating data stored in relational databases.

SQL allows you to perform operations such as selecting specific columns, filtering rows
based on conditions, sorting data, and joining tables to combine information from different
sources. It provides a standardized way to interact with databases and retrieve the data
needed for analysis or modeling tasks.

Furthermore, SQL is not limited to querying databases but also supports creating tables,
modifying data structures, and managing user permissions. This allows data scientists to
handle data engineering tasks, ensuring data is organized, updated, and readily available for
analysis.

6. Big Data Processing:

In the era of big data, traditional data processing techniques may not suffice. Familiarity with
distributed computing frameworks like Apache Hadoop and Apache Spark is becoming
increasingly important. These tools enable efficient processing and analysis of large-scale
datasets across distributed clusters. Learning to leverage these frameworks equips data
scientists with the ability to handle big data challenges effectively.

Apache Hadoop is an open-source framework that allows distributed storage and processing
of large datasets across clusters of computers. It utilizes a distributed file system called
Hadoop Distributed File System (HDFS) and a processing framework called MapReduce. With
Hadoop, data scientists can parallelize computations and distribute data across multiple
nodes, enabling the processing of massive datasets in a scalable and fault-tolerant manner.
Apache Spark, is a fast and general-purpose distributed computing system. It provides an in-
memory computing engine that allows data scientists to perform iterative computations and
interactive data analysis at a much faster pace compared to traditional disk-based systems.
Spark supports various programming languages, including Python and Scala, and offers high-
level APIs for data manipulation, machine learning, and graph processing.

After mastering these big data processing frameworks, data scientists can efficiently handle
large volumes of data, perform complex computations, and extract insights that were
previously unattainable with traditional tools.

7. Version Control:

Version control systems like Git provide a structured and collaborative approach to managing
code and project files. Data scientists often work in teams and need to track changes,
collaborate seamlessly, and maintain a history of their work. By adopting version control
practices, you can effectively manage code, experiment with different approaches, and
ensure the reproducibility of your analyses.

Git allows you to track changes, create branches for experimentation, and merge different
versions of your code. It enables collaboration by allowing multiple contributors to work on
the same project simultaneously and provides mechanisms to resolve conflicts that may
arise during the development process. Moreover, Git integrates well with platforms like
GitHub and GitLab, providing additional features like issue tracking, code reviews, and
project management tools.

Version control not only facilitates collaboration but also ensures the integrity of your work.
By keeping a history of changes, you can revert to previous versions if needed, trace the
evolution of your analyses, and maintain a well-documented and organized workflow.

Conclusion:

Becoming a proficient data scientist requires mastering a diverse set of tools and techniques.
The data science toolbox encompasses programming languages, data visualization tools,
machine learning algorithms, data wrangling skills, SQL and database systems, big data
processing frameworks, and version control systems. By investing time and effort into
developing these foundational skills, you’ll be well-equipped to navigate the complex
landscape of data science and contribute valuable insights to organizations.

Turning Data Into Actionable Insights


Today, businesses are generating more data than ever before. Till 2025, global data creation
is projected to grow to more than 180 zettabytes.

With a plethora of data collection points available at their disposal, it’s easier for business
leaders to be lost in the information glut.

However, every business and business leader should ponder for a moment: will collecting
such a huge volume of data make sense unless it provides meaningful information and
insights? Big data comes with bigger challenges.

In 2017, The Economist published a story titled, “The world’s most valuable resource is no
longer oil, but data.”

But unfortunately, this story is no longer relevant in today’s data-intensive world. Big data
doesn’t always equate to good data.

Instead, the world’s most valuable resource is the ability to use data to extract meaningful
data insights and leverage untapped potential. If appropriately utilized, those insights can
help an organization:

 Adjust the pricing of products or services

 Gain untapped customer insights

 Increase employees’ efficiency

 Improve strategic decision making

 Reduce costs and expenditure

 Find potential ways to achieve efficiencies

 Achieve a systematic digital transformation

 Manage compliance with laws and regulations

 Build and improve relationships with stakeholders

However, before discussing the process and ways of collecting valuable insights, it’s
important to understand the difference between data and insights.

What Is the Difference Between Data and Insights?

Many people use data, information, and insights interchangeably. However, there’s a vast
difference between these terms.

If you look at these terms from a pyramid point of view, data sits at the foundation,
information occupies the middle part, and insight is positioned at the pinnacle.
Data: Raw and unprocessed facts in the form of numbers, text, images, audio or video files,
etc., which primarly exists in various formats ad systems. On its own, data neither makes
sense nor provides valuable inputs to a business.

Information: Information can also be called “data processed, aggregated, and organized into
a more human-friendly format.” It provides more context but is still not ready to inform
business decisions.

Insights: Insights are generated by analyzing information and drawing conclusions. This step
can make or break an organization’s ability to understand its data better and leverage it to
maximize profitability, reduce cost, and create value for shareholders.

If you look around, all successful companies like Coca-Cola, Netflix, Google, Spotify, etc.,
leverage insights to enhance the customer experience and increase their revenue.

In a nutshell, data is the input for extracting relevant information, and then information
becomes the input to obtain meaningful insights.

How Do I Use Data To Make Decisions?

As discussed above, data on its own can’t influence business decisions. It has to be first
processed and organized in a more human-friendly format and then converted into
actionable insights.

For instance, a company receives hundreds or even thousands of invoices from its vendors
each month. Those invoices are recorded in the accounting system that results in big data
generation.

However, such data is of little use in decision making until it is processed further and
actionable insights are drawn from it. Unprocessed data may be limited by severe data
quality issues such as:

1. Duplicate data – Since most organizations collect data from all directions and
systems, it may result in duplication and overlap in these sources. A duplicate invoice
may lead to a duplicate payment.

2. Inaccurate data – Human errors, data drift, and data decay can lead to a loss of data
integrity. Inaccurate data recording may delay payments which can adversely affect
an organization’s relationship with its vendors.

3. Ambiguous data – Inconsistency in data formats can introduce multiple flaws in


reporting and analysis. For instance, phone numbers may be stored in different
formats like 9999999999, +1 9999999999, 999-999-9999, or 99999 99999. Even the
address may be recorded without following the USPS norms, causing problems in
data processing.
Apart from these issues, organizations also face problems with unstructured data, invalid
data, data redundancy, and data transformation errors.

It’s almost impossible for business leaders to make decisions based on myriad data sources
until this procurement big data is converted into relevant information and then actionable
insights.

Only then can leaders uncover hidden patterns and trends and obtain necessary inputs to
make informed business decisions.

Organizations often overlook the accounts payable department while trying to make better
use of data.

AP manages critical financial data that can provide valuable insights for discovering potential
savings.

Organizations can use AP data to optimize their cash flow, create better and deeper
relationships with suppliers, and understand trends in the payment data.

Before we discuss further how an organization can make better sense of AP data, let’s
understand how raw data can be transformed into actionable insights.

How Do I Turn Data Into Actionable Insights?

Here is a 5-step process that can help you convert raw data into actionable insights:

1. Set Clear End Business Goals and Objectives

It’s critical for an organization to keep an eye on the prize — the end goals to be achieved
from data analytics should be clearly outlined.

The goals should align with the company’s strategic priorities. It’s easy to deviate towards
vanity metrics that sound impressive and look good on paper, but in reality, don’t add value
to a business.

A useful framework for setting goals and KPIs is to be SMART – Specific, Measurable,
Attainable, Realistic, and Timely.

2. Ask the Right Questions

Once end goals have been identified, the next step is to figure out key information needed
for informed business decisions.

You can ask yourself the following questions:

 What are the key drivers of revenue, expenses, and risks in the targeted business
area?

 Which channels drive the most conversions?

 How will specific insights impact the operations and add to the bottom line?
 Who will consume these insights? What actions do they want to take based on these
insights?

Every user will have different expectations from the data analytics activity. C-suite executives
may focus on the big financial picture, while managers may be more interested in collecting
insights that improve management practices.

Similarly, executives may want to collect operational insights. Make sure all users
requirements are considered ahead of time.

3. Transform the Data

This step is the most critical step when converting data points into insights. Mostly, data is
stored in disparate systems with varying degrees of accuracy across sources.

Hence, it becomes important for the organization to collect, combine, and collate this data
into a single data model.

Also, the organization has to handle and eliminate common data handling and
transformation challenges such as missing values, different output formats, and varying
levels of granularity for different levels.

Here, pattern recognition also plays a key role. Not all patterns will be relevant or crucial.
Each pattern should be reviewed and moved forward only if it answers necessary questions.

Segmentation is also necessary since it allows you to group data based on common
attributes and then process it further.

4. Apply Visual Analytics

Once data has been collected, collated, and cross-examined for accuracy and cleanliness, the
next step is to set up visual analytics.

It helps an organization go beyond traditional spreadsheets and uncover hidden patterns


and trends.

Visual analytics present information in a highly graphical, interactive, and visual format
through interactive dashboards, reports, summaries, graphs, charts, and maps.

Result?

Critical data is displayed in meaningful, insightful ways to help business leaders make
informed business decisions such as forecasting, planning, analysis, risk
management, strategic sourcing, operational complexcity reductions, and anti-fraud
monitoring — to name a few.

5. Translate Information Into Insights

The final step is to derive the required information to make better strategic decisions and
generate more value from data.
The insights collected from the entire 5-step process can help the organization manage and
enhance profitability, maximise prosperity, and transform risk into value across the board.

This 5-step process is not a law that has to be followed as it is. The main objective behind
converting data into insights is to present it in an easy-to-understand, simple, and visual
language.

By using accurate data, you can craft a meaningful narrative.

How Do I Present Data in an Actionable Way?

Over the last few years, businesses have understood the importance of being more agile and
proactive in getting access to real-time data and insights.

Instead of relying on manual systems where a finance team pulls data from multiple
spreadsheets, crunch numbers, and send reports to stakeholders and executives,
organizations need to adopt an automated system with robust analytics capabilities.

Ideally, it will be a centralized system that captures data in a systematic, standardized format
every time. For example, a centralized Procure-to-Pay software like PLANERGY.

The automated system should have the ability to use various visual formats to provide
actionable insights that help business leaders make informed business decisions.

Data on its own can seem like an alien language to people outside of the analytics team. This
is where data visualization can take raw data and turn it into easily interpretable insights.

A few common data visualization techniques include pie charts, bar charts, histograms,
Gantt charts, heat maps, Waterfall charts, etc.

Which Tools Are Available To Help Convert Data Into Insights?

An organization has a variety of data visualization tools at its disposal – Power BI, Google
Charts, Tableau, etc.

A good data analytics and visualization software is fully customizable and can be embedded
right into the core product or ERP.

You can pull data from multiple sources into a standalone data visualization tool but having
tools in the various areas of your business processes already equipped with data
visualisation can be even better.

You see the relevant data as you are making decisions in the application.

For example, PLANERGY Spend Analysis software offers a powerful real-time business
intelligence software for spend data, equipped with data visualization features such as
reports and customizable dashboards.

You can track every purchase in PLANERGY to power your reporting insights, drill down and
uncover hidden patterns to realize savings of up to 15%, and integrate with almost all ERPs.
Besides, a good visualization tool gives you an option to build custom business intelligence
reports according to your requirements.

All reports are fully filterable which allows you to drill down and see hidden details.

Bottom Line

By adopting automated systems that convert raw data into insights, the organization can
manage their finances more effectively, make better decisions, and earn the best ROI on
capital.

Combining data processing with machine learning makes the system more intelligent and
capable of handling complex data points.

Steadily business leaders have recognized the need to transform their data into actionable
insights and are finding the right tools to capture data accurately, provide information in a
way that they can do a deep dive when needed, but also provides the right data at the right
time and in the right format to aid decision making.

Introduction to the tools that will be used in building data


analysis software:

Version Control Systems

What is a “version control system”?

Version control systems are a category of software tools that helps in recording changes
made to files by keeping a track of modifications done in the code.

Why Version Control system is so Important?

As we know that a software product is developed in collaboration by a group of developers


they might be located at different locations and each one of them contributes to some
specific kind of functionality/features. So in order to contribute to the product, they made
modifications to the source code(either by adding or removing). A version control system is a
kind of software that helps the developer team to efficiently communicate and
manage(track) all the changes that have been made to the source code along with the
information like who made and what changes have been made. A separate branch is created
for every contributor who made the changes and the changes aren’t merged into the
original source code unless all are analyzed as soon as the changes are green signaled they
merged to the main source code. It not only keeps source code organized but also improves
productivity by making the development process smooth.
Basically Version control system keeps track on changes made on a particular software and
take a snapshot of every modification. Let’s suppose if a team of developer add some new
functionalities in an application and the updated version is not working properly so as the
version control system keeps track of our work so with the help of version control system we
can omit the new changes and continue with the previous version.

Benefits of the version control system:

 Enhances the project development speed by providing efficient collaboration,

 Leverages the productivity, expedites product delivery, and skills of the employees
through better communication and assistance,

 Reduce possibilities of errors and conflicts meanwhile project development through


traceability to every small change,

 Employees or contributors of the project can contribute from anywhere irrespective


of the different geographical locations through this VCS,

 For each different contributor to the project, a different working copy is maintained
and not merged to the main file unless the working copy is validated. The most
popular example is Git, Helix core, Microsoft TFS,

 Helps in recovery in case of any disaster or contingent situation,

 Informs us about Who, What, When, Why changes have been made.

Use of Version Control System:

 A repository: It can be thought of as a database of changes. It contains all the edits


and historical versions (snapshots) of the project.

 Copy of Work (sometimes called as checkout): It is the personal copy of all the files
in a project. You can edit to this copy, without affecting the work of others and you
can finally commit your changes to a repository when you are done making your
changes.

 Working in a group: Consider yourself working in a company where you are asked to
work on some live project. You can’t change the main code as it is in production, and
any change may cause inconvenience to the user, also you are working in a team so
you need to collaborate with your team to and adapt their changes. Version control
helps you with the, merging different requests to main repository without making
any undesirable changes. You may test the functionalities without putting it live, and
you don’t need to download and set up each time, just pull the changes and do the
changes, test it and merge it back. It may be visualized as.
Types of Version Control Systems:

 Local Version Control Systems

 Centralized Version Control Systems

 Distributed Version Control Systems

Local Version Control Systems: It is one of the simplest forms and has a database
that kept all the changes to files under revision control. RCS is one of the most
common VCS tools. It keeps patch sets (differences between files) in a special format
on disk. By adding up all the patches it can then re-create what any file looked like at
any point in time.

Centralized Version Control Systems: Centralized version control systems contain just
one repository globally and every user need to commit for reflecting one’s changes in
the repository. It is possible for others to see your changes by updating.

Two things are required to make your changes visible to others which are:

 You commit

 They update
The benefit of CVCS (Centralized Version Control Systems) makes collaboration amongst
developers along with providing an insight to a certain extent on what everyone else is doing
on the project. It allows administrators to fine-grained control over who can do what.

It has some downsides as well which led to the development of DVS. The most obvious is
the single point of failure that the centralized repository represents if it goes down during
that period collaboration and saving versioned changes is not possible. What if the hard disk
of the central database becomes corrupted, and proper backups haven’t been kept? You lose
absolutely everything.

Distributed Version Control Systems: Distributed version control systems contain multiple
repositories. Each user has their own repository and working copy. Just committing your
changes will not give others access to your changes. This is because commit will reflect those
changes in your local repository and you need to push them in order to make them visible
on the central repository. Similarly, When you update, you do not get others’ changes unless
you have first pulled those changes into your repository.

To make your changes visible to others, 4 things are required:

 You commit

 You push

 They pull

 They update

The most popular distributed version control systems are Git, and Mercurial. They help us
overcome the problem of single point of failure.
Purpose of Version Control:

 Multiple people can work simultaneously on a single project. Everyone works on and
edits their own copy of the files and it is up to them when they wish to share the
changes made by them with the rest of the team.

 It also enables one person to use multiple computers to work on a project, so it is


valuable even if you are working by yourself.

 It integrates the work that is done simultaneously by different members of the team.
In some rare cases, when conflicting edits are made by two people to the same line
of a file, then human assistance is requested by the version control system in
deciding what should be done.

 Version control provides access to the historical versions of a project. This is


insurance against computer crashes or data loss. If any mistake is made, you can
easily roll back to a previous version. It is also possible to undo specific edits that too
without losing the work done in the meanwhile. It can be easily known when, why,
and by whom any part of a file was edited.

Markdown
What is Markdown?
Markdown is a lightweight markup language that you can use to add formatting elements to
plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the
world’s most popular markup languages.

Using Markdown is different than using a WYSIWYG editor. In an application like Microsoft
Word, you click buttons to format words and phrases, and the changes are visible
immediately. Markdown isn’t like that. When you create a Markdown-formatted file, you
add Markdown syntax to the text to indicate which words and phrases should look different.

For example, to denote a heading, you add a number sign before it (e.g., # Heading One). Or
to make a phrase bold, you add two asterisks before and after it (e.g., **this text is bold**).
It may take a while to get used to seeing Markdown syntax in your text, especially if you’re
accustomed to WYSIWYG applications. The screenshot below shows a Markdown file
displayed in the Visual Studio Code text editor.

You can add Markdown formatting elements to a plaintext file using a text editor application.
Or you can use one of the many Markdown applications for macOS, Windows, Linux, iOS,
and Android operating systems. There are also several web-based applications specifically
designed for writing in Markdown.

Depending on the application you use, you may not be able to preview the formatted
document in real time. But that’s okay. According to Gruber, Markdown syntax is designed to
be readable and unobtrusive, so the text in Markdown files can be read even if it isn’t
rendered.

The overriding design goal for Markdown’s formatting syntax is to make it as readable as
possible. The idea is that a Markdown-formatted document should be publishable as-is, as
plain text, without looking like it’s been marked up with tags or formatting instructions.

Why Use Markdown?


You might be wondering why people use Markdown instead of a WYSIWYG editor. Why write
with Markdown when you can press buttons in an interface to format your text? As it turns
out, there are several reasons why people use Markdown instead of WYSIWYG editors.

 Markdown can be used for everything. People use it to


create websites, documents, notes, books, presentations, email messages,
and technical documentation.

 Markdown is portable. Files containing Markdown-formatted text can be opened


using virtually any application. If you decide you don’t like the Markdown application
you’re currently using, you can import your Markdown files into another Markdown
application. That’s in stark contrast to word processing applications like Microsoft
Word that lock your content into a proprietary file format.

 Markdown is platform independent. You can create Markdown-formatted text on any


device running any operating system.

 Markdown is future proof. Even if the application you’re using stops working at some
point in the future, you’ll still be able to read your Markdown-formatted text using a
text editing application. This is an important consideration when it comes to books,
university theses, and other milestone documents that need to be preserved
indefinitely.

 Markdown is everywhere. Websites like Reddit and GitHub support Markdown, and
lots of desktop and web-based applications support it.

Kicking the Tires

The best way to get started with Markdown is to use it. That’s easier than ever before thanks
to a variety of free tools.

You don’t even need to download anything. There are several online Markdown editors that
you can use to try writing in Markdown. Dillinger is one of the best online Markdown
editors. Just open the site and start typing in the left pane. A preview of the rendered
document appears in the right pane.
You’ll probably want to keep the Dillinger website open as you read through this guide. That
way you can try the syntax as you learn about it. After you’ve become familiar with
Markdown, you may want to use a Markdown application that can be installed on your
desktop computer or mobile device.

How Does it Work?

Dillinger makes writing in Markdown easy because it hides the stuff happening behind the
scenes, but it’s worth exploring how the process works in general.

When you write in Markdown, the text is stored in a plaintext file that has
an .md or .markdown extension. But then what? How is your Markdown-formatted file
converted into HTML or a print-ready document?

The short answer is that you need a Markdown application capable of processing the
Markdown file. There are lots of applications available — everything from simple scripts to
desktop applications that look like Microsoft Word. Despite their visual differences, all of the
applications do the same thing. Like Dillinger, they all convert Markdown-formatted text to
HTML so it can be displayed in web browsers.

Markdown applications use something called a Markdown processor (also commonly


referred to as a “parser” or an “implementation”) to take the Markdown-formatted text and
output it to HTML format. At that point, your document can be viewed in a web browser or
combined with a style sheet and printed. You can see a visual representation of this process
below.

Note: The Markdown application and processor are two separate components. For the sake
of brevity, I've combined them into one element ("Markdown app") in the figure below.

To summarize, this is a four-part process:

1. Create a Markdown file using a text editor or a dedicated Markdown application. The
file should have an .md or .markdown extension.

2. Open the Markdown file in a Markdown application.

3. Use the Markdown application to convert the Markdown file to an HTML document.

4. View the HTML file in a web browser or use the Markdown application to convert it
to another file format, like PDF.

From your perspective, the process will vary somewhat depending on the application you
use. For example, Dillinger essentially combines steps 1-3 into a single, seamless interface —
all you have to do is type in the left pane and the rendered output magically appears in the
right pane. But if you use other tools, like a text editor with a static website generator, you’ll
find that the process is much more visible.

What’s Markdown Good For?

Markdown is a fast and easy way to take notes, create content for a website, and produce
print-ready documents.

It doesn’t take long to learn the Markdown syntax, and once you know how to use it, you
can write using Markdown just about everywhere. Most people use Markdown to create
content for the web, but Markdown is good for formatting everything from email messages
to grocery lists.
Git
Branching and Merging
The Git feature that really makes it stand apart from nearly every other SCM out there is its
branching model.

Git allows and encourages you to have multiple local branches that can be entirely
independent of each other. The creation, merging, and deletion of those lines of
development takes seconds.

This means that you can do things like:

 Frictionless Context Switching. Create a branch to try out an idea, commit a few
times, switch back to where you branched from, apply a patch, switch back to where
you are experimenting, and merge it in.

 Role-Based Codelines. Have a branch that always contains only what goes to
production, another that you merge work into for testing, and several smaller ones
for day to day work.

 Feature Based Workflow. Create new branches for each new feature you're working
on so you can seamlessly switch back and forth between them, then delete each
branch when that feature gets merged into your main line.

 Disposable Experimentation. Create a branch to experiment in, realize it's not going
to work, and just delete it - abandoning the work—with nobody else ever seeing it
(even if you've pushed other branches in the meantime).

Notably, when you push to a remote repository, you do not have to push all of your
branches. You can choose to share just one of your branches, a few of them, or all of them.
This tends to free people to try new ideas without worrying about having to plan how and
when they are going to merge it in or share it with others.

There are ways to accomplish some of this with other systems, but the work involved is
much more difficult and error-prone. Git makes this process incredibly easy and it changes
the way most developers work when they learn it.

Small and Fast

Git is fast. With Git, nearly all operations are performed locally, giving it a huge speed
advantage on centralized systems that constantly have to communicate with a server
somewhere.

Git was built to work on the Linux kernel, meaning that it has had to effectively handle large
repositories from day one. Git is written in C, reducing the overhead of runtimes associated
with higher-level languages. Speed and performance has been a primary design goal of Git
from the start.

Distributed

One of the nicest features of any Distributed SCM, Git included, is that it's distributed. This
means that instead of doing a "checkout" of the current tip of the source code, you do a
"clone" of the entire repository.

Multiple Backups

This means that even if you're using a centralized workflow, every user essentially has a full
backup of the main server. Each of these copies could be pushed up to replace the main
server in the event of a crash or corruption. In effect, there is no single point of failure with
Git unless there is only a single copy of the repository.

Any Workflow

Because of Git's distributed nature and superb branching system, an almost endless number
of workflows can be implemented with relative ease.

Subversion-Style Workflow

A centralized workflow is very common, especially from people transitioning from a


centralized system. Git will not allow you to push if someone has pushed since the last time
you fetched, so a centralized model where all developers push to the same server works just
fine.
Integration Manager Workflow

Another common Git workflow involves an integration manager — a single person who
commits to the 'blessed' repository. A number of developers then clone from that
repository, push to their own independent repositories, and ask the integrator to pull in their
changes. This is the type of development model often seen with open source or GitHub
repositories.

Dictator and Lieutenants Workflow

For more massive projects, a development workflow like that of the Linux kernel is often
effective. In this model, some people ('lieutenants') are in charge of a specific subsystem of
the project and they merge in all changes related to that subsystem. Another integrator (the
'dictator') can pull changes from only his/her lieutenants and then push to the 'blessed'
repository that everyone then clones from again.
Data Assurance

The data model that Git uses ensures the cryptographic integrity of every bit of your project.
Every file and commit is checksummed and retrieved by its checksum when checked back
out. It's impossible to get anything out of Git other than the exact bits you put in.

It is also impossible to change any file, date, commit message, or any other data in a Git
repository without changing the IDs of everything after it. This means that if you have a
commit ID, you can be assured not only that your project is exactly the same as when it was
committed, but that nothing in its history was changed.

Most centralized version control systems provide no such integrity by default.

Staging Area

Unlike the other systems, Git has something called the "staging area" or "index". This is an
intermediate area where commits can be formatted and reviewed before completing the
commit.

One thing that sets Git apart from other tools is that it's possible to quickly stage some of
your files and commit them without committing all of the other modified files in your
working directory or having to list them on the command line during the commit.
This allows you to stage only portions of a modified file. Gone are the days of making two
logically unrelated modifications to a file before you realized that you forgot to commit one
of them. Now you can just stage the change you need for the current commit and stage the
other change for the next commit. This feature scales up to as many different changes to
your file as needed.

Free and Open Source

Git is released under the GNU General Public License version 2.0, which is an open source
license. The Git project chose to use GPLv2 to guarantee your freedom to share and change
free software---to make sure the software is free for all its users.

GitHub

What is GitHub?

GitHub is a web-based version control and collaboration platform for software developers.
Microsoft, the biggest single contributor to GitHub, acquired the platform for $7.5 billion in
2018. GitHub, which is delivered through a software as a service (SaaS) business model, was
started in 2008. It was founded on Git, an open source code management system created by
Linus Torvalds to make software builds faster.

Git is used to store the source code for a project and track the complete history of all
changes to that code. It lets developers collaborate on a project more effectively by
providing tools for managing possibly conflicting changes from multiple developers.

GitHub allows developers to change, adapt and improve software from its public repositories
for free as part of various paid plans. Each public and private repository contains all a
project's files, as well as each file's revision history. Repositories can have multiple
collaborators and owners.

How does GitHub work?

GitHub facilitates social coding by providing a hosting service and web interface for the Git
code repository, as well as management tools for collaboration. The developer platform can
be thought of as a social networking site for software developers. Members can follow each
other, rate each other's work, receive updates for specific open source projects, and
communicate publicly or privately.

The following are some important terms GitHub developers use:

 Fork. A fork, also known as a branch, is a repository that has been copied from one
member's account to another member's account. Forks and branches let a developer
make modifications without affecting the original code.

 Pull request. If a developer would like to share their modifications, they can send a
pull request to the owner of the original repository.

 Merge. If, after reviewing the modifications, the original owner would like to pull the
modifications into the repository, they can accept the modifications and merge them
with the original repository.

 Push. This is the reverse of a pull -- a programmer sends code from a local copy to
the online repository.

 Commit. A commit, or code revision, is an individual change to a file or set of files. By


default, commits are retained and interleaved onto the main project, or they can be
combined into a simpler merge via commit squashing. A unique ID is created when
each commit is saved that lets collaborators keep a record of their work. A commit
can be thought of as a snapshot of a repository.

 Clone. A clone is a local copy of a repository.

Benefits and features of GitHub

GitHub facilitates collaboration among developers. It also provides distributed version


control. Teams of developers can work together in a centralized Git repository and track
changes as they go to stay organized.

GitHub offers an on-premises version in addition to the well-known SaaS product. GitHub
Enterprise supports integrated development environments and continuous integration tools,
as well as many third-party apps and services. It offers more security and auditability than
the SaaS version.

Other products and features of note include the following:

 GitHub Gist. Users share pieces of code or other notes.

 GitHub Flow. A lightweight, branch-based workflow for regularly updated


deployments.

 GitHub Pages. Static webpages to host a project, pulling information directly from an
individual's or organization's GitHub repository.
 GitHub Desktop. Users can access GitHub from Windows or Mac desktops, rather
than going to GitHub's website.

 GitHub Student Developer Pack. A free offering of developer tools for students. It
includes cloud resources, programming tools and support, and GitHub access.

 GitHub Campus Experts. A program students can use to become leaders at their
schools and develop technical communities there.

 GitHub CLI. A free, open source command-line tool that brings GitHub features, such
as pull requests, to a user's local terminal. This capability eliminates the need to
switch contexts when coding, streamlining workflows.

 GitHub Codespaces. A cloud-based development environment that gives users


access to common programming languages and tools. The coding environment runs
in a container and gives users a certain amount of free time before switching to a
paid pricing model.

GitHub use cases

GitHub is used to store, track and collaborate on software projects in a number of different
contexts:

 Businesses use GitHub as version control systems, letting development team


members track changes to source code as developers collaborate on it. This lets
different coders work on a project simultaneously and ensures everyone is working
on the latest version of the code, simplifying project management. It also allows for
previous versions to be called upon should developers need to reference them.
GitHub enables code sharing among developers because code is stored in a central
location. GitHub Enterprise also helps with regulatory compliance because it is a
standardized way to store code.

 Programming instructors and students make use of GitHub in several ways. The
Student Developer Pack gives teachers and students an array of low-cost resources.
Students use the platform to learn web development, work on creative development
projects and host virtual events.

 Open source software developers use GitHub to share projects with individuals who
want to use their software or collaborate on it. Developers network, collaborate and
pitch their work to other developers in real time, catching errors in proposed code
before changes are finalized. These collaboration and networking capabilities are why
GitHub is classified as a social media site; it often links to other community sites such
as Reddit in the repository notes. Users also can download applications from GitHub.

 Nonprogrammers also use GitHub to work on document-based and multimedia


projects. The platform is intuitive to use, and its version control tools are useful for
collaboration. For example, The Art of the Command Line is a comprehensive
guide to the command line. Samplebrain is an experimental music production tool by
electronic musician Aphex Twin. And the Open Source Cookbook is a collection
of food recipes.

Getting started on GitHub

To sign up for GitHub and create a repository, new users and beginners follow these steps:

 Learn about the command line. The command line is how users interact with
GitHub. The ability to use it is a prerequisite for working with GitHub; tutorials and
other tools are available to help with this process. An alternative is the GitHub
Desktop client.

 Install Git. Git can be installed for free using instructions on the Git website. Installing
GitHub Desktop will also install a command-line version of Git. Git comes installed by
default on many Mac and Linux machines.

 Create an account. Go to GitHub's website and create a GitHub account using an


email address.

 Create a new repository. Go to the GitHub homepage, click the + sign and then
click examplerepo. Name the repository and provide a brief description when
prompted. Add a README file, .gitignore template and project license. Then scroll to
the bottom of the page and click Create repository.

R
What is R?
Introduction to R

R is a language and environment for statistical computing and graphics. It is a GNU


project which is similar to the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R
can be considered as a different implementation of S. There are some important differences,
but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, …) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has
been taken over the defaults for the minor design choices in graphics, but the user retains
full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

The R environment

R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes

 an effective data handling and storage facility,

 a suite of operators for calculations on arrays, in particular matrices,

 a large, coherent, integrated collection of intermediate tools for data analysis,

 graphical facilities for data analysis and display either on-screen or on hardcopy, and

 a well-developed, simple and effective programming language which includes


conditionals, loops, user-defined recursive functions and input and output facilities.

The term “environment” is intended to characterize it as a fully planned and coherent


system, rather than an incremental accretion of very specific and inflexible tools, as is
frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect
of S, which makes it easy for users to follow the algorithmic choices made. For
computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run
time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it as an environment


within which statistical techniques are implemented. R can be extended (easily)
via packages. There are about eight packages supplied with the R distribution and many
more are available through the CRAN family of Internet sites covering a very wide range of
modern statistics.

R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.

RStudio IDE (or RStudio) is an integrated development environment for R, a programming


language for statistical computing and graphics. It is available in two formats: RStudio
Desktop is a regular desktop application while RStudio Server runs on a remote server and
allows accessing RStudio using a web browser. The RStudio IDE is a product of Posit
PBC (formerly RStudio PBC, formerly RStudio Inc.).

The RStudio IDE is developed by Posit, PBC, a public-benefit corporation[17] founded by J. J.


Allaire,[18] creator of the programming language ColdFusion. Posit has no formal connection
to the R Foundation, a not-for-profit organization located in Vienna, Austria,[19] which is
responsible for overseeing development of the R environment for statistical computing. Posit
was formerly known as RStudio Inc. In July 2022, it announced that it changed its name to
Posit, to signify its broadening exploration towards other programming languages such
as Python.

What's the Difference Between R & RStudio?

R the application is installed on your computer and uses your personal computer resources
to process R programming language. RStudio integrates with R as an IDE (Integrated
Development Environment) to provide further functionality. RStudio combines a source code
editor, build automation tools and a debugger.

We recommend you install both R and RStudio on your personal computer.

You might also like