Linear Algebra - A Powerful Tool For Data Science
Linear Algebra - A Powerful Tool For Data Science
Research Article
Analysis of data is an important task in data managements systems. Many mathematical tools are
used in data analysis. A new division of data management has appeared in machine learning,
linear algebra, an optimal tool to analyse and manipulate the data. Data science is a multi-
disciplinary subject that uses scientific methods to process the structured and unstructured data
to extract the knowledge by applying suitable algorithms and systems. The strength of linear
algebra is ignored by the researchers due to the poor understanding. It powers major areas of
Data Science including the hot fields of Natural Language Processing and Computer Vision. The
data science enthusiasts finding the programming languages for data science are easy to analyze
the big data rather than using mathematical tools like linear algebra. Linear algebra is a must-
know subject in data science. It will open up possibilities of working and manipulating data. In
this paper, some applications of Linear Algebra in Data Science are explained.
INTRODUCTION
Data science is the field of study that combines domain Linear Algebra is the heart to almost all areas of
expertise, programming skills, and knowledge of mathematics like geometry and functional analysis
mathematics and statistics to extract meaningful insights (Hilbert and Lopez, 2011). Its concepts are a crucial
from data. Data science practitioners apply machine prerequisite for understanding the theory behind Data
learning algorithms to numbers, text, images, video, audio, Science. The data scientist doesn’t need to
and more to produce artificial intelligence systems to understand Linear Algebra before getting started in Data
perform tasks that ordinarily require human intelligence. In Science, but at some point, it is necessary to understand
turn, these systems generate insights which analysts and how the different algorithms really work. Linear algebra in
business users can translate into tangible business value data science is used as follows.
(Ambrust et al., 2010). Machine learning is the branch of
data science used to design algorithms that automatically Scalars, Vectors, Matrices and Tensors
extract valuable information from data. The focus here is
on “automatic”, i.e., machine learning is general-purpose • A scalar is a single number
methodologies that can be applied on datasets, while • A vector is an array of numbers.
producing something that is meaningful (Kakhani et al.,
• A matrix is a 2-D array
2015; Philip et al., 2014).
• A tensor is a n-dimensional array with n>2
Linear algebra is the branch of mathematics concerning
linear equations, linear functions and their representations
through matrices and vector spaces. It helps us to
understand geometric terms in higher dimensions, and
perform mathematical operations on them. By definition,
algebra deals primarily with scalars (one-dimensional
entities), but Linear Algebra has vectors and matrices
(entities which possess two or more dimensional
Fig.1. Representation of data in data science using linear
components) to deal with linear equations and functions
algebra
(Will, 2014).
APPLICATIONS OF LINEAR ALGEBRA IN DATA the origin to the vector if the only permitted directions are
SCIENCES parallel to the axes of the space.
Fig 4: Regularization
The L1 and L2 norms we discussed above are used in two Support Vector Machine Classification
types of regularization:
• L1 regularization used with Lasso Regression Support vector machine is the most common classification
• L2 regularization used with Ridge Regression algorithms that regularly produces remarkable results. It is
an application of the concept of Vector Spaces in Linear
Covariance Matrix Algebra. Support Vector Machine, or SVM, is a
discriminative classifier that works by finding a decision
Bivariate analysis is an important step in data exploration surface. It is a supervised machine learning algorithm. In
to study the relationship between pairs of variables. this algorithm, we plot each data item as a point in an n-
Covariance or Correlation is measures used to study dimensional space (where n is the number of features) with
relationships between two continuous variables. the value of each feature being the value of a particular
coordinate. Then, perform classification by finding the
Covariance indicates the direction of the linear relationship hyperplane that differentiates the two classes very well i.e.
between the variables. A positive covariance indicates that with the maximum margin, which is C is this case.
an increase or decrease in one variable is accompanied
by the same in another. A negative covariance indicates
that an increase or decrease in one is accompanied by the
opposite in the other.
Word2Vec and GloVe are two popular models to create Word Embeddings.
Latent Semantic Analysis
This grayscale image of the digit zero is made of 8 x 8 =
Latent Semantic Analysis (LSA), or Latent Semantic 64 pixels. Each pixel has a value in the range 0 to 255. A
Indexing, is one of the techniques of Topic Modeling. It is value of 0 represents a black pixel and 255 represent a
another application of Singular Value Decomposition. white pixel. Conveniently, an m x n grayscale image can
Latent means ‘hidden’. True to its name, LSA attempts to be represented as a 2D matrix with m rows and n columns
capture the hidden themes or topics from the documents with the cells containing the respective pixel values:
by leveraging the context around the words.
• First, generate the Document-Term matrix for the data
Use SVD to decompose the matrix into 3 matrices:
• Document-Topic matrix
• Topic Importance Diagonal Matrix
• Topic-term matrix
• Truncate the matrices based on the importance of
topics
Linear Algebra in Computer Vision
• Slide this kernel on the 2D input data, performing Hilbert M, Lopez P (2011). The world’s technological
element-wise multiplication capacity to store, communicate and compute
• Add the obtained values and put the sum in a single information. Science. 332: 60-65.
output pixel Jones BF, Wuchty S, Uzzi B. (2008). Multi-University
Research Teams: Shifting Impact Geography and
Stratification in Science. Science 322: 1259-1262.
Kakhani MK, Kakhani S, Biradar SR (2015). Research
issues in big data analytics, International Journal of
Application or Innovation in Engineering and
Management. 2: 228-232.
Philip CL, Chen Q, Zhang CY (2014). Data-intensive
The function can seem a bit complex but it’s widely used applications, challenges, techniques and technologies:
for performing various image processing operations like A survey on big data. Information Sciences. 275: 314-
sharpening and blurring the images and edge detection. 347.
Slavkovic M, Jevtic D (2012). Face Recognition Using
Eigenface Approach. Serbian Journal of Electrical
CONCLUSION Engineering. 9: 121-130.
Sturm P, Ramalingam S, Tardif JP, Gasparini S, Barreto
Linear algebra has vast uses in real world. Linear algebra J, (2010) Camera models and fundamental concepts
methods are applied on the data science to improve the used in geometric computer vision. Foundations and
efficiency of the algorithms to attain the more accurate Trends in Computer Graphics and Vision. 6: 1–183.
results. In this paper, compiled the applications of linear Szeliski R (2010) Computer vision algorithms and
algebra in data sciences and given an insight of each applications. London: Springer.
method. The data scientists can be used linear algebra as Waldrop MM (1992). Complexity: The Emerging Science
tool analyze the data sets. Machine learning approaches at the Edge of Order and Chaos. Simon & Schuster.
are of particular interest considering steadily increasing ISBN: 978-0671872342
search outputs and accessibility of the existing evidence is Will H (2014) Linear Algebra for Computer Vision.
a particular challenge of the research field quality Attribution-Share Alike 4.0 International (CC BY-SA
improvement. 4.0). 1-14.
Wuchty S, Jones BF, Uzzi B. (2007). The increasing
dominance of teams in production of knowledge.
REFERENCES Science 316: 1038-1039.