Search Algorithms
Search Algorithms
Data is only as good as the tools used to process and work with it. When you have
mountains of data to wade through, you need the best, most efficient methods of finding
precisely what you want. The easiest way to look for a needle in a haystack is to use a
magnet. The correct search algorithm is that kind of magnet, helping you find that
needle of desired data in the (gigantic, towering) haystack of Big Data!
This article explores the idea of binary search algorithms, including what they are, how
they compare to the linear search approach, when to use binary searches, and how to
implement them.
A linear, or sequential search, is a way to find an element in a list by looking for the
element sequentially until the search succeeds. Of course, there are other, better
search algorithms available, but linear search algorithms are easy to set up and
conduct, so it’s a decent choice if the element list (or array) is small. In linear search,
you go from the start until the end (in the worst case) to find what you're looking for. If
you reach the end and you haven't found the value, then it's not there. Something like
this:
Here’s an example of a linear search. Say you have ten buckets, numbered 1-10, and
one of them has a tennis ball. You start by looking into Bucket One and see if the ball is
in there. If not, then move on to Bucket Two, and keep going in numerical sequence
until you finally find the ball. That’s a linear search approach.
So linear searches are straightforward, and you can launch into one with little to no
preparation. That's great, but it's slightly less great if you had 1,000 buckets! So that's
why we have binary searches
Would you be surprised to know that we perform binary searches every day of our
lives? Binary searches are highly intuitive and frequently pop up in real life. We'll
discuss some examples later.
Although binary search algorithms are typically used to find one element in a sorted
sequence, they have many other uses. You can apply a binary search to a result, for
example.
Say you wanted to determine the minimum square footage of office space needed to fit
all a company's employees easily. Then, you can conduct a binary search for that
suggested size rather than sequentially checking through all the possible dimensions.
Typically, you would estimate maximum and minimum sizes when conducting the binary
search, then check a middle value, so you can halve the interval repeatedly until you get
your answer. This process saves a lot of time, especially when considering the vast
number of possible iterations of office space square foot available!
There are many other valuable examples, such as code testing, exams, technical
recruiting interviews, code challenges, and library tasks.
Binary searches are efficient algorithms based on the concept of “divide and conquer”
that improves the search by recursively dividing the array in half until you either find the
element or the list gets narrowed down to one piece that doesn’t match the needed
element.
In a binary search, you "divide" the list in halves, then compare the middle value with
what you're looking for. If the value is greater, then you repeat the process in the left
half, otherwise you use the right half. You repeat this until you cannot divide your input
list anymore. Something like this:
Binary searches work under the principle of using the sorted information in the array to
reduce the time complexity to zero (Log n). Here are the binary search approach’s basic
steps:
There's a good reason why some folks refer to binary search algorithms as “the
algorithm of everyday life.” Even if you’re not working in an IT-related career, it’s a safe
bet that you have routinely performed binary searches. It’s practically automatic! Here
are some everyday binary search examples.
Dictionaries
So you somehow find yourself without Internet access, and you need to look up the
definition of the word “wombat.” That means behaving like our primitive ancestors would
and reaching for an actual physical dictionary! If you wanted to do a linear search, you
would start at the “A” words and work your way through the dictionary until you got to
“wombat.” Good luck with that!
However, most of us are cleverer than that, and we instinctively employ the binary
search method. We consult the “W” listings and go to the middle of that section. If
“wombat” is alphabetically smaller than the word on that middle page, we ignore the rest
of the pages on the right side. If “wombat” is larger, then we ignore the left-hand pages.
We then keep repeating the process until we find the word.
Going to the Library
Here's another use that somehow involves a lack of Internet access. You visit your local
library to find a book called "Soups I Have Known." You will be there forever if you enter
the library and search the shelves linearly. So, instead, you rely on alphabetization or a
code system like the Dewey Decimal System to narrow your search.
Page Numbers
So you've found "Soups I Have Known" and checked it out from the library. A friend told
you that there's a fantastic soup on page 200. So you don't open the book to the
Foreword and begin turning the pages, working your way up to 200! Instead, you open
the book to a random spot and check the page number (we guess this book doesn't
have a Table of Contents!). If the page number is greater than 200, then your soup is on
the left-hand side of the current page. However, if the page number is less than 200,
you turn to the pages on the right-hand side. You keep doing this until you find page
200.
This example is probably the most common, and most of us do it without even thinking
about it.
Graph Theory
In the early 18-th century, there was a recreational mathematical puzzle called the Königsberg
bridge problem. The solution of this problem, though simple, opened the world to a new field in
mathematics called graph theory. In today’s world, graph theory has expanded beyond
down what graph theory is and its main components. Finally, I will conclude with 5 applications
of graph theory that are used today in the world of data science.
along both sides of the Pregel river. The city had two islands that were connected to the
mainland through bridges. The smaller island was connected with two bridges to either side of
the river, while the bigger island was connected with only one. Additionally, there was one
bridge connecting both islands. You can see the layout of the bridges in the image below.
attraction of the city. However, you are a bit lazy and do not want to walk too much. So, you do
not want to cross the same bridge more than one time. Is there a path through the city that does
this? Just as a simple rule, you can only cross the river through bridges, so no swimming. How
Superficially, the problem sounds simple to solve. Just try some paths and you will arrive at the
solution. However, no matter how many paths you try, you will not find a solution. Leonhard
Euler, a famous mathematician, realized this, and explained why it was impossible to make this
Firstly, he realized that it does not matter how you travel inside the city. The only important part
to consider for the problem was the connections between the different landmasses. He drew
dots to represent the landmasses and lines connecting these dots to represent the bridges (as
seen in the picture below). The location of the dots and the shape of the lines are not relevant
for the problem, only their relation. In the end, he had an abstraction of the problem with only
and exit from a different bridge. This means that the dot representing that landmass needs two
lines connecting it to represent the enter and exit line. More generally, the dot can have any
even number of lines as connections. This does not necessarily apply for two dots, which are
the first and last landmass in the path. Those dots can have an odd number of lines connecting
them since you could only exit the first dot and only enter the last dot. From counting the lines
connecting each dot of the Königsberg bridge problem, one can see that all of the landmasses
have an odd number of connections. This proves that it is impossible to make a path that
Euler changed the way of solving problems. He recognized that the problem was not about
measuring and calculating the solution, but about finding the geometry and relations behind it.
By abstracting the problem, he started the field of graph theory and his solution became the first
theorem of this field. Since then, graph theory has developed not only from a mathematical
perspective, but into many other fields such as physics, biology, linguistics, social sciences,
as dots (like the landmasses above) and their relationships as lines (like the bridges). The dots
are called vertices or nodes, and the lines are called edges or links. The connection of all the
vertices and edges together is called a graph and can be represented as an image, like the
ones below:
One of the most important properties of graphs is that they are just abstractions of the real
world. This allows the representation of graphs in many different ways, all of which are correct.
For example, all the graphs above are visually very different from each other; however, they all
represent the same relations, thus all are the same graph. You can check it by counting the
Graphs can be divided into many different categories based on their properties. The most
common categories are directed and undirected graphs. Directed graphs have edges with
specific orientations, normally shown as an arrow. For example, in a graph representing a cake
recipe, each vertex is a different step in the recipe and the edges represent the relation between
these steps. You can put the cake in the oven only after mixing the ingredients; therefore, there
is a directed edge from the mixing to the baking step. Undirected graphs have symmetric edges,
just like the ones shown earlier. Another example is the graph of a social network, where
vertices represent people and edges connect people that have a relationship. These
Another important feature is that the vertices or the edges can have weights or labels. Let's
assume that you are building a graph from a series of warehouses and stores in a city to
optimize the supply of the stores. You can have two types of vertices, warehouses and stores,
and the edges that connect them can be weighted by either the physical distance between the
two locations or by the cost to move the product from one location to another. Weights and
labels are very important when using graph theory in real life applications, since it is a way to
For now, I have described what a graph is and its properties, but not how to use graph theory to
solve problems. Interestingly, once abstracted to a graph, the problem falls into a few
fundamental categories, such as path finding, graph coloring, flow calculation and more. In the
next section, I will address some of these categories, some real life problems that fall into them
calculation of their solution can be done with a variety of algorithms that I encourage the reader
to look up since they sometimes become highly complex for this introductory blog. Moreover,
the solutions of such problems may not be unique nor exact. Graph theory algorithms depend
on the size and complexity of the graph; this means that some solutions may just be a very
good approximation to the exact solution. Even more, some problems have not even been
which encompass real life scenarios like the scheduling of airlines. Airlines have flights all
around the world and each flight requires an operating crew. Personnel might be based on a
particular city, so not every flight has access to all personnel. In order to schedule the flight
For this problem, flights are taken as the input to create a directed graph. All serviced cities are
the vertices and there will be a directed edge that connects the departure to the arrival city of
the flight. The resulting graph can be seen as a network flow. The edges have weights, or flow
capacities, equivalent to the number of crew members the flight requires. To complete the flow
network a source and a sink vertex have to be added. The source is connected to the base city
of the airline that provides the personnel and the sink vertex is connected to all destination
cities.
Using graph theory, the airline can then calculate the minimum flow that covers all vertices, thus
the minimum number of crew members that need to operate all flights. Additionally, by giving
weights to the cities corresponding to its importance, the airline can calculate a schedule for a
reduced number of crew members that do not necessarily visit all the cities.
This flow problem can also be applied to many other instances. For example, when having to
supply stores from warehouses with a finite number of trucks, or when scheduling public
transport in specific routes considering the expected amount of people that will be using it.
helps me by giving me directions to cycle from my location to a restaurant or a bar. But how are
these directions calculated? Graph theory is the answer for this challenge, which falls in the
The first step is to transform a map into a graph. For this all street intersections are considered
as vertices and the streets that connect intersections as edges. The edges can have weights
that represent either the physical distance between vertices, or the time that takes to travel
between them. This graph can be directed showing also the one way streets in the city.
Now, to give the direction between two points in the map, an algorithm only needs to calculate
the path with the lowest sum of edge weights between the two corresponding vertices. This can
be trivial for small graphs; however, for graphs created from big cities, this is a hard problem.
Fortunately, there are many different algorithms that may not give the perfect solution, but will
give a very good approximation, such as the Dijkstra's algorithm or the A* search algorithm.
Finding the shortest or fastest route between two points in the map is definitely one of the most
used applications of graph theory. However, there are other applications of the shortest path
problem. For example, in social networks, it can be used to study the “six degrees of separation”
between people, or in telecommunication networks to obtain the minimum delay time in the
network.
few numbers are given as a clue and the remaining numbers needed to be filled follow a simple
rule: they cannot be repeated in the same row, column or region. This puzzle, despite using
numbers, is not a mathematical puzzle, but a combinational puzzle that can be solved with the
One can convert the puzzle to a graph. Here, each position on the grid is represented by a
vertex. The vertices are connected if they share the same row, column or region. This graph is
an undirected graph, since the relationship between vertices goes both ways. An important
feature of the graph is the assignment of a label to each vertex. The label corresponds to the
number used in that position. In graph theory, the labels of vertices are called colors.
To solve the puzzle, one needs to assign a color to all vertices. The main rule of Sudoku is that
each row, column or region cannot have two of the same numbers, thus two vertices that are
connected cannot have the same color. This problem is called graph coloring, and, as with other
graph theory problems, there are many different algorithms that can be used to solve this
problem (Greedy coloring or DSatur algorithm, for example), but their performance depends
real life problems that can be translated to a coloring problem, such as scheduling tasks. For
example, scheduling exams in rooms. Each exam is a vertex and there is an edge connecting
them if it takes place at the same time. The graph created is called an interval graph, and by
solving the minimum coloring problem of the graph, you obtain the minimum number of rooms
needed for all the exams. This can be generalized with tasks that use the same resources, such
Once a query is made to search a specific set of words, the engine looks for websites that
match the query. After finding millions of matches, how does the engine rank them to show the
The search engine solves this through graph theory by first creating a webgraph, a graph
where the vertices are the websites and the directed edges follow hyperlinks within those
websites. The result is a directed graph that shows all relations between websites. Additionally,
one can add weights to the vertices to give priority to more important or influential websites.
To classify the most popular websites, different algorithms can be used. One of the first ones
used by Google is called PageRank. Here, the engine assigns probabilities to click a hyperlink
and iteratively adds them up to form a probability distribution. This distribution represents the
likelihood of a person randomly arriving at a particular website. Then, the engine orders the list
This algorithm had many faults. One can exploit it by having for example blog websites with
many links to a particular website to increase the click probability, or by buying hyperlinks in
websites with higher weights. Nowadays, there are more complicated algorithms that also
consider sponsored advertisement, but the main core is still graph theory and the relations
between websites.
revenue comes from advertising. Having so many users, advertisers will find it very expensive to
place their advertising campaigns within the reach of everyone. However, one can also just
target the people that may be interested in your product. How can you define such a target
audience?
Using graph theory, you can create a social network graph by assigning a vertex to each
person. You connect vertices with edges if the persons have a relationship, such as friends in
Facebook. This leads to an undirected graph. This massive graph would appear at the
beginning very chaotic; however, one can always find patterns in it.
A way to find the ideal target audience is to decompose the graph into smaller sub-graphs.
There are different algorithms that can do this, such as hierarchical clustering algorithms or
minimum cut methods like the Karger's algorithm. The result is the division of the graph into
clusters of people that are highly connected to each other, but less connected to other groups of
people. These groups are called communities and they share common interests, like specific
artists, brands or even political parties. Identifying these communities is advantageous for
advertising since they are more likely to buy common products, follow similar artists or vote for
similar parties.
The detection of communities can be also used for other purposes than advertisement. After
identifying the communities, one can compare connections between groups or even within
groups. If a group or a vertex within the group does not behave as their peers, it can be a sign
of intrusion. This can be used as a security control. For example, if the vertices are computers
or programs in a network, strange behavior could be caused by attacks on it. Identifying strange
Conclusion
In this blog, we went over how graph theory came to live from a simple mathematical puzzle.
You now know the main characteristics of the field and the main problems that can be solved
using graph theory. However, as an introduction to the field, the main goal of this blog is to
encourage the reader to think about problems the way graph theory does: abstract the problem
and remove all non-important parts behind. This will make the search for a solution much
easier.