Lab Sheets
Lab Sheets
1
Multi-file Projects in C:
Introduction:
Most of the programs that you have seen so far involve a single file that contains the entire source
code of the program. But a project in C involves multiple “.h” and “.c” files that have to be
integrated so that they work together. A “.h” file is a header file that can contain global variables,
structure definitions, constants, function declarations and macros, all of which may be shared
across multiple C files and used by them. A “.c” file is a source code file that contains actual source
code and might be a program in and of itself.
Including a header file into a “.c” file enables us to use the variables declared and the libraries
included in the header file. It also enables us to implement or use the functions declared in the
header file. A simple practice in C programming is to keep all the constants, macros, global
variables, and function prototypes (declarations) in the header files and include that header file
wherever it is required.
Let us now see how to include a header file into our program.
Include Syntax:
The preprocessing directive #include is what is used for including both system header files (such
as stdio.h, stdlib.h, string.h, etc.) as well as user-defined header files into a program. We use
angled braces (< >) for system headers whereas we use quotes (“ ”) for user headers. Let us
understand the two forms more closely:
#include <file.h>
This form is used for system header files that come with the compiler. It searches for a file named
'file.h' in a standard list of system directories. An example of this type of inclusion is #include
<stdio.h>. This file inclusion makes functions such as printf() and scanf() accessible to our
program. Without this inclusion, we will not be able to use the above functions. It is an exercise
to try it out!
#include "myfile.h"
This form is used for header files that we create for our programs. It searches for a file named
“myfile.h” in the directory containing the current “.c” file (the file in which you have written the
above statement to include “myfile.h”). All the function declarations, structure definitions, and
global variables that are present in myfile.h then become accessible in our “.c” file.
2
Example of a multi-file project:
Let us build a basic multi-file project step by step. (Create a separate directory for this project
and place the upcoming files in that directory.)
Consider a function that takes as input a string (character pointer) and returns the number of
vowels in the string. Such a function can be reused as many times as desired within the same
program, but our aim is to make use of it even across multiple programs. How might we do that?
#include "count.h"
The above program is defining a function count() that counts the number of vowels in a given
input string. Now, create another file named “count.h” and put the following code into it:
#include <ctype.h>
#include <string.h>
int count(char* string);
The file count.h that we have created includes the function declaration of the function count(),
along with its necessary header files. Whereas the file count_vowels.c contains the actual
3
function definition, and the header file count.h is also included in it. Notice how none of these
files has a main() function in them. And the function count is also not being called anywhere.
Now, let us create a third file that contains a main() function and calls the function count() that
we have just declared inside “count.h” and defined inside “count_vowels.c”. Create a file named
“master.c” and paste the following code onto it:
#include <stdio.h>
#include "count.h"
int main(void)
{
char s[100];
printf("Enter a string: ");
scanf("%[^\n]s", s);
int n = count(s);
printf("Count = %d", n);
}
Notice that we have included “count.h” header file in “master.c”. Now, if we try to compile the
master.c using gcc master.c, you will encounter an error. This is because when we include the
“count.h” file at its top, we are merely including the function declaration in our program (take
another look at the contents of the “count.h” file), whereas the implementation of that function
is still missing. In order to get the implementation, we would have to link the implementation file
(which is actually “count_vowels.c”) with our master file. Follow these steps for the same:
1. First, we must compile each of the “.c” files with the “-c” option with the gcc command.
This option generates a corresponding “.o” or object file (an object file is a file containing
object code, that is, machine code output of an assembler or compiler; the object code is
usually relocatable, and not usually directly executable; it is an intermediate file between
the source code and the executable files), but not the actual executable. Once we
generate “.o” files for all the “.c” files, we can link them together to generate an
executable.
gcc -c count_vowels.c
4
gcc -c master.c
2. Step 1 has generated count_vowels.o and master.o files. You may verify that they have
indeed been generated with the help of the ls command. These are the object files of the
corresponding source files. Now, we link them together to create a single executable file
using the gcc -o option:
./count_vowels_exe
There are multiple advantages to this kind of programming. For example, we can see that
“count_vowels.c” and “count.h” can be reused in multiple projects without any need to copy or
change the code. In fact, there is no need to even compile “count_vowels.c” again, provided we
have made no changes to it. The previously compiled “count_vowels.o” can be directly used for
linking while creating the second executable.
Task 1: Create another file “count_consonants.c”, in which you include “count.h” and implement
the count() function, which this time counts the number of consonants in the input string. Now,
generate the object file for this program and then create a different executable called
5
“count_consonants_exe” that links the original “master.o” file and your newly created
“count_consonants.o” file. See that the two executable files (count_vowels_exe and
count_consonants_exe) may now be executed independent of each other. Note that you shall
not need to recompile “master.o” as no change has been made.
We can write multiple stages of compilation and linking into a single shell script file. This will
enable us to do recompilation for every change we make in any of the files in the project, to a
single line.
● Create a file “myScript.sh” in the same directory where our “.c” and “.h” files are
present.
You can also include the following lines in myScript.sh before the gcc statements:
rm *.o
rm *exe
These two lines will remove the previously compiled “.o” files and the previously created
executables (files that end with the string exe). Then we can execute the remaining lines
to freshly compile and create “.o” files and the exe.
Task 2: Create a shell script for the count_consonants variant of count() that you have created in
task 1 and then run it.
Home Exercise 1: Create a quiz bot that prompts a student the answers (A,B,C,D for the respective
option; N for not attempted) to ten questions in sequence and stores that into a character array.
There are three sets of question papers, and hence there are three sequences of correct answers.
The array thus stored gets passed to a checker function present in the file containing the
appropriate answer key for the student. The answer checker uses the answer key
6
that is inbuilt to the function, and this answer key is different for each set. The student’s score is
then calculated (+4 for correct answer, -1 for incorrect answer, 0 for unattempted questions;
note that the minimum possible score is 0, a student cannot score negative overall) by the answer
checker, and the master program that is conducting the quiz then returns the score.
You should create the main() function in quiz.c file. This function takes the answers from the
users, stores them in a character array and invokes the anwer_checker() function by passing the
above character array as an argument. The answer_checker() function is to be declared in set.h
header file, and it has to be implemented in setA.c, setB.c and setC.c, separately as per the key
for each of the sets. You must include header files appropriately in each of the files. Create three
separate shell scripts for each of the answer sets. [The idea behind this is that the quizzer knows
which set the particular student is getting and runs the respective shell script accordingly, and
once the student is done answering, it returns the score to them.]
7
Makefiles:
A makefile exists to aid in compilation and recompilation of a C project. Let us understand the
same with the help of an example.
Consider the vowel counting program that has been described earlier in this labsheet. We intend
to run the make command on this project. The make command searches for a file named
“makefile” in the current directory, and executes the shell commands written in it. So let us first
create a file named “makefile” in the directory that contains the program files of our desired
program. Now, write the following lines into the file:
The targets are file names, separated by spaces. Typically, there is only one target per rule. The
commands are a series of steps typically used to make the target(s). These need to start with a
tab character, not spaces (ensure that the default indentation in your code editor is a Tab and
not a bunch of whitespaces, otherwise the makefile would fail). The prerequisites are also file
names, separated by spaces. These files need to exist before the commands for the target are
run. These are also called dependencies.
The above makefile contains a few rules. Now, if we execute “make vowel” on the terminal, it
would execute all the commands under the target “vowel” (running just “make” would also, by
default, execute all commands under the first rule). Notice how the prerequisite for “make
8
vowel” is the executable file “count_vowels_exe”. If this file doesn’t exist in our program
directory, it goes to the target “count_vowels_exe” and executes it to create the
“count_vowels_exe” file. Note that the target “count_vowels_exe” is also dependent on
“count_vowels.o” and “master.o” files. If these files don’t exist in the program directory, the
targets “count_vowels.o” and “master.o” would be executed to have these object files created.
In this way, the makefile targets are recursively executed.
So, simply by calling the command “make vowel” on the shell, your entire C projects gets
compiled and executed if the prerequisite files are present. This reduces the redundant
recompilation overhead and adds huge convenience to the programmer as we no longer need to
burden ourselves with syntactical sugar now.
You can also execute the other targets independently. For instance, if we run “make master.o”, it
will check first that master.c exists, and then proceed to compile it upto the object file stage.
Lastly, notice the “clean” rule. Executing this (“make clean” command) will delete all the “.o” files
in this project and the linked executable as well.
Task 3: Experiment with the above makefile, and add more rules to perform the same operations
with the consonants variant of the count function.
9
Pointers and Dynamic Memory Allocation:
Pointers:
Let us revise the basics of pointers in this section. A pointer is a variable whose value is the
address of another variable, i.e., direct address of the memory location. Like any variable or
constant, you must declare a pointer before using it to store any variable address. The general
form of a pointer variable declaration is:
type* ptr_var_name;
Here, type is the pointer's base type; it must be a valid C data type and var-name is the name of
the pointer variable. The asterisk * used to declare a pointer is the same asterisk used for
multiplication. However, in this statement the asterisk is being used to designate that a variable
is a pointer. Take a look at some of the valid pointer declarations:
The actual data type of the value of all pointers, whether integer, float, character, or otherwise,
is the same, a long hexadecimal number that represents a memory address. The only difference
between the pointers of different data types, is the data type of the variable or constant that the
pointer points to. So, for instance, the variable ip in the above declaration is a pointer variable (a
long hexadecimal) which holds the address of a variable which is an integer.
#include <stdio.h>
10
ip = &var; /* store address of var in pointer variable*/
printf("Address of var variable: %x\n", &var );
/* address stored in pointer variable */
printf("Address stored in ip variable: %x\n", ip );
/* access the value using the pointer */
printf("Value of *ip variable: %d\n", *ip );
return 0;
}
When the above code is compiled and executed, it produces the following result:
NULL Pointers: It is always a good practice to assign a NULL value to a pointer variable in case
you do not have an exact address to be assigned. This is done at the time of variable declaration.
A pointer that is assigned NULL is called a null pointer. The NULL pointer is a constant with a value
of zero defined in several standard libraries. Consider the following program:
#include <stdio.h>
int main (void)
{
int *ptr = NULL;
printf("The value of ptr is : %x\n", ptr );
return 0;
}
When the above code is compiled and executed, it produces the following result:
You can therefore use an if statement as shown to test whether or not a given pointer is a NULL
pointer:
11
if(ptr) /* succeeds if p is not null */
if(!ptr) /* succeeds if p is null */
Let us now use the concept of pointers to create dynamically allocated arrays and explore all their
usecases.
The major problem with your typical arrays (in C) defined on stack is that its size is fixed. Once
declared with a specific size, it can’t be changed. Oftentimes we require array-like storage whose
length can be altered (either increased or decreased) dynamically at runtime. This is the purpose
that a dynamic array serves. As such, dynamic arrays are very useful data structures. They can be
initialized with a variable size at runtime. This size can be modified later in the program so as to
expand or shrink the array. Unlike fixed-size arrays, dynamically sized arrays are allocated in the
heap area of memory. Despite being resisable, they still provide random access to their elements.
We use pointers and C functions such as malloc(), free(), calloc(), and realloc() to implement
dynamic-sized arrays. So, let us briefly revise what these functions do:
malloc: Calling this function is equivalent to requesting the OS to allocate some bytes of
memory to our program. If the memory allocation is successful, malloc returns the pointer to the
memory block. Else it returns NULL. malloc stands for "memory allocation". Its declaration looks
like the following:
Creation:
#include <stdio.h>
#include <stdlib.h>
int main()
{
int n = 10;
int* p = (int *) malloc(n);
if (p == NULL)
{
printf("Unable to allocate memory\n");
12
return -1;
}
printf("Allocated %d bytes of memory\n", n);
return 0;
}
In the above snippet, we use malloc() to create n bytes of memory and assign it to the integer
pointer p. However, it should be evident that something wrong has been done in the above
program. We did not deallocate the memory. We must use the free() function to deallocate all
our allocated dynamic memory, which we shall soon learn about.
Here is a simple example where we create a dynamically allocated array of 10 floats and store
some data into it:
// Allocate 10 floats
p = (float*) malloc(10 * sizeof(float));
Accessing Array Elements: There are two ways of accessing elements of an array. One way is
illustrated in the above example. p[0] refers to the first element, p[1] refers to the second
13
element, and so on until p[n-1] refers to the n’th element. Note that the array elements are
contiguously stored in the program memory as illustrated in Figure 1 below.
There is another way to access the array elements using pointer dereferencing. Note that the
name of the array (say p) actually stores the address of the first element of the array p. So, to
access the value of first element of the array we can access it using *p.
If we then increment the value of pointer p and dereference it, we can access the value of second
element of the array, i.e., p+1 gives us the address of the second element of the array and *(p+1)
gives us the value of the second element of the array by dereferencing the address.
In general, if we increment a pointer, then NewPtr := CurrentPtr + N bytes, where N is the size of
the datatype that CurrentPtr points to. In the above example, incrementing p by 1 advances p by
4 bytes (assuming float takes up 4 bytes). Similarly, if we add K to a pointer, then NewPtr :=
CurrentPtr + K * N bytes, where K is a constant integer and N is the number of bytes occupied by
the datatype CurrentPtr points to. So, for example, p + 3, refers to the address of the fourth
element of the array. And the difference in the addresses referred to by p + 3 and p is 3*4 = 12
14
bytes. And we can access the value of the fourth element of the array using *(p + 3).
This idea can be used to access the i’th element of any dynamically allocated array by simply
adding i to its base pointer. Therefore: *(ptr + i) is equivalent to a[i], if ptr is a pointer pointing
to the base address of the array a.
We are now able to access the elements stored in the array p (of course it is utter garbage unless
you have stored something into it first), because the pointer p points to the base of the array, to
which we are adding the index i to obtain the address of our element, and then we dereference
that address with the * operator to get the array element itself. As mentioned in the comment,
the array element can also be accessed directly as p[i].
calloc: This function can allocate multiple contiguous memory blocks of given size and
initializes each block to zero value, whereas malloc allocates a single memory block and the value
at the pointed location is random or garbage. calloc allocates memory and zeroes the allocated
blocks. calloc stands for "contiguous allocation".
no_of_members represents number of memory blocks. size represents the size of each block.
This is more suitable for allocating memory for arrays. Note that “zero value” doesn't just mean
0. If we are allocating an array of structs, calloc() assigns NULL value to strings, 0 to ints/floats,
etc.
realloc: The realloc() function is used to resize the memory block pointed to a pointer that
was previously allocated to the variable by the malloc() or calloc() function. realloc stands for
reallocation. Its function header is the following:
15
- ptr: It is the pointer of the memory block which was previously allocated to the calloc(),
malloc(), or realloc() function that is to be reallocated. If this pointer is NULL, then a new block is
allocated and the pointer to it is returned by the realloc() function.
- size: It is the new size of the memory block which is to be reallocated. It is passed in bytes. If
the size is 0, then the memory block pointed by ptr is deallocated and a NULL pointer is returned
by the realloc() function even if ptr points to an existing block of memory.
If the realloc() request is successful, then it will return a pointer to the block of newly allocated
memory. If the request fails, it will return a NULL pointer.
free: The free function deallocates dynamic memory. Calling free(p) just before return in the
above snippet would have prevented the error. free MUST be called explicitly after the usage of
dynamic memory, irrespective of which function is used to create it (malloc, calloc etc.). Its
function header looks as follows:
int* p;
p = (int*) malloc(sizeof(int));
printf(“Address pointed by p = %p\n”, p);
free(p);
While we have deallocated the memory pointed to by the pointer p using the free() function,
now the pointer p still points to the same memory location. Hence, p now becomes a dangling
pointer, which means that p is not referring to a valid memory location of the program.
int* p; int* q;
p = (int*) malloc(1000*sizeof(int));
q = (int*) malloc(sizeof(int));
p = q;
Notice how p is made to point to another memory block without freeing the previous one. The
memory previously allocated to p now becomes inaccessible. This is known as a memory leak.
16
Now, let us take a look at a detailed example using the above functions. The program given below
creates a structure (struct name) for holding names (first name and last name) as character
arrays. It creates a dynamically allocated array of struct name of size n to hold some names which
it takes as input and then puts into the array. Then, it reallocates the array to have space for one
more element, which is also then taken as input and added to the array. The contents of the array
are displayed at both these steps so that we can verify the correctness of our program.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
int n;
printf("Enter size of the array: ");
scanf("%d", &n);
Name* arr = calloc(n, sizeof(Name)); // Creating enough space for 'n' names.
if (arr == NULL)
{
printf("Unable to allocate memory\n");
return -1;
}
printf("Enter the names (space separated): ");
for (int i = 0; i < n; i++)
{
// Using . to access members of the struct
scanf("%s %s", arr[i].first, arr[i].last);
// Note that arr[i].first is equivalent to (arr+i)->first
}
17
printf("\nGiven array of names: ");
for (int i = 0; i < n; i++)
printf("%s %s\n", arr[i].first, arr[i].last);
printf("\n");
free(arr);
}
An example run of the above program is given below (user input is underlined):
18
Sonny Corleone
Fredo Corleone
Michael Corleone
The above example demonstrates the use of dynamic memory allocation to create a particular
dynamic array of structs and then resize it. It is illustrative of the powers of dynamic memory
allocation, and you are expected to explore these ends on your own beyond this example.
Task 4: Create an interactive interface for a user who wishes to perform some operations on a
dynamic array of strings. For this, you need to write a menu-driven program that asks the user
for the initial length of the array, and then stores those corresponding strings (also taken as input)
into a dynamic array initialised with that length. Then, begin a loop of the menu that prompts
the user to select one of five options:
The addition and deletion of strings must result in resizing of the array. The user should have a
sixth option that would enable them to close the menu-driven program (thereby terminating the
loop).
Perform the operation that the user selected. Write separate functions for each of the options.
Home Exercise 2: Mr David Hilbert is the manager of a hotel which he calls The Grand Hotel.
Create a structure to hold metadata about The Grand Hotel such as its name (a string), address
(also a string), number of rooms currently occupied (an integer), and a dynamic array of structs
(pointer) that is initially empty. These structs will hold the information about the occupants of
the hotel such as their names (string), their age (int), and permanent residential address (string).
After constructing both the structs, design a function that will allow Mr Hilbert to assign the first
unoccupied room to a new visitor. Then, design another function that enables Mr Hilbert to
assign Room #0 to a new visitor. This entails that he would request every occupant to shift to the
next room (Room #1 -> Room #2, Room #2 -> Room #3, etc). Also, write a function that empties
a particular room (when a visitor leaves). Mr Hilbert insists that all the occupants of
19
every succeeding room be shifted one place behind each to fill this gap (obviously, each of the
above three functions would work by altering the dynamic array). Finally, create a function that
displays the information of a particular visitor given the room number that they occupy presently.
[You must use the realloc function whenever a room is emptied or newly occupied to ensure
that your dynamic array is always occupying only just as much space as it needs.]
Now, write a menu-driven main program that initialises The Grand Hotel and then prompts the
user (Mr Hilbert) to choose one of the above four options. The menu keeps reappearing after a
task is completed as long as Mr Hilbert wants to operate the hotel.
Brain teaser: Now, just imagine if Hilbert’s hotel were infinitely long (ie., if he had infinite rooms
in his hotel!). This can obviously not be programmed due to memory restrictions, but understand
that, theoretically, given infinite memory, we could keep resizing the dynamic array holding the
information of the visitors, even if there came an infinite number of them. Just for fun, you can
look up “Hilbert’s infinite hotel paradox”, which is illustrative of an interesting quirk of infinite
sets in mathematics, which we have averted here since our set (array) is not infinite in the first
place.
20
File I/O Handling:
Introduction:
What we have done so far is take input from the user directly at runtime by means of the scanf()
function. But often we have data given to us in files that we may want to read directly from,
rather than input one by one at runtime. Moreover, frequently, we would want to have our
program outputs stored in a more permanent form on the disk (in files). In such cases, it becomes
extremely important to be well-acquainted with file-handling concepts in your language of
choice. All our labs shall be conducted in C, and will involve file I/O operations.
QWERTY
Then, the first accessible element is Q, then W, then E, and so on – i.e. to access E one has to
access Q, then W, and then access E; and when say a value U is added to the above file, the
contents would be:
QWERTYU
i.e. if and when a new element is added it is added to the end (of the list i.e. file). Linux supports
text and binary files. We will only deal with text files in this course. Text files can be accessed
character by character or word by word (i.e strings separated by space).
C provides the I/O library stdio that contains procedures (or functions) for I/O access. The header
file “stdio.h” contains the headers (i.e. declarations) of these functions. We must include this
header file in our program to perform I/O Operations.
C libraries support procedures fscanf(), fgets(), fprintf(), and fputs() for reading and writing to a
file. Refer to the manual pages for information on how to use these procedures. They are similar
to scanf and printf but take an additional (first) argument which is the file pointer.
Since files are abstractions of persistent physical storage, typically initialization and finalization
are required, i.e. initialization must be done before any read/write operations, and finalization
must be done after all read/write operations (particularly before close of program execution). C
21
libraries provide procedures fopen() and fclose() for initialization and finalization of a file. Refer
to the man pages for information on how to use these procedures.
FILE* fp;
fp = fopen(“filename”, “mode”);
fclose(fp);
22
Declaration: int fprintf(FILE *fp, const char *format, ...)
Description: fprintf() function writes a string into a file pointed by fp as shown below.
The typical structure of a program fragment that reads from and/or writes to a file is as follows:
The <mode> in the above statement refers to the mode in which you want to open your file.
The table below describes the modes that can be used:
r Searches file. If the file is opened successfully fopen( ) loads it into memory and sets up
a pointer that points to the first character in it. If the file cannot be opened fopen( )
returns NULL.
rb Open for reading in binary mode. If the file does not exist, fopen( ) returns NULL.
w Searches file. If the file exists, its contents are overwritten. If the file doesn’t exist, a
new file is created. Returns NULL, if unable to open the file.
wb Open for writing in binary mode. If the file exists, its contents are overwritten. If the
file does not exist, it will be created.
a Searches file. If the file is opened successfully fopen( ) loads it into memory and sets up
a pointer that points to the last character in it. If the file doesn’t exist, a new file is
created. Returns NULL, if unable to open the file.
23
ab Open for append in binary mode. Data is added to the end of the file. If the file does
not exist, it will be created.
r+ Searches file. It is opened successfully fopen( ) loads it into memory and sets up a
pointer that points to the first character in it. Returns NULL, if unable to open the file.
rb+ Open for both reading and writing in binary mode. If the file does not exist, fopen( )
returns NULL.
w+ Searches file. If the file exists, its contents are overwritten. If the file doesn’t exist a
new file is created. Returns NULL, if unable to open the file.
wb+ Open for both reading and writing in binary mode. If the file exists, its contents are
overwritten. If the file does not exist, it will be created.
a+ Searches file. If the file is opened successfully fopen( ) loads it into memory and sets up
a pointer that points to the last character in it. If the file doesn’t exist, a new file is
created. Returns NULL, if unable to open the file.
ab+ Open for both reading and appending in binary mode. If the file does not exist, it will
be created.
As described above, if you want to perform operations on a binary file, then you have to append
‘b’ at the end of the mode. For example, to perform write operations on a text file named
“example.txt”, the file pointer can be opened as:
filePointer now holds the file pointer for that file and can subsequently be passed to functions
like fprintf(). The second parameter can be changed to contain any of the attributes listed in the
above table.
We shall now take a look at some examples to illustrate the use of file pointers in reading from
and writing to files. Go through them and understand the exact workings of the functions used.
int main()
{
24
int num;
FILE *fptr;
return 0;
}
Note that opening a file in write mode (“w”) automatically creates the file (empty file) in case it
does not exist already.
25
return 0;
}
int main(void)
{
FILE *fptr;
char name1[20], name2[20];
int ID1, ID2;
Note that if the .csv file has multiple records, we can read them all very conveniently with a loop
[prototype: while(fgets(line,100,fp)){...}], and we can keep storing the values
wherever we desire. The way shown above is one way to tokenise the read lines. There are other
ways to do the same, for instance, using the strtok() function. You may explore these techniques
on your own. [You may refer to the following article for the same:
https://www.geeksforgeeks.org/relational-database-from-csv-files-in-c/].
What you have seen above is one way of performing what is known as “inter-process
communication”. One program can “communicate” with another by writing what it wants to
communicate to a file, which can then be read by the other program. At a different level, even
users can communicate similarly. One user can place their message into some text file, which is
then read by the other user.
Task 5: Write a program that reads itself (the .c file containing its own source code) and displays
on the terminal each line present in the program. You may explore the use of the “ FILE ”
macro to solve the same without hardcoding the filename into the fopen() function (the
26
FILE macro expands to a string whose contents are the filename of the current program;
for example, if we were to write printf(“%s”, FILE ), then we would end up printing the
name of the file containing the source code of our current program).
Task 6: Write a program that cuts/moves the entire contents of one text file into another, leaving
behind the original file blank. Before writing the program, manually create two files “test1.txt”
and “test2.txt” and write some sample text into “test1.txt”, which you must then
programmatically move to “test2.txt”, thereby leaving behind “test1.txt” blank.
Home Exercise 3: You have been given a text file that contains the complete text of J. R. R.
Tolkien’s magnum opus, “The Lord of the Rings”. This text file is called “LOTR.txt”. Write a
program that reads it and counts the number of times the word “hobbit” appears in the text (case
insensitive) and then displays the count. You may need to use some functions supported by
string.h and ctype.h in addition to the file I/O functions.
[Refer to https://www.tutorialspoint.com/c_standard_library/ctype_h.htm and
https://www.tutorialspoint.com/c_standard_library/string_h.htm if required.]
Home Exercise 4: You are an engineer working at the Space Communications and Navigation
division of NASA. Earlier, NASA had sent two astronauts to the moon to inspect a recent crater
that has supposedly formed out of nowhere on its surface. Due to certain restrictions, the
astronauts can only telecommunicate with you in Morse code.
For some reason, there has been no information from the astronauts ever since they landed. This
is quite alarming because the supplies that they had taken with them could only last them a
month on the moon. After months of frightening silence, finally, it seems that a message has
been sent to your receptor. Your colleagues have already transcribed the Morse code into a text
file named “msg.txt”. Now, your job is to write a program that reads the text file and decodes the
message that has been sent from the astronauts’ transmitters.
The complete Morse code chart is given to you in the file “Morse.jpg”. Please refer to it while
constructing your program. (A forward slash ‘/’ has been used to denote space between words
in the message; blankspace ‘ ’ has been used to denote space between letters.)
27
Linked Lists:
Introduction:
Recall how the major disadvantage of using fixed-size arrays was the fact that their length could
not be altered at runtime. To avert this, we decided to make use of dynamic arrays whereby we
would realloc() every time we had exhausted the space allocated to us. There is actually another
way to go about this, and that is with the help of the data structure known as “linked list”.
Arrays are random access lists, whereas linked lists are sequential access lists. The key difference
here is that unlike arrays, linked list elements are not stored at a contiguous location in the
memory; the elements are linked using pointers. They include a series of connected nodes. Here,
each node stores the data and the address of the next node. There is a head pointer that points
to the first node in the linked list and from there on, each node contains its own data as well as
the pointer to the next node. The very last node in the linked list points to NULL, since it is the
last node in the list, thus marking the end of the linked list. Figure 2 below pictorially shows what
a linked list looks like, while Figure 3 illustrates how linked lists are stored in memory.
28
Figure 3: Illustrating linked lists in memory
This makes insertion and deletion very expensive processes in dynamic arrays. There is also the
additional disadvantage of performing a realloc() whenever we want to resize the array, which is
also potentially a very expensive process.
Linked lists avert the above problems by making insertion and deletion very simple tasks.
Insertion for a linked list simply entails that you create a node and then traverse to the position
where you want to perform the insertion, and then just adjust the next pointers of the nodes
there to fit this node into the position. This is a far simpler process than shifting every element to
the right. Deletion is also implemented very similarly to insertion (Hint: the free() function will be
used here).
The diagram below depicts how insertion is to be done into a linked list. We are inserting a new
element at the beginning of an existing list. Please note that changes in the pointers. The blue
29
lines indicate new connections and the crossed connections are removed. Also note the
increment in the value of the variable count which stores the number of nodes in the list.
We have provided you with some sample code for the implementation of linked lists. Note that
we have only implemented the insertAfter() and printList() functions as demonstrative
examples. You may refer to the lectures slides for lectures 3-5 to get a better understanding of
these and also on how to implement the rest.
#include <stdio.h>
#include <stdlib.h>
LIST createNewList()
{
LIST myList;
myList = (LIST) malloc(sizeof(struct linked_list));
myList->count=0;
myList->head=NULL;
return myList;
30
}
31
temp->next = n1;
n1->next = NULL;
l1->count++;
}
else
{
prev = temp;
temp = temp->next;
prev->next = n1;
n1->next = temp;
l1->count++;
}
return;
}
}
return;
}
It is left as an exercise for you to write a main() program for the above to test the functions.
32
Bear in mind that linked lists also have certain disadvantages. For instance, we no longer have
the ability to perform random access. This means, given a position, we would need to traverse
through all the elements prior to it to get to that position in a linked list. Whereas in a dynamic
array we can directly access it using arr[i-1] for ith position (element). Even in a sorted linked list,
we cannot perform binary search (since we cannot perform random access), whereas the same
can be done in a sorted dynamic array. In essence, arrays are faster to access and are more
suitable for searching. And linked lists are more suitable for faster inserts if they are random and
dynamic. Whether to choose linked lists or arrays for representing our data is a choice that we as
programmers make depending on our application and the problem that we are trying to solve.
There are also variations to linked lists. Some of them are briefly described below:
(1) Singly Linked List: The linked list that we have discussed above is a singly linked list. Each
node contains a single pointer, and that pointer points to the next node (which is NULL in
case of the last or right-most node). Here, we can traverse the list in one direction only.
(2) Doubly Linked List: Very similar to singly linked list, but now each node contains two
pointers. One points to the next node and the other points to the previous node. This
allows for two-way traversal in the list. Of course, here we will have two pointers pointing
to NULL (the previous of the left-most node and the next of the right-most node).
(3) Circular Linked List: This is a singly linked list with its last node’s next pointer pointing to
the first node in the list, instead of NULL. This means that if we were to traverse this list,
the traversal would never terminate as we would be encircling the same set of nodes
again and again.
Linked lists, in all their forms, are extremely useful data structures and are used in multiple
applications. It is left as an exercise for you to implement the above variations of linked lists and
find out their applications.
Task 8: Write a function rotate() that takes a linked list of integers as input and rotates the
elements left/anticlockwise/towards head by k steps. You will need to create your own linked list
of integers for this, before implementing the above function.
33
For example, if the linked list originally was [HEAD] -> 1 -> 2 -> 3 -> 4 -> 5 ->
[NULL], and k = 2, then the linked list should become [HEAD] -> 3 -> 4 -> 5 -> 1 ->
2 -> [NULL].
[Hint: For solving the problem, it may be helpful to make the linked list circular first.]
Home Exercise 5: Write a function that takes a linked list of integers as input and reverses all of
its elements.
For example, if the linked list originally was [HEAD] -> 1 -> 2 -> 3 -> 4 -> [NULL],
then after reversal, the linked list should be [HEAD] -> 4 -> 3 -> 2 -> 1 -> [NULL].
[Ideally, you should be able to perform this in one traversal of the linked list without using too
much extra space; in terms of complexity, your solution should run in O(N) time and take O(1)
auxiliary space]
Home Exercise 6: The largest numeric datatype in C, long long integers, has a maximum value of
9,223,372,036,854,775,807 (and 18,446,744,073,709,551,615 if unsigned). Oftentimes, this is
not enough, and we want to store more digits than these datatypes allow us to. One approach
towards solving this is to have a linked list of numbers, with each node of the linked list
representing one digit of the number. So, for example, the number 2023 could be represented in
the linked list as: [HEAD] -> 3 -> 2 -> 0 -> 2 -> [NULL]. It is clear that we store the
digits in a reversed order: LSB to MSB. Furthermore, let us restrict ourselves to positive integers
only for now (though this scheme can very easily be expanded to negative numbers as well).
Create structure definitions necessary for representing such large numbers as described above.
Then arbitrarily store two large numbers (each having more than 30 digits) in two such variables.
Create a function add() that takes two such large numbers as inputs, and returns another such
large number containing the sum of the two. Write a suitable main() function to test the same.
34
Cycle Detection:
In a linked list (ie, a bunch of nodes pointing to other nodes), it is entirely possible that there is a
node (think of it as the last node) pointing to a node somewhere behind in the same linked list.
In other words, there is at least one node that can be reached more than once by following the
stream of next pointers. This is known as a cycle. If there is a cycle in a linked list, it means we
cannot traverse through it the way we are used to (because there is no node pointing to NULL,
which means the traversal will never terminate; we would keep looping in the cycle). Thus, there
must be a way for us to detect cycles in a linked list. This idea can then be extended later to other
complex data structures such as graphs (you will learn about graphs later in the course). Also,
notice that a circular linked list also has a cycle, in fact, the biggest cycle possible.
There are several ways of detecting cycles. The simplest approach is to include an integer
attribute called visited in the struct of the node, and then traverse the list whilst simultaneously
marking the nodes that we visit as “visited” by setting that attribute to 1 (by default the attribute
should be reset to 0). This way, when we traverse, if we happen to come upon a node that we
have already visited previously (it’s visited is set to 1), then we know that there is a cycle in the
linked list.
Task 9: Write a function hasCycle() that will perform the Hare-Tortoise algorithm on an input
linked list and determine whether or not there is a cycle in it. Write a main() function where you
test it on an acyclic linked list, a cyclic linked list, and a circular linked list. You will have to create
your own linked list before calling this function. Alternatively, you can append this function to
Task 7.
35
Efficiency Estimation:
As students of Data Structures and Algorithms, time and space are the most important assets to
us. They are the currency of our subject. In our field, we compare the efficiency of algorithms and
data structures with respect to solving specific problems depending on which takes lesser time
to execute and also on which takes up lesser space in memory. Therefore it becomes imperative
for us to be equipped with the tools that enable us to perform these measurements,
programmatically. In this section, we take a look at time measurements. You will learn about
space/memory measurements in the next labsheet (third week).
A standard template for doing the above measurement in C is provided below for your
convenience.
#include <sys/time.h>
gettimeofday(&t1, NULL);
// Perform the tasks whose execution time is to be measured
gettimeofday(&t2, NULL);
36
Task 10: Create a structure for holding the following two attributes about a student: ID (string),
and CGPA (float). Observe that you are given a text file named “data.txt” that holds the above
information for 10,000 students in a comma-separated form (you may open the file and check
the data yourself).
10a: Read the file and store the data that you read into first a dynamic array of structs and then
a linked list of structs. Separately measure the times for populating the above two different data
structures. Compare the time efficiency for both data structures.
10b: Now, prompt the user to enter ten new entries to be entered into the records, but at specific
locations (also mentioned by the user). Then, insert those ten records into the two data structures
separately and measure the time taken by them for the same. Compare the time efficiency for
the above task as well for both data structures. [Note: Do not include user input time into the
time measurements.]
10c: Prompt the user to enter a position (within the valid range, ideally from somewhere in the
middle) from which to retrieve data from your data structure. Then, proceed to retrieve it and
display the retrieved data on the screen. Measure the time taken to do the above for both
structures and then compare their efficiency. [Note: Do not include user input time into the time
measurements; also do not include time taken to print the retrieved data.]
10d: Finally, delete every entry from both of the above data structures one by one (but separately
for both data structures), and measure the time taken to do the same for both. Compare this
time efficiency as well.
Note that you are expected to create helper functions to perform the above micro tasks (insertion
at the end, insertion at given position, retrieve, deletion from the end, etc). Also, make sure that
you are not interleaving the tasks between the two data structures, ie., you must perform an
entire task for one data structure before attempting it for the other. Otherwise, you will not be
able to measure and compare the time taken by both. You should also create a suitable display
function to help you verify what you are doing at each stage is correct.
37
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
CS-F211: Data Structures and Algorithms
Lab 3: Heap Memory Estimation, Stacks and Queues
Introduction
Welcome to Week 3 of Data Structures and Algorithms! Over the last two weeks, we brushed up
on our C programming and are now (hopefully) comfortable with pointers, structs, file I/O, and
working on multi-file projects. We also looked at time measurement using the <sys/time.h>
library. This week, we will get started with data structures.
Objectives
- Efficiency Estimation (Heap Space Measurement)
- Abstract Data Types (ADTs)
- The Stack ADT
- The Queue ADT
For now, we would assume that the bulk of our memory usage would be on the heap. This is a
fair assumption to make when dealing with dynamic data structures. We can always convert
significant stack usage into heap usage (replacing static arrays with dynamically allocated arrays).
While there are specialised tools that can monitor space usage like valgrind, we would be using
a simple technique where we would add a wrapper around our memory management functions
(malloc() and free()). Whenever this new version malloc() (say myalloc()) iscalled
requesting a certain number of bytes (say n) of memory, myalloc() would now internally
request (sizeof(int) + n) bytes of memory(using malloc()). Subsequently,
1
myalloc() would store the integer n at the beginning of the block and return the pointer
starting from just after the integer holding the size. It would also update a global variable holding
the total size allocated.
Suppose we want to allocate 5 bytes to hold the character array {'h', 'e', 'l', 'l', 'o'}. When we use
malloc, our call would have been:
char *p = malloc(5);
This would have allocated 5 bytes and returned a pointer to the beginning of this block, which
can then be populated with the characters {‘h’, ‘e’,’l’,’l’,’o’}.
Post this, the resulting memory layout is illustrated in Fig 1. A block of 5 bytes contains the
required array. The arrow points to the location returned by malloc().
Fig 1: Memory layout when malloc(5) is used to store the character array
Now instead, if we would have used myalloc(5), the total memory allocated would be sizeof(int)
(ie 4) + the requested size (ie 5) = 9 bytes of memory. The corresponding layout is illustrated in
Fig 2.
Fig 2: Memory layout when myalloc(5) is used to store the character array
As seen here, the memory requested is for 5+4 = 9 bytes. The first 4 bytes hold the integer 5. The
block of size 5 starting after the first four is returned by the function. In this way, the calling
function sees no change in the returned pointer, but in memory the size information has been
stored. So now when this pointer would be passed to wrapper version of free() (say
myfree()), it would look for an integer stored just before the start of the block pointed to by
2
the pointer (by decrementing the void pointer by sizeof(int)(ie 4 bytes) and dereferencing it as
an int called size), decrement this integer size from the global variable maintaining the total
allocated size, and then free the entire block including the integer.
Let us see both these new wrapper functions in more detail. As we saw the last week, the
signature for malloc() is:
Here, size_t is just an alias for unsigned int defined in the <stdlib.h> header file. Thus,
the parameter for malloc() is the integer number of bytes that need to be allocated. It returns
a void pointer to the block of the required size starting at that address. A void pointer isa pointer
that has no associated data type with it. A void pointer can hold an address of anytype and
can be typecasted to any type. Interestingly, void pointers cannot be dereferenced. Thus, we
must cast void pointers to pointers of some specific type before dereferencing them.
For example, if we need to access the integer stored starting from the location pointed to by the
void pointer ptr, we can use the following approach:
or
In this way, void pointers allow us to have type independent memory management operations.
We would be ensuring that our wrappers (myalloc() and myfree()) have identical
parameters and return types as malloc() and free() so that the user can replace all
instances of malloc() and free() with their corresponding wrappers without worrying
about the data types.
#include <stdio.h>
#include <stdlib.h>
size_t heapMemoryAllocated = 0;
#define ADDITIONAL_MEMORY sizeof(int)
3
void *myalloc(size_t size)
{
void *ptr = malloc(size + ADDITIONAL_MEMORY);
if(ptr == NULL)
return NULL;
heapMemoryAllocated += size;
*((int *)ptr) = size;
return ptr + ADDITIONAL_MEMORY;
}
int main()
{
printf("Heap memory allocated: %d\n", heapMemoryAllocated);
double *arr;
arr = myalloc(sizeof(double) * 40);
4
printf("Heap memory allocated: %d\n", heapMemoryAllocated);
myfree(ptr2);
printf("Heap memory allocated: %d\n", heapMemoryAllocated);
return 0;
}
Here, if you observe myalloc(), its argument is the integer number of bytes requested. It
first requests malloc() to allocate ADDITIONAL_MEMORY (here sizeof(int)) + the
requested number of bytes. If the allocation was successful (ptr != NULL), the global variable
heapMemoryAllocated is updated. Now, we wish to store the number of bytes requested
for allocation, in the first ADDITIONAL_MEMORY bytes of the allocated block. For this, we simply
cast the pointer (which is currently of type void *) to int * and store the size there. Now,
the block starting at the end of this int value is returned to the user by:
Since the signature is identical to that of malloc(), the user can replace all malloc() calls
with myalloc() calls without making any other changes in their code. myfree() works in a
similar manner. You can find an implementation of myfree()in the above code. Thus, all heap
memory management operations can be performed through this pair of functions.
Thus, using this method we can keep track of the heap space used by the program in a global
variable heapMemoryAllocated, with negligible overhead.
5
Abstract Data Types (ADTs)
An abstract data type is a theoretical specification of a data structure and its operations. It is a
powerful idea that allows us to separate how we use the data structure from the particular form
of the data structure. An ADT is defined by just its operations. These operations are made
available to the client, who can use them irrespective of how it is internally implemented.
Students familiar with Object Oriented Programming in Java might be reminded of interfaces,
and for a good reason, as they are one way of implementing ADTs. We would be implementing
ADTs in C using header files to specify the interface.
Formally, an ADT is a mathematically specified entity that defines a set of its instances, with:
1. A specific interface: This is a collection of signatures (or function declarations in C) of
operations that can be invoked on an instance. This might be provided as an interface in Java or
a header file in C.
2. A set of axioms (pre-conditions and post-conditions) that define the semantics of the
operations (i.e., what the operations do to instances of the ADT, but not how). These pre- and
post-conditions would be typically expressed in some form of predicate logic expressions (don’t
worry if you haven’t studied them).
This interface is provided to the client who can make use of it without worrying about the exact
implementation. Thus, the way the data is represented and operations are implemented does
not matter for an ADT. A single ADT (as we shall soon see) can have multiple implementations
and each implementation may differ in its performance.
ADTs allow us to separate the issues of correctness and efficiency. As long as the set of axioms is
satisfied by the implementation and the function calls are made as per the signatures provided
in the ADT, correctness is ensured. And performance can be optimized by modifying the
implementation of the ADT independently without worrying about the correctness being
compromised. As long as the pre- and post-conditions continue to be met, modifying the
implementation won't affect the correctness.
6
The Stack ADT
A stack is a Last In First Out (LIFO) structure. We insert and remove elements from the same end
(called the top) of a stack. It is analogous to a stack of cleaned dishes in a kitchen. As the new
dish is cleaned, it is added to the top of the stack. Whenever someone needs a clean dish, they
would pick it from the top.
In the stack data structure, the operation of inserting an element at the top is called pushing
the element to the stack, and the operation of removing the element from the top is called
popping the element off of the stack. These operations are illustrated in Fig 3.
One common application where one might choose to make use of a stack is when you might need
to provide the user of your application with an undo facility (like the one provided by your
favourite code editor). Here, you can store the history of changes made in a stack, pushing a new
change onto the stack, and whenever the user wants to undo it, you can pop the last change off
the stack and revert it.
7
Let us consider a stack that would contain members of a user-defined type 'Element'. This data
structure can now be represented as an ADT as follows:
Behaviour (Interface)
The following methods are included in the Stack ADT:
● Element top(Stack s): Get the last element
● Stack pop(Stack s): Remove the last element
● Stack push(Stack s, Element e): Add a new element
● Boolean isEmpty(Stack s): Find whether the list is empty
● Stack newStack(): Create a new stack
Implementation
Let us begin implementing this ADT now. We would create a multi-file project for implementing
our ADTs of this lab sheet. We have provided a few files to help you get started in the directory
"Stack". Let us first create the structure for our custom data-type Element that would populate
the stack in a new file named element.h. The file contents look like this:
#ifndef ELEMENT_H
#define ELEMENT_H
struct Element
{
int int_value;
float float_value;
};
typedef struct Element Element;
#endif
You would notice that our usual expected header file contents, namely the struct definition and
the typedef statement, are wrapped in an #ifndef - #endif block. This is called an "include-guard".
An include-guard has the syntax:
8
#ifndef <token>
#define <token>
header file contents…
#endif
This is used to prevent multiple inclusions of the file. Without include-guards, if the same header
file gets included twice directly or indirectly in a file, it would lead to a compilation error.Include-
guards prevent this multi-inclusion from happening. While this is not needed as such in our
project as with a small number of files where we can manually keep track of which headers were
included in which file, it is a good practice to use them when dealing with multiple files. You can
learn more about include-guards at https://en.wikipedia.org/wiki/Include_guard.
As we saw, some functions here require us to have a boolean datatype. While historically C did
not have an explicit inbuilt boolean datatype, one was included in the header file <stdbool.h> in
C99. Alternatively, we can also define it using an enum as follows in a file called bool.h.
#ifndef BOOL_H
#define BOOL_H
#endif
Thus you now have a standard implementation of bool that can be used in any of our files by
simply including this file. Since the members of the enum are in the order {false, true}, false would
be implicitly equal to const int 0 and true would be equal to the const int 1, which is the
convention followed in C.
Now, let us specify our method (or function) signatures of the stack ADT in a new file called
stack.h.
#ifndef STACK_H
#define STACK_H
#include "element.h"
#include "bool.h"
Stack *newStack();
9
// Returns a pointer to a new stack. Returns NULL if memory allocation
fails
#endif
You would notice that the method signatures are slightly different from those mentioned in the
ADT above. This has been done for ease of implementation.
This file (stack.h) would be included and made available to the client. The client can look at the
method signatures and decide which calls to make.
Now let us implement the actual ADT. The stack ADT can be implemented in many different ways.
We would be looking at two ways here, namely, using an array and using a linked list.
Using Arrays:
Now we begin implementing the stack ADT using an array. As one would expect, with an array
we run into similar issues to those that arrays had. Namely, we cannot have dynamic-sized stacks.
Thus, we would declare a max size for this declaration. Our push function would return false
when the stack is completely filled.
Let us start implementing the stack ADT using an array in a new file stack_array.c.
10
The file would start looking like this:
#include "element.h"
#include "stack.h"
#include <stdlib.h>
Stack *newStack()
{
Stack *s = (Stack *)malloc(sizeof(Stack));
if(s != NULL)
s->top = -1;
return s;
}
Here, the stack struct has first been defined (which was typedef'ed in the file stack.h). The array
implementation has an array member of size STACK_SIZE and an integer top representing the
current top index. In the function newStack(), we instantiate a stack dynamically, initialise the
top member to -1 signifying an empty stack and return a pointer to the instantiated stack. In
push(), an Element e is pushed onto a Stack s. The stack is passed by reference. The top variable
is incremented. In case the stack is full and we are unable to push the element, the function
returns false. Else the function returns true.
Task 1: After pasting the above code in the file stack_array.c, complete the implementation by
adding definitions for the functions pop(), isEmpty(), top() and freeStack() in a similar fashion to
the same file. You are obviously required to follow the signatures provided in the file stack.h.
11
Now we are ready with the implementation of a stack using an array. We can test out the
implementation using a driver. A driver has been provided to you in a file called stackDriver.c.
Now, you can test out your implementation using the driver code. You can use the following
makefile for your convenience:
runStackWithArray: stackDriver.o stack_array.o
gcc -o runStackWithArray stackDriver.o stack_array.o
./runStackWithArray
clean:
rm -f *.o runStackWithArray
In case some part of the above make file does not make sense, do revisit the previous lab sheet
to reacquaint yourself with the same. Thus, you can now test it out with a single command: make
runStackWithArray
Home Exercise 1: Modify the driver (stackDriver.c) to ensure that all properties (axioms) of the
stack ADT are satisfied by your implementation. Now, test it out and handle the corner cases in
your implementation if needed.
First, we need an implementation of linked lists having elements of type Element. The linked list
must implement the signatures provided to you in the linked_list.h header file.
12
Task 2: Go through the linked_list.h header file and implement the functions in a new file called
linked_list.c. As you would notice in the file, the linked_list and node structures have been
defined.
#include "element.h"
struct node
{
element data;
struct node *next;
};
typedef struct node node;
typedef node * NODE;
struct linked_list
{
int count;
NODE head;
// NODE tail; // Not required for stack. Required for Queue
};
typedef struct linked_list linked_list;
typedef linked_list * LIST;
We have 'typedef'ed the types linked_list, node, LIST, NODE to refer to the types struct
linked_list, node, pointer to struct linked_list, pointer to node respectively. You need to
implement the following functions:
1) LIST createNewList(): This function allocates memory for a new list and returns
a pointer to it. The list is empty and the count is set to 0.
2) NODE createNewNode(element data): This function allocates memory for a new
node and returns a pointer to it. The next pointer is set to NULL and the data is set to the value
passed in.
3) void insertNodeIntoList(NODE node, LIST list): This function inserts
a node at the beginning of the list.
4) void removeFirstNode(LIST list): This function removes the first node from
the list.
5) void insertNodeAtEnd(NODE node, LIST list): This function inserts a
node at the end of the list. (Optional for stack, but would be required when we implement
queue).
13
So, now we have a linked list, but we are no closer to creating a stack … or are we? You might
have realised that we are almost done. We just need to now simulate the stack using the above-
defined linked list. Let us try to understand how we can go about this.
A stack, as we know, is characterised by its two characteristic operations: push and pop. How can
these be carried out on a linked list? The stack is a LIFO data structure. A push is an insert
operation. So, it would have to use one of the insert methods of the linked list. In our current
implementation of the linked list, we only have the head pointer. Thus, it would be more efficient
to use the head of the list as the top of the stack as push and pop would correspond directly to
insertion and deletion at the head and would run in O(1) time.
Now, similarly, the other methods of the stack ADT can also be simulated by the methods of the
linked list.
Task 3: Create a new file stack_ll.c. Include the files linked_list.h and stack.h in this file. Implement
the methods of stack.h using linked_list.h. Extend the Makefile provided in the section on the
implementation of a stack using arrays to include the implementation of the stack using a linked
list as well. You can use the same driver (stackDriver.c) to test this code.
Task 4: Modify both implementations to include performance analysis. Compare the time taken
and the heap space utilized by both implementations.
You have been provided with two files heap_usage.h and heap_usage.c. Go through them and
understand how they work. You can use these functions to implement heap space measurement.
Replace all calls to malloc(), calloc(), realloc() and free() in stack_array.c and stack_ll.c with the
corresponding calls to myalloc() or myfree()...
You have been provided with three input files, namely, small.csv, medium.csv and large.csv.
These files have different numbers of rows of data consisting of students' BITSAT score and CGPA
at the end of the first year. For each of these files, find the time taken and heap space used in
populating the stack and then emptying the stack. You need to push all elements from this file
and and then pop all of them them. Print the maximum space utilised and the total time taken in
populating and emptying the stack for each input file. Ensure that you do not include file or I/O
operations in the time measurement. That means you would have to keep adding the
measured time for the push and pop operations to a counter and output the total time at the
end.
14
Consider the file cgStackDriver.c. Here the code for reading the file has been provided to you. As
you would notice, the method described in lab sheets 1-2 of reading a CSV file using fgets() and
strtok() has been used here. fgets() reads a line from the file and stores it in the line variable.
fgets() returns NULL when it reaches the end of the file. This condition is used to terminate the
while loop. strtok() splits the line into tokens separated by the comma and returns the first token
when called for the first time with the string line as the first argument and the delimiter "," as
the second. When called successively with NULL as the first argument, strtok returns the
subsequent tokens.
Modify the file to achieve the functionality required in this task as described in the comment
blocks.. A helper function iftoe() has been provided to instantiate the Element struct which can
be used if needed. iftoe() is a function that creates an Element from an integer and a float. It
takes two arguments, an integer and a float, and returns an Element. The integer value of the
Element is set to the integer argument and the float value of the Element is set to the float
argument.
The input file is to be provided to this main as a command line argument. You can run it with
different files and thus compare the performance.
For example: if the executable is cgStack.exe, it can be executed with the input file small.csv by
running the following command:
./cgStack.exe small.csv
Also, find out what are the asymptotic complexities of the operations in your implementation
theoretically. Observe whether your empirical results match the theoretical time and space
complexities.
Home Exercise 2: Arithmetic expressions can have either infix, prefix or post-fix notation. We are
used to the infix notation in general where the binary operator is present between the operands.
For eg. "3+4", "1-(3-4*6)+4/2", etc. The same expressions in postfix notation would look like "3 4
+" and "1 3 4 6 * - - 4 2 / +" respectively. This may not look very intuitive to us but the postfix
notation has certain advantages. One such advantage is that we would have needed parenthesis
to specify the operator precedence overriding in infix notation, whereas it is implicitin the postfix
notation. The postfix notation is also known as Reverse Polish Notation (RPN).
Your task is to write a program that would accept a string containing an RPN arithmeticexpression
as input, and print the result as the output. You can assume that the operators would be among
+, -, * and /, your operands would be integer values, these would be separated
15
by spaces (as given in the example below), and your result should be the floating-point result
obtained on evaluating the expression. (Hint: You can use the stack you have implemented.)
For example: If the input is "1 3 4 6 * - - 4 2 / +", the result would be 24.0.
If the input is "1 2 3 4 5 + * - -", the result would be 26.0.
Home Exercise 3: Given an array X, the span S[i] of element X[i] is the maximum number of
consecutive elements X[j] immediately preceding X[i] and such that X[j] <= X[i]. In other words,
S[i] is the maximum k such that the elements X[i], X[i-1],..., X[i - k +1] are all less than or equal to
X[i].
Fig 4. Example of spans. X is the input array and S is the spans array.
Consider the example in Fig 4. The span for element 5 is 3 since the elements 3, 4 and 5 are <=
5. You have been provided with the file incomplete file computeSpan.c in the "HW Exercises"
directory. Complete the computeSpans() function using your implementation of stack such that
the function has a complexity of O(n). [Hint: Look at what happens at the (S[i] - 1)th element.]
16
The Queue ADT
Queues differ from stacks in that queues follow the first-in-first-out (FIFO) principle. You can
think of the queues of students during lunch hours at IC. As opposed to stacks, in a queue,
insertion and deletion occur at different ends. The insertion (called the enqueue operation) is
said to take place at the rear end of the queue, while the removal (called the dequeue operation)
takes place at the front end of the queue. These operations are illustrated with an example in Fig
5.
Many scheduling algorithms make use of queues and their variations to ensure fairness and
efficiency. For example, consider a workspace that has a shared printer between 10 users. It is
possible that multiple users may submit print jobs simultaneously to the printer. If the printer
services these jobs in an interleaved manner, it would result in garbage output. So to prevent
this, the printer maintains a queue of jobs and services the jobs that arrive in a FIFO manner.
Behaviour (Interface)
The following methods are included in the Queue ADT:
17
● Queue createQueue(): Create an empty queue
● Queue enqueue(Queue q, Element o): Insert object o at the rear of the
queue
● Queue dequeue(Queue q): Remove the object from the front end of the queue;
throw an error if queue is empty
● Element front(Queue q): Return (and not remove) the element at the front end
of the queue; throw an error if queue is empty
● int size(Queue q): Return the number of elements in the queue
● boolean empty(Queue q): Return TRUE if the queue is empty, FALSE otherwise
Properties (Axioms)
The following axioms must hold for the implementation to ensure correctness:
● Front(Enqueue(createQueue(), v)) == v
● Dequeque(Enqueue(createQueue(), v)) == createQueue()
● Front(Enqueue(Enqueue(Q, w), v)) == Front(Enqueue(Q, w))
● Dequeue(Enqueue(Enqueue(Q, w), v)) ==
Enqueue(Dequeue(Enqueue(Q, w)), v)
Implementation
As we did for the Stack ADT, we would begin implementing the queue ADT now. Again, we have
provided a few files to help you get started in the directory "Queue". We would first be specifying
the behaviour in a queue.h header file as follows:
#ifndef QUEUE_H
#define QUEUE_H
#include "element.h"
#include "bool.h"
Queue *createQueue();
// createQueue() returns a pointer to a new Queue instance.
18
// the queue, false otherwise.
Again as we did in stacks, we have modified the method signatures here. As we pass the structure
by reference, we can mutate the same structure; hence, our mutators need not return the
modified structure. Instead, they now return a bool denoting whether the enqueue() or
dequeue() operation was successful or not. Note that this is a design choice that we must always
make – whether to have our functions return the modified structure, which then has to be
assigned to the earlier structure, or to have them passed by reference and let the original
structure itself be modified internally.
Using Arrays:
When we first think of using an array to simulate a queue, we might be tempted to simply start
adding elements onto consecutive locations starting from the start of the array and dequeuing
them from the front at the same time. We would require just an array and two integer variables
for this implementation. The two variables would represent the front and rear of the queue
respectively. As we enqueue() more elements onto the queue at the rear, the rear starts moving
rightward (ie its index starts increasing). When we dequeue() elements, the front also starts
moving rightward. However, this implementation has a drawback. The space freed up by a
dequeued element cannot be reused for some other element.
This drawback can be resolved by using the array circularly. Here, we allow both the front and
the rear to drift rightward, with the contents of the queue “wrapping around” the end of an
array, as necessary. Assuming that the array has fixed length N, new elements are enqueued
toward the “end” (rear pointer) of the current queue, progressing from the front to index N −1
19
and continuing at index 0, then 1. The following figure represents such a queue with first
element F and last element R.
Use this logic now to implement the queue ADT. You can alternatively use front and size as the
two variables as front, rear and size are linked to each other through the equation:
rear = (front + size - 1) % ARRAY_SIZE
Task 5: Implement the Queue ADT as specified by the signatures in the queue.h file in a new file
called queue_array.c. Include the heap_usage.h file and make all the memory management calls
through the wrapper functions. Test the correctness of your implementation using a driver
similar to stackDriver.c.
Task 6: Modify the linked list defined by in the linked_list.h file to include the tail pointer. Now
also modify the function declarations in the linked_list.c file for maintaining the correct value at
the tail pointer as you carry out different operations on the list.
You should have implementations (including the tail pointer) for all the functions defined in the
linked_list.h file. Importantly, we should have the functionality to insert and remove elements
from the head and insert elements from the tail. (Is it possible to have an O(1) function to remove
the tail element? If yes, how? If not, why not?)
Task 7: Now implement the queue ADT in a new file queue_ll.c using the modified linked list as
described above. Check its correctness and compare the performance of both implementations
of queues in a similar manner to Task 6. You can refer to the driver provided in the same for the
general flow and File IO template.
20
Home Exercise 4: The operating system is responsible for scheduling the different processes.
There are various scheduling algorithms that can be used. The simplest one is the First-Come-
First-Serve (FCFS) algorithm. In this algorithm, the processes are scheduled in the order in which
they arrive. The next process is scheduled only after the current process has finished executing.
This algorithm is very simple to implement and is used in batch systems.
You are given an input file fcfs_input.txt in the "HW Exercises" directory that contains the details
of the processes that arrive. The file contains the following information:
The first line contains the total number of processes n.
The next n lines contain the details of the processes in the following format:
pid arrival_time burst_time
You can assume that the processes arrive in the order in which they are given in the input file.
The pid is a unique identifier for the process. The arrival_time is the time at which the process
arrives. The burst_time is the time required by the process to finish executing.
You are required to modify your queue implementation to implement the FCFS algorithm. The
queue should contain the processes that are waiting to be executed. The queue should be
implemented using a linked list.
21
5
1 0 5
2 1 3
3 3 2
4 8 4
5 9 1
22
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
CS-F211: Data Structures and Algorithms
Lab 4: Insertion Sort & Merge Sort
Introduction
Welcome to week 4 of Data Structures and Algorithms! Today we are going to get started with
one of the most fundamental problems when dealing with collections of data — Sorting. Sorting
is often required in itself or as a pre-requisite for other algorithms. For example, Binary Search
requires the input array to be sorted. So, it becomes important to have efficient techniques to
sort data. We would be looking at and implementing two important sorting algorithms - Insertion
Sort and Merge Sort. We would also be comparing their performance and understanding the
different use cases where one of them might be better than the other. Towards the end, we shall
see how can we sort data stored in arbitrarily large files that don't even fit in memory.
The numbers to be sorted are also known as keys. Oftentimes, the members to be sorted would
be records (or objects) having multiple fields. In these cases, they are sorted based on some
particular field called the key. That is, the records are reordered such that the specific field of all
instances are in the sorted order. A practical example of this would be sorting “student” records
based on the CGPA field for PS allotment.
Insertion Sort
Insertion sort is a simple sorting algorithm. It is the algorithm that many people end up using
while sorting a hand of cards. You start with an empty hand and each time a card is dealt to you,
you place it in the correct position in your hand. To find the correct position for a card, we
compare it with each of the cards already in the hand, from right to left, as illustrated in
Figure 1. At all times, the cards held in the hand are sorted.
Thus, we first place the first card in the hand, then we place the second card in the correct
position relative to the first card, then we place the third card in the correct position relative to
the first two cards, and so on. The algorithm is called insertion sort because we insert each card
into its correct position in the hand.
At each step, we "insert in order" the new element into the sorted part of the array. Each
insertion assumes that the elements to the left of it are already sorted. Thus, insertion sort is an
application of the divide-and-conquer paradigm. In this paradigm, we divide the problem into
subproblems, solve the subproblems, and then combine the solutions to the subproblems to
solve the original problem. In insertion sort, the original problem is to sort an array of n elements.
The subproblems are to sort the first n-1 elements and to insert the nth element into the sorted
array of the first n-1 elements. The solution to the original problem is to solve the subproblems
and then insert the nth element into the sorted array of the first n-1 elements.
For this lab sheet, we would be implementing the iterative version of insertion sort as discussed
in class. The algorithm (for an array of integer elements) is as follows:
void insertionSort(int A[], int n)
{
for(int j = 1; j < n; j++)
{
insertInOrder(A[j], A, j);
}
}
This code calls the subroutine insertInOrder() which is defined in the following manner:
// Pre-condition: (length(A) - 1 > last) & forall j: 0 <= j < last - 1:
A[j] <= A[j+1]
void insertInOrder(int v, int A[], int last)
{
int j = last - 1;
while(j >= 0 && v < A[j])
{
A[j+1] = A[j];
j--;
}
A[j+1] = v;
}
// Post-condition: forall j: 0 <= j < last - 1: A[j] <= A[j+1]
Here, we start comparing the element to be inserted (int v) from the (last) th element (ie., A[j]
where j = last - 1) down to 0. The elements are shifted right until an element lesser than or
equal to v is found.
Here, a run is shown on the array {5, 2, 4, 6, 1, 3}. Array indices appear above the rectangles, and
values stored in the array positions appear within the rectangles. (a)–(e) show the iterations of
the for loop in insertionSort() function. In each iteration, the black rectangle holds the key taken
from A[j] (ie v), which is compared with the values in shaded rectangles to its left in the test of
the while loop in the function insertInOrder(). Shaded arrows show array values moved one
position to the right in the body of the while loop, and black arrows indicate where the key moves
to in the last line of the function insertInOrder(). (f) shows the final sorted array.
Task 1: Consider the following structure for storing the details of a person as defined below:
struct person
{
int id;
char *name;
int age;
int height;
int weight;
};
Write an implementation of insertion sort that would sort an array of persons by referring to
the description of the insertion sort algorithm above. The key field for this implementation is to
be considered as height.
For example:
Consider an array of persons containing the following data:
1 Sokka 15 150 45
3 Zuko 16 160 50
4 Katara 14 145 38
5 Toph 12 113 30
5 Toph 12 113 30
4 Katara 14 145 38
1 Sokka 15 150 45
3 Zuko 16 160 50
Also write an appropriate main() function that initialises an array with the values of Table 1,
passes it as the input to your implementation of insertion sort and prints the array post sorting.
Task 2: You have been provided with a set of files having file names datX.csv where X stands for
the input size. These files contain comma-separated entries for the details of the students. Write
a program to read the data from these files and store them in a dynamically allocated array of
struct person as defined in Task 1. You may take in the input file as a command line
argument. Now, run the insertion sort algorithm on this array of struct person to sortthe
data in ascending order of the height of the students. Measure the time taken by the algorithm
for sorting each of the input-sized arrays individually. Plot a graph of the time taken by the
algorithm against the input size. (You can use any spreadsheet software like Microsoft Excel or
plotting software to plot the graph. Interested students may also try using the python library
matplotlib; you may refer to this tutorial:
https://matplotlib.org/stable/tutorials/introductory/pyplot.html)
Our plot should reflect an approximately quadratic relationship between the time taken and the
size of the array. We can also see that insertion-sort is in-place. In other words, it does not require
any significant additional space over the memory required to store the original array itself. The
sorted array ends up at the same locations as the original array and the elements are just moved
around.
Home Exercise 1: Write a program to sort elements of an array of struct person (as defined in
Task 1) in lexicographical order of the name of the students. Measure the time taken by the
algorithm for each of the input sizes. Plot a graph of the time taken by the algorithm against the
input size. (You can use any spreadsheet software or plotting software to plot the graph.) [Hint:
You can use the strcmp() function of the string.h library to compare two strings.]
Merge Sort
Merge sort also uses the divide-and-conquer paradigm. Here, the problem of sorting the
sequence is reduced to sorting sub-sequences followed by the merging of the sub-sequences.
To sort a sequence S with n elements using the three divide-and-conquer steps, the merge-sort
algorithm proceeds as follows:
1. Divide: If S has zero or one element, return S immediately; it is already sorted. Otherwise
(S has at least two elements), remove all the elements from S and put them into two
sequences, S1 and S2, each containing about half of the elements of S; that is, S1 contains
the first ⌊n/2⌋ elements of S, and S2 contains the remaining ⌈n/2⌉ elements.
(Recall that the notation ⌊x⌋ indicates the floor of x, that is, the largest integer k, such
that k ≤ x. Similarly, the notation ⌈x⌉ indicates the ceiling of x, that is, the smallest
integer m, such that x ≤ m.)
2. Conquer: Recursively sort sequences S1 and S2.
3. Combine: Put the elements back into S by merging the sorted sequences S1 and S2 into
a sorted sequence.
It is intuitive to think of recursion when specifying the implementation. The algorithm for the
mergeSort() function is given as follows:
Here, mergeSort() recursively calls itself on the two halves of the array and the two sorted halves
are then merged to form one sorted array through the merge() routine. An illustration of merge
sort is provided in Fig 3.
Fig 2. Illustration of Merge Sort
Here in Fig 2. The merge-sort tree T for an execution of the merge-sort algorithm on a sequence
with 8 elements is shown. (a) shows the input sequences processed at each node of T, while (b)
shows the output sequences generated at each node of T
Merge
The merge function combines two sorted arrays into a single array. In this algorithm, it is used to
combine the two recursively sorted subarrays into a merged array. The merge routine is also
known as the “two-pointer technique”, because we are moving ahead two pointers and
“merging” from the appropriate one according to the comparison between the two elements.
In other words, the merge operation reads/deletes elements from the front of two sorted
sub-lists and adds them onto the rear of a new sorted list.
This is FIFO (i.e First-in-First-out) order i.e. the behaviour of merge can be described as follows:
1. reading from two FIFO lists (where FIFO order matches sorted order)
2. writing into a new FIFO list (resulting in a sorted order).
In Fig 3 (b), you can observe the merge operation being performed at each step.
We can make use of an auxiliary function here that takes two sorted arrays L1 and L2 indexed
from s1 to e1 and s2 to e2 respectively and store the sorted result in an array L3 indexed from s3
to e3.
void mergeAux(int L1[], int s1, int e1, int L2[], int s2, int e2, int
L3[], int s3, int e3);
Given an implementation of mergeAux(), merge() can simply act as a wrapper in the following
manner:
When L1 or L2 or both are empty, we need to handle the corner cases appropriately. Try to do
this yourself! :)
void mergeAux (int L1[], int s1, int e1, int L2[], int s2, int e2, int
L3[], int s3, int e3)
{
int i,j,k;
// Traverse both arrays
i=s1; j=s2; k=s3;
while (i <= e1 && j <= e2) {
// Check if current element of first array is smaller
// than current element of second array
// If yes, store first array element and increment first
// array index. Otherwise do same with second array
if (L1[i] < L2[j])
L3[k++] = L1[i++];
else
L3[k++] = L2[j++];
}
// Store remaining elements of first array
while (i <= e1)
L3[k++] = L1[i++];
}
Read the above implementation and try to understand its working.
Merge by insert
Now, in both versions of mergeAux() that we saw, we had to allocate an additional array of size
O(n). Can this be avoided?
We can modify our merging logic by now inserting the elements of L1 into L2 (similar to insertion
sort). This comes at a greater time complexity but demands lesser space. You will learnmore
about this space-time tradeoff in week 7’s labsheet.
Task 3: Implement merge sort as described above for sorting an array of integers. Define the
main() and the mergeSort() functions in the file intMergeSort.c. Define the signatures for merge()
and mergeAux() in the files intMerge.h and intMergeAux.h respectively. In the file intMerge.c
write the implementation of merge() using mergeAux(). Write the iterative and recursive
implementations in the files intMergeAuxIter() and intMergeAuxRec() respectively. Also
implement the merge() by insert in the file intMergeByInsert.c. You have been provided a file
balances.txt that contains the bank account balances of the customers of a given branch.
Compare the time taken and peak heap space used in these three implementations for sorting
the input given in the file balances.txt.
Task 4: Similar to Task 1, implement merge sort (using the mergeAux() by iteration) for sorting
elements of struct person based on height as the key field. Now as described in Task 2,
measure the time taken in sorting the files of the form datX.csv that have been provided to you.
Plot a graph of the time taken by the algorithm against the input size. (You can use any
spreadsheet software or plotting software to plot the graph.) Compare this plot with that
obtained for insertion sort and try to justify the difference observed.
Home Exercise 2: Given an integer array A, find if an integer p exists in the array such that the
number of integers greater than p in the array equals to p. Print the integer if it exists, else print
"No such integer found". While this problem can be solved directly, you are encouraged to use
your implementation of merge sort in your solution.
Input 2:
A = [1, 1, 3, 3]
Output:
No such integer found
Home Exercise 3: Given an integer array nums of size N, print all the triplets [nums[i],
nums[j], nums[k]] such that i != j, i != k, and j != k, and nums[i] +
nums[j] + nums[k] == 0. Your output should not contain duplicates. The order of printing
the triplets does not matter. Your solution should have a time complexity of O(N2) and a space
complexity of O(1).
Example 2:
Input: nums = [0,1,1]
Output: []
Explanation: The only possible triplet does not sum up to 0.
External Merge Sort
In all the sorting implementations we saw, we read the files into memory in a statically or
dynamically allocated array. However, this is not always possible. Many a time, the data to be
sorted is so large that it might not fit into memory. In this case, we use an approach known as
external merge sort.
We are going to be applying our merge-sort algorithm, but treat the input as a file rather than
an array. The external merge sort algorithm is described as follows:
1. Split into chunks small enough to sort in memory (“runs”).
2. Merge pairs (or groups) of runs using the external merge algorithm
3. Keep merging the resulting runs (each time = a “pass”) until left with one sorted file!
Let us say we have a huge file, for example, say input size 10 M records that need to be sorted
and only up to 1 M entries can fit in memory at any given time.
Here we would start by populating an array of size 1 M with the first 1 M entries of the file as we
do with normal merge-sort. Now, we would run merge sort on this array and store the sorted
result in a new file (called sorted1.csv). In this way we would continue reading, sorting and storing
the 10 chunks of size 1 M each in 10 files (sorted1.csv, …, sorted10.csv).
Now, we can merge these files 2 at a time to form the final sorted output file. [Note that while
merging we would need to read only one record each from the two files and store the smaller
one in the final sorted file. Thus only two records need to be in memory at the same time.]
In this way, the approach of external merge sort can be used to sort files having an arbitrarily
large input size.
Task 5: Implement external merge sort as described above and sort the file dat7578440.csv which
has approximately 7.5 M entries using chunks of size 1M. [You may use a smaller chunk size if 1
M does not fit in memory]
Home Exercise 4: You are the captain of the 99th Precinct of the NYPD. You have been given a
list of criminals and their crimes.
- crimes.txt
This file contains the list of the crimes, the year it was committed, and the ID of the criminal
who committed them.
The format of the file is as follows:
The first line contains the number of crimes (m)
The next m lines contain the crime, year, and ID of the criminal in the following format:
crime,year,ID
Now the Police Department has decided to calculate a score for each criminal called criminality.
The criminality is calculated as follows:
The criminality of a criminal is the sum of the crime coefficients for each count of crime
committed by the criminal.
If the criminal is less than 18 years old while committing the crime, the crime coefficient for
that crime is multiplied by 0.5. The current year is 2023.
For example, if a criminal named Doug Judy is 25 years old and has committed 2 crimes:
- Arson in 2010
- Grand Theft Auto in 2020
His year of birth is 2023 - 25 = 1998.
He committed Arson in 2010. His age while committing the crime is 2010 - 1998 = 12.
He committed Grand Theft Auto in 2020. His age while committing the crime is 2020 - 1998 =
22.
So the criminality of Doug Judy is 10 * 0.5 + 10 = 15 since he was a minor while committing
Arson and he was an adult while committing Grand Theft Auto.
You have to create a function to calculate the criminality of each criminal. Create an array of
elements of the structure you defined and store the required details of the criminals and the
crimes committed by them in it. Sort the array of structures in the order of the criminality of
the criminals in descending order and store the sorted array in a file called sorted_criminals.txt.
Sample Input:
criminal_database.txt:
3
Doug Judy,52,200
Melanie Hawkins,61,3
James Dylan Borden,78,125
crimes.txt:
10
GRAND THEFT AUTO,1980,200
GRAND THEFT AUTO,1990,200
ARSON,2022,200
ARSON,2010,3
BREAKING AND ENTERING,2000,3
BREAKING AND ENTERING,1960,125
ROBBERY,1960,125
HOMICIDE,1975,3
HOMICIDE,1990,3
HOMICIDE,2000,125
OUTPUT:
sorted_criminals.txt:
Melanie Hawkins,61,3,45.000000
James Dylan Borden,78,125,27.500000
Doug Judy,52,200,25.000000
Explanation:
The crimes committed by Melanie Hawkins are:
ARSON,2010,3
BREAKING AND ENTERING,2000,3
HOMICIDE,1975,3
HOMICIDE,1990,3
Now, she was a minor when she committed the homicide of 1975, but was an adult during the
other three crimes
Introduction
Welcome to week 9! This week we shall familiarise ourselves with an important data structure
known as the Binary Search Tree, often abbreviated as BST. It is the first tree-based data structure
that you will learn in this course. It is built on top of the underlying concept governing the binary
search procedure, which you have learnt in your introductory computer programming course.
There are several useful operations that can be performed on a binary search tree, and we shall
learn them through examples and problems.
Note: All functions implemented in this labsheet are provided to you in a file named “bst.c”.
There is no need to copy-paste code from this document to your code editor.
1
Introduction to Binary Search Trees
The Binary Search Tree data structure is a tree-like organisation or encapsulation of the binary
search operation. It is a binary tree (meaning that each node can have at most two children) that
supports an easy searching mechanism. A binary search tree obeys the following property: “each
internal node y stores an element e such that the elements stored in the left subtree of y are less
than or equal to e, and the elements stored in the right subtree of y are greater than or equal to
e”. This property is fundamental in modelling its encapsulated binary search behaviour.
Evidently, a binary search tree can be stored as a linked data structure consisting of nodes and
pointers. A binary search tree containing integer keys can be implemented as follows:
The following functions can be used to create (and return) new binary search trees and nodes:
BST *new_bst()
{
BST *bst = malloc(sizeof(BST));
bst->root = NULL;
return bst;
}
2
return node;
}
Each node in our implementation contains a key and two pointers, one pointing to its left child
and the other pointing to its right child. Our macro BST structure contains just a single pointer
pointing to the root of the tree. This is analogous to the linked list structure that contains just the
head pointer.
Figure 1: (a) is a BST with 6 nodes and height 2, (b) is a BST with the same 6 nodes but height 4
Observe that in Figure 1(a) the element at the root is 6, the elements 2, 5, and 5 in its left subtree
are no larger than 6, and likewise the elements 7 and 8 in its right subtree are no smaller than 6.
This property holds for every node in the tree. Similar observations can be made about the other
tree as well.
Bear in mind that a node in BST might contain an entire struct in itself, and need not just be
integer keys. However, no matter the struct that a BST’s nodes are made up of, the binary search
tree property will always have to be checked, which means that the struct must contain a key in
it that is a number (integer, float, or double) and that will be used for maintaining that property.
For example, if it is a BST containing “Student” nodes, the key might be their CGPA (which would
be a float attribute/member within the Student structure). Every function that we will be
discussing in this labsheet will be designed in accordance with an integer BST. But always
remember that this need not be the case, and in fact, for most problems, this will not be the case.
All the functions might have to be modified appropriately if the node structure is modified to fit
the needs of a different problem.
3
Traversals
The binary-search-tree property enables us to print out all the keys in a binary search tree in
sorted order by the simple “inorder tree walk” algorithm (which follows the sequence “go left,
print data, go right” at each node). Recursion is the cleanest way to implement this logic, though
it can be done iteratively as well. This recursive algorithm can be programmed as follows:
As clear from the above function, we first recursively call our traversal function on the current
node’s left child, then print the current node’s key value, and then call the traversal function on
this node’s right child. The base case of this function is simply the case when the current node is
NULL, and in that case, we return. Through the process of recursion, this simple piece of code
conducts a clean in-order traversal of the binary search tree.
An in-order traversal of a binary search tree gives us the elements of the BST in sorted order.
As you have seen in the lectures, one of the main indicators of performance for binary search
trees is their height (or depth), which is defined as the maximum depth of a leaf node from the
root of the tree (note that it is measured in terms of number of edges). The height of a search
tree limits the worst-case time complexity of insert and search operations (which we shall discuss
in the next section).
In-order traversal is not the only kind of traversal. There are also the appropriately named pre-
order traversal (which follows the sequence “print data, go left, go right”) and post-order
traversal (which follows the sequence “go left, go right, print data”).
Task 1: Implement the pre-order and post-order traversal functions in the bst.c source file. Now,
write alternate forms of all three of the traversal functions that print a string “null” for the
4
children in the trees that are NULL pointers, rather than leave them blank. [You can also try
converting them to iterative functions.]
Home Exercise 1: Write a function that performs a level-order traversal on a BST. A level-order
traversal is defined as the left-to-right breadth-first search on the BST. In other words, we
traverse all nodes at the first level first (left to right), followed by all nodes at the second level
(again, left to right), and so on till we traverse all the nodes at level d+1 (where d is the depth of
the BST). For instance, the level-order traversal of the tree in Figure 1(a) will be “6 5 7 2 5 8” and
that of the tree in Figure 1(b) will be “2 5 7 6 8 5”.
Home Exercise 2: Write a function to perform the reverse level-order traversal on a BST. A
reverse level-order traversal is a level-order traversal in which we start from the (d+1)th level left
to right, followed by the dth level (again, left to right), and so on upto the 1st level. For instance,
the reverse level-order traversal of the tree in Figure 1(a) will be “2 5 8 5 7 6” and that of the
tree in Figure 1(b) will be “5 6 8 7 5 2”.
5
Operations on Binary Search Trees
Insertion
The insertion operation on a binary search tree is very straightforward. We start scanning for the
correct position of the node to be inserted downward from the root node. This means that we
keep comparing whether our node to be inserted has a key greater than or less than the node
we are checking it with. If it is greater, then we move to our node’s right child. If less, we move
to the left child. We keep repeating this process till the child we are moving to becomes NULL. At
that point, we have found the position where our node is to be inserted. Simple pointer
manipulations at that location allow us to insert the node into its place.
6
return;
}
current = current->right;
}
}
}
If the BST’s root is NULL, then the node we are trying to insert into the BST is going to become
the root. In every other case, we must perform the traversal downward and add the node to the
tree when we have found its correct position where it will satisfy the BST property.
Constructing a binary search tree involves performing multiple insert operations into it. But
note that the BST can end up looking very skewed or very balanced, depending on the order in
which we perform our insertions, and this affects the “height” of our BST, which in turn has a
bearing on the time complexity of all our BST operations. Finding clever ways for maintaining
height balance in a binary search tree is a big research topic in the field of data structures, and
we shall learn some such techniques in next week’s lab. For our purposes, we shall assume that,
in the worst case, our BST can indeed be heavily skewed and have an O(n) height, where n is the
number of nodes.
Task 2: Write a function constructBST() that takes as input an array of integers and creates a new
BST and then iteratively performs the insert operation on a BST with those integers. It returns a
pointer to the finally constructed BST (ie., BST*). [Bear in mind that this function will also
change if the node structure is changed, because in that case you would be requiring multiple
fields and the input to the function would be an array of structs.]
Querying
Searching
The searching operation in a binary search tree is logically very similar to the insertion operation
and also leverages the binary search tree property. Our aim here is to find whether an input key
exists in a given binary search tree. For this, we would perform the same traversal from the insert
operation and try to find out where this key “should be”, and then check whether it is actually
present there.
7
{
Node *current = bst->root;
while (current != NULL)
{
if (key == current->value)
{
return 1;
}
else if (key < current->value)
{
current = current->left;
}
else
{
current = current->right;
}
}
return 0;
}
The search routine can easily be modified to return, instead of zero/one, a pointer to the node
where the key has been found.
8
return current->value;
}
}
Node *current = node->left;
9
while (current->right != NULL)
{
current = current->right;
}
return current;
}
Task 3: Write a recursive function that takes a binary search tree as its input and finds out
whether it satisfies the binary search tree property. Now, convert this function to an iterative
one. You can test your functions by constructing an array of BSTs (which would be an array of
structs) using the arrays of numbers present in the n_integers.txt file that was provided to you in
lab 7 (it is provided along with this labsheet as well) and running your function on each element
of the array. Make use of your constructBST() function for the BST creation part. [Recallthat the
n_integers.txt file contains arrays containing n integers (the arrays are on separate lines), each
array is also preceded by a number indicating the length of that array.]
Task 4: Write a function to determine the height of a binary search tree, starting from its root.
Test this function by constructing BSTs using the arrays in the n_integers.txt file.
10
Deletion
The deletion operation in a binary search tree is fairly more complex than the previously
discussed operations and is not as straightforward. We need to be very careful that we do not
end up violating the BST property while performing the operation. Deleting a node x from a BST
involves three cases:
1. If x is a leaf node, ie., it has no children, then we simply remove the node by modifying its
parent to point to NULL instead.
2. If x has one child, then we can elevate that child to take up x’s position in the BST by
modifying x’s parent to point to x’s child followed by removing x.
3. If x has two children, we must find either x’s in-order successor or predecessor y and then
replace x’s contents with y’s. Once this is done, we can remove the successor (or
predecessor) node used by recursively calling this same function on that successor (or
predecessor).
11
}
if (node->value < current->value)
{
current = current->left;
}
else
{
current = current->right;
}
}
free(node);
return;
}
if (node->left == NULL)
{
// Node only has right child
Node* current = bst->root;
if (current == node)
{
bst->root = node->right;
free(node);
return;
}
while (current != NULL)
{
if (current->left == node)
{
current->left = node->right;
break;
}
if (current->right == node)
{
current->right = node->right;
break;
}
if (node->value < current->value)
{
current = current->left;
12
}
else
{
current = current->right;
}
}
free(node);
return;
}
if (node->right == NULL)
{
// Node only has left child
Node* current = bst->root;
if (current == node)
{
bst->root = node->left;
free(node);
return;
}
while (current != NULL)
{
if (current->left == node)
{
current->left = node->left;
break;
}
if (current->right == node)
{
current->right = node->left;
break;
}
if (node->value < current->value)
{
current = current->left;
}
else
{
current = current->right;
13
}
}
free(node);
return;
}
Task 5: Write a function named removeHalfNodes() that removes all nodes that have only one
child from an input BST. Do not invoke the delete() function implemented above to solve this.
You can test this function by creating some BSTs using some of the arrays of numbers given to
you in the n_integers.txt file, just like in Task 3.
Task 6: Create a new node structure for your BST, whereby a node contains struct person
instead of int value as its attribute. The struct is defined as follows:
struct person
{
int id;
char *name;
int age;
int height;
int weight;
};
Create a modified constructBST() function that creates a BST of these nodes by taking as input an
array of structs. Use the “height” field as the key of the BST. You may need to modify certain
other functions also to do this. Run this function with data from the datX.csv files (given).
Now, write a function LCA() that takes three inputs: a BST and the IDs of two nodes (ID is a field
in the struct that is a part of the node). This function finds out the “least common ancestor” of
the two corresponding nodes in the BST. The least common ancestor is defined between two
nodes p and q as the lowest node (ie., the node at maximum depth) in the tree that has both p
14
and q as its descendants (assume that a node can be a descendant of itself). Run the LCA()
function on two random IDs in the size 10 BST and manually verify its correctness.
Home Exercise 3: The downside of the deletion approach that we have followed has to do with
the case where the node has two children. In that case, with our approach (copying node’s data
into its parent), the node actually deleted might not be the node passed to the delete procedure.
If other components of a program maintain pointers to nodes in the tree, they could mistakenly
end up with “stale” pointers to nodes that have been deleted.
The approach that we have taken was followed in the first two editions of your reference book
(Cormen, Leiserson, Rivest, Stein). But from their third edition onwards, they have updated it to
a slightly more complicated variant which resolves the downside that we just discussed. Refer to
pages 296-298 of the book’s third edition and construct a new, more powerful variant of the
delete() function on BSTs based on their approach.
When it comes to space complexity, if we implement recursive versions of the search, insert, and
delete operations, there are at most n stack frames in memory at a time. Equivalently, if we
implement the operations iteratively using some explicit stack mechanism, there could still be at
most n elements in the stack. This implies that the space complexity of these operations is O(n).
It is left as an exercise for you to find out the time and space complexities of the
minimum/maximum and predecessor/successor querying operations.
1
Interested students may refer to the proof of Theorem 12.4 on pages 300-303 of Cormen T.H., Leiserson, C.E.,
Rivest, R.L., and C. Stein. Introduction to Algorithms, MIT Press, 3rd Edition, 2009.
15
Task 7: Write a function that takes a binary search tree as an input and “flattens” it, ie., it
constructs and returns a linked list equivalent to the binary search tree. The constructed linked
list must be in the same order as that of the pre-order traversal of the BST. This process is
illustrated in Figure 2.
Now, try to perform the flattening “in-place” using O(1) auxiliary space only.
Home Exercise 4: Devise a function that takes a BST as an input and returns the kth smallest
element in the BST (where ‘k’ is also taken as an input).
Home Exercise 5: Recall that the worst-case search time in a BST is proportional to the depth of
the tree. Thus, to minimize the worst-case search time, the height of the tree should be made as
small as possible; by this metric, the ideal binary search tree is perfectly balanced.
However, in many applications of binary search trees, it is more important to minimise the total
overall cost of multiple searches rather than the worst-case cost of a single search. If a particular
node in a BST is searched more frequently than another, then it makes sense to put this node
closer to the root of the tree compared to the other node (even if that means increasing the
overall depth of the tree to something more than a perfectly balanced one). This way, we favour
the node that is searched more often and disfavour the one that isn’t. A perfectly balanced tree
need not be the best choice if are aware that some items in the BST are significantly more
frequently looked up than others. In fact, an imbalanced tree may be a better option in that case.
In probabilistic terms, we are attempting to “optimise” (minimise) the “expected search cost” in
our binary search tree.
16
Suppose that you are given as inputs a sorted array key[0 … n-1] of search keys and an array
freq[0 ... n-1] of frequency counts, where freq[i] is the number of searches for keys[i]. Come up
with a function constructOBST() that takes the above (along with an input array) as input and
constructs a binary search tree that is optimal in the view of the above discussion.
[Note: Your approach should be to come up with a seemingly straightforward solution to this that
apparently takes a lot of time, and then try to optimise your time complexity. A solution that runs
in O(n3) time exists for this problem. In fact, certain optimisations to that solution can also bring
down the time complexity to O(n2).]
17
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
CS-F211: Data Structures and Algorithms
Lab 10: AVL Trees and 2-4 Trees
Introduction
Welcome to week 10 of Data Structures and Algorithms! Last week, we studied a riveting data
structure - Binary Search Trees. We studied various operations that can be performed on a BST
such as insertion, querying, deletion, retrieving the kth smallest element, etc. For most of these
operations, for a tree having n nodes, the time complexity came out to be O(h), where h is the
height of the tree. While this height can be O(log n) when the tree is balanced, in the worst
case it can go to O(n) when the tree is extremely skewed. We would be studying how to ensure
O(log n) depth today using two different methods, leading to two data structures - AVL Trees
and 2-4 Trees.
1
Introduction to AVL Trees
An AVL tree (named after inventors Adelson-Velsky and Landis) is a self-balancing binary search
tree. It is a binary search tree that has to satisfy the height balancing property. The height
balancing property says that the heights of the two child subtrees of any node differ by at most
one.
Any binary search tree T that satisfies the height balancing property is said to be an AVL tree.
An example of an AVL tree is shown in Figure 1. The keys of the entries are shown inside the
nodes, and the heights of the nodes are shown above the nodes. You can appreciate the height
balancing property here, as for any node, the heights of the children differ by at most 1, which
leads to an almost balanced tree and ensures logarithmic height1.
Home Exercise 1: Write a function that takes a pointer to a BST (as defined in lab sheet 9) and
checks if it is an AVL tree.
The above function must return 0 when the tree is not an AVL tree and 1 when it is an AVL tree.
You might make use of a helper function to check if a node is height-balanced as follows:
1
Refer to the proof of Theorem 3.2 as given in Michael T. Goodrich, Roberto Tamassia - Algorithm Design.
Foundations, Analysis, and Internet Examples-Wiley (2001)
2
The is_height_balanced() function can be implemented in a recursive fashion and thus
is_avl() is now reduced to is_height_balanced(bst->root). Complete the
function definitions and test them out using an appropriate main().
Now, the difficult part is ensuring that this property is preserved while inserting and deleting
nodes into an AVL tree2. For this, we define a new useful operation on a tree called rotations.
2
You can visualize the various AVL tree operations on the following website - https://visualgo.net/en/bst?slide=14
3
Rotations
The primary operation to rebalance a binary search tree is known as a rotation. During a
rotation, we rotate a parent to be below its child, as illustrated in Figure 2.
The operation LEFT-ROTATE(T,x) transforms the configuration of the two nodes on the right into
the configuration on the left by changing a constant number of pointers. The inverse operation
RIGHT-ROTATE(T,y). The letters 𝛼, 𝛼 and 𝛼 represent arbitrary subtrees. The rotation operation
preserves the binary-search-tree property: the keys in 𝛼 precede x.key, which preceded the keys
in 𝛼, which precede y.key, which preceded the keys in 𝛼.
The mechanics of the rotate-left operation is illustrated in code snippet 1 given below:
You can write the rotate_right function which works in a similar manner by yourself.
4
Building an AVL Tree
We will first define an insert() method that inserts a node into an AVL tree while ensuring that
the height balance property is maintained, and the resulting tree is also an AVL tree. Then we can
create an AVL tree by simply inserting the elements one at a time while ensuring that at each
step, the tree is an AVL tree.
First, we insert a node as we do for a usual binary search tree. Then we perform rotations on the
tree to ensure that it satisfies the height balance property. When a node is inserted, the height
balance property may be violated for multiple nodes. However, an interesting thing is that all of
these nodes where the property can be violated are the ancestors of the newly inserted node.
So we need to traverse up from the newly inserted node up to potentially the root and perform
rotations whenever we encounter a height imbalance.
Here, when node 54 is inserted into an AVL tree, there is an imbalance created at node 78 as the
heights of the ancestors of 54: 62, 50, 78 and 44 are affected. So we perform a left rotation on
50 followed by a right rotation on 78 to transform tree (a) into tree (b) which is now an AVL tree.
5
There are different kinds of imbalances based on the structure of the child and the grandchild
towards the inserted node from the first node of imbalance encountered while traversing
upwards.
We can achieve this by making the insertAVL() a recursive function, which starts at the root, adds
to either subtree and at every step, once the control returns to the calling function after the
recursive call, can now perform rotations to balance the tree.
There are four types of imbalances that can possibly be created after the insertion of a node:
1. LL Imbalance (or similarly RR Imbalance): Here, the new node is part of the left subtree
of the left child of the first unbalanced node encountered while traversing upwards from
the added node to the root.
Here, a single right rotation can balance the node up to this level.
2. LR Imbalance (or similarly RL imbalance): Here, the new node is a part of the right subtree
of the left child (say X) of the first imbalanced node (say Z) encountered while traversing
towards the root. Here, two rotations are required to balance this level. First, a left
rotation is required on X to convert the LR imbalance into an LL imbalance. Now a right
rotation on Z as above can fix the imbalance at this level and we can continue moving
upwards to fix any other unbalanced nodes.
6
Figure 5: LR Imbalance being solved by two rotations
Here, node Z is imbalanced and the newly added node is a part of the subtree rooted at
Y. So the way to rebalance the tree is to first left-rotate X followed by right-rotating Z.
So, after the node has been inserted into an AVL we move upwards towards the root and at
every imbalanced node, apply appropriate rotations to restore the AVL property to the tree.
7
{
// LL imbalance
node = rotate_right(node);
}
else
{
// LR imbalance
node->left = rotate_left(node->left);
node = rotate_right(node);
}
}
else
{
/*
Complete the code for the following cases:
RR imbalance
RL imbalance
*/
}
}
return node;
}
Task 1: To the bst.c file, add functions rotate_left() and rotate_right(). Now complete the
insertAVL() function as per code snippet 2 given above using appropriate calls to rotate_left()
and rotate_right() as per the different imbalance conditions.
Now, write a main() that adds nodes in the order: 1, 2, 3, 4, 5, 6, 7, 8, 9; using the insertAVL()
function as defined here. Check whether the resulting BST is an AVL tree or not. Also, observe
the structure of the tree by printing the nodes in breadth-first order.
You can use the function given in code snippet 3 below for the breadth-first traversal of the tree:
8
return;
}
Node *queue[100];
int front = 0;
int back = 0;
queue[back++] = node;
while (front != back)
{
Node *current = queue[front++];
printf("%d ", current->value);
if (current->left != NULL)
{
queue[back++] = current->left;
}
if (current->right != NULL)
{
queue[back++] = current->right;
}
}
}
Note, the size of the queue (taken as 100 here) should in general be a function of the size of the
tree. You can use a better queue implementation based on the techniques discussed in Week 3.
Task 2: Observe the insertAVL() function as implemented in the snippet. What is the time
complexity of inserting one node into the AVL tree? It looks O(height) right? However, it is not. If
you go through the code carefully, you would observe that each iteration of insertAVL() makes a
call to is_height_balanced(), which itself has a complexity of O(height).
This results in a complexity of O(log 2n) which is not optimal. You can reduce this complexity by
storing some additional information about the height in each node.
Add a height parameter to your bst node structure and modify the insertAVL() function
so that it now updates this parameter and reduces the running time from O(log2n) to O(log
n). Consider the height of leaf nodes to be 1. (Yes, this is another example of the space-time
complexity tradeoff).) You can also use a balance_factor() function as a sub-routine that
calculates the balance factor of a node as the left_height - right_height.
9
Home Exercise 2: We saw the traverse_bfs() function in Task 1. This function is called breadth-
first (or level order) because it prints all the nodes of each level before proceeding to the next.
An another way of traversing the tree is called depth-first traversal. In this, one path upto the
leafis completely traversed before you proceed along another path. All the walks we saw last
week - in-order, pre-order and post-order, are versions of DFS.
You had implemented these walks last week. Now complete the program provided in the file
hw2.c that constructs a binary tree given the in-order and post-order traversals.
Sample Input 1
8
4 8 2 5 1 6 3 7
8 4 5 2 6 7 3 1
Sample Output 1
1 2 3 4 5 6 7 8
Sample Input 2
9
1 3 4 5 6 7 8 9 10
1 3 5 7 6 10 9 8 4
Sample Output 2
4 3 8 1 6 9 5 7 10
10
Deletion from an AVL Tree
Similar to insertion, in deletion as well we delete the node normally from the BST and then move
upwards fixing the tree where needed. In this section, we will consider that the tree has been
augmented with the height parameter as described in Task 2. After deleting a node, we move
upwards, checking the balance_factor at each level. Based on the balance_factor, we can
determine if the tree is still balanced or not. If the balance_factor is -2 or 2, then we need to
perform a rotation to restore the balance. There are four possible cases of imbalances: left-left,
left-right, right-left and right-right.
In order to find which case of rotation needs to be performed, we check for the longer child of
the current node, and its longer child. Based on their locations (left or right), we can identify the
kind of imbalance present.
11
{
// value is at this node
if (node->left == NULL && node->right == NULL)
{
// node is a leaf
free(node);
node = NULL;
}
else if (node->left == NULL)
{
// node has only right child
struct node *temp = node;
node = node->right;
free(temp);
}
else if (node->right == NULL)
{
// node has only left child
struct node *temp = node;
node = node->left;
free(temp);
}
else
{
// node has both children
struct node *temp = predecessor(node);
node->value = temp->value;
node->left = deleteAVL(node->left, temp->value);
node->height = 1 + height(node->left) > height(node->right) ?
height(node->left) : height(node->right);
}
}
if (balance > 1)
{
// left subtree is longer
if (balance_factor(node->left) >= 0)
{
12
// LL imbalance
node = rotate_right(node);
node->right->height = 1 + (height(node->right->left) >
height(node->right->right) ? height(node->right->left) :
height(node->right->right));
node->height = 1 + (height(node->left) > height(node->right) ?
height(node->left) : height(node->right));
}
else
{
// LR imbalance
node->left = rotate_left(node->left);
node->left->left->height = 1 + (height(node->left->left->left) >
height(node->left->left->right) ? height(node->left->left->left) :
height(node->left->left->right));
node->left->height = 1 + (height(node->left->left) >
height(node->left->right) ? height(node->left->left) :
height(node->left->right));
node = rotate_right(node);
node->right->height = 1 + (height(node->right->left) >
height(node->right->right) ? height(node->right->left) :
height(node->right->right));
node->height = 1 + (height(node->left) > height(node->right) ?
height(node->left) : height(node->right));
}
}
else if (balance < -1)
{
// right subtree is longer
if (balance_factor(node->right) <= 0)
{
// RR imbalance
node = rotate_left(node);
node->left->height = 1 + (height(node->left->left) >
height(node->left->right) ? height(node->left->left) :
height(node->left->right));
node->height = 1 + (height(node->left) > height(node->right) ?
height(node->left) : height(node->right));
}
else
{
13
// RL imbalance
node->right = rotate_right(node->right);
node->right->right->height = 1 + (height(node->right->right->left) >
height(node->right->right->right) ? height(node->right->right->left) :
height(node->right->right->right));
node->right->height = 1 + (height(node->right->left) >
height(node->right->right) ? height(node->right->left) :
height(node->right->right));
node = rotate_left(node);
node->left->height = 1 + (height(node->left->left) >
height(node->left->right) ? height(node->left->left) :
height(node->left->right));
node->height = 1 + (height(node->left) > height(node->right) ?
height(node->left) : height(node->right));
}
}
return node;
}
Home Exercise 3: Consider the insertAVL() and deleteAVL() functions above. Both
functions are recursive. Write a non-recursive version of insertAVL() and deleteAVL()
that uses a stack to store the nodes that are visited during the traversal. The stack should be
implemented using a linked list. The stack should store the addresses of the nodes that are visited
during the traversal. The non-recursive versions of insertAVL() and deleteAVL()should
have the same functionality as the recursive versions of insertAVL() and deleteAVL().
Task 3: Add a parent attribute to the node structure. Now implement iterative versions of
insertAVL() and deleteAVL() without using the stack.
In comparison with the implementation in Home Exercise 3 above using the stack, which iterative
version had a larger space overhead? Which was faster?
14
Introduction to (2, 4) Trees
Now we move from binary search trees to multiway search trees. A multiway search tree has
more than two children. Obviously, as we keep increasing the width of the tree, we can have
shorter trees which might suggest lower complexity of search, insert, delete and other
operations. However, with the time we save in height, we end up losing in width as the
complexity of looking for the correct subtree at each level increases. Here in particular we
consider (2, 4) trees. A (2, 4) tree has the following property:
If we compare (2,4) trees with AVL trees, while both of them ensure logarithmic depth, (2,4) trees
are in a sense less strict with respect to their balancing criteria than AVL trees.
Insertion
In insertion, first we find the appropriate leaf node. Then the insertion operation is always
performed on this leaf node. Whenever we encounter a 4 node (node having 4 children) while
traversing towards the leaf, we split preemptively it so that we can avoid the traversal back to
the root after insertion.
The file two_four.c has been shared with you that contains the code for insertion. Let us try to
understand it in parts.
After defining the struct node and the tree, we begin the function insert_24()
1) In case the tree is empty, create a root node and add the element there:
{
Node *temp = tree->root;
if (node == NULL)
{
15
Node *myNode = new_node();
myNode->isLeaf = 1;
myNode->keys[0] = val;
myNode->num_keys = 1;
tree->root = myNode;
return;
}
if (parent == NULL)
{
parent = new_node();
parent->isLeaf = 0;
parent->children[0] = temp;
parent->children[1] = newNode;
parent->keys[0] = temp->keys[1];
parent->num_keys = 1;
tree->root = parent;
printf("Created new root node\n");
}
else
16
// The parent must have 1 or 2 keys since all 4-nodes are split
{
if (parent->num_keys == 1)
// {...}
// 2 keys in the parent
else
// {...}
// Find the correct parent and child for the next iteration
for (int i = 0; i <= parent->num_keys; i++)
{
if (val < parent->keys[i])
{
parent = parent->children[i];
break;
}
else if (i == parent->num_keys)
{
parent = parent->children[i];
break;
}
}
// Find the correct child for the next iteration
for (int i = 0; i <= parent->num_keys; i++)
// { ... update temp ... }
}
4) If not a 4-node, just traverse to the correct child and end while
else
{
parent = temp;
// Find the correct child for the next iteration
for (int i = 0; i < parent->num_keys; i++)
{
if (val < parent->keys[i])
{
temp = parent->children[i];
17
break;
}
}
if (parent == temp)
{
temp = parent->children[parent->num_keys];
}
}
}
5) After the while loop, just insert into the correct leaf node, now pointed to by parent
}
parent->num_keys++;
}
Thus, in this way, we can ensure that all nodes are at the same level in a two-four tree as well as
ensuring the depth of the tree is O(log(N)).
18
Task 4: Write a function to search for a given value in a (2, 4) tree. While searching in both (2,4)
trees and AVL trees are O(log N), which is faster? Construct an AVL tree and a (2,4) tree with
integers from 1 to 100000. Now perform 5000 random searches of numbers in this range on each
tree and compare the avg time taken in each tree.
Home Exercise 4: You have a database of people called people.csv with the following fields:
- name
- user_id
- salary
- age
An index is a data structure that allows you to search for a specific value in a database in O(log n)
time. For each key in the index, you store a pointer to the record in the database that hasthat
key. Here, the pointer is the index of the record in the array.
You can use two (2,4) trees as indices for the user_id and salary fields. You can assume the both
these fields have unique entries in the file. Thus you can search for people by their user_id or
salary in O(log n) time using these trees.
19
};
Similarly, define a salary_node and construct (2,4) trees for both of them.
Task 5: We saw insertion into (2,4) trees above. Deletion from (2,4) trees follows a similar spirit
of preemptively adjusting the upper nodes as we move below in order to maintain the property
of the tree. While we were splitting near full nodes above, we would be merging nodes having
space here. Similar to insertion you need to traverse up to the appropriate node containing the
key to be deleted, performing the following steps:
1. If the element, k is in the node and the node is a leaf containing at least 2 keys, simply
remove k from the node.
2. If the element, k is in the node and the node is an internal node perform one of the
following:
a. If the element's left child has at least 2 keys, replace the element with its
predecessor, p, and then recursively delete p.
b. If the element's right child has at least 2 keys, replace the element with its
successor, s, and then recursively delete s.
c. If both children have only 1 key (the minimum), merge the right child into the left
child and include the element, k, in the left child. Free the right child and
recursively delete k from the left child.
3. If the element, k, is not in the internal node, follow the proper link to find k. To ensure
that all nodes we travel through will have at least 2 keys, you may need to perform one
of the following before descending into a node. Then, you will descend into the
corresponding node. Eventually, case 1 or 2 will be arrived at (if k is in the tree).
a. If the child node (the one being descending into) has only 1 key and has an
immediate sibling with at least 2 keys, move an element down from the parent
into the child and move an element from the sibling into the parent.
b. If both the child node and its immediate siblings have only 1 key each, merge the
child node with one of the siblings and move an element down from the parent
into the merged node. This element will be the middle element in the node. Free
the node whose elements were merged into the other node.
3
Matt Klassen's lecture notes on 2-3-4 trees
20
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
CS-F211: Data Structures and Algorithms
Lab 11: Heaps and Heapsort
Introduction
Welcome to week 11 of Data Structures and Algorithms! This week, we shall learn yet another
important data structure known as the “heap”. Heaps are commonly used in algorithms that
involve sorting, searching, and prioritisation, such as Dijkstra's algorithm for finding the shortest
path between nodes in a graph. At the heart of all these is the most essential operation that can
be performed on a heap: “heapify”. We shall then also learn a new sorting algorithm that uses
heaps. Understanding how heaps and heapify work and how to implement them can significantly
improve one's ability to design efficient algorithms and solve complex problems.
1
https://xkcd.com/835/
Introduction to the Heap Data Structure
The heap data structure is defined as a nearly complete binary tree that satisfies the heap
property2. There are two kinds of heaps: max-heaps and min-heaps. In both kinds of heaps, a
special property is satisfied by the underlying tree. This property is known as the heap property.
Heap Property
Max-heap property: The child of a node has to have a key that is smaller than the node itself. In
other words, the parent of a node has to have a key greater than or equal to the node itself.
Min-heap property: The child of a node has to have a key that is greater than the node itself. In
other words, the parent of a node has to have a key smaller than or equal to the node itself.
Therefore, a heap satisfying the max-heap property is known as a max-heap, and likewise for a
min-heap. We shall be restricting ourselves to max-heaps and max-heap property for the
purposes of our discussion in this labsheet, although the discussion would be equivalent to that
of a min-heap satisfying the min-heap property.
The above figure (Figure 1) displays the two kinds of heaps and how each satisfies their property.
2
Note that whenever we refer to “heaps” in this labsheet we are always referring to “binary heaps”.
Implementation of Heaps
One might think that since a heap is a tree, and a tree would be best implemented as a graph-
like data structure with pointers to children and so on. While that is one way to implement the
binary tree that makes up the heap ADT, it is not the only way. Often, when faced with design
choices during implementation, one goes with the simplest implementation wherever possible.
In this case, we shall see how a heap need not be implemented using a graph data structure with
nodes and edges. We shall learn how to implement the heap ADT using nothing but an array.
We can prepare our array such that each node in the binary tree corresponds to an element in
the array. However, we need to come up with a scheme such that we are able to check and
incorporate the heap property into the array implementation as well. This can be done by abiding
by the following scheme:
1. The root of the tree corresponds to the first element in the array (index = 0).
2. The children of the root are found at the two positions after the double of the root’s
index, ie., the second and third elements (indices = 1 and 2, which are equal to 0*2+1 and
0*2+2 respectively) in the array are the children of the root.
3. Inductively repeat the above for all nodes in the heap, ie., for all elements in the array,
till the nodes in the heap are exhausted.
Refer to Figure 2 for a visual representation of the above scheme for mapping a tree-heap to an
array-heap.
We shall see how easily functions can be implemented for navigating the binary tree, without
having to store and allocate and deallocate or reallocate pointers. This adds to our convenience
when dealing with heaps. While we are dealing actually with a normal dynamic array, we must
always have in mind that this array is just an “encoding” for the underlying tree that it refers to.
We have simply come up with a mechanism that stores it more conveniently for us. This approach
can be generalised for any binary tree, and in fact can even be applied to any tree in general.
However, our implementation is not over just yet. Since a binary heap can have its lowest level
partially filled (anywhere from 1 to 2 d nodes, where ‘d’ is the depth of the tree), we must account
for that in our array. We shall add complete 2 d more space (for 2d elements) to our array if a new
level needs to be added to the tree at depth d. This will help our cause since repeated realloc
calls will incur high time penalties (system calls are expensive; realloc is an encapsulated system
call to the OS requesting for more memory), therefore we want to minimise our realloc calls while
keeping the array size reasonable at any given instant. But what this means is that our array will
have some garbage elements if the heap is not a complete binary tree.
To keep track of which elements are valid and which are not, we can simply maintain an integer
(we can call it “heap size”) that specifies the last index up to which the elements in our array are
valid heap elements, and beyond which are garbage.
Keeping the above points in mind, let us design the struct to implement the binary tree on which
our max-heap will be built. Even though it is just a binary tree (at this stage), we shall call it the
“heap” nonetheless.
struct heap {
int *data;
int size;
int capacity;
int depth;
};
typedef struct heap* Heap;
It would help our goal of encapsulation if we create a function that creates such a struct for us
and returns a pointer to it. Let’s call this function heap_create().
Heap heap_create()
{
Heap h = malloc(sizeof(struct heap));
h->data = malloc(sizeof(int));
h->size = 0;
h->capacity = 1;
h->depth = 0;
return h;
}
Task 1: Write a function add_to_tree() that takes as input a binary tree that has been
implemented just like the heap struct above and an integer and places that element at the next
free slot in the binary tree keeping it nearly complete. What this means is that a new level cannot
be introduced into the tree unless the previous level has been completely filled with elements.
Now, implement a binary tree using nodes and edges just like you did in Lab 9 for BSTs. Write a
similar add_to_tree() function for this implementation. Instantiate two different binary trees,
one using each implementation, and insert ten elements into them (these ten elements can be
hardcoded into your main() function for the purposes of this exercise). Now, measure the total
memory occupied by each implementation (note that direct use of sizeof will not work for
attributes that are stored as pointers, and you will need to create your own functions to obtain
the correct sizes for the same). Compare and contrast the memory usage of both
implementations.
Task 2: This scheme we discussed for implementing the binary tree of the heap directly entails
three constant-time operations that will help us navigate the “tree” when it has been
implemented in the form of an array. They are:
Implement the above functions as described earlier. [Hint: They can be implemented in
effectively only one line of code, excluding validity checks.]
Heapify
What we have implemented so far is no more than a binary tree. The only thing separating a
binary tree from a heap is the heap property. To introduce the heap property into a binary tree
(that is not yet a heap), we shall make use of the famous heapify routine. This operation is
described as follows. It takes in as input a binary tree (say, the array implementation) and a node
(an index) in the tree. Under the pre-condition that the left and right subtrees of the givennode
are already satisfying the heap property, the routine makes the heap property satisfies at the
given node’s level too by recursively making the largest node “float down” till the heap property
is satisfied by the subtree rooted at the given node’s position.
This process is illustrated in Figure 3. Notice how the node with key 4 keeps floating downward
(through swaps) while the nodes with higher keys get floated up, till the max-heap property is
satisfied.
Figure 3: Illustrating the heapify procedure when called on node at index “1”
if (largest != index)
{
int temp = h->data[index];
h->data[index] = h->data[largest];
h->data[largest] = temp;
max_heapify(h, largest);
}
}
The worst-case time complexity of the heapify operation is O(log n), where n is the number of
nodes in the heap. Notice how log n is just the height of the tree.
The heapify operation does not convert the entire binary tree to a heap. It only prepares a
subtree that satisfies the heap property – that too, assuming that the left and right subtrees of
the given node are already satisfying the heap property. Therefore, this in and of itself does not
help us construct a heap.
To build an entire heap from a binary tree, we would need to make several calls to the heapify
function in a bottom-up fashion. What this means is that we would call the heapify function on
the leaf nodes of the tree, and then on the level above the leaves, ie., their parents, and so on.
Task 3: Write a function build_max_heap() that takes as input an array of integers that represents
a binary tree and outputs another array that represents the heap that would be
formed from the binary tree. Your function obviously should call the max_heapify() function
within it.
Contrary to common sense, the build_max_heap() operation actually runs in O(n) time instead
of the intuitively expected O(n log n). A rigorous proof of this can be found in your reference
book.3
Task 4: Write a function that takes as input a heap (an array) and a particular level or depth, and
returns the total number of nodes present in the heap at that depth.
Home Exercise 1: Write functions min_heapify() and build_min_heap() that do the min-heap
equivalent of the aforementioned functions.
3
Refer to pages 157-159 of Cormen T.H., Leiserson, C.E., Rivest, R.L., and C. Stein. Introduction to Algorithms, MIT
Press, 3rd Edition, 2009.
Heapsort
One of the most important reasons to study the heap data structure is to gain an understanding
of the heapsort algorithm as well. Like merge sort, but unlike insertion sort, heapsort’s running
time is O(n log n). Like insertion sort, but unlike merge sort, heapsort sorts in-place: only a
constant number of array elements are stored outside the input array at any time. Thus, heapsort
combines the better attributes of the two sorting algorithms we have already seen.
The heapsort algorithm desires to build a max heap and then perform the heapify operation
repeatedly. It starts out by calling the build_max_heap() function to create a max heap out of the
given elements. At this stage (and at every iteration of this algorithm), the element at the top of
the heap is the largest element in the heap. Therefore, we take this element and place it at its
deserved location (at the end of the heap, which is also the end of the array). We would need to
decrease the size of the heap so that the element we just considered and placed in its sorted
location no longer interferes with our sorting process. We can then perform heapify on the root
of the heap, and then repeat this process until our heap is emptied.
void heap_sort(Heap h)
{
h = build_max_heap(h);
for (int i = h->size - 1; i >= 1; i--)
{
int temp = h->data[0];
h->data[0] = h->data[i];
h->data[i] = temp;
h->size = h->size - 1;
max_heapify(h, 0);
}
}
Refer to Figure 4 for a pictorial illustration of how exactly the heapsort algorithm works to sort
our array. Note that Figure 4 starts out with the max heap equivalent of a sample input array.
Figure 4: The working of heapsort. (a) The max-heap data structure just after build_max_heap.
(b)–(j) The max-heap just after each call of max_heapify, showing the value of i at that time.
Only lightly shaded nodes remain in the heap, the rest fall outside the “heapsize”. (k) The
resulting sorted array A.
Task 5: You have been provided with the set of files having file names datX.csv where X stands
for the input size (as in earlier labs). Recollect that struct person was defined as follows:
struct person
{
int id;
char *name;
int age;
int height;
int weight;
};
These files contain comma-separated entries for the details of the students. Write a program to
read the data from these files and store them in a dynamically allocated array of struct person.
In the previous labs, you had seen the performance of the other sorting algorithms on these files.
Now, modify the heap sort algorithm discussed in this labsheet to sort arrays of struct person
based on the height field. This would require you to modify every function discussed so far
because now you need to accommodate an array of structs within the heap. Plot the time taken
and maximum heap space (not to be confused with your heap data structure) utilised and
observe how they vary with the size of the input. Report the comparative performance of the
heap sort algorithm against the earlier algorithms.
Priority Queues
A heap can function as an efficient priority queue as well. This is another useful application of
heaps. As with heaps, priority queues come in two forms: max-priority queues and min-priority
queues. We will focus here on how to implement max-priority queues, which are in turn based
on max-heaps.
A priority queue is a data structure that maintains a set of elements ordered based on some
priority associated with a key field of the elements, to enable efficient retrieval of the maximum
(in case of a max-priority queue) or minimum (in case of a min-priority queue) element from the
queue. There are implementations of priority queues that do not use heaps, for instance, a simple
array implemented with insertion sort. But the heap implementation naturally leads to the idea
of a priority queue, which also makes it more efficient compared to alternate implementations
(we shall discuss this soon).
The name “priority queue” is no coincidence. You can correlate it with the queue data structure
that you learnt earlier in the course. The only difference here is that a normal queue orders its
elements in the sequence that they were entered into the queue, whereas a priority queue uses
some characteristic property (known as the “key”) of the elements to order them differentially
regardless of the sequence in which they were enqueued. In other words, we can also claim
that a “normal” queue is just a priority queue where the priority key is the “time spent in the
queue” (the element that has spent the most time in the queue, ie., the element that was the
first to be enqueued among into the queue, would be dequeued before the other elements).
Our goal is to implement the necessary data structure for a priority queue and also implement
these operations, ideally, such that they run efficiently. Let us see how we might use a heap to
do the same.
Firstly, it should feel intuitive to you how a max-heap could be used to implement a max-priority
queue. A max-heap always holds the max element at the top of the heap, and that is the
element with the highest priority that we want to retrieve at any point from our max-priority
queue. If we remove that element from the top of the heap, we have seen that we can perform
heapify on the root to prepare a heap, yet again, from the remaining elements. In this manner,
we always have our max-priority element on the top of the heap, and therefore we can use this
implementation. Equivalently, a min-heap can be used to implement a min-priority queue.
The alternative to using a heap would be to use standard arrays and repetitive sorting (using a
standard sorting algorithm like insertion sort, for instance), which is very much a noob solution.
This would cost us O(n log n) time after each retrieval, in the worst case; whereas heapify allows
us to do the same with only O(log n) time post each retrieval, in the worst case.
Therefore, we can say that a heap is a priority queue (we do not even need to create a separate
struct for the priority queue!). This is not an exaggeration, in fact, people actually end up using
the two terms synonymously at times. So, now, we only need to write the functions that perform
the priority queue operations on a heap.
Task 6: You have been provided with a file priorities.c that contains some half-implemented
priority queue functions (on a heap). Complete the implementation of priority queue on that
file taking help of the other functions that you have implemented so far in this labsheet.
Note: In general, whenever a priority queue is used, the elements of the queue are structs
containing a bunch of fields. This means that the underlying tree is also implemented as an array
of structs. One of the attributes of the struct will be used as the key. For the purposes of this
question, however, you can use simple integer elements with the integer itself as the key.
Home Exercise 2: Implement a min-priority queue just like the max-priority queue.
Home Exercise 3: Consider the following situation. You are given an integer array gifts that
denotes the number of gifts present in various piles. This means that gifts[i] denotes the number
of gifts present in the ith pile. Every second, you greedily find the pile that has the most number
of gifts and pick enough gifts from the pile to always leave behind the square root of the number
of gifts initially present in the pile (if the square root is fractional, you leave behind only the
integer part).
Write a function pick_gifts() that takes as input the array and an integer k and returns the total
number of gifts remaining in the piles after k seconds have passed.
Home Exercise 4: Write a function that takes as input an array containing k sorted linked lists
(each list contains at most n elements), merges them all into one single sorted linked list, and
returns that linked list. Use a priority queue to solve this problem. [This problem can be solved
with a time complexity of O(kn log k).
Home Exercise 5: Suppose for a moment that you are an engineer working at Spotify. Your
colleagues in the data analytics team have done thorough research and devised an algorithm that
assigns a score to every song for every user (based on how much they are expected to like that
song), and these scores are updated regularly. Now, suppose that a candidate user opens up a
random playlist they find on the Spotify web and starts playing it on shuffle (this means that the
order of the songs in the playlist can be randomised). In this scenario, Spotify thinks that it might
be more profitable to play songs in decreasing order of the scores assigned to them by the
aforementioned algorithm. Arranging them in this order makes it quite likely that the songs will
be played in some seemingly random order, but if the analysis is correct, the user is more likely
to enjoy the songs than a random shuffle (because the first couple of songs would be more to
their liking, as per our analysis) if they listen to only a couple of songs from that playlist. To make
the “random” feeling even better, their algorithm also perturbs the score a little bit in some
random direction every so often, so that the user does not get suspicious about this internal data
manipulation.4
For a sample user, you are given a file playlist.txt that contains comma-separated fields in the
following format: <artist; song; priority>, where the priority contains the priorities that have been
calculated for these songs for this user by the data analytics team. This file represents a typical
playlist on Spotify.
Assuming that the user puts this playlist on shuffle, your task is to read the .txt file and insert
each song one by one into a priority queue. Thereafter, your priority queue would be usable
when the user decides to, say, skip a song, or when one song gets over, etc. For demonstration
purposes, extract each song from the priority queue one by one in the order dictated by the
priority queue, and output that to the console.
4
This is a completely fictitious example made up for demonstrating the possible uses of the concepts learnt in this
labsheet. Priority queues are the go-to data structure wherever the concept of “scheduling” or “differential
selection” comes up, just like in this exercise.
Spotify has not revealed any of these actual technical details to us. One can feel free to interchange “Spotify” with
“Youtube Music”, “Wynk”, “Saavn”, or any of their favourite music streaming service apps, as far as we are
concerned.