CSC 204 Data Structure-1
CSC 204 Data Structure-1
Data
Data is a set of values of qualitative or quantitative variables. Data in computing (or data
processing) is represented in a structure that is often tabular (represented by rows and columns), a
tree (a set of nodes with parent-children relationship), or a graph (a set of connected nodes). Data
is typically the result of measurements and can be visualized using graphs or images. Data as an
abstract concept can be viewed as the lowest level of abstraction, from which information and then
knowledge are derived. Unprocessed data which is also known as raw data refers to a collection
of numbers, characters and is a relative term; data processing commonly occurs by stages, and the
"processed data" from one stage may be considered the "raw data" of the next. Field data refers to
raw data that is collected in an uncontrolled environment. Experimental data refers to data that is
generated within the context of a scientific investigation by observation and recording. Information
Information is that which informs us with some valid meaning, i.e. that from which data can be
derived. Information is conveyed either as the content of a message or through direct or indirect
observation of something. Information can be encoded into various forms for transmission and
interpretation. For example, information may be encoded into signs, and transmitted via signals.
Information resolves uncertainty. The uncertainty of an event is measured by its probability of
occurrence and is inversely proportional to that. The more uncertain an event, the more information
is required to resolve uncertainty of that event. In other words, information is the message having
different meanings in different contexts. Thus the concept of information becomes closely related
to notions of constraint, communication, control, data, instruction, knowledge, meaning,
understanding, perception & representation.
Data Type
Data types are used within type systems, which offer various ways of defining, implementing and
using the data. Different type systems ensure varying degrees of type safety. Almost all
programming languages explicitly include the notion of data type. Though different languages may
use different terminology. Common data types may include:
Integers,
Booleans,
Characters,
Floating-pointnumbers,
Alphanumeric strings.
For example, in the Java programming language, the "int" type represents the set of 32-
bitintegersranging in value from -2,147,483,648 to 2,147,483,647, as well as the operations that
can be performed on integers, such as addition, subtraction, and multiplication. Colors, on the other
1
hand, are represented by three bytes denoting the amounts each of red, green, and blue, and one
string representing that color's name; allowable operations include addition and subtraction, but
not multiplication. Most programming languages also allow the programmer to define additional
data types, usually by combining multiple elements of other types and defining the valid operations
of the new data type. For example, a programmer might create a new data type named "complex
number" that would include real and imaginary parts. A data type also represents a constraint
placed upon the interpretation of data in a type system, describing representation, interpretation
and structure of values or objects stored in computer memory. The type system uses data type
information to check correctness of computer programs that access or manipulate the data.
Primitive data types
All data in computers based on digital electronics is represented as bits (alternatives 0 and 1) on
the lowest level. The smallest addressable unit of data is usually a group of bits called a byte
(usually an octet, which is 8 bits). The unit processed by machine code instructions is called a word
(as of 2011, typically 32 or 64 bits). Most instructions interpret the word as a binary number, such
that a 32-bit word can represent unsigned integer values from 0to or signed integer values from to
. Because of two's complement, the machine language and machine doesn't need to distinguish
between these unsigned and signed data types for the most part. There is a specific set of arithmetic
instructions that use a different interpretation of the bits in word as a floating-point number.
Machine data types need to be exposed or made available in systems or low-level programming
languages, allowing fine-grained control over hardware. The C programming language, for
instance, supplies integer types of various widths, such as short and long. If a corresponding native
type does not exist on the target platform, the compiler will break them down into code using types
that do exist. For instance, if a 32-bit integer is requested on a 16 bit platform, the compiler will
tacitly treat it as an array of two 16 bit integers. Several languages allow binary and hexadecimal
literals, for convenient manipulation of machine data.
In higher level programming, machine data types are often hidden or abstracted as an
implementation detail that would render code less portable if exposed. For instance, a generic
numeric type might be supplied instead of integers of some specific bit-width.
Boolean type
The Boolean type represents the values true and false. Although only two values are possible, they
are rarely implemented as a single binary digit for efficiency reasons. Many programming
languages do not have an explicit boolean type, instead interpreting (for instance) 0 as false and
other values as true. Numeric types Such as:
The integer data types, or "whole numbers". May be subtyped according to their ability to contain
negative values (e.g. unsigned in C and C++). May also have a small number of predefined
subtypes (such as short and long in C/C++); or allow users to freely define sub ranges such as 1..12
(e.g. Pascal/Ada).
2
Floating point data types, sometimes misleadingly called reals, contain fractional values. They
usually have predefined limits on both their maximum values and their precision. These are often
represented as decimal numbers.
Fixed point data types are convenient for representing monetary values. They are often
implemented internally as integers, leading to predefined limits.
Bignum or arbitrary precision numeric types lack predefined limits. They are not primitive types,
and are used sparingly for efficiency reasons.
Composite/ Derived data types Composite types are derived from more than one primitive type.
This can be done in a number of ways. The ways they are combined are called data structures.
Composing a primitive type into a compound type generally results in a new type, e.g. array-of-
integer is a different type to integer.
An array stores a number of elements of the same type in a specific order. They are accessed
using an integer to specify which element is required (although the elements may be of almost any
type). Arrays may be fixed-length or expandable.
Record (also called tuple or struct) Records are among the simplest data structures. A record is a
value that contains other values, typically in fixed number and sequence and typically indexed by
names. The elements of records are usually called fields or members.
Union. A union type definition will specify which of a number of permitted primitive types may
be stored in its instances, e.g. "float or long integer". Contrast with a record, which could be defined
to contain a float and an integer; whereas, in a union, there is only one value at a time.
A tagged union (also called a variant, variant record, discriminated union, or disjoint union)
contains an additional field indicating its current type, for enhanced type safety.
A set is an abstract data structure that can store certain values, without any particular order, and
no repeated values. Values themselves are not retrieved from sets, rather one tests a value for
membership to obtain a boolean "in" or "not in".
An object contains a number of data fields, like a record, and also a number of program code
fragments for accessing or modifying them. Data structures not containing code, like those above,
are called plain old data structure. Many others are possible, but they tend to be further variations
and compounds of the above.
Enumerated Type
This has values which are different from each other, and which can be compared and assigned, but
which do not necessarily have any particular concrete representation in the computer's memory;
compilers and interpreters can represent them arbitrarily. For example, the four suits in a deck of
playing cards may be four enumerators named CLUB, DIAMOND, HEART, SPADE, belonging
to an enumerated type named suit. If a variable Vis declared having suit as its data type, one can
assign any of those four values to it. Some implementations allow programmers to assign integer
values to the enumeration values, or even treat them as type-equivalent to integers.
String and text types
Such as:
Alphanumeric character. A letter of the alphabet, digit, blank space, punctuation mark, etc.
3
Alphanumeric strings, a sequence of characters. They are typically used to represent words and
text.
Character and string types can store sequences of characters from a character set such as ASCII.
Since most character sets include the digits, it is possible to have a numeric string, such as "1234".
However, many languages would still treat these as belonging to a different type to the numeric
value 1234.
Character and string types can have different subtypes according to the required character "width".
The original 7-bit wide ASCII was found to be limited and superseded by 8 and 16-bit sets.
STRING PROCESSING
String
A finite sequence ‘S‘of zero or more characters is called a String. The string with zero character is
called the empty string or null string.
The length of S1||S2||S3is equal to the sum of length string S1& S2& S3. A string Y is called a
substring of a string ‘S‘ & if their exits string ̳S‘ & if their exits string X & Z. Such that S=X || Y
|| Z. If X is an empty string, then y is called & initial substring of ‘S‘ & Z is an empty string then
‘Y‘ is called a terminal substring of ‘S‘.
CHARACTER DATA TYPE:-
The character data type is of two data type. (1) Constant (2) Variable
Constant String :-> The constant string is fixed & is written in either ‘ ‘ single quote & “ ”
double quotation.
4
Ex:- ‘SONA’
“Sona‖”
Variable String:
String variable falls into 3 categories.
1. Static
2. Semi-Static
3. Dynamic
Static character variable: Whose variable is defined before the program can be executed &
cannot change throughout the program.
Semi-static variable: Whose length variable may as long as the length does not exist, a
maximum value. A maximum value determine by the program before the program is
executed.
Dynamic variable: A variable whose length can change during the execution of the program.
String Operation: There are four different operations.
1. Sub string
2. Indexing
3. Concatenation
4. Length
Sub string:-
c. The length of the substring of the last character of the substring. We called this operation
SUBSTRING.
SUBSTRING (S, K, L) T KL
For e.g.; SUBSTRING (‘TO BE OR NOT TO BE‘, 4, 7)
5
SUBSTRING=BE OR N
SUBSTRING (THE END, 4, 4)
SUBSTRING= END.
INDEXING:-
Indexing also called pattern matching which refers to finding the position where a string pattern
‘P‘. First appears in a given string text ‘T‘, we called this operation index and write as INDEX
(text, pattern).
If the pattern ̳P‘ does not appear in text ̳T‘ then index is assign the value 0; the
argument & text and pattern can either string constant or string variable. For e.g.; T contains
the text.
‘HIS FATHER IS THE PROFESSOR‘
Then INDEX (T, ‘THE‘)
7
INDEX (T, ‘THEN‘)
0
Concatenation:-
Let S1& S2in be the string then concatenation of S1& S2is denoted byS1S2, S1||S2, each the
string consist of the character of S1followed by the characters ofS2.
Ex:-S1= ‘Sonalisa‘ S2= ’ ‘ S3= ‘Behera‘
S1||S2||S3= Sonalisa Behera
Length operation:-The number of character in a string is called its length. We will write
LENGTH (string).For the length of a given string LENGTH (“Computer”). The length is 8.
Basic language LEN (STRING)
Strlen (string)
Strupper(string)
String upper
Strupr(‘computer‘)
COMPUTER
String lower
6
Strlwr (‘COMPUTER‘)
computer
String concatenating Strcnt
String Reverse Strrev
Data Representation
All data in the computer memory can only consist of patterns of 1’s and 0’s. Memory cells can
only hold 8 1’s and 0’s – called 8 bits, or 1 byte. However, we know that the computer can
manipulate numbers (both integers and real number), characters or strings (sequences of character)
and Boolean values. It can also store images and music. All of these things must be stored as
patterns of 1’s and 0’s.
Data refers to the symbols that represent people, events, things, and ideas. Data can be a name, a
number, the colors in a photograph, or the notes in a musical composition.
Data Representation refers to the form in which data is stored, processed, and transmitted.
Devices such as smartphones, iPods, and computers store data in digital formats that can be
handled by electronic circuitry.
Digitization is the process of converting information, such as text, numbers, photo, or music, into
digital data that can be manipulated by electronic devices.
The Digital Revolution has evolved through four phases, beginning with big, expensive,
standalone computers, and progressing to today’s digital world in which small, inexpensive digital
devices are everywhere.
The 0s and 1s used to represent digital data are referred to as binary digits — from this term we
get the word bit that stands for binary digit.
A bit is a 0 or 1 used in the digital representation of data. A digital file, usually referred to simply
as a file, is a named collection of data that exits on a storage medium, such as a hard disk, CD,
DVD, or flash drive.
Numeric data consists of numbers that can be used in arithmetic operations. Digital devices
represent numeric data using the binary number system, also called base 2.
The binary number system only has two digits: 0 and 1. No numeral like 2 exists in the system, so
the number “two” is represented in binary as 10 (pronounced “one zero”).
7
System Radix Allowable Digits
---------------------------------------------------------------------
Decimal 10 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
Binary 2 0, 1
Octal 8 0, 1, 2, 3, 4, 5, 6, 7
Hexadecimal 16 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F
EXAMPLE 2.1 Three numbers represented as powers of a radix. 243.5110 = 2 * 102 + 4 * 101 +
3 * 100 + 5 * 10-1 + 1 * 10-2
2123 = 2 * 32 + 1 * 31 + 2 * 30 = 2310
101102 = 1 * 24 + 0 * 23 + 1 * 22 + 1 * 21 + 0 * 20 = 2210
8
EXAMPLE 2.4 Convert 14710 to binary 14710 = 100100112
A binary number with N bits can represent unsigned integer from 0 to 2n – 1. Overflow: the result
of an arithmetic operation is outside the range of allowable precision for the give number of bits.
Converting Fractions
EXAMPLE 2.7 Convert 0.3437510 to binary with 4 bits to the right of the binary point. Reading
from top to bottom, 0.3437510 = 0.01012 to four binary places. We simply discard (or truncate) our
answer when the desired accuracy has been achieved.
0.3437510 = 0.01012
9
First, convert to decimal 31214 = 21710
Then convert to base 3 21710 = 220013
We have 31214 = 220013
A signed-magnitude number has a sign as its left-most bit (also referred to as the high-order bit or
the most significant bit) while the remaining bits represent the magnitude (or absolute value) of
the numeric value.
N bits can represent – (2n-1 - 1) to 2n-1 -1
EXAMPLE 2.10 Add 010011112 to 001000112 using signed-magnitude arithmetic.
010011112 (79) + 001000112 (35) = 011100102 (114) There is no overflow in this example
EXAMPLE 2.11 Add 010011112 to 011000112 using signed-magnitude arithmetic.
An overflow condition and the carry is discarded, resulting in an incorrect sum.
We obtain the erroneous result of
010011112 (79) + 011000112 (99) = 01100102 (50)
EXAMPLE 2.12 Subtract 010011112 from 011000112 using signed-magnitude arithmetic.
We find 0110000112 (99) - 010011112 (79) = 000101002 (20) in signed-magnitude representation.
EXAMPLE 2.14
EXAMPLE 2.15
The signed magnitude has two representations for zero, 10000000 and 00000000(and
mathematically speaking, the simple shouldn’t happen!).
Complement Systems
One’s Complement
This sort of bit-flipping is very simple to implement in computer hardware.
10
EXAMPLE 2.16 Express 2310 and -910 in 8-bit binary one’s complement form.
2310 = + (000101112) = 000101112
-910 = - (000010012) = 111101102
The primary disadvantage of one’s complement is that we still have two representations for zero:
00000000 and 11111111
Two’s Complement
Find the one’s complement and add 1.
EXAMPLE 2.19 Express 2310, -2310, and -910 in 8-bit binary two’s complement form.
2310 = + (000101112) = 000101112
-2310 = - (000101112) = 111010002 + 1 = 111010012
-910 = - (000010012) = 111101102 + 1 = 111101112
EXAMPLE 2.20 Add 910 to -2310 using two’s complement arithmetic.
000010012 (910) + 111010012 (-2310) = 111100102 (-1410)
EXAMPLE 2.21 Find the sum of 2310 and -910 in binary using two’s complement arithmetic.
000101112 (2310) + 111101112 (-910) = 000011102 (1410)
A Simple Rule for Detecting an Overflow Condition: If the carry in the sign bit equals the carry
out of the bit, no overflow has occurred. If the carry into the sign bit is different from the carry out
of the sign bit, over (and thus an error) has occurred.
EXAMPLE 2.22 Find the sum of 12610 and 810 in binary using two’s complement arithmetic.
011111102 (12610) + 000010002 (810) = 100001102 (-12210)
11
A one is carried into the leftmost bit, but a zero is carried out. Because these carries are not equal,
an overflow has occurred.
Representing Text
Character data is composed of letters, symbols, and numerals that are not used in calculations.
Examples of character data include your name, address, and hair color.
Character data is commonly referred to as “text.”
Digital devices employ several types of codes to represent character data, including ASCII,
Unicode, and their variants.
ASCII (American Standard Code for Information Interchange, pronounced “ASK ee”) requires
seven bits for each character.
The ASCII code for an uppercase A is 1000001.
Extended ASCII is a superset of ASCII that uses eight bits for each character.
For example, Extended ASCII represents the uppercase letter A as 01000001.
Using eight bits instead of seven bits allows Extended ASCII to provide codes for 256 characters.
Unicode (pronounced “YOU ni code”) uses sixteen bits and provides codes or 65,000 characters.
This is a bonus for representing the alphabets of multiple languages.
UTF-8 is a variable-length coding scheme that uses seven bits for common ASCII characters but
uses sixteen-bit Unicode as necessary.
12
ASCII codes are used for numerals, such as Social Security numbers and phone numbers.
Plain, unformatted text is sometimes called ASCII text and is stored in a so-called text file with a
name ending in .txt.
On Apple devices these files are labeled “Plain Text.” In Windows, these files are labeled “Text
Document”.
ASCII text files contain no formatting. To create documents with styles and formats, formatting
codes have to be embedded in the text.
Microsoft Word produces formatted text and creates documents in DOCX format.
Apple Pages produces documents in PAGES format.
Adobe Acrobat produces documents in PDF format.
HTML markup language used for Web pages produces documents in HTML format
13
Use bits for data rates, such as Internet connection speeds, and movie download speeds.
Use bytes for file sizes and storage capacities. 104 KB: Kilobyte (KB or Kbyte) is often used when
referring to the size of small computer files.
56 Kbps: Kilobit (Kb or Kbit) can be used for slow data rates, such as a 56 Kbps (kilobits per
second) dial-up connection. 50 Mbps: Megabit (Mb or Mbit) is used for faster data rates, such as
a 50 Mbps (megabits per second) Internet connection.
3.2 MB: Megabyte (MB or MByte) is typically used when referring to the size of files containing
photos and videos. 100 Gbit: Gigabit (Gb or Gbit) is used for really fast network speeds. 16 GB:
Gigabyte (GB or GByte) is commonly used to refer to storage capacity.
Data Compression
To reduce file size and transmission times, digital data can be compressed. Data compression refers
to any technique that recodes the data in a file so that it contains fewer bits. Compression is
commonly referred to as “zipping”.
Compression techniques divided into two categories: lossless and lossy. Lossless compression
provides a way to compress data and reconstitute it into its original state; uncompressed data stays
exactly the same as the original data. Lossy compression throws away some of the original data
during the compression process; uncompressed data is not exactly the same as the original.
Software for compressing data is sometimes referred to as a compression utility or a zip tool. On
laptops and desktop computers, the compression utility is accessed from the same screen used to
manage files.
The process of reconstituting files is called extracting or unzipping. Compressed files may end
with a .zip, .gz, .pkg, or.tar.gz.
14
This will explain two important concepts: stack and heap. This starts explaining what happens
internally when you declare a variable and then it moves ahead to explain two important concepts:
stack and heap.
//Line 1
int i=4;
//Line 2
int y=2;
//Line 3
class1 cls1 = new class1 ();
}
It’s a three line code, let’s understand line by line how things execute internally.
15
Line 1: When this line is executed, the compiler allocates a small amount of memory in the
stack. The stack is responsible for keeping track of the running memory needed in your
application.
Line 2: Now the execution moves to the next step. As the name says stack, it stacks this
memory allocation on top of the first memory allocation. You can think about stack as a
series of compartments or boxes put on top of each other. Memory allocation and de-
allocation is done using LIFO (Last In First Out) logic. In other words memory is allocated
and de-allocated at only one end of the memory, i.e., top of the stack.
Line 3: In line 3, we have created an object. When this line is executed it creates a pointer
on the stack and the actual object is stored in a different type of memory location called
‘Heap’. ‘Heap’ does not track running memory, it’s just a pile of objects which can be
reached at any moment of time. Heap is used for dynamic memory allocation.
One more important point to note here is reference pointers are allocated on stack. The statement,
Class1 cls1; does not allocate memory for an instance of Class1, it only allocates a stack variable
cls1 (and sets it to null). The time it hits the new keyword, it allocates on "heap".
Exiting the method (the fun): Now finally the execution control starts exiting the method. When it
passes the end control, it clears all the memory variables which are assigned on stack. In other
words all variables which are related to int data type are de-allocated in ‘LIFO’ fashion from the
stack.
The big catch – It did not de-allocate the heap memory. This memory will be later de-allocated by
the garbage collector.
Now many of our developer friends must be wondering why two types of memory, can’t we just
allocate everything on just one memory type and we are done?
16
If you look closely, primitive data types are not complex, they hold single values like ‘int i = 0’.
Object data types are complex, they reference other objects or other primitive data types. In other
words, they hold reference to other multiple values and each one of them must be stored in
memory. Object types need dynamic memory while primitive ones needs static type memory. If
the requirement is of dynamic memory, it’s allocated on the heap or else it goes on a stack.
Run-time refers to the time when an application actually executes. In this discussion compile-time
means everything before run-time, that is, compilation, linking, and loading. As programming
languages and environments have become more complicated, managing the storage at run-time
has gotten extremely difficult indeed. Systems and algorithms to manage run-time storage are now
among the most difficult in existence.
For a compiled program, its static structure is the structure of the source program, how it is
organized. The dynamic structure is the structure that evolves during run-time.
It is useful to understand how storage is managed in different programming languages and for
different kinds of data. Three important cases are:
With static storage, the location of every variable is fixed, allocated and known at compile-time.
In principle, every variable has a fixed constant machine address.
17
One uses the word binding to mean an association of a property with an entity in a programming
language. All binding of storage locations to program names occur at compile-time. The bindings
are fixed and unchanged throughout run-time.
This is the storage model used for the FORTRAN language up to 1977. It is a simple model, easy
to set up, with very little to manage at run-time. In fact the main changes at run-time are with the
parameters. In FORTRAN, parameters are always passed by reference, that is, the location (or
reference) of the parameter is passed. FORTRAN passes arrays just the same as C or Java, but
FORTRAN also passes simple variables by reference, rather than by value, as C and Java do. In a
FORTRAN program, this is the main thing that changes during run-time, since a function may be
passed different addresses for its parameters during different calls at run-time.
These features of FORTRAN can lead to a curious error: If one passes a constant, say 2, as a
parameter, what is actually passed is a reference to the constant. Suppose inside the function, the
formal parameter (call it x) is incremented, say with x = x + 1. Then this can result in a change of
the constant itself, so that a = 2 would then actually give a the value of 3.
The static storage model allows simple allocation (at compile-time) and allows efficient execution
(at run-time). So why don't we still use this method? Why not just use static storage?
Answers:
o Historically storage was expensive, and one wanted to use it sparingly. The static
method is simple to implement and efficient in execution, but it is more wasteful
of storage than other methods. (Though now large cheap memory is available.)
o One can't implement recursion statically.
o One can't have dynamic data structures, that is, the structure of the data can't be
decided on and created at run-time.
All three of C, Java, and C++ allow the declaration of static variables, so at run time, each of
these languages has a static portion of storage for these variables.
Stack Allocation:
A run-time stack is a simple and efficient way to provide the storage needed for function calls.
When a function is called, all storage needed for the function is allocated on the stack in a section
called the activation record. The storage includes room for the return address, the return value
(possibly a pointer), actual parameters passed to the function, and any variables declared within
the function, including static arrays (C and C++ only), static structs and classes (C and C++ only),
and variable-sized arrays (GNU C and Algol, but not standard C, C++, or Java). The activation
record also holds locations for temporary values, and pointers to other parts of the stack (to
facilitate access and deal location). On return from the function, the storage on the stack is deal
located. It is always an error to make use of the stack after returning from the function (except
perhaps to retrieve a value returned). For an example of the misuse of stack storage, see dangle
example.
Allocation on the stack can be as simple as modifying a single stacktop pointer, and deallocation
is typically just as easy. Implementing functions using only static storage (as was done in
18
FORTRAN) does not allow recursive functions, whereas the stack allocation supports recursion.
This method has just a little extra run-time overhead compared with static allocation.
This method uses storage more efficiently, since local variables for all defined functions are
allocated using static allocation, whereas the stack allocation only needs to allocate storage for
currently executing functions. For example, if there are three functions A, B, and C, each with a
100-byte local array, then all three arrays must be allocated (300 bytes) in the static model, while
only the arrays for executing functions are allocated in the stack model.
This stack allocation method was introduced in 1960 for Algol 60 by Peter Naur. (Of course he
worked on it before then.) It might seem plausible that Naur developed Algol 60, and then hit on
the clever way to implement function calls. The reality is more impressive though: Naur thought
of the stack allocation method and then invented a language that would utilize this method. Since
1960 all "standard" compiled languages have used this method, including later dialects of Algol,
Pascal and descendants, PL/I, C and C++, and Java.
Note that actual parameters (possibly expressions) are evaluated in the environment of the calling
function. Then each resulting value of this evaluation is copied into the spot in the called function's
activation record for the corresponding formal parameter (so that the formal parameter behaves
like a local variable). Then all the additional information is copied into the activation record and
the function is put into execution.
The Pascal language allowed the definition of a "helper" function to be buried inside the definition
of the function it is helping. The provided a rudimentary version of the "object-oriented" approach,
since the information in the helper function was hidden from view (so-called information hiding).
Function definitions could be buried within other functions to an arbitrary depth of nesting, and
this feature made Pascal implementation much more difficult than it otherwise would have been.
(The stack needed extra pointers to access variables at different levels of nesting efficiently.) These
mechanisms provided interesting examples for implementation in compiler construction classes,
but in practice they were seldom used. The developers of C, and those of C++ and Java who
followed, did not allow a function definition to be nested inside another function.
Long before C, language designers recognized the need for true dynamic storage allocation, to
obtain new storage during run-time from somewhere besides a stack. This storage would be
explicitly allocated and would remain until explicitly deallocated. The source of the storage is now
commonly called a heap (an Algol 68 term). The Lisp language, among others needed similar
storage. So Pascal, C, C++, and Java made all stack allocation relatively simple, not allowing a
array to be allocated that was of a size determined at run-time. (Java doesn't allow any arrays on
the stack. When C and C++ allowed only sizes of arrays that were known at compile time on the
stack, that made the stack management simpler.) Then all these languages allowed special storage
of any size (dynamic) to be allocated at run-time.
In each case, a desired amount of storage allocated and a pointer to the storage is returned. Later
it is possible to return the storage to the heap (sort of like recycling) to be used during other
allocations. Pascal, C, and C++ have explicit deallocation as the norm, whereas automatic
deallocation, called garbage collection, is possible. In Java automatic garbage collection is the
norm, although explicit deallocation is possible.
Here are the different functions for allocating and deallocating in the different languages:
19
(Note: In Pascal and C, the above are functions. In C++, new and delete are operators, although
you can also use the C functions, assuming that you do not mix them with the C++ operators
(causing a catastophe). In Java, System.gc() is a function, while new is a keyword, a part of the
language used to allocate storage.)
Pitfalls with allocation and explicit deallocation:
There are two mistakes one can make when using explicit allocation and deallocation in a language
like C or C++. Notice that neither of these mistakes are possible in Java.
Mistake 1: Memory Leak. If one allocates storage, but does not completely deallocate this storage
after it is no longer in use, the size of the executable module will keep growing, potentially causing
problems. Specifically, if all pointers to some storage area are overwritten or deallocated
themselves, then the storage may not yet be deallocated, but is inaccessible. Such storage is called
garbage, since it takes up space, but can no longer be used or deallocated. The accumulation of
such garbage in a running program creates what is called a memory leak. In a complex long-
running program (such as an operating system), memory leaks are common (because the program
is complex), and are bad (since the executable keeps growing in size as the program continues to
run).
Memory leaks are a serious problem in C, but they are even more of a problem in C++, because of
the complexity of this language, particularly the complexity of hidden allocations and
deallocations. Bjarne Stroustrup, the inventor of C++, had this to say in in his FAQ about C++:
Answer: By writing code that doesn't have any. Clearly, if your code has new operations,
delete operations, and pointer arithmetic all over the place, you are going to mess up
somewhere and get leaks, stray pointers, etc. This is true independently of how
conscientious you are with your allocations: eventually the complexity of the code will
overcome the time and effort you can afford. It follows that successful techniques rely on
hiding allocation and deallocation inside more manageable types. Good examples are the
standard containers. They manage memory for their elements better than you could without
disproportionate effort.
20
(Several complex examples here) ...
If systematic application of these techniques is not possible in your environment (you have
to use code from elsewhere, part of your program was written by Neanderthals, etc.), be
sure to use a memory leak detector as part of your standard development procedure, or plug
in a garbage collector.
Mistake 2: An active pointer to garbage. A much more serious mistake than the previous
one is to deallocate storage when the program still has an active pointer to the storage.
Mistake 1 just leads to programs that keep using more memory as they run; they have to
be restarted every now and then. But Mistake 2 can lead to a program that crashes -- a
potential catastrophe.
Pointers and References
Machine addresses
Computer memory consists of one long list of addressable bytes.
A pointer is a data item that contains an address.
Recall that an Abstract Data Type (ADT) has a set of values and a set of
operations on those values
Pointers and references have the same set of values (memory
addresses)
Pointers have more defined operations than references
Pointers are more flexible and more general than
references
References are safer than pointers (from error or
malicious misuse)
References allow automatic garbage collection
(pointers don’t)
A (non-abstract) Data Type also has an implementation
21
The implementations of pointers and references are similar
Java references carry information about the thing referenced; in C, it’s up to the compiler to
figure out what it can
Basically, pointers and references are the same thing; they point to (refer to)
something else in memory
A Data Structure is a description of how data is organized in memory
Many (not all) data structures are built from objects pointing/referring
to one another
Understanding pointers (references) is fundamental to this course
If this course were taught in C or C++ instead of Java, all the “nuts and bolts” would
be the same
This course is in Java, but it’s not about Java
You need to know how to create your own data structures
I will also teach some Java-specific packages
In real life, it’s stupid to redo work that’s already been
done for you
A trivial examples
22
A more serious example
A binary tree is a data structure in which every node (object) has zero, one, or two
children (references to other nodes)
Arithmetic expressions can be represented as binary trees
To evaluate an arithmetic
expression:
If it is a leaf, return its value
Otherwise, evaluate its two subtrees, and perform the indicated
operation
23
public class BinaryTree {
public Object value; // the information in this node
private BinaryTree leftChild;
private BinaryTree rightChild;
Suppose you have two references to a Vector, and you use one of them to add
elements to the Vector
What happens if Java decides to replace this Vector with a bigger one?
It looks like the second reference is a “dangling pointer,” referring to nothing
This doesn’t happen! Java protects you from this error
But how?
24
Linked structures
One disadvantage of using arrays to store data is that arrays are static structures and therefore
cannot be easily extended or reduced to fit the data set. Arrays are also expensive to maintain
new insertions and deletions. In this chapter we consider another data structure called Linked
Lists that addresses some of the limitations of arrays.
A linked list is a linear data structure where each element is a separate object.
Linked List is a very commonly used linear data structure which consists of group of nodes in a
sequence.
Each node holds its own data and the address of the next node hence forming a chain like
structure.
Each element (we will call it a node) of a list is comprising of two items - the data and a reference
to the next node. The last node has a reference to null. The entry point into a linked list is called
the head of the list. It should be noted that head is not a separate node, but the reference to the first
node. If the list is empty then the head is a null reference.
A linked list is a dynamic data structure. The number of nodes in a list is not fixed and can grow
and shrink on demand. Any application which has to deal with an unknown number of objects will
need to use a linked list.
One disadvantage of a linked list against an array is that it does not allow direct access to the
individual elements. If you want to access a particular item then you have to start at the head and
follow the references until you get to that item.
Another disadvantage is that a linked list uses more memory compare with an array - we extra 4
bytes (on 32-bit CPU) to store a reference to the next node.
25
Types of Linked Lists
A singly linked list is described above
A doubly linked list is a list that has two references, one to the next node and another to previous
node.
Another important type of a linked list is called a circular linked list where last node of the list
points back to the first node (or the head) of the list.
The Node class
In Java you are allowed to define a class (say, B) inside of another class (say, A). The class A is
called the outer class, and the class B is called the inner class. The purpose of inner classes is
purely to be used internally as helper classes. Here is the LinkedList class with the inner Node
class.
private static class Node<AnyType>
{
private AnyType data;
private Node<AnyType> next;
We implement the LinkedList class with two inner classes: static Node class and non-static
LinkedListIterator class. See LinkedList.java for a complete implementation.
Examples
Let us assume the singly linked list above and trace down the effect of each fragment below. The
list is restored to its initial state before each line executes.
head = head.next;
26
head.next = head.next.next;
head.next.next.next.next = head;
The method creates a node and prepends it at the beginning of the list.
27
Traversing
Start with the head and access each node until you reach null. Do not change the head reference.
addLast
The method appends the node to the end of the list. This requires traversing, but make sure you
stop at the last node
28
Inserting "after"
Find a node containing "key" and insert a new node after it. In the picture below, we insert a new
node after "e":
if(tmp != null)
tmp.next = new Node<AnyType>(toInsert, tmp.next);
}
Inserting "before"
Find a node containing "key" and insert a new node before that node. In the picture below, we
insert a new node before "a":
For the sake of convenience, we maintain two references prev and cur. When we move along
the list we shift these two references, keeping prev one step before cur. We continue until cur
reaches the node before which we need to make an insertion. If cur reaches null, we don't insert,
otherwise we insert a new node between prev and cur.
29
while(cur != null && !cur.data.equals(key))
{
prev = cur;
cur = cur.next;
}
//insert between cur and prev
if(cur != null) prev.next = new Node<AnyType>(toInsert, cur);
}
Deletion
Find a node containing "key" and delete it. In the picture below we delete a node containing "A"
The algorithm is similar to insert "before" algorithm. It is convinient to use two references prev
and cur. When we move along the list we shift these two references, keeping prev one step
before cur. We continue until cur reaches the node which we need to delete. There are three
exceptional cases, we need to take care of:
1. list is empty
2. delete the head node
3. node is not in the list
if( head.data.equals(key) )
{
head = head.next;
return;
}
30
prev = cur;
cur = cur.next;
}
1. They are a dynamic in nature which allocates the memory when required.
2. Insertion and deletion operations can be easily implemented.
3. Stacks and queues can be easily executed.
4. Linked List reduces the access time.
Disadvantages of Linked Lists
1. The memory is wasted as pointers require extra memory for storage.
2. No element can be accessed randomly; it has to access each node sequentially.
3. Reverse Traversing is difficult in linked list.
Applications of Linked Lists
1. Linked lists are used to implement stacks, queues, graphs, etc.
2. Linked lists let you insert elements at the beginning and end of the list.
3. In Linked Lists we don't need to know the size in advance.
31