Module 2(Data Types)
Module 2(Data Types)
Data Types
Data Types
Introduction
Primitive Data Types
Character String Types
User-Defined Ordinal Types
Array Types
Record Types
List Types
Pointer and Reference Types
Introduction
A data type defines a collection of data objects and a set of
predefined operations on those objects
One design issue for all data types: What operations are
defined and how are they specified?
Primitive Data Types
Almost all programming languages provide a set of primitive
data types
Primitive data types: Those not defined in terms of other data
types
Some primitive data types are merely reflections of the
hardware
Others require only a little non-hardware support for their
implementation
1. Numeric Types
a). Integer
Java includes four signed integer sizes: byte, short, int, and
long.
The most common primitive numeric data type is integer.
Each value consists of two parts, the real part and the imaginary part
These are the primary data types for business data processing and are
therefore essential to COBOL.
Most commonly used coding was the 8-bit code ASCII (American Standard Code for
Information Interchange), which uses the values 0 to 127 to code 128 different characters.
ISO 8859-1 is another 8-bit character code, but it allows 256 different characters. Ada
95+ uses ISO 8859-1.
The colors type uses the default internal values for the enumeration
constants, 0, 1, . . . , although the constants could have been
assigned any integer literal (or any constant-valued expression).
The enumeration values are coerced to int when they are put in
integer context.
For example, if the current value of myColor is blue, then the
expression myColor++ would assign green to myColor.
In the area of reliability, the enumeration types of Ada, C#, F#, and
Java 5.0 provide two advantages:
(1) No arithmetic operations are legal on enumeration types; this
prevents adding days of the week, for example, and
(2) second, no enumeration variable can be assigned a value outside its
defined range. If the colors enumeration type has 10 enumeration
constants and uses 0..9 as its internal values, no number greater than 9
can be assigned to a colors type variable.
Ada, C#, and Java 5.0 provide better support for enumeration than
C++ because enumeration type variables in these languages are not
coerced into integer types
2.Subrange Types
A subrange type is a contiguous subsequence of an ordinal type.
For example, 12..14 is a subrange of integer type.
Ada’s Design:
In Ada, subranges are included in the category of types called subtypes.
Day1 : Days;
Day2 : Weekdays;
...
Day2 := Day1;
The syntax of array references is fairly universal: The array name is followed by
the list of subscripts, which is surrounded by either parentheses or brackets.
array_name(subscript_value_list) → element
Index Syntax
FORTRAN, PL/I, Ada use parentheses
Ada explicitly uses parentheses to show uniformity between array
references and function calls because both are mappings
Most other languages use brackets
Arrays Index (Subscript) Types
FORTRAN, C: integer only
Ada: integer or enumeration (includes Boolean and char)
Java: integer types only
Index range checking
- C, C++, Perl, and Fortran do not specify range checking
- Java, ML, C# specify range checking
- In Ada, the default is to require range checking, but it can be
turned off
Subscript Binding and Array Categories
The binding of the subscript type to an array variable is usually
static, but the subscript value ranges are sometimes dynamically
bound.
A jagged array is one in which the lengths of the rows need not be the
same.
For example, a jagged matrix may consist of three rows, one with 5
elements, one with 7 elements, and one with 12 elements
This also applies to the columns and higher dimensions.
C, C++, and Java support jagged arrays but not rectangular arrays.
Fortran, Ada, C#, and F# support rectangular arrays. (C# and F# also
support jagged arrays.)
Slices
A slice of an array is some substructure of that array. For example, if A is a matrix, then the
first row of A is one possible slice.
It is important to realize that a slice is not a new data type. Rather, it is a mechanism for
referencing part of an array as a unit.
The third parameter form for slice is a range, which has the form
of an integer expression, two periods, and a second integer
expression.
With a range parameter, slice returns an array of the element with
the given range of subscripts.
For example, list.slice (1..3) returns [4, 6, 8]
Evaluation of Array Types
Arrays have been included in virtually all programming languages.
Because the number of rows above the ith row is i and the number of elements
to the left of the jth column is j, we have
location(a[i, j]) = address of a[0, 0] + (((i * n) + j) * element_size)
Single-dimensioned array
Multi-dimensional array
Record Types
A record is a heterogeneous aggregate of data elements in which
the individual elements are identified by names and accessed
through offsets from the beginning of the structure.
Example:
type Employee_Name_Type is record
First : String (1..20);
Middle : String (1..10);
Last : String (1..20);
end record;
type Employee_Record_Type is record
Employee_Name: Employee_Name_Type;
Hourly_Rate: Float;
end record;
Employee_Record: Employee_Record_Type;
If ptr is a pointer variable with the value 7080 and the cell whose
address is 7080 has the value 206, then the assignment
Problems with Pointers
Dangling pointers
A pointer that contains the address of a heap-dynamic variable that has been
deallocated.
Dangling pointers are dangerous for several reasons.
First, the location being pointed to may have been reallocated to some new
heap-dynamic variable. If the new variable is not the same type as the old
one, type checks of uses of the dangling pointer are invalid.
Even if the new dynamic variable is the same type, its new value will have no
relationship to the old pointer’s dereferenced value. Furthermore, if the
dangling pointer is used to change the heap-dynamic variable, the value of
the new heap-dynamic variable will be destroyed.
The following sequence of operations creates a dangling pointer in many
languages:
1. A new heap-dynamic variable is created and pointer p1 is set to point at it.
2. Pointer p2 is assigned p1’s value.
3. The heap-dynamic variable pointed to by p1 is explicitly deallocated
(possibly setting p1 to nil), but p2 is not changed by the operation. p2 is now a
dangling pointer. If the deallocation operation did not change p1, both p1 and
p2 would be dangling. (Of course, this is a problem of aliasing—p1 and p2 are
aliases.
Lost heap-dynamic variable(often called garbage)
An allocated heap-dynamic variable that is no longer accessible
to the user program (often called garbage)
Lost heap-dynamic variables are most often created by the
following sequence of operations:
1.Pointer p1 is set to point to a newly created heap-dynamic
variable
2.Pointer p1 is later set to point to another newly created
heap-dynamic variable
The process of losing heap-dynamic variables is called memory
leakage
Memory leakage is a problem, regardless of whether the
language uses implicit or explicit deallocation.
Pointers in Ada
Ada’s pointers are called access types.
The dangling-pointer problem is partially alleviated by Ada’s
design, at least in theory.
A heap-dynamic variable may be (at the implementor’s option)
implicitly deallocated at the end of the scope of its pointer type.
The Ada language also has an explicit deallocator,
Unchecked_Deallocation.
Unchecked_Deallocation can cause dangling pointers.
The lost heap-dynamic variable problem is not eliminated by Ada’s
design of pointers.
Pointers in C and C++
Extremely flexible but must be used with care. This design offers
no solutions to the dangling pointer or lost heap-dynamic variable
problems.
Pointers can point at any variable regardless of when or where it
was allocated
Used for dynamic storage management and addressing
Pointer arithmetic is possible
Explicit dereferencing and address-of operators
C and C++ include pointers of type void *, which can point at
values of any type.
Type checking is not a problem with void * pointers, because these
languages disallow dereferencing them.
One common use of void * pointers is as the types of parameters of
functions that operate on memory.
In C and C++, the asterisk (*) denotes the dereferencing
operation, and the ampersand (&) denotes the operator for
producing the address of a variable.
For example, consider the following code:
int *ptr; int count, init;
...
ptr = &init;
count = *ptr;
The assignment to the variable ptr sets it to the address of init.
The assignment to count dereferences ptr to produce the value
at init, which is then assigned to count.
So, the effect of the two assignment statements is to assign the
value of init to count.
The two assignment statements above are equivalent in their effect
on count to the single assignment
count = init;
Reference Types
A reference type variable is similar to a pointer, with one
important and fundamental difference: A pointer refers to an
address in memory, while a reference refers to an object or a value
in memory.
Reference type variables are specified in definitions by preceding
their names with ampersands (&).
For example,
int result = 0;
int &ref_result = result;
...
ref_result = 100;
In this code segment, result and ref_result are aliases.
C++ includes a special kind of reference type called a reference type
that is used primarily for formal parameters
Advantages of both pass-by-reference and pass-by-value
Java extends C++’s reference variables and allows them to replace
pointers entirely
References are references to objects, rather than being
addresses
C# includes both the references of Java and the pointers of C++
Evaluation of Pointers
Dangling pointers and garbage are problems as is heap
management
Pointers have been compared with the goto.
The goto statement widens the range of statements that can be
executed next.
Pointer variables widen the range of memory cells that can be
referenced by a variable.
Pointers or references are necessary for dynamic data structures--
so we can't design a language without them
Implementation of Pointer and Reference
Types
In most languages, pointers are used in heap management.
The same is true for Java and C# references, as well as the
variables in Smalltalk and Ruby, so we cannot treat pointers and
references separately.
1. Representations of Pointers and References
In most larger computers, pointers and references are single
values stored in memory cells.
However, in early microcomputers based on Intel
microprocessors, addresses have two parts: a segment and an
offset.
So, pointers and references are implemented in these systems as pairs
of 16-bit cells, one for each of the two parts of an address.
2. Solutions to the Dangling-Pointer Problem
There have been several proposed solutions to the dangling-
pointer problem.
Tombstones :
Every heap-dynamic variable includes a special cell, called a
tombstone, that is itself a pointer to the heap-dynamic variable.
The actual pointer variable points only at tombstones and never to
heap-dynamic variables.
When a heap-dynamic variable is deallocated, the tombstone
remains but is set to nil, indicating that the heap-dynamic variable
no longer exists.
This approach prevents a pointer from ever pointing to a
deallocated variable. Any reference to any pointer that points to a
nil tombstone can be detected as an error.
Tombstones are costly in both time and space. Because
tombstones are never deallocated, their storage is never
reclaimed.
locks-and-keys approach:
pointer values are represented as ordered pairs (key, address),
where the key is an integer value.
Heap-dynamic variables are represented as the storage for the
variable plus a header cell that stores an integer lock value.
When a heap-dynamic variable is allocated, a lock value is created
and placed both in the lock cell of the heap-dynamic variable and
in the key cell of the pointer that is specified in the call to new.
Every access to the dereferenced pointer compares the key value
of the pointer to the lock value in the heap-dynamic variable. If
they match, the access is legal; otherwise the access is treated as a
run-time error.
When a heap-dynamic variable is deallocated with dispose, its
lock value is cleared to an illegal lock value. Then, if a pointer
other than the one specified in the dispose is dereferenced, its
address value will still be intact, but its key value will no longer
match the lock, so the access will not be allowed.
3. Heap Management
Heap management can be a very complex run-time process.
We examine the process in two separate situations:
one in which all heap storage is allocated and deallocated in units
of a single size,
and one in which variable-size segments are allocated and
deallocated.
Two approaches to reclaim garbage
Reference counters (eager approach): reclamation is
incremental and is done when inaccessible cells are created
Mark-sweep (lazy approach): reclamation occurs only when the
list of variable space becomes empty
Many variations of these two approaches have been developed.
The reference counter method of storage reclamation
accomplishes its goal by maintaining in every cell a counter that
stores the number of pointers that are currently pointing at the cell.
Disadvantage: There are three distinct problems with the reference
counter method.
First, if storage cells are relatively small, the space required for the
counters is significant.
Second, some execution time is obviously required to maintain the
counter values.
Some of the inefficiency of reference counters can be eliminated by
an approach named deferred reference counting, which avoids
reference counters for some pointers.
The third problem is that complications arise when a collection of cells
is connected circularly. The problem here is that each cell in the
circular list has a reference counter value of at least 1, which prevents it
from being collected and placed back on the list of available space.
Advantage: it is intrinsically incremental, so significant delays in the
application execution are avoided
The mark-sweep process of garbage collection operates as
follows:
The run-time system allocates storage cells as requested and
disconnects pointers from cells as necessary, without regard for
storage reclamation (allowing garbage to accumulate), until it has
allocated all available cells.
The mark-sweep process consists of three distinct phases.
First, all cells in the heap have their indicators set to indicate they are
garbage.
The second part, called the marking phase, is the most difficult. Every
pointer in the program is traced into the heap, and all reachable cells are
marked as not being garbage.
After this, the third phase, called the sweep phase, is executed: All cells in
the heap that have not been specifically marked as still being used are
returned to the list of available space.
Disadvantages: in its original form, it was done too infrequently.
When done, it caused significant delays in application execution.
Contemporary mark-sweep algorithms avoid this by doing it more
often—called incremental mark-sweep
Variable-Size Cells:
Managing a heap from which variable-size cells are allocated
has all the difficulties of managing one for single-size cells,
but also has additional problems.
The additional problems posed by variable-size cell
management depend on the method used.
If mark-sweep is used, the following additional problems
occur:
The initial setting of the indicators of all cells in the heap to
indicate that they are garbage is difficult.
The marking process is nontrivial.
Maintaining the list of available space is another source of
overhead.
Type Checking
Type checking is the activity of ensuring that the operands of an
operator are of compatible types.
A compatible type is one that either is legal for the operator or is
allowed under language rules to be implicitly converted by
compiler-generated code (or the interpreter) to a legal type.
This automatic conversion is called a coercion.
For example, if an int variable and a float variable are added in Java,
the value of the int variable is coerced to float and a floating-point add
is done.
A type error is the application of an operator to an operand of an
inappropriate type.
For example, in the original version of C, if an int value was passed to
a function that expected a float value, a type error would occur
If all type bindings are static, nearly all type checking can be static
If type bindings are dynamic, type checking must be dynamic(type
checking at run time)
Type checking is complicated when a language allows a memory
cell to store values of different types at different times during
execution.
Strong Typing
A programming language is strongly typed if type errors are
always detected.
This requires that the types of all operands can be determined,
either at compile time or at run time.
Advantage of strong typing: its ability to detect all misuses of
variables that result in type errors.
A strongly typed language also allows the detection, at run time,
of uses of the incorrect type values in variables that can store
values of more than one type.
Language examples:
Ada is nearly strongly typed, (UNCHECKED CONVERSION
is loophole)
Java and C#, although they are based on C++, are strongly
typed in the same sense as Ada.
C and C++ are not strongly typed languages because both
include union types, which are not type checked.
ML is strongly typed, even though the types of some function
parameters may not be known at compile time.
F# is strongly typed.
Type Equivalence
Type equivalence is a strict form of type compatibility—
compatibility without coercion.
There are two approaches to defining type equivalence: name type
equivalence and structure type equivalence.
Name type equivalence means that two variables have
equivalent types if they are defined either in the same declaration
or in declarations that use the same type name.
Easy to implement but highly restrictive:
Subranges of integer types are not equivalent with integer
types
Formal parameters must be the same type as their
corresponding actual parameters
Structure type equivalence means that two variables have
equivalent types if their types have identical structures.
More flexible, but harder to implement
Under name type equivalence, only the two type names must
be compared to determine equivalence. Under structure type
equivalence, however, the entire structures of the two types
must be compared.This comparison is not always simple.
Another difficulty with structure type equivalence is that it
disallows differentiating between types with the same structure