C5: Copenhagen Comprehensive C# Collection Classes and Experience With Generic C#
C5: Copenhagen Comprehensive C# Collection Classes and Experience With Generic C#
and
• Provide a collection class library for C# that is as a comprehensive as those of comparable languages.
• Test and document well enough that it will be widely usable.
History
• Niels Jørgen vastly extended and improved the library in his 2004 MSc thesis.
Developed using Whidbey alpha release from August 2003.
Recently converted to March 2004 Community Technology Preview version.
• Funding has been applied for from MSR Unirel for finishing, polishing, testing and documenting the library.
• Use best known data structures and algorithms, even if cumbersome to implement.
• Asymptotics (scalability) are more important than nanosecond efficiency.
• It is OK to sacrifice a constant-factor space overhead to support richer functionality.
• Concurrent read-accesses, including iteration, must be naturally thread-safe (no synchronization overhead).
• Have the machine model in mind. E.g. avoid splay trees: they incur many write-barrier checks.
• Measure performance of implementation alternatives; e.g. whether to use objects or structs internally.
• Use only managed code, no unsafe operations.
• Test carefully using Nunit, and document test coverage.
• Write all code from scratch and release under an MIT-style license.
• Smalltalk-80 comprehensive library: sets, bags, lists, sequences, hash-based and tree-based dictionaries.
Critique by Cook (1992): capabilities and implementation are mixed; functionality is (unsystematically)
duplicated.
(Smalltalk has no interfaces and no generic types.)
• Java 1.2 has comprehensive library: lists, sets and dictionaries; array-based, linked, hashtables, binary trees.
Critique by Evered and Menger (1998): Separation into interface hierarchy and class hierarchy a step forward,
but not complete: interfaces do not describe all capabilities of implementations.
Java 1.5 generic collection library is based directly on the non-generic library.
• C#/.Net Framework classes version 1.1 has no linked lists and no tree-based dictionaries.
Oddly named SortedList is an array-based comparison-based dictionary called ‘a HashTable/Array hybrid’.
Version 2.0 generic collection library is less weird (SortedDictionary) and has LinkedList, but still not trees.
IEnumerable<T>
ISequenced<T>
IIndexed<T> ISorted<T>
IIndexedSorted<T>
IList<T> IPersistentSorted<T>
Enumerable get enumerator; apply function; exists and for all quantifiers
Indexed by integer index: sequenced collection with element access, insertion, removal
List indexed collection that supports sorting, updatable views, FindAll(p) and
Map(f) with list result
PriorityQueue extensible that supports efficient FindMin and FindMax
Sorted by element ordering: sequenced that has predecessor, successor, cut by element,
element subrange queries
IndexedSorted sorted and indexed collection; element subrange counts and queries
A recent .Net CLI proposal makes S.C.Generic.ICollection<T> a mix of collection and editable.
But although it is trivial to turn a min-problem into a max-problem, it causes confusion in practice.
And since interval heaps are just as fast as single-ended heaps, we provide both functionalities in one.
Cut finds the greatest low <= c and the least high > c, if any.
A range query produces a directed enumerable. Hence it can be enumerated (or processed) backwards:
foreach (Talk t in talks.RangeTo(nexttalk).Backwards()) {
...
}
A persistent sorted collection supports making a snapshot (but not snapshots of snapshots).
In principle one could have persistent non-sorted collections, but there are no efficient implementations.
ICollection<T> IExtensible<T>
IEditableCollection<T> IPriorityQueue<T>
ISequenced<T>
IIndexedSorted<T> IPersistentSorted<T>
IList<T>
IEnumerable<KeyValuePair<K,V>>
IDictionary<K,V>
ISortedDictionary<K,V>
HashDictionary<K,V> RedBlackTreeDictionary<K,V>
A view of a list is a sublist of the list. All views work on the same underlying list.
Updates to a view will update the underlying list. Other updates to the underlying list invalidate the view(s).
Common applications:
• Orthogonality: Make operation, such as Contains, independent of the list subrange operated on.
There are many operations: Find, IndexOf, LastIndexOf, Update, FindAll, CopyTo, . . .
In the Smalltalk and .Net libraries, some of these exist in three versions each, causing method proliferation.
Convex hull: Least convex set that encloses a given set of points.
Sort points (x, y) lexicographically, and separate in upper and lower set.
Considerably clearer than previous implementation using explicit linked list nodes.
Later updates to the tree do not affect the snapshot. A snapshot is not updatable (hence semi-persistent).
Application 2: Geometric algorithms such as planar point location use many almost-identical dictionaries:
y
(x,y)
Each dashed vertical line can be represented by a snapshot of a dictionary. Much faster than using copies.
Initially a snapshot shares all tree nodes with the underlying red-black tree.
When the underlying tree is updated and any snapshot exists, tree nodes are copied lazily.
This requires extra data, including one reference, in each tree node.
Then all operations remain O(log n) and use only amortized constant extra space.
Also tried path copying persistence and other variations, but node copying persistence is faster.
Recommended idiom:
using (ISorted<T> snap = tree.Snapshot()) {
...
}
When all snapshots have been disposed (or finalized), the underlying tree stops making node copies at updates.
Taken separately, views on linked lists and hash-indexes on linked lists are easy to implement.
But the combination is tricky: We want to use a single hashtable for a list and all its views.
How do we know whether a linked-list node found in the hashtable belongs to a given view?
(In an array-based list, just check whether the found item index is within the view’s index range.)
Put increasing integer tags on list nodes; a node’s tag must be between the tags of the view’s first and last node.
The tricky part is to maintain increasing tag order when inserting (and deleting) list nodes.
Uses ideas from Sleator and Tarjan (1987) and Bender et al. (2002) to do this in amortized constant time.
Further improved by organizing elements into sufficiently large tag groups; speeds up updates nearly twice.
list
49 8 45 1 33 44 7 46
view
C5 assessment
• Adapt to new standard .Net IComparer<T>, IComparable<T>, name scheme, and so on.
Use (new) standard .Net delegate types for higher-order functions.
• Generally the .Net CLR implementation August 2003 alpha was robust and supported the project very well.
• There was some flakiness in the Visual Studio IDE (but entirely forgivable in an alpha release).
• Difficult to predict efficiency of CIL code. Must measure, and then get some surprises:
Static field access is very slow; yet another reason to avoid them.
Sometimes method parameter access is fast and this is good:
public void M(int x) {
...
while (...) {
... x ...
}
}
But sometimes a local variable is much better:
public void M(int x) {
int y = x;
while (...) {
... y ...
}
}
• An attempt was made to use run-time code generation via System.Reflection.Emit (for generation of hashers
and comparers) but that failed for lack of support of generics.
One can add or multiply a value of type AddMul<A,R> with an A, given result of type R:
interface AddMul<A,R> {
R Add(A e); // Addition with A, giving R
R Mul(A e); // Multiplication with A, giving R
}
Can define polynomials over E if E supports addition and multiplication and has a zero (made by new E()):
class Polynomial<E> : AddMul<E,Polynomial<E>>,
AddMul<Polynomial<E>,Polynomial<E>>
where E : AddMul<E,E>, new() {
private readonly E[] cs; // cs contains coefficients of xˆ0, xˆ1, ...
Hamming: All products of 2, 3 and 5, in increasing order: 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, . . .
In lazy functional languages the sequence is recursively definable as
hamming = 1 : merge(map (*2) hamming, merge(map (*3) hamming, map (*5) hamming)
The method Hamming returns the 2-, 3- and 5-multiples of a given integer sequence xs:
public static IEnumerable<int> Hamming(IEnumerable<int> xs) {
return Merge(Map<int,int>(delegate(int x) { return 2 * x; }, xs),
Merge(Map<int,int>(delegate(int x) { return 3 * x; }, xs),
Map<int,int>(delegate(int x) { return 5 * x; }, xs)));
}
Random ints (1.000.000) or strings (200.000); average time/s of 20 runs; 850 MHz mobile P-III; Windows 2000;
March 2004 CTP of Whidbey.
• Generics is the only way to have generality, type safety, and efficiency.
• The overhead in generics (1.36 vs 0.50) is due mostly to generality: passing the Compare method.
• The generics win is clearly larger for the value type int than for the reference type string.
• (String comparison is quite slow, so a dictionary with String keys should be hash-based not tree-based).
The many features interact rather well; few surprises. (Value types and readonly).
• Kinds: static, instance, instance virtual, abstract, interface, and explicit interface member implementation.
• By-value and by-reference (ref/out) parameter passing; and value type and reference type arguments.
• Access modifiers; inherit from base class versus getting the method from an enclosing class.
• Parameter arrays; overloading resolution and ‘better conversion’ at compile-time.
• Run-time calls to virtual methods, respecting new in the subclass chain.
• Implicit argument conversions at run-time.
• And on top of this, generic type parameter inference.
Also, the autoboxing of simple type values may introduce performance surprises in connection with generics: