The generic collection classes provided by the .Net Framework are well thought-out and perform quite well for the great majority of a developer’s collection handling needs. The only real issues arise when the contents of a collection get too large and you want to use a method like Contains to see if an element exists in the collection. The Contains method is an O(n) operation, meaning that it takes proportionally longer to execute the larger the collection becomes.
Many times the solution to the problem is to use a collection backed by a HashSet. So long as all of the items in the collection are unique, an index can be built and organized using the hash value for each element in the collection making the Contains method of a HashSet an O(1) operation (it takes the same amount of time to execute regardless of the number of elements in the collection).
However, the HashSet can have its limitations as well. Depending on the objects being used it may be possible to generate a hash collision (two objects in the collection produce the same hash value), though this generally won’t happen unless you are using custom types. There’s also the fact that there is only one hash value per element to search on.
For a collection of Strings, in particular, this could be a problem. In the case of strings, you may want the ability to search for a full string in the collection, or just the starting characters to find all strings that begin with those characters. A HashSet would not be suitable for this because there is no way to search for the substrings without enumerating the entire collection.
There is another solution to fast searches over a large collection of strings, but it comes with a cost. It is possible to build a collection that offers performance approaching O(1) for any number of strings in a collection; however, to do so we must be willing to sacrifice memory consumption for performance. If we are willing to consume more memory than the raw contents of the collection requires, we can create a collection of strings backed by a character dictionary of character dictionaries.
A String is, by definition, an Array of Characters. So the string “Hello World” is also an array containing the characters: (0) = ‘H’ (1) = ‘e’ (2) = ‘l’ (3) = ‘l’ (4) = ‘o’ (5) = ‘ ‘ (6) = ‘W’ (7) = ‘o’ (8) = ‘r’ (9) = ‘l’ (10) = ‘d’
Instead of using one array to hold the characters of the string, we can reorganize a string such that each letter is contained in a Dictionary(Of Char, Dictionary(Of Char)) creating a tree-like structure of characters. For example, if we added the words “Hello” and “Help” to the collection, we would have a tree of dictionaries like: (0) = ‘H’ (0) = ‘e’ (0) = ‘l’ (0) = ‘l’ (0) = ‘o’ (1) = ‘p’
So the letter “H” is the first key in the root dictionary. The value for the key “H” is another Dictionary(Of Char) and its first key is the letter “e”, and so on to the “o” in “Hello”. For the word “Help”, the “H”, “e”, and “l” are already in the tree, so the first “l” now gets a second child of the letter “p”.
By this logic, we also have the word “Hell” in our collection, but we didn’t add that word… so how do we tell the difference between words explicitly added to the collection and those added accidentally? We can use Chr(0) as a terminator character to indicate the end of an intended string. In this way, a letter with a child of Chr(0) will indicate the end of a complete word. So our previous example then becomes:
(0) = ‘H’ (0) = ‘e’ (0) = ‘l’ (0) = ‘l’ (0) = ‘o’ (0) = Chr(0) (1) = ‘p’ (0) = Chr(0)
As you can begin to see, with an extra character on each string and a new dictionary for each new letter in a word, the amount of memory consumed is much greater than is required to merely represent the characters in the string. But at the same time, because we have organized the characters into a tree structure, the greatest number of iterations required to find any word in the collection is equal to the longest word in the collection; so searches become very fast. And if we choose to execute a search that does not look for the termination character, we can quickly find all words in the collection that start with a given set of letters.
Although the basic premise of the collection is fairly straightforward, the code implementation can be a bit tricky in that our tree of characters must still be able to expose itself as a collection of complete strings. To achieve this, we need to create a custom enumerator for our collection that knows how to find each complete string within the tree of characters.
We can begin by creating a “SearchableStringCollection” class which implements ICollection(Of String) and declares a backing store of Dictionary(Of Char, SearchableStringCollection):
Public
Class
SearchableStringCollection
Implements
ICollection(Of
String
)
Private
_ListCore
As
New
Dictionary(Of
Char
, SearchableStringCollection)
End
There are a few simple class members that we can implement without first defining our actual character storage and retrieval mechanisms. First we can implement the Count property by simply defining a private _Count field to hold the number of full strings in the collection. We’ll want to track this total manually as items are added to and removed from the collection because our backing store will not have this information readily at hand (we would have to enumerate the entire collection to find and count all of the complete strings, and that would have undesirable performance).
_Count
Integer
ReadOnly
Property
Count
System.Collections.Generic.ICollection(Of
).Count
Get
Return
IsReadOnly
Boolean
).IsReadOnly
False
Sub
Clear()
).Clear
_ListCore.Clear()
_Count = 0
CopyTo(array()
, arrayIndex
).CopyTo
For
Each
entry
In
Me
array(arrayIndex) = entry
arrayIndex += 1
If
arrayIndex = array.Length
Then
Exit
Next
The primary functionality methods Add, Contains, and Remove will require helper methods to assist in their execution, and we’ll look at each more closely later in this article. We’ll also want to add a “ContainsStartingWith” method to see if a partial match exists in the collection, as well as a “Find” method which can return all strings beginning with a given set of characters.
The only remaining class members then are the implementations of GetEnumerator. The implementation coming from IEnumerable can be easily coded to just return the strongly-typed implementation coming from IEnumerable (Of String). We can also change the access level of this member to Protected so that the class only exposes a single GetEnumerator method.
However, the actual implementation of GetEnumerator from IEnumerable(Of String) will need to return an instance of the custom enumerator class which is capable of finding all complete strings within the tree of characters.
Function
GetEnumerator()
System.Collections.Generic.IEnumerator(Of
System.Collections.Generic.IEnumerable(Of
).GetEnumerator
Protected
GetEnumeratorCore()
System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator
Let’s take a look at the definitions of the four primary helper methods used to add, remove, and find strings in the collection:
Overridable
OnAdd(characters()
, index
OnContains(characters()
OnFind(characters()
()
OnRemove(characters()
As you can see, each method takes a character array and an index as parameters. The reason for this design is to facilitate calling each of these methods in a recursive fashion. A recursive method is simply a function or subroutine which continues to call itself while some condition is met. Recursion works well with tree structures of data because you need to “walk the branches” of the tree whenever you work with the data. This means that the primary worker methods only need to know a starting position within the tree at which they should begin their work. Each method can then call itself, passing the original character array it received along with an updated index indicating the next position at which processing needs to continue.
The OnAdd method will provide the functionality of walking the character tree, looking for existing characters and adding new branches when needed. The code for this method consists of four conditional checks with one opportunity for recursion:
Dim
result
=
characters.Length > 0
Not
_ListCore.ContainsKey(characters(index))
_ListCore.Add(characters(index),
SearchableStringCollection)
index = characters.Length - 1
result =
True
index < characters.Length - 1
result = _ListCore(characters(index)).OnAdd(characters, index + 1)
The method begins by declaring a result variable of the returning type (Boolean). This value will be returned at the end of the method to indicate whether or not a new item was added to the collection (if the provided complete string already exists in the collection, no new string will be added). Next it checks the trivial case that the provided character array is empty. Providing that some characters were supplied, the method then begins to look for existing characters or add new branches to the tree.
The method checks to see if the specified character exists in the local character dictionary. If that character does not yet exist, a new branch is created and the character is added to the collection. It is important to note that when this method is first called from an instance of SearchableStringCollection, the _ListCore field is pointing to the “root” dictionary of the tree, however, since each branch of the tree contains another SearchableStringCollection, in subsequent recursive calls the _ListCore field will point to the dictionary for the particular branch of the tree that execution has walked to.
Once the current character has been processed, the method repeats itself for each remaining character in the originally provided string. However, it is again important to note that the target of this second call to OnAdd is the branch stemming from the root collection; it is not a recursive call into the same object instance. This is how each helper method recursively walks the character tree.
_ListCore(characters(index)).OnRemove(characters, index + 1)
_ListCore(characters(index))._ListCore.Count = 0
_ListCore.Remove(characters(index))
Else
_ListCore(characters(index)).OnContains(characters, index + 1)
List(Of
result.AddRange(_ListCore(characters(index)).OnFind(characters, index + 1))
Using enm
SearchableStringCollectionEnumerator(_ListCore(characters(index))._ListCore.GetEnumerator)
While
enm.MoveNext
result.Add(
.Concat(
(characters), enm.Current))
Using
result.ToArray
SearchableStringCollectionEnumerator
IEnumerator(Of
_BaseEnumerator
IEnumerator(Of KeyValuePair(Of
, SearchableStringCollection))
_Enumerators
List(Of IEnumerator(Of KeyValuePair(Of
, SearchableStringCollection)))
_Current
System.Text.StringBuilder
Friend
(baseEnumerator
_BaseEnumerator = baseEnumerator
_Enumerators.Add(_BaseEnumerator)
Current
).Current
_Current.ToString
CurrentCore
Object
System.Collections.IEnumerator.Current
Reset()
System.Collections.IEnumerator.Reset
_Current.Clear()
_Enumerators.Remove(_BaseEnumerator)
_BaseEnumerator.Reset()
enm
enm.Dispose()
_Enumerators.Clear()
MoveNext()
System.Collections.IEnumerator.MoveNext
= BuildCurrent()
_Current.Length > 0
_Current.Length -= 1
BuildCurrent()
_Enumerators.Count > 0
_Enumerators(_Enumerators.Count - 1).MoveNext()
_Current.Append(_Enumerators(_Enumerators.Count - 1).Current.Key)
_Current(_Current.Length - 1) = Chr(0)
_Enumerators.Add(_Enumerators(_Enumerators.Count - 1).Current.Value._ListCore.GetEnumerator)
result = BuildCurrent()
_Enumerators.RemoveAt(_Enumerators.Count - 1)
disposedValue
Dispose(disposing
.disposedValue
disposing
_BaseEnumerator.Dispose()
_BaseEnumerator =
Nothing
.disposedValue =
Dispose()
IDisposable.Dispose
Dispose(
GC.SuppressFinalize(
Add(item
).Add
OnAdd((item & Chr(0)).ToArray, 0)
_Count += 1
Remove(item
).Remove
= OnRemove((item & Chr(0)).ToCharArray, 0)
_Count -= 1
SearchableStringCollectionEnumerator(
._ListCore.GetEnumerator)
Contains(item
).Contains
OnContains((item & Chr(0)).ToCharArray, 0)
ContainsStartingWith(startsWith
OnContains(startsWith.ToCharArray, 0)
Find(startsWith
OnFind(startsWith.ToCharArray, 0)
Option
Strict
On
''' <summary>
''' Represents a collection of strings similar in performance to a HashSet(Of String) that offers partial string searches.
''' </summary>
''' <remarks></remarks>
''' Adds the specified string to the collection if it does not already exist.
''' <param name="item">The string to add to the collection.</param>
''' Clears all strings from the collection.
''' Gets a value indicating whether the collection contains the specified string.
''' <param name="item">The exact string to find.</param>
''' <returns>True if the collection contains the specified string, otherwise false.</returns>
''' Gets a value indicating whether the collection contains any strings that begin with the specified characters.
''' <param name="startsWith">The starting characters to search for.</param>
''' <returns>True if the collection contains one or more strings beginning with the specified characters, otherwise false.</returns>
''' Copies the contents of the collection to the specified array.
''' <param name="array">The array of strings to copy the collection into.</param>
''' <param name="arrayIndex">The location in the array at which copying begins.</param>
''' Gets the number of complete strings contained in the collection.
''' <value></value>
''' <returns></returns>
''' Gets an array of strings that begin with the specified characters.
''' <param name="startsWith">The starting characters to find.</param>
''' <returns>An array of strings beginning with the specified characters.</returns>
''' Gets an enumerator that iterates over the strings in the collection.
''' Returns false indicating that the collection is not read-only.
''' Removes the specified string from the collection.
''' <param name="item">The string to remove from the collection.</param>
''' <returns>True if the string is found and removed, otherwise false.</returns>
''' Enumerates the elements of a SearchableStringCollection.
''' Gets the element in the collection at the current position of the enumerator.
''' Advances the enumerator to the next element in the collection.
''' Sets the enumerator to its initial position, which is before the first element in the collection.
#Region "IDisposable Support"
#End Region