LINQ does Logic? Chapter 4: Composable Hashing

Checking  multiple collections for mutual membership of some item is an emerging theme in our examples. For instance, we want to know if some material has a certain value for specific density, and, if so, whether it’s also in a database of solubles.

The time it takes to scan a collection is proportional to the length of the collection. Imagine looking for a word in a dictionary by starting at the beginning and checking every word. A standard way to speed this up is, of course, hashing. In the case of a paper dictionary, your “hash function” returns the first letter of the word, which sends you to a “hash bucket:” the section of the dictionary with all the words beginning with that letter. You naturally recurse, effecting a trie search.

As the number of collections to cross-check grows, so grows the importance of speeding up the checks. The time to check all collections grows as the product of the sizes of the collections. Imagine checking that a certain word is in four different dictionaries, first using the brute-force-scan method then using the brighter hash method. Now imagine doing likewise for 1,000 words.

In Chapter 2: Conjunctions, we built LINQ and Mathematica examples that check whether materials float on water by looking up densities and solubilities. The running times of these programs over the tiny sample data set were negligible, but we noted that the method used would not scale. That method used the .Contains extension method on IEnumerable and the MemberQ primitive on Lists to scan the entire list of solubles for each candidate material.

In this chapter, we build another example with longer lists and starkly illustrate that the brute-force methods are intolerable. In the offing, we fix the problem using a composable hash-set type. We create a LINQ operation that converts an IEnumerable into a hash set, and transparently insert that operation into a chain of other operations. This is more than just good aesthetics: much later, we shall see that remoting composable LINQ operations opens whole vistas of dynamic scenarios, where pipelines can be set up and taken down on-the-fly.

First, look at “hashing.linq” in the LINQ subdirectory of the “LINQdoesLogic” repo. It refers to a data file of about 350,000 English words adapted from the public-domain MOBY lists. To run this LINQpad script, you will have to adjust the file location or the reference to it in the script — the checked-in version of the script assumes the file is in

c:/tmp/words.txt

The script first reads the file and then does timed lookups of two words, “prehistorically,” which is in the file, and “zzzzzzzz,” which isn’t, as follows:

    var sw = new Stopwatch();
    sw.Start();
    words
        .Contains("prehistorically")
        .Dump("Does the IEnumerable contain \"prehistorically\"?")
        ;
    sw.Stop(); sw.ElapsedMilliseconds.Dump("Elapsed milliseconds
    required to decide whether the words IEnumerable contains
    \"prehistorically\"");

On my dreadnaught-class Lenovo laptop, plugged in, these lookups take around 15 to 30 milliseconds, respectively (about twice that on battery).

These timings are probably not very accurate because they’re too short compared to the included overhead of printing an answer. Later in the file, along with some other digressions that might interest you, there are snippets that look up 1000 words and amortize the overhead, taking around 15 seconds, consistent with the shorter end of the single-word lookup time.

We then build a hash set, first using the built-in “Add” method of the built-in HashSet class in the .NET Framework. This method takes a HashSet and an element and returns true if the element was inserted in the HashSet. This protocol allows us to insert and count all the non-duplicate words in the file into a hash set with a LINQ composition like the following:

    var wordsHashSet = new HashSet<string>();
    sw.Reset(); sw.Start();
    words
        .Where(w => wordsHashSet.Add(w))
        .Count()
        .Dump("Count of unique words in the hash set")
        ;
    sw.Stop(); sw.ElapsedMilliseconds.Dump("Elapsed milliseconds to
    build and enumerate the hash set by transformation");

Note the “Where” method instead of “Select.” With “Where”, we noticed that the original MOBY word list “single.txt” actually contained a duplicate word, “animate,” because the count revealed by LINQ differed from the line count of the file by 1. This mistake has certainly been in the MOBY file for years but it was easy to find via LINQ. I took it out of “words.txt,” but I have left it in the Mathematica version of our program for illustrative purposes.

Ok, this is great, we have a hash set. We next do some timed lookups of our test words and find them to be negligible.

    sw.Reset(); sw.Start();
    wordsHashSet
        .Contains("prehistorically")
        .Dump("Does the hash set contain \"prehistorically\"?")
        ;
    sw.Stop(); sw.ElapsedMilliseconds.Dump("Elapsed milliseconds to 
    decide whether the hash set contains \"prehistorically\"");

However, we have lost something. The only thing we can do with this approach is load the hash set by side effect and then use it later. This requires us to create a variable for the hash set, so it’s not composable. It’s a rule — whenever you must create a variable “out of line” of the LINQ query, the query itself is not composable, because it depends on the variable created in its environment. You can’t cut and paste the query or remote it without its whole environment.

If we could bypass the variable, we would have a composable transform that converts the original word list into an efficient form and only then does the tough work.

In fact, there is an even deeper level of composability available, and that is in the creation of the hash set itself. Our eyes tell us that if ToArray and ToList and ToDictionary are good LINQ, then ToHashSet would be good LINQ, but we must create it. We’ll need a “streaming AddTo” along the way, which will stand us in good stead in future Reactive scenarios. But, getting back to hashing; we treat the hash set as a mere aggregate of the original word list!

    public static class Extensions
    { ...
        public static ICollection<T> AddTo<T>(
            this ICollection<T> target, T item)
        {   target.Add(item);
            return target;
        }
        public static HashSet<T> ToHashSet<T>(
            this IEnumerable<T> source)
        {   return source.Aggregate(
               new HashSet<T>(),
               (hs, elt) => (HashSet<T>)(hs.AddTo(elt)));
        }
    ...
    }

Let’s make the whole exercise much more challenging (see composableHashing.linq in the repo). First, imagine a transform that takes a random sample of a given length out of the original word list

    words
        .ToArray()
        .RandomChoice(words.Count(), 1000)

where

    public static class Extensions
    { ...
        private static Random random = new Random();
        public static IEnumerable<T> RandomChoice<T>(
            this IEnumerable<T> source, int length, int count = 1)
        {   var result = new T [count];
            for (var i = 0; i < count; i++)
                result[i] = source.ElementAt(random.Next(0, length));
            return result;
        }
    ...
    }

The “ToArray” is critical. The “ElementAt” inside “RandomChoice” is efficient for arrays, but not for lists. To convince yourself of this, try timing the above with and without the “ToArray.”

In any event, once we have the sample, we want to run membership tests for every element of the sample on the original word list to stress the membership checking. Semantically, we want to write something like the following (Don’t try this; it takes minutes, but have no fear, we’re fixing it:)

    words
        .ToArray() // comment this out to make it slower
        .RandomChoice(words.Count(), 1000)
        .Where(s => words
            .ToHashSet() // comment this out to make it faster
            .Contains(s))
        .Count()
        .Dump("Number of sample words in the original (expect 1000)")

The “ToHashSet” here is not even wrong because we recreate the hash set for every word in the sample. However, we can create a local name, inline, to contain the hash set, which we make just once. The best way to create local names in composable queries is with anonymous types:

    new [] { new {
            aWords = words.ToArray(), 
                // comment out ToArray to go slow
            hWords = words.ToHashSet() 
                // comment out ToHashSet to go REALLY slow
    }   }
        .Select(o => new {
            hWords = o.hWords,
            sample = o.aWords.RandomChoice(o.aWords.Count(), 1000)
        })
        .Select(o => o.sample
            .Where(s => o.hWords.Contains(s))
        )
        .ElementAt(0)
        .Count()
        .Dump("number of samples in the original set (expect 1000)")
        ;
    }

This isn’t perfect because later lines must know the names created in earlier lines, but the lines can be cut and pasted and remoted freely because they have no dependence on environment outside the query, just on local names created earlier in the query.

This can be mitigated a bit using .NET Tuples. With them, dependent lines must know the positions of required items within a tuple. Either way — with property names in anonymous types or with positional references in Tuples, there is a little, conventional contract amongst the composable parts in the query. This is the maximal decoupling we know how to effect at this time.

Now, to the Mathematica version. First, we can conveniently include the entire word list inline as a “closed” cell to remove the dependence on the file system. Second, we left in the duplicate word “animate” in case you want to experiment with writing some queries to find it.

There are a several ways to represent hash sets in Mathematica. The most versatile is as a list of rules. A natural way to build one is by Mapping (in LINQ: Selecting) a transform over the words. The transform converts a word into a rule that maps the word to True. Here is a sample

    Map[#->True&,words]//RandomChoice[#,5]&//TableForm

    entia->True
    galeoid->True
    manderelle->True
    accent->True
    tricot->True

The “list of rules” representation is incredibly versatile and powerful. It is the natural representation for an object, both in the object-oriented-programming sense and in the JSON sense. For present purposes, we run Mathematica’s built-in Dispatch function over the list of rules, which transparently converts the list into a hash table, which we are underutilizing as a mere hash set:

    hashSet = (Map[# -> True &, words] // Dispatch);

We now make our stress-test sample of 1,000 words

    bigSample=RandomChoice[words,{1000}];

and compare the times of checking for membership using old-slow MemberQ (almost 5 seconds!)

    In[11]:= Timing @ (And @@ Map[MemberQ[words, #] & , bigSample])
    Out[11]= {4.696, True}

versus shiny-new-fast hashing (less than 1/10 of a second!)

    In[12]:= Timing @ 
        (And @@ Map[True === (# /. hashSet) &, bigSample])
    Out[12]= {0.078, True}

Notice the ReplaceAll operator:

/.

The way to look something up in a list of rules (or its optimized equivalent hash table) is just to apply the rules to an expression. In this case, the expression is #, the argument of the function in Map, which takes on the value of each word in bigSample in turn.

As usual, we’re impressed with the terseness of Mathematica, despite its slightly geeky syntax, but we are also gratified that we know how to do exactly the same thing fluidly and fluently in LINQ.

Colophon: The link on “fluently” above conceals a pun. If you wish to investigate local names and chained dependencies in composable queries more deeply , do click on it and read Joe Albahari’s answer.

Advertisements

~ by rebcabin on September 11, 2012.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: