LINQ Does Logic? Chapter 8: Cut and Fail

•October 21, 2012 • Leave a Comment

This time, we model Prolog’s cut and fail control pseudopredicates using Chapter 7’s IEnumerable of IEnumerables. In Chapter 7, we found that fail triggers backtracking and that backtracking is natural in LINQ if alternatives are represented with exactly one level of IEnumerable nesting. Now, we dig into Feraudy’s EXAMPL7.SPR [sic] to find a new exploit of that structure: the cut, which immediately bails out of all backtracking, producing “the nearest” result.

In keeping with the exploratory spirit of this blog, rather than attempt a “full-fidelity” emulation of Prolog’s cut, we’ll just reproduce Feraudy’s example as naturally as we can in LINQ, admitting that “naturally” is a subjective judgment, a matter of taste and aesthetics. We also admit that an early attempt at a full-fidelity emulation of cut looked like it was going down a slippery slope to an implementation of Prolog-proper in LINQ, and that is not at all what we want, at least not at this stage of our explorations. Instead, we’re looking for easy ways to get the effect of Prolog using LINQ in a straightforward way.

Feraudy presents the following factbase:

    (lives_in anca usa)
    (lives_in brian usa)
    (lives_in murray canada)

and the following alternative rules:

    ((north_american P)
      (lives_in P usa)
      (cut) 
      /* experiment without the cut; 
       * experiment with fail; 
       * use (trace) to monitor internal execution; 
       */
    )
    ((north_american P)
     (lives_in P canada)
    )

The interpretation is that someone is North-American if he or she lives in the USA or in Canada. The cut just means “as soon as you find someone in the USA, report that person and stop looking for more” (typical chauvinism, eh?).

In Feraudy’s interpreter, you can evaluate this goal with commands like the following:

    (consult "SPR/EXAMPL7.SPR")
    (findall P (north_american P) X)

Let’s try three versions of Feraudy’s example, one with no control pseudopredicates — no cut and no fail; one with our old friend fail; which triggers order-preserving backtracking; and one with cut, and see what we get.

First, represent the fact base the way we did in earlier chapters and scripts, only this time using anonymous types and C# dynamic, for reasons that become clear below

    var lives_in = new dynamic [] {
        new {person = "anca", country = "usa"},
        new {person = "brian", country = "usa"}, 
        new {person = "murray", country = "canada"},
    };

This is just a non-nested IEnumerable because there are no rules, here, only facts. Rules are conjunctions, nested one more level. Let’s first write Feraudy’s rule alternatives, with no cut and no fail:

    var north_american_1 = new [] {
        lives_in.Where(l => l.country == "usa"),
        lives_in.Where(l => l.country == "canada"),
    };

When we “evaluate” this rule, essentially the same as asking Feraudy’s Prolog interpreter to findall, we expect to see results for anca, brian, and murray:

    north_american_1.evaluate();

Great. What’s the code for evaluate? Showing the full code will jump us ahead of cut, so let’s take it on faith for the moment.

Next, slip in a fail. Remember, fail says “give up looking down this conjunction and consider the next alternative.” A great way to model this is just to replace results with the empty IEnumerable. SelectMany will flatten those away. The deep thing happening here is that modeling alternatives with exactly one level of nesting sets up SelectMany as the evaluator because it removes exactly one level of nesting. We have a matching pair.

Write fail as follows, inside a public static class named Extensions, so that it’s an extension method:

    public static IEnumerable<dynamic> Fail (this IEnumerable<dynamic> these)
    { return new dynamic [] {}; }

This fail takes whatever it gets and discards it. Easy.

Our intrepid application programmer now modifies the knowledge base to the following:

    var north_american_2 = new [] {
        lives_in.Where(l => l.country == "usa")
                .Fail(),
        lives_in.Where(l => l.country == "canada"),
    };

Think for a moment what you expect the output to be. Every time evaluation gets to the fail, results so far should be discarded and evaluation should proceed to the next alternative. We should see no one from the USA and just murray from Canada. A complying version of evaluate now becomes obvious:

    public static void evaluate(
        this IEnumerable< IEnumerable<dynamic> > alternatives)
    {    alternatives
            .SelectMany(predicates => predicates)
            .Dump()
            ;   
    }

SelectMany flattens out the discarded USA-dwellers and our results are:

Cut says “get all the way out and produce the ‘nearest’ result.” Here, “nearest” must mean “the first result of all the possibilities.” Prolog represents “all possibilities” as a reference into the knowledge base including variables that get instantiated via unification, and we’re not modeling that process. Instead, in LINQ, we represent all possibilities as a filter — a .Where on an IEnumerable. A cut following the statement of such a filter should pick the first item in the filtered list and throw it out in an exception to be caught in evaluate. That’s the easiest way to bail out of backtracking.

First, a custom exception that can carry a payload, the “nearest” result:

    public class CutException : Exception
    { public IEnumerable<dynamic> payload; }

Next, a modified evaluate that catches these exceptions:

    public static void evaluate(
        this IEnumerable< IEnumerable<dynamic> > alternatives)
    {    try
        {    alternatives
            .SelectMany(predicates => predicates.evaluate())
            .Dump()
            ;
        }
        catch (CutException ce) { ce.payload.Dump(); }   }

We need an inner evaluate that can test whether an item “wants” to throw a cut exception. A slightly generalized way to do this is to check for the existence of an action property on a dynamic. If present, execute it. This is a general way to insert side effects into data items, at the cost of making everything dynamic. This cost is non-negligible, and, in a real-world application of these techniques, must be soberly traded off against other concerns. But it makes for a very pretty bit of code, here.

The best way to test for existence of a property on a dynamic object seems to be just to try it and catch the exception if the object is not present. This requires the Microsoft.CSharp.RuntimeBinder namespace and Microsoft.CSharp.dll, part of the DLR runtime. Our inner evaluate is as follows:

    public static IEnumerable<dynamic> evaluate(
        this IEnumerable<dynamic> alternative)
    {   var result = new List<dynamic>();
        foreach (var predicate in alternative)
        {   try { predicate.action(); }
            catch (RuntimeBinderException e) { /*e.Dump();*/ }
            result.Add(predicate);
        }
        return result;   }

Cut now inserts an action that throws the cut exception with the appropriate payload:

public static IEnumerable<dynamic> Cut (this IEnumerable<dynamic> these)
    {   return new [] {new {action = new Action(() =>
        {   throw new CutException {payload = these.Take(1)}; })
}};   }

When we run it, we get our expected chauvinistic result:

This code is available for you to play with in the usual spot.

Advertisements

LINQ Does Logic? Chapter 7: Backtracking

•October 14, 2012 • 1 Comment

In EXAMPL6.SPR [sic], Feraudy makes two points:

  1. Fail is Prolog’s way dump out of a sequence of goals
  2. backtracking automatically shifts evaluation to alternative definitions

Feraudy first shows just an ordinary sequence of goals

    ((demo6)
     (display success_goal1)
     (display success_goal2)
     (display success_goal3)
     (display success_goal4)
     (fail)
     (display success_goal5)
     (display success_goal6)
     (display success_goal7)
    )

When executed, we see displays of success_goals 1 through 4, then a “No,” Prolog’s standard way of reporting that it doesn’t have an answer (we can’t tell the difference between a real “no” and a “don’t know,” as discussed in Chapter 3).

A nice way to mimic this in LINQ is with a sequence of Actions, which are functions that produce no results.

    var demo6 = default(IEnumerable< Action > );

    var displayMaker = new Func<string, Action>
        (s => new Action
            (() => Console.WriteLine(s)));

    var fail = new Action(() =>
    {   throw new ApplicationException("fail");   });

    demo6 = new Action [] {
        displayMaker("success_goal1"),
        displayMaker("success_goal2"),
        displayMaker("success_goal3"),
        displayMaker("success_goal4"),
        fail,
        displayMaker("success_goal5"),
        displayMaker("success_goal6"),
        displayMaker("success_goal7"),
        };

Of course, we need a clean way to execute this. It’s often tricky to decide what to do with exceptions. Here, we’ll just print the string “fail” so that our output resembles that produced by SProlog.

    var performSequence = new Action<
        IEnumerable<Action> > (axs =>
        {    try
            {    axs
                    .Select(ax => {ax(); return 0;})
                    .Distinct()
                    .Dump()
                    ;
            } catch (Exception exception)
            {   Console.WriteLine ("fail");    }    });

    "".Dump("demo6"); 
    performSequence(demo6);

Now, in Feraudy’s demo6_1, there happens to be an additional definition. It first replicates demo6, and then there is an alternative:

    ((demo6_1)
     (display success_goal1)
     (display success_goal2)
     (display success_goal3)
     (display success_goal4)
     (fail)
     (display success_goal5)
     (display success_goal6)
     (display success_goal7)
    )

    ((demo6_1)
     (display haha)
     (nl)
    )

When Prolog encounters the fail, what does it do? Instead of saying “No,” it backtracks. That just means it looks for alternative definitions before bailing out. The order of presentation of the alternative definitions is important. Prolog searches the first one first, the second one next, and so on.

This is very similar to Mathematica’s behavior (Mathematica is the other travelling companion in our exploration of LINQ’s doing Logic). So similar, in fact, that it’s utterly straightforward to emulate EXAMPL6.SPR in Mathematica; it’s barely even an exercise for the reader and we won’t go into it here.

Since the alternatives are presented sequentially, and the order matters, the representation in LINQ is obvious: a sequence of sequences of actions, or

    var demo6_1 = default(
        IEnumerable< IEnumerable< Action > >);

    demo6_1 = new Action [] [] {
        new Action [] {
            displayMaker("success_goal1"),
            displayMaker("success_goal2"),
            displayMaker("success_goal3"),
            displayMaker("success_goal4"),
            fail,
            displayMaker("success_goal5"),
            displayMaker("success_goal6"),
            displayMaker("success_goal7"),
            },
        new Action [] {
            displayMaker("haha"),
        },
    };

All we need to get this going, generally, is a new way of handling the exception that only prints “fail” if there are no alternatives remaining.

    var performSequenceWithBacktracking = new Action<
        IEnumerable< Action > > (axs =>
        {   axs
                .Select(ax => {ax(); return 0;})
                .Distinct()
                .Dump()
                ; });

    var performSequencesWithBacktracking = new Action<
        IEnumerable< IEnumerable< Action > > > (axss =>
        {   axss
                .SelectMany(axs =>
                {    try
                    {    performSequenceWithBacktracking(axs); }
                    catch (Exception) {
                        if (axss.Last() == axs)
                            Console.WriteLine ("fail");   }
                        return new [] {0}; })
                .Distinct()
                .Dump()
                ; });

In the following invocation, it won’t print “fail” because the last alternative doesn’t contain a fail.

    "".Dump("demo6_1");
    performSequencesWithBacktracking(demo6_1);

But, if we change the second alternative of demo6_1 to contain an embedded fail

    new Action [] {
        displayMaker("haha"),
        fail,
        displayMaker("boohoo"),
    },

We get the printout of “fail,” plus, one of the debugging “Dump” calls is skipped — the one internal to “performeSequenceWithBacktracking.”

Note especially the use of “SelectMany,” the mother of all LINQ combinators, in “performSequencesWithBacktracking.”

Do we ever need deeper levels of nesting, say sequences of sequences of sequences of Actions? Not if all alternatives to a single definition are presented at the same level of nesting. We see again that a single level of nesting is fundamental. It seems to suffice to capture all kinds of containment structure, much in the way that binary trees can represent n-ary trees.

The LINQPad script for this chapter is available here.

LINQ does Logic? Chapter 6: Remotable Recursion

•September 30, 2012 • Leave a Comment

In Chapter 5, we used LINQ’s .Select operator to call recursive methods in a composable fashion. In this chapter, we make the recursive calls remotable via recursive lambda expressions, getting rid of pre-canned methods altogether. This will allow the crowd marketplace to innovate certain functions that are not pre-canned in the cloud, while not completely sacrificing sandbox security.

In earlier posts, we showed how to create remotable recursive lambda expressions in JavaScript. This time, we use Wes Dyer’s version of the same thing in C#, and add a redux in Mathematica and another in LINQPad of his superior derivation and explanation.

Chapter 5 presented a composable version of Feraudy’s EXAMPL5.SPR in Prolog, which implements a vestigial expression evaluator. The example is an architecturally sound foundation for a real-world expression evaluator. For example, we could, in principle, create an evaluator for Mathematica expressions. If those expressions are remotable, we could, for instance, author and test them in Mathematica and evaluate them in a sandbox written in C# or JavaScript, even on another machine. Such a sandbox would not have all the massive power and library knowledge of Mathematica, of course, but we could emulate Mathematica’s fundamental term-rewriting strategy to implement basic symbolic computations and pattern matching to enable advanced scenarios involving, say, units of measure, logic programming, and rules-driven workflow processes.

In chapter 5, we used instance methods on classes in C# to do the recursion. In particular, the linqEvaluate method on the @expression abstract class is the workhorse. Its recursive override on the @binary subclass looks like this:

public sealed class @binary : @expression
{   public @expression Left  {get; set;}
    public @operator   Op    {get; set;}
    public @expression Right {get; set;}
    public override @expression linqEvaluate ()
    {   return Op.linqApply(Left.linqEvaluate(), Right.linqEvaluate());  }
}

There we see that linqEvaluate on a @binary @expression recursively calls linqEvaluate on the Left and Right operand @expressions, then calls the operator’s linqApply method on the results.

In a remoting context, this will work fine if both the sender and the receiver have the definitions of the C# classes, in particular the CLR code that implements the methods. Such a situation even has security advantages: you can’t remote a call when the receiver does not statically know the class type and all its method code. But it’s also too restrictive. You can’t call any recursive function that isn’t statically pre-canned at a receiver. That means that the receiver must have pre-installed all allowable methods, and that means no innovation in the crowd. Depending on scenario, this may be exactly what you want, or exactly what you DON’T want.

What we can do, technically, is strip ALL methods off the classes, leaving only data. Remoting data is relatively easy — serialize into JSON or XML or whatever. Then, we package CODE as DATA, also, using .NET Expressions. This representation is automatically created from IQueryable and IQbservable so long as the code is stored in lambda expressions in LINQ query operators like .Select and .Where. It’s also possible to write .NET Expressions explicitly, and there will be a lot more to say about this later.

The current program, though, is to get rid of all methods and replace them with lambda expressions. Then, the calls to evaluate them, namely

    exprs
        .Select(e => Y(@eval)(e))
        .Dump(“Good LINQ method: expect IEnum of @expression”)
        ;

where

    var c1 = new @constant {Value = 4};
    var exprs = new [] { // IEnumerable<@expression>
        new @binary {Left = new @constant {Value = 4},
                     Op = new @times(),
                     Right = new @binary {Left = new @constant {Value = 5},
                                          Op = new @plus(),
                                          Right = new @constant {Value = 6}}}};

and

    Extensions
        .generate()
        .Return()
        .Dump(“Random Expression Tree”)
        .Select(e => Y(@eval)(e))
        .Dump(“Value of the Tree”)
        ;

The magic is in this definition of @eval, which takes its self-application as one input and the expression to evaluate as its second input. The Y Combinator, namely

    delegate Func<A,R> Rec<A,R>(Rec<A,R> _);
    public static Func<A,R> Y<A,R>(Func<Func<A,R>, Func<A,R>> s)
    { Rec<A,R> rec = f => n => s(f(f))(n);
        return rec(rec); // (g |-> g@g) applied to f => …
    }

creates the appropriately delayed self-application so that eval can “call itself” recursively. Here is @eval:

    Func<Func<@expression,@expression>,Func<@expression,@expression>>
        @eval = ev => ex =>
            (ex is @constant) ? ex
            : (ex is @binary) ?
            @apply(
                (ex as @binary).Op,
                ev((ex as @binary).Left),
                ev((ex as @binary).Right))
            : default(@expression)
            ;

The entire example is hosted here as a LINQPad script. Please follow the links in this post for detailed explanations of Y and how .NET Expressions are remoted. In this example, the expressions are remotable; actually remoting them involves more machinery such as implementing IQueryable, IQbservable, serialization, and various transforms on LINQ .NET Expressions. Of course, we want to use LINQ itself to perform such transformations, and that will lead us into deeper waters.

Bayes’ Theorem a Triviality?

•September 25, 2012 • Leave a Comment

As is so often the case, micro-examples make the otherwise abstract obvious. Almost anyone will be able to reason upward from the following adaptation of a famous example to a general statement of the theorem. For some reason, the little example is so easy to understand and to remember.

Imagine you have a school of 200 students with 60% boys and 40% girls; 90% of the boys wear pants, 10% wear kilts; 35% of the girls wear pants and 65% wear skirts. You see a student in the distance and can tell that he or she is wearing a kilt or a skirt, but you can’t tell whether the student is a boy or a girl. What’s the probability the student is a girl?

This is only easy if you just write out the counts, as in 120 boys: 108 in pants, 12 in kilts; 80 girls: 28 in pants, 52 in skirts. If you try to work in fractions you will go insane with arithmetic.

But Bayes’ theorem becomes a triviality: 64 (12+52) students are in skirts or kilts; 136 (108+28) in pants (marginal counts). Just write out the ratios. Probability that student is girl given that we see skirt or kilt is P(girl|skirt) = P(girl & skirt) / P(skirt) = 52 / 64. Prob that student is in skirt given she is a girl is P(skirt|girl) = P(girl & skirt) / P(girl) = 52 / 80. P(girl & skirt) = P(skirt|girl)*P(girl) = P(girl|skirt)*P(skirt). That’s the usual statement of the theorem.

LINQ does Logic will return soon.

LINQ does Logic? Chapter 5: Composable Recursion

•September 18, 2012 • Leave a Comment

Embedding recursive function in LINQ gives us opportunities for dynamically composable and remotable streams of operations. If we make a library of expression processors — evaluators, optimizers, rewriters of all kinds — all of type Expression -> Expression, then can put them in LINQ .Select and .SelectMany methods, which we can apply to IEnumerables and IObservables of expressions. We can send these around the system as data via IQueryable and IQbservable, allowing us to build up and tear down whole processing graphs completely dynamically, without rebuilding and redeploying applications and services.

Moving forward through the samples in Feraudy’s original distribution of SProlog, we see Exampl4.SPR is a demonstration of IO primitives. That isn’t specifically about logic programming, but rather about how Prolog interfaces with the “normal world”. We’ll skip it.

Exampl5.SPR is about recursion. This is interesting.

This particular example is a lesson in writing an expression evaluator in Prolog. Now, expression evaluation is a completely standard and classic lesson in recursion and method dispatch, and has been done to death in every programming language under the sun, e.g., F#, JavaScript, Scala, Python, etc. Could there possibly be anything more to say about it?

Maybe. Turns out that embedding such a thing in LINQ does give us a new twist: composable streaming. If we make the evaluator of type Expression -> Expression, then it becomes something we can put in a LINQ .Select method, which we can apply to an IEnumerable of expressions, or even better, to an IObservable of expressions. We can subscribe the evaluator to a potentially infinite stream of expressions and evaluate them as they arrive, forever. We can conceptually generalize from mere evaluators to all kinds of other expression processors.

Let’s first transcribe Feraudy’s example closely. He will take an expression and produce an integer. In the below, iplus, imult, and iminus are built-in Prolog primitives for integer arithmetic. Feraudy’s grammar presents two kinds of expressions: constants and binary expressions. The constant kind “bottom-out” the recursions and just produce integer values directly. The binary kind recursively walk the expression tree, evaluating left and right branches until they reduce to integers and then combining them with the built-in operators. Standard mathematical induction will furnish a proof of termination of this program on well formed finite inputs.

((evaluate V V)
 (integer V)
)
((evaluate (E1 Operator E2) V)
 (evaluate E1 V1) /* evaluate calls itself */
 (evaluate E2 V2)
 (apply_operator Operator V1 V2 V)
)

((apply_operator plus X Y Z)
 (iplus X Y Z)
)
((apply_operator minus X Y Z)
 (iminus X Y Z)
)
((apply_operator times X Y Z)
 (imult X Y Z)
)

This is pretty easy to emulate in C# LINQ. We’ll use abstract classes for the expression and operator types, putting specialized implementations of evaluate and apply in sealed subclasses. We won’t try to make the classes generic on the underlying numeric type; we’ll just use integers for now (while it’s interesting to pursue numeric generics, it’s non-trivial and a whole ‘nuther can of worms).

Here’s a real quick hack at it, with names decorated by ‘@’ both to conceal the use of a keyword, namely ‘operator,’ and to give a radiant aura of specialness to these classes:

public abstract class @expression 
{   public abstract int evaluate();   }

public sealed class @constant : @expression
{   public int Value {get; set;}
    public override int evaluate ()
    {   return Value;   }   }

public sealed class @binary : @expression
{   public @expression Left  {get; set;}
    public @operator   Op    {get; set;}
    public @expression Right {get; set;}
    public override int evaluate ()
    {   return Op.apply(Left.evaluate(), Right.evaluate());   }   
}

public abstract class @operator
{   public abstract int apply(int a, int b);   }

public sealed class @plus : @operator
{   public override int apply(int a, int b)
    {   return a + b;   }   }

public sealed class @times : @operator
{   public override int apply(int a, int b)
    {   return a * b;   }   }

public sealed class @minus: @operator
{   public override int apply(int a, int b)
    {   return a - b;   }   }

Just as with the Prolog version, these don’t do much beyond converting infix notation to prefix notation. Here’s an example usage, essentially a unit test:

    // This is a transcription of 4 * (5 + 6) == 44
    new @binary {
        Left  = new @constant {Value = 4},
        Op    = new @times(),
        Right = new @binary {
            Left  = new @constant {Value = 5},
            Op    = new @plus(),
            Right = new @constant {Value = 6}}}
        .evaluate()
        .Dump("Non-LINQ method: expect 44")
        ;

Ok, great. Now, let’s package the expressions in an IEnumerable so we can use .Select to apply the evaluator:

    // LINQ method introduces one level of IEnumerable / IObservable
    var exprs = new [] { // Puts expression in IEnumerable<@expression>
        // The rest is as above
        new @binary {
            Left  = new @constant {Value = 4},
            Op    = new @times(),
            Right = new @binary {
                Left  = new @constant {Value = 5},
                Op    = new @plus(),
                Right = new @constant {Value = 6}}}};

The evaluator is now of type IEnumerable<@expression> to IEnumerable<int>, represented as a lambda expression inside a Select (i.e., a Map):

    exprs
        .Select(e => e.evaluate())
        .Dump("Bad LINQ method: expect IEnum of int containing 44")
        ;

Oops! We can’t compose this with downstream functions that manipulate expressions. We’ve left the world of expressions and entered the world of integers. For composability, the entire evaluator is of the WRONG TYPE! We must make it produce IEnumerable<@expression> and not IEnumerable<int>. Let’s do this by adding linqEvaluate and linqApply methods, and let’s make a different operator (a so-called co-monadic one) to “extract” values from an expression! Extract should only work on leaf-level expressions, so the default implementation in the base class throws an exception:

public abstract class @expression 
{   public virtual int extract()
    {   throw new InvalidOperationException(
        "cannot extract a value from an arbitrary expression");   }
    public abstract @expression linqEvaluate();   }

public sealed class @constant : @expression
{   public int Value {get; set;}
    public override int extract ()
    {   return Value;   }
    public override @expression linqEvaluate ()
    {   return this;   }   }

public sealed class @binary : @expression
{   public @expression Left  {get; set;}
    public @operator   Op    {get; set;}
    public @expression Right {get; set;}
    public override @expression linqEvaluate ()
    {   return Op.linqApply(Left.linqEvaluate(), Right.linqEvaluate());   }   
}

public abstract class @operator
{   public abstract @expression linqApply(@expression a, @expression b);   
}

public sealed class @plus : @operator
{   public override @expression linqApply(@expression a, @expression b)
    {   return new @constant {Value = a.extract() + b.extract()};   }   
}

public sealed class @times : @operator
{   public override @expression linqApply(@expression a, @expression b)
    {   return new @constant {Value = a.extract() * b.extract()};   }   
}

public sealed class @minus: @operator
{   public override @expression linqApply(@expression a, @expression b)
    {   return new @constant {Value = a.extract() - b.extract()};   }   
}

This is just like the old stuff, only it makes sure that the gazoutas are the same type as the gazintas, namely of type @expression. That means we can freely compose expression rewriters — not just evaluators — anywhere in a LINQ pipeline. Such rewriters can do all kinds of things: optimization, customization, repurposing, and reasoning with logic! So we have come full circle (though we’re not nearly done with this series!). While investigating logic programming via LINQ, we found that LINQ’s composability led us directly to opportunities to apply logic programming to expressions. This is a gold mine.

Let us end with a couple of examples of using these new classes. There is a complete LINQpad script here with all this in more for your experimentation. Here is a usage example generating random expression trees and evaluating them in an IEnumerable:

    Enumerable
        .Range(0, 10)
        .Select(_ => @expression.generate())
        .Select(e => e.linqEvaluate())
        .Dump("Values of 10 random expression trees")
        ;

and here is a similar example evaluating them via a push model in an IObservable:

    Observable
        .Interval(TimeSpan.FromSeconds(1.0))
        .Take(10)
        .Select(_ => @expression.generate())
        .Subscribe(e => new {expression = e, value = e.linqEvaluate()}
            .Dump("A new expression every tick")
            );

LINQ does Logic? Chapter 4: Composable Hashing

•September 11, 2012 • Leave a Comment

Checking  multiple collections for mutual membership of some item is an emerging theme in our examples. For instance, we want to know if some material has a certain value for specific density, and, if so, whether it’s also in a database of solubles.

The time it takes to scan a collection is proportional to the length of the collection. Imagine looking for a word in a dictionary by starting at the beginning and checking every word. A standard way to speed this up is, of course, hashing. In the case of a paper dictionary, your “hash function” returns the first letter of the word, which sends you to a “hash bucket:” the section of the dictionary with all the words beginning with that letter. You naturally recurse, effecting a trie search.

As the number of collections to cross-check grows, so grows the importance of speeding up the checks. The time to check all collections grows as the product of the sizes of the collections. Imagine checking that a certain word is in four different dictionaries, first using the brute-force-scan method then using the brighter hash method. Now imagine doing likewise for 1,000 words.

In Chapter 2: Conjunctions, we built LINQ and Mathematica examples that check whether materials float on water by looking up densities and solubilities. The running times of these programs over the tiny sample data set were negligible, but we noted that the method used would not scale. That method used the .Contains extension method on IEnumerable and the MemberQ primitive on Lists to scan the entire list of solubles for each candidate material.

In this chapter, we build another example with longer lists and starkly illustrate that the brute-force methods are intolerable. In the offing, we fix the problem using a composable hash-set type. We create a LINQ operation that converts an IEnumerable into a hash set, and transparently insert that operation into a chain of other operations. This is more than just good aesthetics: much later, we shall see that remoting composable LINQ operations opens whole vistas of dynamic scenarios, where pipelines can be set up and taken down on-the-fly.

First, look at “hashing.linq” in the LINQ subdirectory of the “LINQdoesLogic” repo. It refers to a data file of about 350,000 English words adapted from the public-domain MOBY lists. To run this LINQpad script, you will have to adjust the file location or the reference to it in the script — the checked-in version of the script assumes the file is in

c:/tmp/words.txt

The script first reads the file and then does timed lookups of two words, “prehistorically,” which is in the file, and “zzzzzzzz,” which isn’t, as follows:

    var sw = new Stopwatch();
    sw.Start();
    words
        .Contains("prehistorically")
        .Dump("Does the IEnumerable contain \"prehistorically\"?")
        ;
    sw.Stop(); sw.ElapsedMilliseconds.Dump("Elapsed milliseconds
    required to decide whether the words IEnumerable contains
    \"prehistorically\"");

On my dreadnaught-class Lenovo laptop, plugged in, these lookups take around 15 to 30 milliseconds, respectively (about twice that on battery).

These timings are probably not very accurate because they’re too short compared to the included overhead of printing an answer. Later in the file, along with some other digressions that might interest you, there are snippets that look up 1000 words and amortize the overhead, taking around 15 seconds, consistent with the shorter end of the single-word lookup time.

We then build a hash set, first using the built-in “Add” method of the built-in HashSet class in the .NET Framework. This method takes a HashSet and an element and returns true if the element was inserted in the HashSet. This protocol allows us to insert and count all the non-duplicate words in the file into a hash set with a LINQ composition like the following:

    var wordsHashSet = new HashSet<string>();
    sw.Reset(); sw.Start();
    words
        .Where(w => wordsHashSet.Add(w))
        .Count()
        .Dump("Count of unique words in the hash set")
        ;
    sw.Stop(); sw.ElapsedMilliseconds.Dump("Elapsed milliseconds to
    build and enumerate the hash set by transformation");

Note the “Where” method instead of “Select.” With “Where”, we noticed that the original MOBY word list “single.txt” actually contained a duplicate word, “animate,” because the count revealed by LINQ differed from the line count of the file by 1. This mistake has certainly been in the MOBY file for years but it was easy to find via LINQ. I took it out of “words.txt,” but I have left it in the Mathematica version of our program for illustrative purposes.

Ok, this is great, we have a hash set. We next do some timed lookups of our test words and find them to be negligible.

    sw.Reset(); sw.Start();
    wordsHashSet
        .Contains("prehistorically")
        .Dump("Does the hash set contain \"prehistorically\"?")
        ;
    sw.Stop(); sw.ElapsedMilliseconds.Dump("Elapsed milliseconds to 
    decide whether the hash set contains \"prehistorically\"");

However, we have lost something. The only thing we can do with this approach is load the hash set by side effect and then use it later. This requires us to create a variable for the hash set, so it’s not composable. It’s a rule — whenever you must create a variable “out of line” of the LINQ query, the query itself is not composable, because it depends on the variable created in its environment. You can’t cut and paste the query or remote it without its whole environment.

If we could bypass the variable, we would have a composable transform that converts the original word list into an efficient form and only then does the tough work.

In fact, there is an even deeper level of composability available, and that is in the creation of the hash set itself. Our eyes tell us that if ToArray and ToList and ToDictionary are good LINQ, then ToHashSet would be good LINQ, but we must create it. We’ll need a “streaming AddTo” along the way, which will stand us in good stead in future Reactive scenarios. But, getting back to hashing; we treat the hash set as a mere aggregate of the original word list!

    public static class Extensions
    { ...
        public static ICollection<T> AddTo<T>(
            this ICollection<T> target, T item)
        {   target.Add(item);
            return target;
        }
        public static HashSet<T> ToHashSet<T>(
            this IEnumerable<T> source)
        {   return source.Aggregate(
               new HashSet<T>(),
               (hs, elt) => (HashSet<T>)(hs.AddTo(elt)));
        }
    ...
    }

Let’s make the whole exercise much more challenging (see composableHashing.linq in the repo). First, imagine a transform that takes a random sample of a given length out of the original word list

    words
        .ToArray()
        .RandomChoice(words.Count(), 1000)

where

    public static class Extensions
    { ...
        private static Random random = new Random();
        public static IEnumerable<T> RandomChoice<T>(
            this IEnumerable<T> source, int length, int count = 1)
        {   var result = new T [count];
            for (var i = 0; i < count; i++)
                result[i] = source.ElementAt(random.Next(0, length));
            return result;
        }
    ...
    }

The “ToArray” is critical. The “ElementAt” inside “RandomChoice” is efficient for arrays, but not for lists. To convince yourself of this, try timing the above with and without the “ToArray.”

In any event, once we have the sample, we want to run membership tests for every element of the sample on the original word list to stress the membership checking. Semantically, we want to write something like the following (Don’t try this; it takes minutes, but have no fear, we’re fixing it:)

    words
        .ToArray() // comment this out to make it slower
        .RandomChoice(words.Count(), 1000)
        .Where(s => words
            .ToHashSet() // comment this out to make it faster
            .Contains(s))
        .Count()
        .Dump("Number of sample words in the original (expect 1000)")

The “ToHashSet” here is not even wrong because we recreate the hash set for every word in the sample. However, we can create a local name, inline, to contain the hash set, which we make just once. The best way to create local names in composable queries is with anonymous types:

    new [] { new {
            aWords = words.ToArray(), 
                // comment out ToArray to go slow
            hWords = words.ToHashSet() 
                // comment out ToHashSet to go REALLY slow
    }   }
        .Select(o => new {
            hWords = o.hWords,
            sample = o.aWords.RandomChoice(o.aWords.Count(), 1000)
        })
        .Select(o => o.sample
            .Where(s => o.hWords.Contains(s))
        )
        .ElementAt(0)
        .Count()
        .Dump("number of samples in the original set (expect 1000)")
        ;
    }

This isn’t perfect because later lines must know the names created in earlier lines, but the lines can be cut and pasted and remoted freely because they have no dependence on environment outside the query, just on local names created earlier in the query.

This can be mitigated a bit using .NET Tuples. With them, dependent lines must know the positions of required items within a tuple. Either way — with property names in anonymous types or with positional references in Tuples, there is a little, conventional contract amongst the composable parts in the query. This is the maximal decoupling we know how to effect at this time.

Now, to the Mathematica version. First, we can conveniently include the entire word list inline as a “closed” cell to remove the dependence on the file system. Second, we left in the duplicate word “animate” in case you want to experiment with writing some queries to find it.

There are a several ways to represent hash sets in Mathematica. The most versatile is as a list of rules. A natural way to build one is by Mapping (in LINQ: Selecting) a transform over the words. The transform converts a word into a rule that maps the word to True. Here is a sample

    Map[#->True&,words]//RandomChoice[#,5]&//TableForm

    entia->True
    galeoid->True
    manderelle->True
    accent->True
    tricot->True

The “list of rules” representation is incredibly versatile and powerful. It is the natural representation for an object, both in the object-oriented-programming sense and in the JSON sense. For present purposes, we run Mathematica’s built-in Dispatch function over the list of rules, which transparently converts the list into a hash table, which we are underutilizing as a mere hash set:

    hashSet = (Map[# -> True &, words] // Dispatch);

We now make our stress-test sample of 1,000 words

    bigSample=RandomChoice[words,{1000}];

and compare the times of checking for membership using old-slow MemberQ (almost 5 seconds!)

    In[11]:= Timing @ (And @@ Map[MemberQ[words, #] & , bigSample])
    Out[11]= {4.696, True}

versus shiny-new-fast hashing (less than 1/10 of a second!)

    In[12]:= Timing @ 
        (And @@ Map[True === (# /. hashSet) &, bigSample])
    Out[12]= {0.078, True}

Notice the ReplaceAll operator:

/.

The way to look something up in a list of rules (or its optimized equivalent hash table) is just to apply the rules to an expression. In this case, the expression is #, the argument of the function in Map, which takes on the value of each word in bigSample in turn.

As usual, we’re impressed with the terseness of Mathematica, despite its slightly geeky syntax, but we are also gratified that we know how to do exactly the same thing fluidly and fluently in LINQ.

Colophon: The link on “fluently” above conceals a pun. If you wish to investigate local names and chained dependencies in composable queries more deeply , do click on it and read Joe Albahari’s answer.

LINQ does Logic? Chapter 3: Rewriting Conjunctions

•September 4, 2012 • 1 Comment

Here is the secret of Prolog: it converts proofs of conjunctions into disproofs of disjunctions. Instead of trying to prove every statement in a bunch connected by “and’s,” it tries to disprove just one statement in a bunch connected by “or’s.” This can give us, in Logical LINQ, a time-saving “quick exit” computation in some cases, but there are some logical subtleties to consider. Here we go with a bit of analysis that is more philosophical than computational:

Here’s our running example of a Prolog rule, just rendered in English:

A material can float on water if its specific density is less than unity and if it’s not soluble in water.

Let’s give each proposition in the rule a short, symbolic name:

F:(A material can Float on water) if L:(its specific density is Less than unity) and S:(if it’s not Soluble in water).

and contract the rule into a short, symbolic form

F if L && S

We can examine the logical structure of this short form without distraction from its interpretation in the world of floating materials. Its operational meaning is “to prove F, prove L and prove S.” Logically, if we don’t know that both L and S are true, or even if we know that one or both are false, then we can’t prove F, at least according to this isolated rule. Only if we know that both L and S are true do we know that F is true, otherwise we don’t know, again, according to this one isolated rule.

It follows, then that “if we disprove L, or disprove S, or show that either one cannot be proved from the facts on hand, then we know that we can’t prove F from this rule and Prolog can say ‘no’ right away.” For example, if we prove that a material’s density is not less than unity, or we don’t have any data on that material’s density, or if we prove that the material is soluble, or if we don’t have any data on its solubility, then we know we can’t prove that the material floats.

Before you cry foul, let me examine the obvious case: we prove that the density of material X is greater than or equal to unity (we disprove L) but we also happen to prove that material X is not soluble (we prove S). I just said that after disproving L we can jump out and declare that we can’t prove F, but you might object that, we jumped the gun and in fact, we missed the chance to disprove F.

First, I must narrow down your objection by dismissing any reference to the level of real-world interpretation: that’s the wrong level. Of course it’s true that dense, non-soluble materials don’t float — boats etc. notwithstanding. But that is not the rule we encoded in Prolog. The rule we encoded only addressed the proof of floating. Formally speaking, we have no way of disproving F from the rules and data at hand.

There is another level, though, on which your objection has merit, and that’s the level of the “closed-world assumption.” Under this assumption, our data and rules are a complete description of the domain of discourse and, therefore, failure to prove is proof of falsehood. Under this assumption, our Prolog query

F if L && S

should be read as

F if and only if L && S

This is an intermediate level of interpretation: not at the real-world level dismissed above, and not at Prolog’s mechanical level of scanning lists of facts, but at the level of what we think when Prolog says “no.” When we are contemplating the results of a Prolog query, we are free to use or not use the closed-world assumption, and our conclusion of whether Prolog disproved our query statement or merely failed to prove it depends on that assumption. So, we must thoroughly understand how Prolog decides ‘no’ and what it might mean with and without the closed-world assumption.

Our example query actually has the following form at the finest level of detail, moving closer to the Mathematica notation:

F[m] if D[md] && L[d] && S[m]

where

F[m] = material m can float on water

D[md] = we have a record in the fact base of m‘s specific density d

L[d] = we do a computation proving that specific density d is less than unity

S[m] = we have a record in the fact base that material m is not soluble in water

Prolog will say ‘no’ if any of D[md], L[d], S[m] fail (that’s the quick-exit optimization and the subject of this chapter). What does this mean, with and without the closed-world assumption?

D[md] will fail if there is no record for material m in the fact base. Now, it’s notable that under the closed-world assumption and in the presence of the next term L[d], we don’t actually need to put any records in the fact base for overdense materials. So long as we have a record for every conceivable material with density less than unity, the combination D && L will robustly fail due to mere absence of a record for all materials with density greater than or equal to unity. But to the extent that we’d like D and L independent, D[md] cannot fail under the closed-world assumption unless we didn’t put the density d of material m in the fact base. Failure of D generates a bug report.

Without the closed-world assumption, failure of D means that we don’t know the density of material m; failure of D means failure to prove F; we don’t know whether the material can float, and all is consistent. So much for D; on to L.

L[d] will fail if is greater than or equal to unity. L is one of a class of terms for which the closed-world assumption is automatically valid. Any number d must be either less than 1 or not less than 1, so failure to prove L[d] is, in fact, always a proof that L[d] is false. Even if we don’t take the closed-world assumption for other parts of the query or for the query as a whole, it’s always valid for L[d]. With or without the closed-world approximation, our interpretation of failure of L is the same: it’s proof that the material is overdense.

Independent failure of L forces immediate failure of F, but does it mean failure to prove floating or proof of failure to float? We would like to say, based on general theory, that under the closed-world assumption, failure of L constitutes disproof of F. Checking against our real-world interpretation, it does: the material won’t float whether it’s soluble or not, so looks like we’re OK, although we might feel queasy that we checked against the formerly forbidden real-world interpretation.

Without the closed-world assumption, we’d like to say that failure of L means failure to prove F, and this interpretation is always available. If you tell me that you disproved F, I don’t contradict you by saying that I’m not sure I believe you. I just accept that I think F might be false and it might be true. Put another way, the truth of (not F) implies the truth of (F or not F) by disjunction introduction. So much for L; now finally to S.

S[m] will fail if we don’t have a record that m is not soluble. Here the closed-world assumption makes all the difference. Under this assumption, we have complete data and the absence of a record for S[m] proves that m is soluble. Failure of S[m] means disproof of F[m] — m does not float because it’s soluble. Without the closed-world assumption, failure of S and therefore of F means that we don’t know.

The complete analysis of this example shows that when Prolog says ‘no,’ under the closed-world assumption, we can conclude that the material does not float, and without the closed-world assumption, we can only conclude the more modest statement that Prolog can’t prove that the material floats.

Let’s finish with a symbolic rewrite of the conjunction that we may be able to exploit in our Logical LINQ going forward:

Using the material-conditional definition of “if” (though see Entailment for the many faces of “if”), and using “===>” to mean “rewrites to”:

(F if D && L && S) ===> F or not (D && L && S)

and then, by De Morgan’s:

F or not (D && L && S) ===> F or not D or not L or not S

Prolog now tries to disprove D, then L, then S in sequence. It stops with an overall ‘no’ if it gets one solid disproof (or failure to prove) along the way, only concluding that F is proved if it cannot disprove any of the others, and reporting the variable bindings it accumulated on the way.

 
%d bloggers like this: