Concur: a Newton machine learning application

Key Technologies
Concur makes use of several key technologies including Apple's Newton technology, machine learning and infrared communications.

Newton Technology

Newton Technology is a collective term for Newton devices, their operating system and programming language. Apple Computer introduced the first Newton product, the Apple MessagePad, based on this technology in August of 1993. The technology has since been licensed by about a dozen firms and to date we have seen the Sharp ExpertPad, the Motorola Marco and the Digital Ocean Tarpon. A larger device called "the slate" and several other devices are expected to come to market in the near future.
The Apple MessagePad is a one pound, programmable, pen-based, handheld computer that recognizes cursive handwriting and sells new for less than $600. It has a 20MHz RISC processor, 4 MB of ROM, 1 MB of RAM and a battery life in the 8-10 hour range. Its communication options include serial communications, LocalTalk networking, FAX and data modem and infrared communications operating at 38K baud. It has built-in support for email, FAX and PostScript printing. Later models have more memory and even longer battery life.
All of the programming and data is stored in either ROM or RAM that has a redundant battery backup; there is no secondary store. There are no files or directories in the traditional sense. Data is organized into soups and the typical element in a soup is a frame. Frames may be likened to records and soups may be likened to files. Many data values are stored only once where multiple occurrences of the data are simply pointers to a single instance. This and a few other tricks causes RAM data to be compressed at about 6:1.
The Newton OS is a preemptive, multi-tasking operating system and Newton devices are programmed using an object-oriented dynamic language called NewtonScript. Hosted on a Macintosh, the combination browser, editor, compiler, linker and debugger generates packages of byte codes for downloading to a tethered Newton device by either a serial or a network connection. NewtonScript is currently an interpreted language, but a native compiler for compute intensive routines is under development.
NewtonScript programs are compiled to byte codes to reduce their memory requirements and increase portability should a Newton Technology licensee choose a different hardware platform or processor.
NewtonScript programs are object-oriented but are prototype based; that is, inheritance is obtained from prototypes and other instances and not from classes. It is possible, however, to simulate classes through the careful use of prototypes. There is also an extensive library of views (widgets) from which to build applications.
The structure of a Newton application is an inheritance tree of views where each view can see the data and methods of parent views. As such, an application is simply another set of views installed on top of the extensible OS which provides event loop and I/O processing support.
All Newton devices share these characteristics. Other products incorporate extensions such as ARDIS radio networking as in the case of Motorola's Marco or a larger writing surface as in the case of the slate. The Digital Ocean Tarpon is ruggedized and capable of withstanding complete immersion to a depth of 20 feet and can withstand a drop to concrete from a height of six feet. One of its options is an internal Global Positioning System, or GPS, making the Tarpon ideal for field data collection tasks. As a platform, Newton is often a good choice because it represents a new model of computing with a new form factor affording the user new dimensions in mobility and price.

Machine Learning
Another key technology Concur depends upon is machine learning. Machine learning applications can learn and adapt to complex and highly individualistic environments, perhaps better than if a programmer attempted to code individual solutions. It also provides a way to handle situations that are sometimes beyond the grasp of the programmer. If a problem were fully understood in those cases, it could conceivably be programmed and machine learning would have little value. There are problems that are understood only sufficiently for a programmer to provide the basic application with a machine learning component. This kind of program can go on to learn and adapt to the complex environment.
Also, machine learning is applicable to situations where programs must adapt to the needs thousands of individual users and the cost of custom programming for each user is prohibitive. Here machine learning provides a means of adapting to many individual situations or users without the programmer having to understand the needs of each. When the number of situations or users is small this is not so much a problem, but it doesn't take many before a machine learning approach is preferred.
There are several machine learning paradigms that might have been used in the design of Concur; decision trees or ID3, Artificial Neural Networks or ANNs and case-based reasoning or CBR. ID3 has the advantage that it returns a quick real-time recommendation because it would have had no more than a dozen decisions to make. ID3 would require, however, significant amounts of off-line time to build the decision tree which should be rebuilt periodically as additional learning becomes available. Also, ID3 doesn't provide an easy mechanism for the notion of relative importance or weighting. Because user control and a minimum of off-line processing were high objectives, ID3 was set aside.
Artificial neural networks were not attractive for several reasons. One is that a supervised learning period will need to be performed by a novice user which conflicts with the ease-of-use objective. Another is that the user might change the relative importance preferences after a period of use precipitating another training period. Finally, ANNs "hide" the rationale for a recommendation making validation difficult. With cased-based reasoning the programmer, and to some degree, the user have an opportunity to understand how recommendations are determined.
CBR requires, perhaps, more storage and time than either ID3 or ANNs but has the advantage of allowing the user to change the relative importance factors at any time with immediate and valid consequences in the recommendations. CBR is easier to comprehend and the algorithms are likewise cleaner. If an appropriate indexing scheme is applied, CBR can perform within a reasonable time. Likewise, efficient data compression techniques can minimize storage requirements to within acceptable limits. For these reasons, case-based reasoning was chosen for implementation in Concur.
CBR works by searching a data base of prior observations for the one that best matches the current situation in the hope that it will provide the best predictor of what action should be taken. The key is the matching algorithm or, as it is sometimes called, the retrieval algorithm. The case that best matches the current situation is called the exemplar or the nearest neighbor. The action stored in the data base with the exemplar becomes the CBR's recommendation for solving the current problem. Whether the exemplar's action is acted upon automatically or not is dependent on the degree of match and the confidence expressed by the programmer or user.
Concur uses a nearest neighbor algorithm with a weighted distance. This is where the distance, or degree of mismatch, of each of the attributes is calculated and the more important attributes are given higher weights. The weighted distances are them summed to provide a total distance for each case and the case with the lowest distance is selected to become the exemplar.
Without time constraints, one could match against the entire case base and be assured of finding the best exemplar for the match criteria. Unfortunately, time is usually of the essence and case bases are often large. Indexing schemes help reduce the time to find the most relevant cases.
Choosing an index for a large case base is frequently more complicated than choosing an index for a traditional database. The problem can become even more interesting in machine learning situations where the new, learned cases are automatically added to the case base or the indices might have to change dynamically over time.
Under ideal conditions, one would like to be able to dynamically choose indices, but in a real-time environment with limited power and processor availability, these indices need to be carefully decided beforehand. The following discussion presents some the issues that must be considered and later, an algorithm is proposed to simplify and proceduralize the index design process.

Case-Based Reasoner Indexing Concepts
The purpose of an index is to subdivide a database for quick access to records of interest. In case-based reasoning, the goal of an index is to feed the matching algorithm a relatively small number of, hopefully, relevant cases. It is not sufficient to simply limit the number of cases, but to present the highest quality subset of cases. And, conversely, the most appropriate cases should not be overlooked simply because the case base was poorly indexed.
Here are several concepts and definitions relevant to choosing an indexing scheme.
Indexing Vocabulary
Each application area has its own words and jargon that have special meaning, perhaps to that subject matter area alone. These words are important descriptors that may be used to classify a case. It is therefore important to choose a vocabulary for describing attributes, or features, and representing cases before dwelling on the specific vocabulary for an index.
From the chosen vocabulary several related but mutually exclusive descriptors may be combined to define an attribute domain which, in turn, may be indexed. These selected descriptors are called the indexing vocabulary. [Kol93]
Surface vs. Structural features
It is important to understand the difference between a surface feature and a structural one. Surface features are straightforward; they are those attributes that are immediately observable. Structural features, on the other hand, usually involve deeper thematic relationships. Consider these examples. [Sei94]

Working late

X has choice of working all night and into the next day to finish project or get sleep and work the next day refreshed. The first provides more time while the second provides less, but higher quality, time.

"Flight of the Phoenix"

In the movie "Flight of the Phoenix", Jimmy Stewart had 7 remaining explosive charges for starting the engine. He could have used all 7 in one shot or he could have used a couple to clear the exhausts of sand and the remainder to actually start the engine.

The working late and Phoenix examples have no surface features in common but are perfect matches when looking at structural features. Both of them have a goal with limited resources and they must decide between the two plans on how to use that resource. Go for one do-or-die shot or some preparation for a briefer, but hopefully, higher quality second attempt? This is an example of an optimization step and a direct-use step.
Another example: consider chess and football. They have a lot in common yet the games seem dissimilar. [Kol93, Sei94] Chess is played on a checkerboard, football on a field; chess has pieces, football has people; chess turns move to the other side after each move, football changes turns when 10 yards are not gained in 4 plays. Both games share few surface features. Checkerboards are unlike football fields and pieces are unlike players.
But chess and football have several common structural features. Both games have two sides, both sides want to win but cannot, the sides are opponents rather than cooperators, both games involve pieces or players positioned on a playing field or a board and the concept of a forking move in chess is quite similar to the option play in football.
These structural features are more abstract and address deeper concepts that are not apparent without some analysis. To observe the similarity between the fork and the option play, some thinking is required. It is these structural attributes that might allow a seemingly dissimilar case to be the best match for formulating a recommendation.

Abstraction
If descriptors are too specific, good cases might not match well. Abstraction may be needed so that cases can be found when there are negligible distinctions. [Kol93] For example, apples, oranges, potatoes and squash may be the actual items in question, but it may be more appropriate to abstract to the level of fruits and vegetables so that yams will match favorably.
Concreteness
Concrete is the opposite of abstract. Hard numbers are concrete; different observers are likely to describe the situation the same way. At the other extreme, "good" and "bad" are abstract and different observers are not as likely to agree. An index should be concrete enough so that different observers will likely choose the same descriptors and find the same case repeatedly.
Casual vs. Correlated
It is helpful to understand the sometimes subtle difference between a causal relationship and a correlated one. Consider another example. [Sei94]

Missed Exit

Y misses a freeway exit by being in the left lane and it is too late to move over. Several days later it is also too late but Y notices a billboard. Third time, Y sees the billboard, moves over and takes the exit as planned.

There is no causal relationship here. The billboard, a concrete feature, is correlated with the desired outcome, the exit. One does not cause the other. They are simply in the same place with one being easier to see than the other. It is more coincidence, and therefore, the relationship is one of correlation.
The working late and Phoenix examples both have a causal relationships. The optimization step of sleeping enhances the following work day (sleep causes higher productivity) and clear exhausts promotes, or causes, a clean start.
Predictive and Useful Indices
In the missed exit example, the billboard feature is predictive because it happens to be correlated with the desired exit. In the working late example, the sleeping feature is predictive of a more productive day because of the causal relationship and likewise, in the Phoenix example, clearing the exhausts is predictive in the sense that a clean start in likely to follow. [Sei94]
All of this discussion leads to the primary point concerning indices. An index should be predictive to be useful. Indices are our tool to limit the search when retrieving a few cases from a large case library. To obtain relevant and useful cases, indices should predict them. Obviously, if an index were to point in some other direction and miss the best cases, it would not be very useful.
In summary, structural features may be more likely to help us solve complex problems and surface features are usually easier and quicker to describe. In both cases indices of these features are likely to reliably predict the utility of the retrieved information. [Sei94] This suggests that useful indices should be based on any features that are apparent at the time of a retrieval.
And "good indexes should be abstract enough to provide coverage but concrete enough to be recognizable." [Kol93]
Appropriateness conditions
A minor distinction between predictive features and appropriateness, or prerequisite, features should be made. It might seem that indices are simply the preconditions of a recommended solution, but not all preconditions are predictive of solution relevance. [Sei94] This distinction is more relevant to applications other than our meeting scheduler.

Proposed Procedure for Choosing Indices
The previous material addresses the many issues to be considered when identifying indices. To help apply the theory, here is a proposed procedure to define the case base data attributes and the steps one might take to choose specific indices for a case base.

describe the cases by building and applying a vocabulary. Use surface or structural features as appropriate. Be sure they are concrete (same values from multiple observers). If too fine grained, abstract as necessary and, perhaps, establish the abstracted descriptors as new attributes if maintaining the original detail is important.

define the matching algorithm as if all cases were available. Assign weights to the attributes as appropriate. Decide which features are most predictive.

rank the predictive features in order of their weights, as candidate indices.

prune the list of indices if it appears the case base will become too fragmented and important cases might get missed.

analyze the complexity, space and speed requirements to choose an indexing structure. Disk and memory requirements, efficiency and programming complexity should be considered.

Finding relevant cases through indices poses an interesting question. If a borderline case is otherwise a good candidate for being an exemplar but could be classified on either branch of an indexing tree structure, should the case be duplicated so that it can be found when appropriate? Redundancy like this may be wasteful and if arbitrarily applied, may break down when choices of indices and attribute weights are dynamically calculated.
Another solution is to consider searching the top two or three indices and merging the selections into one common set for nearest neighbor calculation. If a good case just misses being selected by one index and is truly a strong candidate, it will hopefully match the criteria of another index and be selected. This avoids a number of the problems previously mentioned. Concur will take advantage of this concept.

Continue

Original paper: Copyright � 1995 by Larry Bugbee
This HTML rendering: Copyright � 1997 by Larry Bugbee
All rights reserved.