The Diagram tool is one of the primary features in Think Kit, a new set of tools within our app Paper and the result of over four years of pen-and-touch input research. Diagram's hallmark feature is shape detection, which allows people to draw perfect shapes and connectors, while still maintaining the quality of the human hand.

When using Diagram, strokes drawn in Paper are initially light blue, and only change to the final stroke color once you lift your finger or stylus. This change in color indicates the temporary nature of the initial stroke drawn, which is then analyzed by Paper. If a simple shape is detected, it's replaced with a cleaner shape that still retains the stroke's hand-drawn look.

Similarly, when drawing connectors between shapes, Diagram both straightens that line, and neatly places it closely to the shapes it's connecting. Diagram also shows two indicators at the connectors' endpoints that, when tapped, place arrow-heads on the canvas. Even with all this machinery in the Diagram tool, however, we worked hard to leave your handwriting and freeform sketches untouched.

Detecting the intention behind a hand-drawn stroke is not easy. Imagine someone hands you a piece of paper. On that paper is a single curve—sort of circular, but with the left side a bit straighter than the right. Deciding whether that shape is a square, a circle, or something else entirely (say, a capital ‘D’) can be challenging for actual humans, possibly even undecidable.

The space of all possible touch inputs is a huge continuum in which squares warp smoothly into rounded rectangles, text, and circles. When you then add in lines and connectors (e.g. the letter “L” vs. a right-angled connector), the situation becomes even more challenging given these are fundamentally ambiguous. Consider what you can easily do with a single whiteboard marker:

- Create shapes like rectangles and ellipses
- Create lines, arrows, and connectors between shapes
- Create free form pen strokes, e.g. for text labels and small sketches

The space of all such strokes can be thought of as a linear continuum: on one end are closed shapes like triangles and squares, on the other is handwriting. In between, you have inputs which could be classified - by computers and human alike - as geometric objects or handwriting. Creating multiple tools has been the traditional way to resolve this; you have one tool for squares, another for text, another to manipulate, and so on. For example, building a "polygon” tool, which always straightens lines, is clearly easier.

Changing tools, however, breaks the mental flow and slows input. Our aim with Think Kit was to bring Paper's immediacy to structured thinking by creating a single “diagram” tool that spans the continuum. This is an immense challenge given this inherent ambiguity requires understanding touch input well enough to draw reliable conclusions about the creator's intent. The engine must be able to differentiate a capital "D" from a shape, or a capital "L" from a right-angled connector.

Think Kit's “Intention Engine” is our way of navigating this space. It's akin to auto-tune for drawing, tangibly retaining qualities of the human hand while simultaneously making drawings appear more symmetric and even. Our algorithms are guided by the intention of a sketching hand, as captured by velocity, statistics of human motion, and multi-scale analysis.

In the following sections we’ll explain two of the signature features of Think Kit: shape detection itself, and then the unique way in which we use it to render organic-looking ink. The first brings structure to the continuum via geometry and statistics; the second uses the continuous nature of the input to make the output look natural.

### Problem and Architecture

As aforementioned, the fundamental problem in creating Think Kit’s “Diagram” tool is the continuum of input: there's no clear division between shapes, connectors, and everything else. Think Kit approaches this by first making its best attempt to understand the input, considering both what was drawn, and how the user drew it. It also uses shape detection in a unique way: ink rendering is tightly integrated with the recognition and correction steps, and the degree of correction adapts both to what the user drew, and how they drew it.

The detection and rendering proceeds in several stages. The engine first performs a wavelet-like analysis of the curve, using detectors which look for primitive features like lines, circular arcs, two lines meeting at an angle, and so on. From the output, a short list of possible corners is identified, and finally we prune corners using statistical methods which discard less significant features based on both geometry and intent. For two similar corner strokes, we are less likely to discard a slowly- and carefully-drawn one. The result is a decomposition of the curve into an n-sided polygon, along with a measure of goodness-of-fit, which we’ll discuss more below.

This output is then fed into a stage which makes decisions about what the shape is (for example, square vs rectangle vs parallelogram) and how much correction to apply. The Intention Engine controls Ink Engine parameters and stroke replay, sometimes several times over. In order to make its decisions, it keeps track of past strokes and their timing, as well as the location and other meta data of all detected shapes. This deliberate preservation of imperfection serves two purposes. First, it allows the user to focus on the content instead of obsessing about the precise placement of shapes relative to each other. Second, this fuzzy, continuous sort of classification is a way of respecting the continuum of input strokes; the engine can mitigate the visual impact of classification mistakes in ambiguous situations.

### Curve Decompositions and Goodness of Fit

Decomposing a curve into an optimal polygon is all about fitting corners and lines. Doing this robustly requires a mathematical model for corners and lines that understands natural drawing. If people drew neatly, shape detection would be easy. But people don’t draw neatly, and finding a corner in actual data is hard. For example, up close, a corner often looks like a circular arc - it only becomes a corner when it has lines on either side.

In other words, you need to consider both the primitive features you’ve detected, as well as their relative scales. Further, even if all you want to detect are squares, you’ll still need to look for arbitrary n-gons or you’ll incorrectly decide a 5-sided polygon is a square, if two of the sides meet at a shallow angle.

One also must consider scale. When you’re looking for lines, it’s a natural thing to fit a line and measure its goodness of fit with some metric(s). The choice of metric is complicated (to the degree that it’s actually an area of active academic research), and we won’t say much about that here. Instead we’ll present our examples using the most well-known metric: classical squared error between two sets of sample points. In this setting, a statistic known as $R$-squared is about the simplest possible measure of goodness. From the touch sensor, we are given a set $p_{i}$ of sample points which arrive at times $t_{i}$. If we fit a line $L$ to the sample points, we can measure the ${residual}$ $R_{L}$, which is the amount of error in our fit:

$$R_{L} \equiv \sum_{i = 0}^{N} \| L(t_{i}) - p_{i} \|^{2}$$

We also measure the total sum of squares of the sample points

$$T_{p} \equiv \sum_{i=0}^{n} \| p_{i} - \bar{p} \|^{2}$$

about their mean:

$$\bar{p} \equiv \frac{1}{N} \sum_{i=0}^{N} p_{i}$$

Finally we can inspect the goodness of fit using the $R$-squared statistic:

$$R^{2}(L, p) = 1 - \frac{R_{\ell}}{T_{p}}$$

This should be interpreted as the fraction of the total sum of squares which is explained by the model. If your model is a perfect explanation, then residual is zero and $R^{2}$ is one. As the fit gets worse, the $R^{2}$ decreases and can even be negative. Let’s look at an example. In the left figure below, we have a sinusoidal curve, and we’re fitting lines of various lengths about a central point. If we plot $R^{2}$ as a function of line length, we see the following:

On the left, we see least-squares lines of various lengths passing through the central point of the curve. On the right, the local maxima of the $R^{2}$ statistic indicate scales at which the curve looks most like a straight line, relative to the error. The sharp drop in $R^{2}$ corresponds to the leading and trailing segments where the curve makes sharp turns away from the line-like part. At small scales, every smooth arc of a curve will be well-approximated by a line.

The local maxima of $R^{2}$ have identified the length scales at which the curve looks most like a line, relative to the local noise level. Notice that we have more than one local maximum. This is much like looking at a picket fence; up close, it looks like a number of short line segments meeting at right angles. From for away, it looks like a single long straight line. Using these statistics, we can quickly identify the possible places at which a curve might have polygonal sides, and rule out all others. Similarly, by fitting two lines we can find corners, and so on. A fast dynamic programming algorithm can be employed to find the optimal decomposition of the curve into an N-gon.

As mentioned above, it is not sufficient to simply fit one model, such as squares, so let’s examine another motivating example. When looking for straight lines, you will have much better luck if you fit multi-sided polygons as well. Again, let’s look at the $R$-squared statistics for fitting a single line to a 3-sided polyline:

On the left we have a 3-sided open polygon with a very short central side. On the right, we have plotted the goodness-of-fit when fitting one single line to all three segments in the figure at left. When the line is very short, the central horizontal segment on the left is a good fit. This good fit is initially followed by a sharp drop as the line grows longer. However, as the line becomes longer still, the relative error decreases: fitting a vertical line to all three sides is a decent fit, since the error from the central portion is relatively small. The apparent goodness rises.

A naive algorithm might look at the large-scale local maximum on the right and declare the single line segment to be an good fit, which would be a serious error: the correct answer is obviously that a line through the central point should only extend to ends of the small central segment. When fitting our models, we use our fast solver to simultaneously find a whole family of optimal N-gons, where N goes from 1 to an essentially unlimited number of sides. We then start with the largest possible N and decrease the number of sides. Removing sides typically results in small increases in error until we see that we have removed an “important” side (e.g. the central short side in the example above), which creates a sudden, large increase in error along that side. At this point the algorithm will terminate and return a 3-sided polygon.

### Velocity Response

Geometry itself is only part of the story, however. By also considering how carefully and deliberately something was drawn, we can resolve many ambiguous situations and adjust error thresholds when determining. Two curves may look identical, but if one was drawn very steadily over the course of 3 seconds and another was dashed off in 100ms, it seems quite natural to make different decisions about what the intent was. Slow, deliberate motion signals intent to do *something* — the engine then needs to decide what was intended. For example, it might reduce the error required to preserve a side, or allow two sides to meet at more shallow angles, if both were drawn deliberately.

In this example, the polygon on the right has only slightly more geometry (a hint of a corner on the bottom side) to suggest five sides rather than four. However, the user slowed near the corner, indicating intent. The recognizer interpreted this as an attempt to create a corner, and allowed the 5-sided model. On the left, the bottom arc was drawn quickly. Even though the geometry is nearly identical, the recognizer produced a 4-sided polygon.

Another example, showing lines at multiple scales in the same polygon. If the user had drawn more deliberately, the right side of the figure might have been broken into two lines, with a short segment added on the right side near the top right corner. In this case, however, the drawing was simply interpreted as a sloppy straight line. The short line segments on the left were still recognized though, since our error thresholds are local, not global.

### Classifying Shapes

When the previous stage is complete, we have a decomposition of the curve into a number of arcs, with one arc per side of the polygon, and each with an associated goodness of fit score. At this point, we need to once again balance intent and geometry to decide exactly what the user meant to draw. At launch, Think Kit cleans up three classes of shapes - ellipses, triangles and quadrilaterals - and each class has a set of regularized variants: circles, equilateral, isosceles and right triangles, rectangles/squares, parallelograms, diamonds and trapezoids.

For each shape class, we solve for a best-fit shape by optimizing particular kinds of metric error, subject to constraints. For example, a parallelogram can be fit by minimizing metric error, subject to some constraints on the direction vectors for opposing sides. Rectangles can be fit by including constraints on all four side directions, and so on. Let’s assume that we’ve detected four sides in the input samples { $p_i$ } for $i$ = 1, 2, 3, …, N, and with vertices at times { $t_{ni}$ }. We need to further classify the type of shape. We can simultaneously fit four line segments $L_i^{*}$ :

\begin{align} L_{1}^{*} & \equiv \arg \min_{L_{1}} \sum_{i=0}^{n_{1}} \| P_{L_{1}}(p_{i}) - p_{i} \|^{2} \\\\ L_{2}^{*} & \equiv \arg \min_{L_{2}} \sum_{i=n_{1}}^{n_{2}} \| P_{L_{2}}(p_{i}) - p_{i} \|^{2} \\\\ L_{3}^{*} & \equiv \arg \min_{L_{3}} \sum_{i=n_{2}}^{n_{3}} \| P_{L_{3}}(p_{i}) - p_{i} \|^{2} \\\\ L_{4}^{*} & \equiv \arg \min_{L_{4}} \sum_{i=n_{3}}^{N} \| P_{L_{4}}(p_{i}) - p_{i} \|^{2} \end{align}

where $P_L(p)$ is the projection of the point p onto the line L. (If we were fitting a single line, this would typically be accomplished using a singular value decomposition (SVD) of the Gram matrix associated with the points.) Further, for each type of shape we’ll impose some constraints on the optimization to ensure the sides are oriented correctly. This can be accomplished by adding constraints on the normal vectors to each side, as shown:

If $n_{n}$ is the unit normal vector to the n-th side, a parallelogram is defined by the additional constraints

$$n_{1} = -n_{3}$$ $$n_{2} = -n_{4}$$

to ensure we have two sets of parallel sides. Other polygons have similar, but slightly different constraints on the solution. In general, if a line L has unit normal vector $n = (n_{x}, n_{y})$ and $p = (p_{x}, p_{y})$ is a point in the plane, then the line will satisfy an equation of the form

$$n_{x}p_{x} + n_{y}p_{y} + c = 0$$

If $p$ does not lie on $L,$ then the residual $r$ (the distance between $p$ and its projection onto $L$) is given by

$$r = c + n_{x}p_{x} + n_{y}p_{y}$$

Minimizing these residuals yields a system of N linear equations subject to constraints on the normals. The metric error keeps the resulting polygon true to the data, and the constraints enforce conditions on the fit shape. This constrained minimization admits an efficient solution using QR decomposition and a single SVD on a small linear system. For each class, the optimization returns the best-fit shape and a fit error. We can then weight those error numbers against each other, with error tolerances all adaptive to perceived intent. We then decide under what circumstances a box, for example, can override a parallelogram, even though its error was higher.

### Rendering Shapes and Connectors

In our model of the space of all curves, polygons typically lie at the opposite end of the spectrum from handwriting. This clear separation means our polygon detection is reliable enough that the engine can make a clear decision whether a stroke is a shape or not. As described above, if a stroke is recognized to be close enough to any of these regularized shapes, it is corrected towards that shape. If it appears to be polygonal, but the sides don’t seem to be one of our recognized classes, then the sides are simply straightened out between the corners. Even though recognition is unambiguous, the cleanup considers intent by looking at how neatly and carefully the shape was drawn. This way, the correction feels more like nudging a stroke into a shape and less like gesture-based shape creation.

For lines, connectors, and circles, however, the fundamental geometric situation is quite different. These kinds of shapes lie in the middle of the spectrum, somewhere between obvious polygons and handwriting (most alphabets, for example, are full of straight line segments). Instead of making a binary decision on whether or not to straighten a line, we therefore treat line straightening as a continuous spectrum. Very long, fast strokes are straightened the most, while we apply less straightening to very short or deliberate strokes. No line segment is ever perfectly straightened.

In some cases, we are able to use context or gestures. If a stroke starts or ends in a shape, but also lies partially outside the shape, Think Kit assumes it is an arrow or connector between shapes. To help sketch clean arrows, blue arrowheads are faded in at each end point, but fade out again if they remain untapped for some time. Connector lines are always straightened. A deliberate pause at the end of a stroke signals to the Intention Engine that a more nuanced correction should be applied: for shapes, the corners remain unchanged and the stroke is only straightened for each segment. Line and free form strokes receive optional arrowheads, and connectors are no longer trimmed to shape boundaries.

Nesting and grouping offers yet another form of context. The Intention Engine not only maintains a pixel-accurate map of the location of all shapes and their borders, but also understands shapes nested inside one another. This additional contextual knowledge allows a connector to be precisely trimmed or snapped to a shape's border, even across different shape hierarchies. It also allows nested regions to move with their parent when it is dragged.

### The Prototyping Story: Design and Engineering

One of the most interesting production aspects of this project is the amount of prototypes we built along the way. FiftyThree has always had a strong prototyping discipline, yet Think Kit required even more than usual due to the design and engineering complexity.

Prototypes are like time-travel machines. They allow teams to travel into the future and quickly glimpse how approaches may plot out, in terms of both user experience and their technical feasibility. Some early prototypes provided powerful algorithms for analyzing the nature of strokes. Others demonstrated a novel approach for cleaning up shapes while preserving their organic nature.

We also created hybrid prototypes that explored various capabilities of these engines in tandem with each other. Some prototypes favored freeform drawing, for example, while others corrected shapes more aggressively as you drew. Some prototypes employed heavier visual cues, while others provided a clean canvas.

Each approach revealed its strengths and weaknesses in terms of what it communicates to the user, allowing us to define and fine tune the personality of the tool. Should it be assertive or suggestive? Flexible or opinionated? Ideally just the right amount of each, matching the user’s intent.

Prototypes are also handy for user testing. Instead of handing our testers a Keynote presentation and asking them to “imagine if this was real,” we gave them actual working software, rendering the tests more accurate and valuable.

The road to a polished product is long and hard, and prototyping helped us plot a map of where the treacherous swamps may be hiding. We developed over a dozen full-fledged prototypes in total, each of which could probably be its own app. This was key in allowing us to mold Think Kit into its current state, and get a common intuition for where it should head next.

### Future Directions

In building Think Kit we gained key insights in several areas, from mathematics, to interaction design, to graphics. There are many ways to bring structure to this continuum of inputs, and we now understand much more about where the resulting ambiguities lie, and which are best resolved with science, engineering, or interaction design. Think Kit uses these insights to create a digital whiteboard marker, but the understanding and techniques we evolved along the way have much broader applications. We hope we’ve piqued your interest.