Yeah, yeah — of course a computer won at a math competition. That’s not the point. This story, which concerns a rather amazing program called GeoS from the Allen Institute for Artificial Intelligence (AI2), is about the ability of AI to usefully engage with the world. To a computer, with a brain literally structured for these sorts of operations, the math SAT is not a test on calculation, but reading comprehension. That’s why this story is so interesting: GeoS isn’t as good as the average American at geometry, it’s as good as the average American at the SAT itself.
Specifically, this AI program was able to score 49% accuracy on official SAT geometry questions, and 61% in practice questions. The 49% figure is basically identical to the average for real human test-takers. The program was not given digitized or specially labeled versions of the test, but looked at the exact same question layout as real students. It read the writing. It interpreted the diagrams. It figured out what the question was asking, and then it solved the problem. It only got the answer about half the time — which makes it roughly as fallible as a human being.
Of course, GeoS makes errors for different reasons than high-schoolers. A human being might correctly interpret the question, then apply the wrong formula, or muck up the calculation. GeoS, being a computer, will virtually always get the correct answer so long as it truly understands the question. It might not be able to read a word correctly, or the grammar of a question might be too alien for the computer to parse. Regardless, what we’re really measuring here is the computer’s ability to understand human communication in a form that’s deliberately (pardon the pun) obtuse.
To do this, the researchers had to smash together a whole array of different software technologies. GeoS uses optical character recognition (OCR) algorithms to read the text, and custom language processing to try to understand what it reads. Geometry questions are structured to be difficult to parse, hiding important information as inferences and implications.
The other side of the coin is that though geometry questions are dense and hard to tease apart, they’re also extremely uniform in structure and subject matter. The AI’s programmers can plan for the strict design principles that go into writing the questions. It couldn’t take this same programming and directly apply it to calculus problems for instance, because they use somewhat different language and mathematical symbols to describe the problem. But a good GeometryBot would also be relatively easy to adapt to those few distinguishing rules. Each successive new area of competence would make the next one easier to acquire.
One intriguing implication of this research is that someday, we might have algorithms quality-checking SAT questions. We could have different AI programs intended to achieve different levels of success on average questions, perhaps even for different reasons. Run proposed new questions through them, and their relative performance could not only weed out bad questions for point to the source of the problem. BadAtReadingAI and BadAtLogicAI did as expected on the question, but BadAtDiagramsAI did terribly — maybe the drawing simply needs to be a little clearer.
This isn’t a sign of the coming AI-pocalypse, or at least not a particularly immediate sign; as dense as geometry questions might be, they’re homogeneous and nowhere near as complex as something like conversational speech. But this study shows how the individual tools available to AI researchers can be assembled to create rather full-featured artificial intelligences. When things will really take off is when those same researchers start snapping together those amalgamations into something far more versatile and full-featured — something not entirely unlike a real biological mind.
Find out more by searching for it!