Model behavior: Waite teaching machine learning via March Madness

Scott Schrage | University Communication / Shutterstock

Zig, or go Zags? Favor new blood or blue blood? Dance with Cinderella or a stepsister?

Every March Madness bracket is a bet (often literally) on one of 9.2 quintillion possible permutations of winners and losers, front-runners and dark horses, drowsy blowouts and rousing upsets.

While out walking his dog, the University of Nebraska–Lincoln’s Matt Waite realized that the annual rite of spring and college basketball was also an ideal opportunity to apply some lessons he was teaching in Sports Media and Communication 460: Advanced Sports Data Analysis. Some of the 19 undergrads in the class might be unfamiliar with, if not openly wary of, the quantitative realm, but most were already planning to fill out brackets. So he decided to turn the ritual exercise into a class exercise.

“To tell you the truth, it wasn’t on the syllabus when I started the class,” said Waite, professor of practice of journalism and mass communications. “Between the way that the course schedule was working out, the progression that the students were making, and the timing of the tournament, it just sort of all came together.

“That’s something that I really, really try to do in my sports data classes, is make examples of the moment.”

To Waite’s mind, March Madness is especially suited to teaching the fundamentals of machine learning — in the simplest terms, feeding data into a computer algorithm for the sake of training it to predict future outcomes. Analytically inclined college basketball fans and bettors have increasingly looked to machine learning for an assist when filling out their brackets. Waite has even built his own models on the foundations of books like “Basketball on Paper” and other sacred texts of analytics.

“I wanted to have sports communicators dip their toes into the waters of machine learning and predictive analytics — where the tools of doing this have become easy enough to use, but understanding what’s going into the algorithm, and what’s coming out of it, takes some work,” he said. “But once you have some key concepts, you can communicate with it. You can tell stories with the output.”

Waite began by giving his SPMC 460 students access to the box scores of every men’s college basketball game going back to the 2014-15 season. (He tried to do the same for the women’s tournament, but despite his ongoing efforts, a lack of available data made it unworkable. “There is sexism in sports data, just as there is in sports in general. Game-level statistics for women’s basketball are vastly more difficult to get your hands on than men’s,” Waite said.)

Those box scores were stuffed with the raw statistics used to calculate more advanced metrics that have historically proven predictive of successful teams: average margin of victory, points scored per possession, shooting percentages, turnovers, offensive rebounding rates, and so on. But it was up to each student to decide which statistics they would feed into an algorithm, and which of three algorithms would consume those stats.

“Machine learning is not magic, and the algorithms are doing a very specific thing: using input that you give them and coming up with answers,” Waite said. “And you, as a human being, need to be able to evaluate those.”

With those fateful decisions made, the students tested their inputs and algorithms by asking the latter to predict the winners of games that had already been played but whose outcomes were a mystery to the machine. After some fine-tuning, the students were ready to run their newly trained algorithms through the bracket-busting gauntlet of March Madness, picking all 63 games (not including the so-called First Four) ahead of time.

“My goal was to let them run wild, see where they got, and then talk about where it went wrong after it happened,” Waite said.

Or, in the case of a few students, where it’s gone especially right.

“I’ve got a handful of folks who are just absolute basketball maniacs and were skeptical that some computer was going to tell them better than they knew,” Waite said. “I have a handful who have absolutely no interest in basketball whatsoever. I had to literally explain the rules of basketball to them, and what these statistics are, for them to even be able to function with this. And the irony is (that) two of those folks are in the top five of the class.”

Thomas Baker, a junior who leads the pack with a bracket in the 99th percentile of those submitted to ESPN.com, is an “absolute hoop-head” who can “rattle off names and their season narratives” at the drop of a basketball, Waite said. Baker put that Bilas-esque knowledge to use by occasionally disregarding an algorithm-based prediction. But he also chose a relatively sophisticated algorithm: a so-called random forest that, true to its name, consists of many decision-tree analyses that proceed in a random fashion to limit the possibility of statistical bias.

“The decision tree learns where to make splits based on the amount of similarity in data,” Waite said. “So you might take all of the teams that shoot better than 40% from the 3-point line and put them over in this group. The teams that shoot worse than that, we’re going to put them over in that group. Then those groups get split by something. And then those (subsequent) groups get split by something (else). So on and so forth, until you get to the end, where if you have a team that matches all of these particular parameters, the model says there’s a 58% chance that they’re going to win the game.”

Kaitlynn Johnson, a senior in fourth place and the 96th percentile, could hardly be more different — a total college basketball novice who built “maybe the most simplistic model,” input only some basic shooting stats, and dutifully followed every prediction. Still, Waite said, anyone who’s spent as much time as he has with brackets might have predicted the seemingly unpredictable success of a rookie predictor.

“Before this even got going, I honestly predicted that somebody like that was going to be near the top,” Waite said. “Because it happens in every bracket pool. If you’ve ever filled out a bracket in an office, you know there’s somebody in there who’s like, ‘I don’t know anything about basketball, but those uniforms are cool! Let’s pick those.’ Or, ‘I like Wildcats more than Blue Devils, so I’ll take them.’ And they always seem to do really well. So I saw her coming a mile away.”

As for Waite himself? He’s just glad to no longer be bringing up the rear, where he spent about half of the tournament. Riding a hot streak that began in the Sweet 16, he’s ascended to a respectable ninth place and breached the 84th percentile on ESPN.com. If nothing else, he said, his marginal March should at least help him illustrate an important point to the class: that while the machine needs a properly educated ghost to guide it, that education goes only so far — and even the best-informed ghosts can be busted.

“There is a certain amount of humility and, I would even say, naivete that needs to go into this, where there is such a thing as the curse of knowledge,” Waite said. “I read the canonical basketball analysis book and tried, as close as I could, to implement the analysis steps into a model. I spent hours and hours on mine, used the fanciest algorithms that I could — and immediately just got my head kicked in. Meanwhile, somebody who didn’t know what a field goal was three weeks ago came up with a very simple and, truthfully, elegant model, and is crushing it.”

And if, in the process of tracking their brackets and retracing their missteps and claiming bragging rights for the rest of the semester, the future media professionals forget or even begin losing some of their lingering aversion to numbers? So much the better, Waite said.

“The students I’ve got are not computer scientists; they’re not statistics majors,” he said. “They’ve (often) avoided math as much as possible. So, for me, the trick is trying to make this as relevant as possible, and draw them in that way. You know, it’s sort of the spoonful of sugar.

“We’re using the tournament to introduce some pretty complex topics in an environment that is easy to understand, in a way that’s accessible, using something that they’re doing anyway. If you can bring those things together, I think you’re in good territory.”