let’s crack home field advantage

Looking into how to add home field advantage (HFA) to the Elo predictor I got down a rabbit hole of what numbers to use. The code we had for home_road_splits.py was tuned to use our virtual conference set. That produced the 55% home winning percentage number that we employed in the first base case predictor. That’s fine, but it’s a pretty small sample. What do the larger numbers say? If you look up college football home field advantage you find numbers in the range of 61%-65% home winning percentage. But as we’ve discussed before many big programs play unbalanced schedules, with more home games than road games. Those non-conference home games are usually against weaker teams. You build the biggest stadiums in the world to give the people what they want, and what they want isn’t Tennessee playing .500 ball in September.

Major college football is not a zero sum game: wins are shipped up the class structure from low to high. Big programs feed on small programs and FCS schools in order to pad their wins, satisfy their fans and create bowl eligibility in the giant fanbases most able to recruit 20,000 eager holiday season football tourists.

But that’s non-conference play. Conference games for the most part are a zero sum game. The spirit of conference play is balanced home/road splits for every team. There’s no escaping the good teams in your conference eventually. Tennessee can schedule East Tennessee State and Chattanooga as home September dates without ever dreaming of going to play them on their home fields. But when it comes to the other SEC powerhouses like Alabama they have to play them one year at home and one year on the road.

So let’s break out conference play and non-conference play and see what happens. I modified home_road_splits.py in order to create a new master query for what games to use: instead of our virtual conference method let’s have real conference check. (It’s still a little rough so we’re not into actual command line options yet.) The last 20 years all look very similar for the major conferences. This is intra-conference game winning percentage for home teams starting 2001:

overall: 591-491-0

overall: 615-483-0

overall: 621-510-0

overall: 541-408-0

Pac 10/12:

This is all very close to the 55% we were seeing for our MCC virtual conference data. The aggregate for the whole Power 5 over the last 20 years is 55.8%. I sampled a few G5 and it doesn’t look too different.

We can flip a flag to check the inverse, (ONLY non-conference games), and the numbers are crazy. In games involving an SEC team playing a team from a different conference, excluding neutral site games, since 2011 the home team is 401-100 for an even .800 winning percentage. It’s not just the SEC: In Big Ten inter-conference games home winning percentage is .743 and even for the “lowly” Pac-12 it’s .755.

So it seems safe to say inter-conference games appear very different from intra-conference. But how much of that is the “positive sum game” for wins that we’ve identified? As we detailed above inter-conference games are much more likely to be home wins for Power 5 schools because they bring in lesser programs for easier home dates. You know, “cupcakes.”

The whole reason we went down this trail in the first place is because Elo is a raw power rank with no built-in notion of home field advantage. But we should be able to use this “flaw” in our favor. There’s an entry in the cfbd data model for pregame Elo for both teams. If we total up the expected wins from pregame Elo calculations, over time that divergence from actual wins and losses should show home field advantage while accounting for inferior competition. That’s the theory. As a control, if we run expected Elo W-L against the conference schedule we see numbers like this (going from 2011 season on)

overall: 325-249-0
overall ELO: 290.0-284.0

overall: 339-263-0
overall ELO: 297.9-304.1

Big 10
overall: 327-290-0
overall ELO: 304.9-312.1

Big 12
overall: 261-212-0
overall ELO: 235.8-237.2

overall: 341-277-0
overall ELO: 308.9-309.1

So far so good… the Elo predicted records are very close to .500 as we would expect in the relatively zero sum game of conference play. That’s at the very least a nice sanity check about the Elo data. The first time I ran it again the non-conference I saw a convincing and constant HFA of around .130 in most conferences. Wow, that seems big and interesting? Unfortunately I realized that the totals for Elo record and actuals weren’t adding up. The issue is that the Elo data is actually incomplete. Many non-conference games don’t have entries for pregame Elo values for the participants. Here’s a sample from a non-conference ACC run, dumping out what games are missing Elos for one or more participants:

No Elos for 2017 Presbyterian College at Wake Forest
No Elos for 2017 Central Connecticut at Syracuse
No Elos for 2017 Bethune-Cookman at Miami
No Elos for 2017 Youngstown State at Pittsburgh
No Elos for 2017 William & Mary at Virginia
No Elos for 2017 North Carolina Central at Duke
No Elos for 2017 Jacksonville State at Georgia Tech
No Elos for 2017 Delaware at Virginia Tech
No Elos for 2017 Furman at NC State
No Elos for 2017 Murray State at Louisville
No Elos for 2017 Delaware State at Florida State
No Elos for 2017 The Citadel at Clemson
No Elos for 2017 Western Carolina at North Carolina
2017, ACC, 20, 15
2017 ELO 19.403-15.597
No Elos for 2018 James Madison at NC State
No Elos for 2018 Furman at Clemson
No Elos for 2018 Alcorn State at Georgia Tech
No Elos for 2018 Albany at Pittsburgh
No Elos for 2018 Richmond at Virginia
No Elos for 2018 Towson at Wake Forest
No Elos for 2018 Holy Cross at Boston College
No Elos for 2018 William & Mary at Virginia Tech
No Elos for 2018 Wagner at Syracuse
No Elos for 2018 Savannah State at Miami
No Elos for 2018 Indiana State at Louisville
No Elos for 2018 Samford at Florida State
No Elos for 2018 North Carolina Central at Duke
No Elos for 2018 Western Carolina at North Carolina

You can see the pattern here: it’s mostly FCS teams. You can’t really blame the cfbd data since creating a meaningful Elo rating for an FCS team in the context of FBS play is hard. For example, the aggregate record for home teams when an ACC team was playing and there was no Elo data from 2011-present: 133 wins and 4 losses. These games are mismatches the vast majority of the time. Here are the four losses:

2011 Richmond at Duke
2016 Richmond at Virginia
2019 The Citadel at Georgia Tech
2021 Jacksonville State at Florida State

Losing to an FCS team is a big deal. It leads the news. Coaches get fired. Sometimes whole wikipedia articles are written. So what happens when we drop the win/loss data out for games with missing Elos? Can we see anything about home field advantage? Here are the runs for the P5 again using 2011-present data:

overall: 178-76-0
overall ELO: 142.0-112.0
missing ELOS: 78-7
HFA on available data: 0.142

overall: 277-98-0
overall ELO: 250.4-124.6
missing ELOS: 124-2
HFA on available data: 0.071

Big 10
overall: 265-112-0
overall ELO: 230.7-146.3
missing ELOS: 68-3
HFA on available data: 0.091

Big 12
overall: 133-79-0
overall ELO: 123.8-88.2
missing ELOS: 73-6
HFA on available data: 0.043

overall: 234-145-0
overall ELO: 202.6-176.4
missing ELOS: 133-4
HFA on available data: 0.083

There’s still something there. Even if we eliminate the cupcakes we still see a persistently higher HFA for non-conference games. In the aggregate the HFA is .079.

I spent several hours trying to figure out strategies to integrate FCS games into the HFA Monte Carlo model when a small voice spoke to me: “We’ve already decided to eliminate FCS games from the virtual conference code.”

This is a familiar pattern on a software project. An engineer has a pet idea, delivers an intriguing first pass and then it’s “just a few more days of work” to get it really firmed up into something. Meanwhile there are actual bugs and the shape of the emerging feature seems oddly orthogonal to the mission of the project. Fun’s over! We can confidently put a little bifurcated HFA boost into the Elo predictor and move on from this stuff. This is not a betting blog. (But don’t bet against the SEC in FCS games.)

The biggest real bug on the board is multiway tiebreakers. That’s a hard problem with an ugly refactor at its heart. No more fun stuff. home_road_splits.py will stay a developer-only minefield for now with all the data experiments checked in.

What about that terrible 2-8 record for home teams this year in the MCC? The Elo expected record using pregame Elo values was 4.7-5.3. This schedule did tilt toward stronger road teams. I suspect if we reran it using final Elos we’d see an even more pronounced slant as a lot of these teams ended up being much worse than expected. (That gives me an idea: maybe we should do that for the whole data set… does final season Elo add more information than the pregame value? Stop. STOP.)