Doing a project as pointless as this lets me really examine some of the pitfalls of how software projects get hung up. Let’s take a look at a verbose run for 2007 under the current codebase:
UCLA 45 at Stanford 17 on Sep 01, 2007 Sacramento State 3 at Fresno State 24 on Sep 02, 2007 San José State 0 at Stanford 37 on Sep 16, 2007 UC Davis 14 at San José State 34 on Sep 29, 2007 Stanford 24 at USC 23 on Oct 06, 2007 California 21 at UCLA 30 on Oct 20, 2007 San José State 0 at Fresno State 30 on Oct 20, 2007 USC 24 at California 17 on Nov 11, 2007 UCLA 7 at USC 24 on Dec 01, 2007 California 13 at Stanford 20 on Dec 02, 2007 2007 ordered standings Fresno State 2-0 Stanford 3-1 USC 2-1 UCLA 2-1 San José State 1-2 Sacramento State 0-1 UC Davis 0-1 California 0-3 2007 MCC winner is Fresno State (2-0)
Fresno State was a good team that year, but their 2-0 record includes a blowout against Sacramento State, then as now an FCS team. (2006 was the early days of the new FCS designation. Before that there was D-IAA and the “College Division”. In all cases we refer to the tier below top-tier football as “FCS”.) It seems like the most reasonable thing is “don’t count FCS games.”
First off, why do we need a code solution to this at all? We’re already pulling the California teams into a separate hand-rolled serialized database, why don’t we just figure out who’s FCS one time and adjust the set? On that front there are slightly tricky historical cases that I want to be robust in code to deal with. Teams move up and down. San Diego State didn’t jump into the top tier until the end of the 1960s but has stayed there since. UC Santa Barbara played at the top level in the 60s, discontinued the sport and then returned at Division III for a short time in the 80s.
Given all that I feel like the cleaner solution is to identify the fixed geographic stuff once and then let code figure out the level year by year. Especially since we want to make the system eventually work across 120 years of results. So after filing this thought away the devil’s advocate feature creep began. “But why are we excluding FCS entirely? Wouldn’t it be cool if the FCS teams had a role as spoiler? If you lose to an FCS team you’re eliminated from contention for the Cup, no matter how good the rest of your record is.”
Once I got down that road I let the idea of doing a comprehensive FCS solution distract from the simple case. The crux of the issue is that the cfbd API query is by nature extremely de-normalized, you can’t construct your own join. When you query games you get a fixed game object. And if we eliminate the FCS games entirely we won’t have them later on in the tiebreak process to check for spoilers. The best method would be to have some data model solution where the spoiler check is an easy isolated stateless lookup we can code by itself and decide to slot in or not to the final ordered standings.
For instance, check_minimum_wins functions like this. The last part of find_vconf_games is fairly readable:
standings = build_standings(mcc_games) if (len(standings) == 0): if (verbose): print("There are no standings, possibly because no games were completed.") return False ordered_standings = sorted(standings.values(), reverse = True, key = standings_sortfunc) if not check_minimum_wins(ordered_standings): print("No team has enough wins") return False if (verbose) : for line in ordered_standings: print(line) if (break_ties(ordered_standings, mcc_games)) : if (verbose) : print() print(str(year) + " ordered standings") print() for line in ordered_standings: print(line) print() print(str(year) + " MCC winner is " + ordered_standings.team_name + " (" + str(ordered_standings.wins) + "-" + str(ordered_standings.losses) + ")") return True else: print("could not resolve a winner for " + str(year)) return False
We build the standings, sort the standings, then do the logical rule check of minimum wins and tie breaking. It would make sense to add a check_fcs_spoilers() in there somewhere.
But in order to do that we’d need to keep around the un-pruned gameset with FCS markers in them, that we add, and make the standings code smart enough to know what doesn’t “count”. The “teams” dictionary as currently implemented is a simple map of id->name, which is good because the denormalized name is actually used as the unique key in most of the cfbd records that are returned. But if we start enhancing our idea of what a team is we should build some kind of richer structure to hold team info, like name and FCS/FBS tier.
Why not just do the FCS check early on when we’re finding games and then mark the team as having lost to FCS at that time? Well that would mean adding win/loss code to the game-finding code. The win/loss stuff is fairly ugly because you have to do the hard work of comparing home and away scores and then pulling the winner from home/away. There’s no easy virtual overlay over the struct to just deal with the “winner” and the “loser” of the game. (Although building an idealized object around the core data model would be a good idea for reasons like this.)
I think the best way is probably to err on the side of deep copying. Prune the gameset of FCS games but keep the original around. At the time you do that, cache the FCS/FBS info in a dictionary of FBS teams, since the dataset is tiny and the query is expensive. Then later if you want to do the isolated spoiler check you walk the original gameset for FCS victories and use the team info you find there to further prune the final ordered standings built from the pruned gameset.
If this becomes too cumbersome the next step is to build the idealized object of our dreams and spend the time importing the cfbd query recordset into the objects, which can then have accessor functions that supply the FCS status, countable win, readily accessible winner and loser, etc.
Hang on, we got down this road because I “thought it would be cool” to include FCS spoilers. Does any of this really matter? It would be a lot easier to build a one-off query to check how many times FCS even beat FBS among our California teams. Just doing a quick eyeball/grep test of verbose results from the last 40 years gives us a good answer: not many. There are few times Davis beat Pacific and Sacto State beating Pacific but it definitely doesn’t affect the winners.
Deep breath. Simple first.
def remove_fbs_teams(configuration, teams, cur_year): api_instance = cfbd.TeamsApi(cfbd.ApiClient(configuration)) year = cur_year fcs_teams =  fbs_teams = api_instance.get_fbs_teams(year=year) for team_id in teams: found = False for cur_team in fbs_teams: if (cur_team.id == team_id) : found = True break if (not found) : fcs_teams.append(team_id) for fcs_team in fcs_teams : del teams[fcs_team]
That’s our clean nuke of the non-FBS teams from the supplied teams dataset. (We are interested in the intersection of user-supplied teams and cfbd dataset FBS list.) The slightly lame inner loop is because I hate doing operations on an array I’m iterating through. Feels wrong, although it’s possible python3 just makes it work. I will check that. Double loop because there’s no other way to query the API than to just get a big FBS list.
Here’s our 2007 run after adding that to the mix:
UCLA 45 at Stanford 17 on Sep 01, 2007 San José State 0 at Stanford 37 on Sep 16, 2007 Stanford 24 at USC 23 on Oct 06, 2007 California 21 at UCLA 30 on Oct 20, 2007 San José State 0 at Fresno State 30 on Oct 20, 2007 USC 24 at California 17 on Nov 11, 2007 UCLA 7 at USC 24 on Dec 01, 2007 California 13 at Stanford 20 on Dec 02, 2007 disqualifying insufficient wins (1) from Fresno State 2007 ordered standings Stanford 3-1 USC 2-1 UCLA 2-1 San José State 0-2 California 0-3 2007 MCC winner is Stanford (3-1)
Another Stanford cup! Well what do you know… I’ll re-run all the results with that new rule and we’ll see, then go forward with the spoiler insanity some other time.