Tuesday, June 24, 2014

2014 Season Preview

This is the post where I attempt to use math and logic to predict the outcome of a game which is based on randomness and luck.  By the time November rolls around, this post will probably make me look silly.

Nonetheless, this is what you do this time of year, so lets go.

West Division

2013 Record: 11-7
Pythagorean Wins: 10.1 (over-performed by 0.9 wins, 3rd luckiest)
Record in Close Games (decided by 7 points or fewer): 4-1
Simple Rating: 3.0 (3rd)
Turnover Differential: +2 (4th)

The math suggests that the Lions were a bit lucky in 2013; they had the best record in the league in close games, and they over-performed their point differential by a small margin.  The math also suggests that despite these factors, the Lions were the 3rd best team in the league last year.  Unfortunately for BC fans, they were also the 3rd best team in their own conference.  Teams that over-perform in the 0.5-1 game range tend to regress by around a game and a half the next year, but in this case, with Calgary and Saskatchewan looking vulnerable and an extra game against Winnipeg on the schedule, I wouldn't expect that to happen.

Prediction: 11-7, 2nd in the West.

2013 Record: 14-4
Pythagorean Wins: 12.3 (over-performed by 1.7 wins, luckiest in the league)
Record in Close Games : 3-2 (4th)
Simple Rating: 7.2 (1st)
Turnover Differential: +19 (T-1st)

Just because the numbers say a team is the luckiest in the league, doesn't mean they aren't also a very good team.  The Stamps were a very good team in 2013, one of only 12 teams since 1990 to finish with at least 14 wins.  They were good in close games, but not overly so, good at taking care of the ball, and solid on defense against the pass.  The loss of Kevin Glenn may hurt them if Drew Tate is unable to stay healthy, but Mitchell has shown flashes of brilliance in his time under center, so the QB play should remain solid.  Teams that over-perform in the 1.5+ range tend to fall back to the pack a bit though, and I expect the Stampeders to follow suit this year.

Prediction: 12-6, 1st in the West.

2013 Record: 4-14
Pythagorean Wins: 6.5 (under-performed by 2.5 wins, unluckiest in the league)
Record in Close Games: 1-6 (last)
Simple Rating: -3.8 (7th)
Turnover Differential: -15 (7th)

Lets get this out of the way early: the Eskimos were tremendously unlucky last year.  Only 5 teams since 1990 have fallen short of their expected win total by more then Edmonton did in 2013.  The good news for Eskimo fans?  Of those which played a season the next year (the 1995 Shreveport Pirates folded after their season), each of them finished with at least 9 wins the next season.  Of course, there is also the 1997 Bombers, who come out just ahead of the Eskimos at -2.4 wins, and finished with just 3 the next year.  That said, their record in close games is bound to improve, so it's a good bet that they'll see some improvement.

Prediction: 8-10, 3rd in the West.

2013 Record: 11-7
Pythagorean Wins: 12.1 (under-performed by 1.1 wins, 2nd unluckiest)
Record in Close Games: 3-5 (6th)
Simple Rating: 6.2 (2nd)
Turnover Differential: +19 (T-1st)

The 2013 Riders were a very good team by the numbers; nearly the equal of the Stampeders, despite a 3 game difference in the standings.  Of course, we all know how it turned out in the end.  Normally, a team that falls short of expectations by 1+ wins would be expected to show improvement next year, in the range of another 1-1.5 wins.  Sadly, this Rider fan doesn't see that happening here.  These math functions can be a great way to judge performance beyond the standings, but they lack situational awareness, and what they don't know, is that the 2014 Riders do not look all that much like the 2013 Riders.  Hits to the receiving corps and the loss of Kory Sheets are bound to hurt the offense.  The upside for Rider Nation?  The 2013 team was the best defense in the league, allowing a league-low 398 points and only 20 passing touchdowns.

Prediction: 6-12, 4th in the West

2013 Record: 3-15
Pythagorean Wins: 3.8 (under-performed by 0.8 wins, 3rd unluckiest)
Record in Close Games: 1-4 (7th)
Simple Rating: -11.4
Turnover Differential: -27 (last)

The Bombers were really bad in 2013.  The single game they finished behind the Eskimos in the basement of the CFL doesn't tell the full story here, by simple ranking system (which ranks team by average point margin adjusted for opponents), they were a full touchdown per game worse than the Eskimos.  By this metric, the Eskimos were closer to the 3rd place team (BC), than they were the Bombers.  There are faint signs of hope, however; the turnover margin is likely to improve in 2014, and they were unlucky in close games last year, a stat which is likely to be closer to 50-50 over time.  Still, they have an unproven starter under center and they play in the difficult West; 2014 may be a long season as well.

Prediction: 5-13, 5th in the West

East Division

2013 Record: 10-8
Pythagorean Wins: 8.6 (over-performed by 1.4 wins, second luckiest)
Record in Close Games: 5-3 (3rd)
Simple Rating: -1.9 (6th)
Turnover Differential: -13 (6th)

Hamilton had a nice run to the Grey Cup final in 2013, but a strong record in close games and the second luckiest W/L percentage in the league are all signs for regression.  However, this is a team with a strong coach and a pair of young QBs who have shown they can play.  Couple those with a schedule against the weak East Division, and you may have a team able to make another run in 2014.

Prediction: 9-9, 2nd in the East

2013 Record: 8-10
Pythagorean Wins: 8.7 (under-performed by 0.7 wins, 4th unluckiest)
Record in Close Games: 5-5 (5th)
Simple Rating: -1.1 (5th)
Turnover Differential: -2 (1st)

Montreal finished just about right where they should have last year, ending up closer to their expected win total than any other team.  They were .500 in close games, just about even in turnover differential and near zero (average) in simple rating.  The defense was still good, first in the league for takeaways and yards allowed, but the offense will need to take much better care of the ball in 2014 for things to get better.  I think they will, but not by much.

Prediction: 9-9, 2nd in the East

2013 Record: n/a
Pythagorean Wins: n/a
Record in Close Games: n/a
Simple Rating: n/a
Turnover Differential: n/a

How do you make a stats-based prediction for a team that's never played a down of football?  You really can't, but we do have some historical data to work with.  Since 1990, seven CFL teams have started from scratch (technically 9, but the 1996 Texans and Alouettes were both relocations). Their combined record was 49-77 (.389).  A veteran QB and a promising backup will provide some optimism, but history is not on Ottawa's side.

Prediction: 7-11, 4th in the East

2013 Record: 11-7
Pythagorean Wins: 10.2 (over-performed by 0.8 wins, 4th luckiest)
Record in Close Games: 6-2 (2nd)
Simple Rating: 1.7 (4th)
Turnover Differential: -13 (6th)

Mathematically, the best team in the East would have been a mere 4th in the strong West Division, but this was a well balanced team in 2013.  They were 3rd in points for, 3rd in points allowed, and 3rd in turnover differential.  A late season loss to Ricky Ray in the midst of an all-time great season hurt, and they fell short of expectations in the playoffs.  The record in close games is likely to take a hit, but the turnover differential should balance out as well, and a full season with Ray at the helm should keep them at the top of the East again.

Prediction: 11-7, 1st in the East

Monday, June 23, 2014

Introducing cflstats.ca

My name is Mike, and I'm a stat-aholic.

It should be obvious by now that I'm mildly obsessed with sports stats. I started this blog mid-season last year with the intention of bringing some of the so-called "advanced stats" up north to the CFL. I don't claim to be a math genius, but I can read a formula, and as a programmer, I'm fairly adept at collecting stats. This made Pythagorean Expectation was a perfect place to start, as the required data (points for and against) was easily available, and the formula was fairly simple.

But it was obvious from the start that we CFL fans suffer from a lack of good data.  Towards the end of the season, I set out to improve that.

After some long nights in the off season, I'm proud to unveil CFLStats.ca.



What is CFLStats.ca?

For the NFL statheads out there, the resemblance to pro-football-reference.com will be immediately apparently.  It was my inspiration and guide throughout the process.  While the code behind cflstats.ca has no ties to PFR, the development would not have been possible without having it as a guide.

What CFLStats.ca provides is a very large searchable database that includes (almost*) every action of every play, for every game processed so far.  Due to time constraints, this means the 2009, 2010 and 2013 seasons.  I hope to finish processing 2011 and 2012 in the near future.

What's missing?

A few things right now, but primarily the data from 2011 and 2012.  It's available and will be there eventually, but importing games is a time consuming process, and they simply weren't ready in time.  Rest assured you'll see that data in time.

A major limitation of the data however is "Games Played" stats, and by association, the "per game" averages.  Unfortunately, it's impossible to determine based on the play by play whether a player actually suited up for a game or not.  Part of the process was to run through player transactions to determine trades as well as active/injury status, but there are some players (primarily backup quarterbacks) who remain active but never appear in the game.  The DB will therefore show them as having "played" in 18 games for the season, while in reality they may have only appeared in a handful, or even none at all.  As a result, the per game stats should be considered an estimate.

What can I search for?

A lot of stuff.

Oh you want more?  Oh alright.  

You can search for team games that match your criteria (how many games where there in 2013 where a team passed for 400 yards?).  (Answer: 4. Toronto did it 3 times.)

You can search for player games that match your criteria (which players had a game where they rushed for 150 yards?).  (Answer: Kory Sheets (3 times), Chris Garrett, Jon Cornish (3 times), Chad Kackert and Brandon Whitaker)

You can search for drives matching your conditions (show me the drives where a team got the ball after an interception or fumble). (Answer: it happened 246 times, and 82 touchdowns were scored)

And you can search for plays matching your conditions (show me the result of play from the opponents goal line). (Answer: 73 plays and 51 touchdowns)

Within these searches are a lot of options for filtering, sorting and grouping.  I'm sure there are things which currently can't be searched for, but I think you'll find there are a ton of things you can.


Errors

What errors?  Everything is perfect, or I wouldn't be releasing it, obviously.

Yea that's a lie.  There are going to be errors in the data, it's a fact of life with a database this large.  The import process is designed to catch as many as can be identified, and I fix those by hand as I find them, but I'm certain some have slipped through.  At the bottom of every page, you'll see a "report error" link.  If you find something you think is wrong, click that button and send in the details of the error.  The more detail the better.  You don't have to provide an email to send a report, but if you do, I'll update you with the resolution.

* In rare cases, the play by play data was not available for processing or contained errors too numerous to be utilized.  These games are clearly marked when you access them, and will have high level stats available based on game box scores, but no searchable plays.

Wednesday, June 18, 2014

Finding a New Magic Number

In the formula for Pythagorean Expectation, a magic number exists.

Ok, it's not really magic, rather, it started from an assumption made by a very smart man ("2 would be a good number") and ended with rigorous scientific testing by even more very smart people ("1.83 is actually a better value for baseball").

When I started this project, I knew that different sports use different exponents, that more scoring means a higher exponent, and that the CFL has more scoring than the NFL.  Unfortunately, as I was just beginning to collect data, I had no way of determining what the best exponent for the CFL would be.  In the end, after looking at the values used for various sports (MLB = 1.83, EPL= 1.30, NHL = 2.15, NFL = 2.37, NBA = 13.91), I decided the gap between the CFL and NFL was probably small enough that the known exponent for the NFL was likely good enough to be useful for my calculations.

Now, however, I have data going back to 1990, and after some prompting from a gentleman from Hamilton, I realized it would be prudent to go back and do the math.

Based on some research, I settled on the method outlined here (external link), which calculates a value for lacrosse.  The calculation itself is fairly simple:

1) Find the expected win total using the Py Expectation formula, and subtract it from the actual win total.
2) Square that value.
3) Calculate this value for every team in every year that I have data for.
4) Add up all the values.
5) Find the square root.

This leaves me with the root-mean-square error (RMSE) for the data using whichever exponent I used in step 1. All that's left at this point is to run the calculation with a range of exponents to determine which results in the lowest RMSE.

Thanks to Bill Barnwell and the others who have already done these calculations for the NFL, I had a reasonable clue as to where the exponent would fall, so I calculated the RMSE for 2.00 through 5.00, increasing by 0.01 each time.

As expected, the value came out higher than the NFL, but not by much:


The most accurate value of the bunch is 2.74 (raw RMSE data), with an error rate approximately 3% lower than the original 2.37 exponent.

So what does this all mean?

Good question. For starters, it means that going forward, I will be using 2.74 for future calculations.  At some point, I will also go back and revise some of the posts discussing historical data to improve the accuracy.  I will not go back and alter the data for 2013, as they were simply to provide a week by week run down, and there would be limited value in correcting the data at this point.

So there you have it: 2.74, my new favorite number.