Monday, June 23, 2014

Introducing cflstats.ca

My name is Mike, and I'm a stat-aholic.

It should be obvious by now that I'm mildly obsessed with sports stats. I started this blog mid-season last year with the intention of bringing some of the so-called "advanced stats" up north to the CFL. I don't claim to be a math genius, but I can read a formula, and as a programmer, I'm fairly adept at collecting stats. This made Pythagorean Expectation was a perfect place to start, as the required data (points for and against) was easily available, and the formula was fairly simple.

But it was obvious from the start that we CFL fans suffer from a lack of good data.  Towards the end of the season, I set out to improve that.

After some long nights in the off season, I'm proud to unveil CFLStats.ca.



What is CFLStats.ca?

For the NFL statheads out there, the resemblance to pro-football-reference.com will be immediately apparently.  It was my inspiration and guide throughout the process.  While the code behind cflstats.ca has no ties to PFR, the development would not have been possible without having it as a guide.

What CFLStats.ca provides is a very large searchable database that includes (almost*) every action of every play, for every game processed so far.  Due to time constraints, this means the 2009, 2010 and 2013 seasons.  I hope to finish processing 2011 and 2012 in the near future.

What's missing?

A few things right now, but primarily the data from 2011 and 2012.  It's available and will be there eventually, but importing games is a time consuming process, and they simply weren't ready in time.  Rest assured you'll see that data in time.

A major limitation of the data however is "Games Played" stats, and by association, the "per game" averages.  Unfortunately, it's impossible to determine based on the play by play whether a player actually suited up for a game or not.  Part of the process was to run through player transactions to determine trades as well as active/injury status, but there are some players (primarily backup quarterbacks) who remain active but never appear in the game.  The DB will therefore show them as having "played" in 18 games for the season, while in reality they may have only appeared in a handful, or even none at all.  As a result, the per game stats should be considered an estimate.

What can I search for?

A lot of stuff.

Oh you want more?  Oh alright.  

You can search for team games that match your criteria (how many games where there in 2013 where a team passed for 400 yards?).  (Answer: 4. Toronto did it 3 times.)

You can search for player games that match your criteria (which players had a game where they rushed for 150 yards?).  (Answer: Kory Sheets (3 times), Chris Garrett, Jon Cornish (3 times), Chad Kackert and Brandon Whitaker)

You can search for drives matching your conditions (show me the drives where a team got the ball after an interception or fumble). (Answer: it happened 246 times, and 82 touchdowns were scored)

And you can search for plays matching your conditions (show me the result of play from the opponents goal line). (Answer: 73 plays and 51 touchdowns)

Within these searches are a lot of options for filtering, sorting and grouping.  I'm sure there are things which currently can't be searched for, but I think you'll find there are a ton of things you can.


Errors

What errors?  Everything is perfect, or I wouldn't be releasing it, obviously.

Yea that's a lie.  There are going to be errors in the data, it's a fact of life with a database this large.  The import process is designed to catch as many as can be identified, and I fix those by hand as I find them, but I'm certain some have slipped through.  At the bottom of every page, you'll see a "report error" link.  If you find something you think is wrong, click that button and send in the details of the error.  The more detail the better.  You don't have to provide an email to send a report, but if you do, I'll update you with the resolution.

* In rare cases, the play by play data was not available for processing or contained errors too numerous to be utilized.  These games are clearly marked when you access them, and will have high level stats available based on game box scores, but no searchable plays.

Wednesday, June 18, 2014

Finding a New Magic Number

In the formula for Pythagorean Expectation, a magic number exists.

Ok, it's not really magic, rather, it started from an assumption made by a very smart man ("2 would be a good number") and ended with rigorous scientific testing by even more very smart people ("1.83 is actually a better value for baseball").

When I started this project, I knew that different sports use different exponents, that more scoring means a higher exponent, and that the CFL has more scoring than the NFL.  Unfortunately, as I was just beginning to collect data, I had no way of determining what the best exponent for the CFL would be.  In the end, after looking at the values used for various sports (MLB = 1.83, EPL= 1.30, NHL = 2.15, NFL = 2.37, NBA = 13.91), I decided the gap between the CFL and NFL was probably small enough that the known exponent for the NFL was likely good enough to be useful for my calculations.

Now, however, I have data going back to 1990, and after some prompting from a gentleman from Hamilton, I realized it would be prudent to go back and do the math.

Based on some research, I settled on the method outlined here (external link), which calculates a value for lacrosse.  The calculation itself is fairly simple:

1) Find the expected win total using the Py Expectation formula, and subtract it from the actual win total.
2) Square that value.
3) Calculate this value for every team in every year that I have data for.
4) Add up all the values.
5) Find the square root.

This leaves me with the root-mean-square error (RMSE) for the data using whichever exponent I used in step 1. All that's left at this point is to run the calculation with a range of exponents to determine which results in the lowest RMSE.

Thanks to Bill Barnwell and the others who have already done these calculations for the NFL, I had a reasonable clue as to where the exponent would fall, so I calculated the RMSE for 2.00 through 5.00, increasing by 0.01 each time.

As expected, the value came out higher than the NFL, but not by much:


The most accurate value of the bunch is 2.74 (raw RMSE data), with an error rate approximately 3% lower than the original 2.37 exponent.

So what does this all mean?

Good question. For starters, it means that going forward, I will be using 2.74 for future calculations.  At some point, I will also go back and revise some of the posts discussing historical data to improve the accuracy.  I will not go back and alter the data for 2013, as they were simply to provide a week by week run down, and there would be limited value in correcting the data at this point.

So there you have it: 2.74, my new favorite number.

Monday, November 4, 2013

Py Win Rankings - Week 19 and year end wrap-up

The regular season is over and unfortunately for us fans, it ended with a week that meant precisely nothing to the playoff picture.  As a result, we ended up with a weekend slate of games populated by backups.  Too bad, especially since a narrow Edmonton win dragged them closer to their py expectations and made their final numbers a little bit less interesting.

Calgary finishes the season on top, Winnipeg finishes the season at the bottom, and Hamilton and Montreal wind up in a tie despite a 2 game difference in the official standings.

Here are the final numbers:

Luckiest Team: Calgary (+2.1 wins)
Unluckiest Team: Edmonton (-3 wins)

Biggest Jump: Hamilton (+0.8 projected wins)
Biggest Drop: Calgary (-0.5 projected wins)




2013 Recap

Now that the season is complete, we can go back and see how things changed since I posted my first article back in week 10.

As I noted in my introductory article, the primary value of the pythagorean expectation formula is as an indicator of future results.  While I used it here as a sort of "mathematical" power ranking, that's really not it's purpose.

#1 Calgary
Started: 7-2, 1.2 wins over expectation
Finished: 14-4, 2.1 wins over expectation
All time rank: #32 of 207

The Stamps ignored the odds and finished the second half of the season the same way the as the first - 7-2 and roughly 1 win over expectation.  Calgary's finishing total of 2.1 wins over expectation is the second highest since 1990, tied with Winnipeg in 2001 (lost in the Grey Cup), and Baltimore in 1995 (won the Grey Cup).


#2 Saskatchewan
Started: 8-1, 1.4 wins over expectation
Finished: 11-7, 0.7 wins below expectation
All time rank: #37 of 207

The Riders started strong but regressed towards expectations over the second half of the season.  With the league's best scoring defense and second best offense, this Rider team finishes as the best since 1990, according to Py win percentage.

#3 Toronto
Started: 5-4, exactly on expectation
Finished: 11-7, 1 win over expectation
All time rank: #75 of 207

Toronto was pretty consistent all year.  They got a bit luckier in the second half of the season after playing right along expectations in the first half.

#4 BC
Started: 6-3, 1.3 wins over expectation
Finished: 11-7, 1.1 wins over expectation
All time rank: #76 of 207

Like Toronto, BC was fairly consistent for most of the year.  The #3 and #4 teams jumped back and forth all year, finishing with nearly identical seasons.  Toronto scored 3 more points than BC, and BC allowed 2 points less than Toronto.  They end up back to back in the all time rankings, a mere 0.018 Py wins apart.

#5 Montreal
Started: 4-5, 0.3 wins over expectation
Finished: 8-10, 0.7 wins below expectation
All time rank: #116 of 207


2013 was a rough year for Alouette fans, but they can take some solace in the fact that the math says they are just the tiniest bit better than Hamilton, despite the 2 game difference in records.

#6 Hamilton (tie)
Started: 4-5, 0.1 wins below expectation
Finished: 10-8, 1.3 wins over expectation
All time rank: #119 of 207

The two game difference between Montreal and Hamilton is why stats like this were invented.  Like BC and Toronto, these teams had virtually identical seasons, separated by 6 points offensively, and 3 points defensively, and yet Hamilton finishes 2 games clear of Montreal in the standings.  Expect a close one in Guelph this weekend.  (Side note - is there anyone out there who'd have guessed that Montreal finishes with the better offense, and Hamilton with the better defense?)

#7 Edmonton
Started: 1-8, 2.4 wins below expectation
Finished: 4-14, 3 wins below expectation
All time rank: #160 of 207

From a math standpoint, Edmonton was the most interesting team in the league this year.  Their close losses early in the season inspired me to start collecting these stats, and unfortunately for Eskimo fans, their luck did not improve in the second half of the season.  A meaningless week 19 win brings their win differential up slightly, but still good for a tie for second all time at -3.0 wins.

#8 Winnipeg
Started: 1-8, 1.4 wins below expectation
Finished: 3-15, 1.3 wins below expectation
All time rank: #200 of 207

The Bombers were the worst team in the league this year, and it wasn't particularly close.  Their defense allowed 66 points more than 7th place Edmonton, and in a year where half the league scored more than 500 points, Winnipeg wasn't even close to cracking 400.  According to the numbers, only 7 teams since 1990 have been worse, and two of those teams don't even exist anymore (the 1995 Ottawa Rough Riders, and the 1994 Shreveport Pirates).


Playoff Predictions

My research into how Py Wins and Big Wins can be used to project playoff stats is incomplete at this point, but what I do have so far indicates that the answer is probably "not very well".

Again going back to 1990, the best team according to Py Wins has only gone on to make the Grey Cup 56.5% of the time, and only won it 43.5% of the time.  That fares poorly compared to simply using wins as a projector, where the team with the most wins (outright or tied) has made the Grey Cup 65% of the time, and won it 56.5% of the time.

I intend to do some more research in the off season, but my theory at this point is that home field advantage, coupled with the bye that the division winner gets, is a significant enough advantage that it more than off-sets any difference in team quality, especially since the two teams playing in the West or East Final games should typically be fairly close in quality.

With that in mind, here are my mathematically unsound, empirically irrelevant predictions:

West Semi - BC @ SSK
BC has been poor on the road (3-6) and the Riders are above average (6-3) at home.   They also beat BC twice fairly handily.  I like Saskatchewan to advance here.

East Semi - MTL @ HAM
Hamilton was good at home (6-3), and Montreal was just OK on the road (4-5), but the math suggests they are very evenly matched, and it took a wacky special teams play for Hamilton to pull off the last one.  I think Montreal will put this one away earlier and avoid the late game shenanigans.

West Final - SSK @ CGY
Calgary is just too good at home (8-1), and despite play each other close this season, it just seems like Calgary has had Saskatchewan's number since that early loss at Mosaic.  It pains me to write this, but I see Calgary winning this one.

East Final - MTL @ TOR
Remember what I said about top seeds and playoff success?  I don't see this game defying the odds.  Toronto is a better team on both sides of the ball, and I like them to win and set up a rematch of the 2012 Grey Cup.

Grey Cup - CGY vs TOR
I really hope I'm wrong about this matchup, and I get to see the Riders play in the Grey Cup at home.  But there is no room for hope in predictions, only speculation and BS.  Toronto pulled off a crazy upset at McMahon earlier in the year, but this should basically be a home game for Calgary, and the Stamps have been the best team all year.  I like the Stamps to win it all, adding another mark in the "won grey cup" column for both the "Most wins" and "Most Py Wins" statistics.

Tuesday, October 29, 2013

Py Win Rankings - Week 17

One last week of numbers before the end of the season, and it's looking to me like this year is going to stand out on both ends of the spectrum.  It looks like a virtual certainty that Edmonton will finish as the second unluckiest team of all time, and now it's looking like Calgary will be the luckiest 14 or 15 win team in history as well.  Other teams have finished further above their Py Expectation, but only Baltimore in 1995 has finished with 15 wins and been more than 2 wins above expectation.  It's far from empirical in the least, but it's worth nothing that Baltimore won the Grey Cup that year.

The rankings themselves haven't changed at all, without even any interesting projection changes.  That's of course because as the season goes on, each game affects the totals by a smaller percentage than previous games, so things are mostly stable by now.  Based on the gaps between teams at this point, I don't anticipate any changes next week either, other than perhaps BC moving up a spot if they win big and Toronto loses.

Next week after we have the final numbers, I'll take a look at each team and how historically similar teams have fared in the playoffs and future seasons.

Luckiest Team: Calgary (+2.3 wins)
Unluckiest Team: Edmonton (-3.4 wins)

Biggest Jump: Toronto and BC (+0.3 projected wins)
Biggest Drop: Saskatchewan (-0.3 projected wins)

Monday, October 21, 2013

Py Win Rankings - Week 17

2 weeks left in the season (playoff time, for the fantasy football fans).

A bit of shuffling in the ranks this week, as 4 teams switch places.  Toronto and Montreal move up, BC and Hamilton move down.  My broken record repeats as Edmonton continues to be historically unlucky.

Luckiest Team: Calgary (+1.9 wins)
Unluckiest Team: Edmonton (-3.1 wins)

Biggest Jump: Montreal (+0.8 projected wins)
Biggest Drop: Hamilton (-0.7 projected wins)

 * In hindsight, my decision to call column 10 "Projected" was a poor one.  It was never a true projection, it's merely the teams' Py winning percentage extrapolated over 18 games.  It looks quite silly now that Calgary has more real wins than "projected" wins.  I'll find a better name next year, or better yet, work on a proper projection.

Tuesday, October 15, 2013

Py Win Rankings Week 16

Nothing to see here folks, no change at all.

Calgary stays on top, Winnipeg on the bottom.  Even the projections for the top 2 teams (which to be clear, aren't a prediction for how many wins I expect a team to finish with, they are just the result of the pythagorean formula taken over 18 games).

One thing to note here, barring some kind of miraculous turnaround, Edmonton is closing in on one of the unluckiest seasons in the past 20+ years.  Their current total of -3.0 wins vs expectation would finish in a tie for second place with Hamilton in 2008, only behind Winnipeg's -4.5 in 2010.   Eskimo fans take heart - each of those teams followed up their historically unlucky seasons with big turnarounds the next year - 9 wins and a home playoff game for the Tiger-Cats, and 10 wins and a Grey Cup appearances for the Bombers.

Luckiest Team: Calgary (+1.7 wins)
Unluckiest Team: Edmonton (-3 wins)

Biggest Jump: Winnipeg (+0.5 projected wins)
Biggest Drop: BC (-0.5 projected wins)


Thursday, October 10, 2013

Py Rankings Week 15

Little late on this one, sorry to anyone who was looking for this post earlier in the week.

After many weeks of hanging around despite losses, the Riders win this week and still relinquish their hold on top spot, dropping to #2 and leaving Calgary alone at the top, while Montreal and Edmonton swap places near the bottom.

Nothing overly surprising this week; the rankings exactly match the CFL standings.

Luckiest Team: Calgary (+1.5 wins)
Unluckiest Team: Edmonton (-2.6 wins)

Biggest Jump: Montreal (+0.9 projected wins)
Biggest Drop: Edmonton (-0.6 projected wins)