I was asked by TSN to make predictions for the 2013 NCAA Men's Basketball "March Madness" tournament bracket based solely on a statistical analysis, without using any specific knowledge of NCAA teams (which is just as well since, although I like sports and watch them sometimes and even play a bit of neighbourhood pick-up basketball myself, I haven't closely followed any spectator sports in years).
So I proceeded by:
a) Gathering lots of different data variables for each team, for each of the past four regular seasons.
b) Separately gathering the results of each game of each of the past three years' March Madness tournaments.
c) Combining all of that data together for my computer programs to read (which turned out to be very time-consuming, since different data are available on different web sites in different formats with different team name abbreviations, so I had to "teach" my computer to match them all up).
d) Exploring different "non-negative linear combinations" of the data, i.e. formulas which use the data from a given regular season, to give an overall score to each team (I use the phrase "regular season" to include all games from that season prior to the NCAA March Madness tournament, including conference tournament games).
e) Developing computer programs to "fit" the formula based on previous seasons, i.e. to do an extensive search to figure out which of those formulas did the best job of predicting the winners for each game in that year's tournament, using data from the corresponding regular season.
f) Eventually coming up with a single best formula for this, which I call the "Rosenthal Fit."
g) Then, filling in the actual bracket simply by picking, for each game, whichever team has a larger value of their Rosenthal Fit.
The formula for the Rosenthal Fit, plus an evaluation of how well it performed when applied to data from the previous three years' tournaments, is provided below. Corresponding values for all teams for the 2012-2013 regular season (to be used to predict the 2013 tournament bracket) are listed just below:
The NCAA tournament is inherently hard to predict. Indeed, the total number of different ways of filling in your bracket predictions is 2^63 (i.e., 63 different 2's all multiplied together), which works out to about 9 x 10 to the 18th, i.e. a nine followed by 18 zeros, which equals nine billion billion, or nine million million million. That's a lot of possibilities!
In fact, even the experts find it challenging. For example, in past tournament games, the higher-seeded team only won about 70 per cent of the games. This means that even when many of the most knowledgeable people get together to seed the teams, they can still only correctly predict the winner about 70 per cent of the time. Individual expert basketball predictors (e.g. Kem Pomeroy at KenPom.com) tend to perform similarly, accurately predicting the winner in only about 70 per cent of the tournament games. Part of the reason is that each matchup is a single-elimination game, rather than e.g. a seven-game series, so there is lots of inherent day-to-day randomness, and it is quite possible for a weaker team to beat a "better" team in any one game, making predictions that much more difficult.
So, despite my extensive computer programming and statistical modeling, I do not expect to do better than calling about 70 per cent of the games correctly.
Indeed, I would say that anyone who does much better than 70 per cent would have to get fairly lucky (in addition to perhaps having a good predictive model and/or good knowledge of the basketball teams).
Statistical Data Considered:
To perform my statistical analysis, I downloaded and considered lots of different statistics, including the following (listed with sources):
- WinFrac: The team's overall game-winning fraction for the entire regular (pre-March Madness) season. (teamrankings.com)
- WinFrac3: The team's game-winning fraction in their final three regular season games. (teamrankings.com)
- CWinFrac: The team's game-winning fraction for games within their own conference. (realtimerpi.com)
- NCWinFrac: The team's game-winning fraction for games outside of their own conference. (realtimerpi.com)
- AdOff: The team's "adjusted" offensive efficiency rating. (KenPom.com)
- AdDef: The team's "adjusted" defensive effiiency rating. (KenPom.com)
- OffEff: The team's unadjusted offensive effiiency rating. (teamrankings.com)
- DefEff: The team's unadjusted offensive effiiency rating. (teamrankings.com)
- SOS: The team's "Strength of Schedule", a measure of the average strength of the opponents they played. (realtimerpi.com)
- RPI: The team's "Ratings Percentage Index". (realtimerpi.com)
- PntPG: The team's average number of points scored per game. (teamrankings.com)
- OpPnt: The team's average number of points scored against them per game. (teamrankings.com)
- I also examined the team statistics provided at ncaa.com and at espn.go.com, but they largely overlapped with the above statistics, so in the end I did not need to use them directly.
Finally, and most importantly, the "outcome" measure was:
- TourRes: The game-by-game, line-by-line win/loss results for each game of each of the past three March Madness tournaments. (kusports.com)
Statistical Modeling Approach Taken:
My approach was to try to figure out which linear combination of (i.e., formula using) the above-listed regular-season statistical values would do the best job of ranking the teams from highest to lowest, in terms of who won which games in the corresponding year's tournament. I computed this using regular-season statistical values, and corresponding tournament game results, for each of the three seasons 2009-2010, 2010-2011, and 2011-2012.
To perform this computation, I wrote computer programs in C and in R, which used such techniques as "linear regression", "constrained linear regression," and finally a "Monte Carlo (randomised) search algorithm," to find an optimal formula.
Although my computer programs considered all of the above variables, they ultimately selected just a few of those variables as being most relevant for prediction, namely: WinFrac, WinFrac3, OffEff, DefEff, SOS, and NCWinFrac.
Using the above statistical analysis, the resulting best linear combination turned out to be:
Rosenthal Fit = 6:2337 x WinFrac + 1:7180 x WinFrac3 +1:1179 x OffEff + 1:9189 x DefEff + 11:9846 x SOS + 7:3712 x NCWinFrac
I then applied this linear combination formula to the regular-season statistics for the current (2012-2013) season. This provided an overall numerical rating for each team this year, based on their regular-season statistics. These ratings are listed, in order from highest to lowest below.
Then, to fill out this year's tournament bracket using this Rosenthal Fit, simply choose, for each game, whichever team has a higher value of the Rosenthal Fit.
Note: The above rating system is based purely on statistical analysis, without taking any other factors into account. Certain late-breaking events (e.g. Kentucky Wildcats superstar Nerlens Noel's major injury on February 12) could potentially have a large impact on a team's tournament performance despite making only small changes to their regular-season statistics, which could throw off my model's predictions. I did consider making a few post-hoc adjustments to account for such developments, but in the end I decided not to - thus keeping the Rosenthal Fit as a purely statistical measure.
Comparison to Other Predictors:
The following table shows how the Rosenthal Fit, and also the tournament seedings, and also the RPI (Ratings Percentage Index) itself, would have done at predicting tournament games in each of the past three tournaments. (In two of the tournaments, there was one game between two equally-seeded teams; those two games are excluded from the evaluation of the tournament seedings)
|2009-2010||42/62 (67.74%)||44/63 (69.84%)||48/63 (76.19%)|
|2010-2011||43/63 (68.25%)||38/63 (60.32%)||43/63 (68.25%)|
|2011-2012||46/62 (74.19%)||44/63 (69.84%)||45/63 (71.43%)|
|Total||131/187 (70.05%)||126/189 (66.67%)||136/189 (71.96%)|
This table shows that the Rosenthal Fit compares favourably with RPI and with the tournament seedings. This should not be taken as evidence of any particular superiority, since the Rosenthal Fit was developed precisely to try to maximise these predictions. Still, it does suggest that the Rosenthal Fit is at least roughly comparable in predictive power to these expert measures.
In a few weeks, we will know how well it performed this year.
Jeffrey Rosenthal is a professor in the Department of Statistics at the University of Toronto, and the author of the bestseller Struck by Lightning: The Curious World of Probabilities. His analysis can seen during TSN's coverage of the 2013 NCAA Men's Basketball tournament.
List of Rosenthal Fit Values:
Duke 24.1150 Louisville 23.7559 Kansas 23.6584 New Mexico 23.5325 Gonzaga 23.4355 Arizona 23.2148 Indiana 23.0785 Michigan 22.6300 Ohio St. 22.6260 Georgetown 22.5934 Syracuse 22.5526 Creighton 22.5324 Miami (FL) 22.3322 Notre Dame 22.2744 Pittsburgh 22.1597 Memphis 22.1042 Wichita St. 22.0946 Saint Louis 22.0907 Florida 22.0731 Michigan St. 22.0105 Butler 21.9748 Kansas St. 21.9461 Oregon 21.9407 Colorado St. 21.8670 Mississippi 21.8169 UNLV 21.7975 Cincinnati 21.7373 N.C. State 21.7080 VCU 21.6183 Bucknell 21.5939 Oklahoma St. 21.5885 St. Mary's 21.5479 Illinois 21.3910 Maryland 21.3721 Belmont 21.3090 UCLA 21.3080 Marquette 21.2605 Temple 21.2184 North Carolina 21.1325 Wyoming 21.0634 Wisconsin 20.9743 Missouri 20.8896 Charlotte 20.8322 Minnesota 20.8182 Middle Tenn.St. 20.8046 IowaSt. 20.8036 Valparaiso 20.7961 San Diego St. 20.6748 Connecticut 20.6519 Iowa 20.6125 Colorado 20.5972 Boise State 20.5151 Albany 20.3990 Utah St. 20.3426 Akron 20.3190 Southern Miss 20.2688 LaSalle 20.1715 Arizona St. 20.0918 Oklahoma 19.9951 Rutgers 19.8699 LSU 19.7374 Tennessee 19.5588 Villanova 19.5010 Houston 19.4979 Virginia 19.4679 Stanford 19.4496 Santa Clara 19.4331 Kentucky 19.3383 Brigham Young 19.3114 Lehigh 19.2614 Seton Hall 19.2364 Texas A&M 19.2074 California 19.1917 Stony Brook 19.1861 Georgia Tech 19.0646 Ohio 19.0342 New Mexico St. 18.9641 Florida St. 18.8859 S Dakota St. 18.8602 Arkansas 18.8197 Davidson 18.7817 Baylor 18.7774 Alabama 18.7748 Dayton 18.7484 Fla Gulf Cst 18.7107 Tulane 18.6753 Loyola (MD) 18.6450 Texas 18.6347 Murray St. 18.6279 Richmond 18.6116 Rob. Morris 18.5161 Providence 18.4669 Nebraska 18.4523 AirForce 18.4451 Iona 18.4391 Illinois St. 18.3915 Vermont 18.3840 Oregon St. 18.3567 South Florida 18.3112 Indiana St. 18.3080 Washington 18.2090 Evansville 18.2070 Harvard 18.1508 Bryant 17.9622 Denver 17.8817 TX El Paso 17.8263 Xavier 17.7947 W. Kentucky 17.7828 Utah 17.7690 St. John's 17.7554 Canisius 17.6712 Wagner 17.6241 Fairfield 17.5919 Tulsa 17.5297 Montana 17.4721 Pacific 17.4308 Vanderbilt 17.3922 Arkansas St. 17.3845 Penn St. 17.3180 Northern Iowa 17.3111 Northwestern 17.2556 Long Island 17.2556 James Madison 17.2510 Detroit 17.2379 George Mason 17.2111 Bradley 17.0855 Loyola (IL) 17.0722 Elon 17.0680 St. Bonaventure 17.0655 Mercer 17.0336 Drake 17.0289 NW State 17.0187 Wake Forest 17.0182 Niagara 16.9581 Purdue 16.9563 Hartford 16.9487 Texas Tech 16.9233 Boston U 16.8685 Rider 16.8067 Clemson 16.7166 De Paul 16.6454 Nevada 16.5988 Princeton 16.5938 UAB 16.5054 UC Irvine 16.5046 Delaware 16.4777 Towson 16.4171 Georgia 16.3679 Lafayette 16.3253 West Virginia 16.2019 San Diego 16.1158 NC A&T 16.1027 Southern 16.0950 Toledo 16.0701 Hawaii 16.0292 Cal Poly 15.8982 Idaho 15.8592 Cleveland St. 15.7620 IPFW 15.7000 Savannah St. 15.6405 Fresno St. 15.6242 Pepperdine 15.6083 Norfolk St. 15.5815 Holy Cross 15.5070 Marshall 15.4374 Army 15.3794 Oral Roberts 15.3730 USC 15.3022 Sam Houston St. 15.2898 Yale 15.1663 Winthrop 15.1356 Morehead St. 15.0979 Brown 15.0842 Drexel 15.0668 TX San Antonio 15.0024 Oakland 14.9904 McNeese St. 14.9467 Quinnipiac 14.9358 North Texas 14.8990 Duquesne 14.8985 Troy 14.8513 Morgan St. 14.7504 Georgia St. 14.7192 LA Lafayette 14.7140 Lipscomb 14.7121 Long Beach St. 14.7059 Manhattan 14.6780 UC Davis 14.5437 Columbia 14.5091 St. Peter's 14.4304 High Point 14.3977 Auburn 14.3659 Marist 14.3493 Wofford 14.3461 San Jose St. 14.3070 Cornell 14.2636 Buffalo 14.2271 Rhode Island 14.1902 Liberty 14.0328 Portland 13.9293 Delaware St. 13.7218 Miami (OH) 13.6686 South Dakota 13.6241 Stetson 13.5838 Fordham 13.5698 N.C. Asheville 13.5688 UCSB 13.5529 Campbell 13.4454 Colgate 13.4360 North Dakota 13.4358 Monmouth 13.3985 Chattanooga 13.3883 Dartmouth 13.2551 Maine 13.1639 Seattle 13.0385 Radford 12.9002 Montana St. 12.8383 Jacksonville 12.8043 Siena 12.7232 Hampton 12.7056 Navy 12.4556 Chicago St. 12.3891 SE Louisiana 12.2742 N. Colorado 12.1435 Jackson St. 12.1361 Austin Peay 12.0914 Rice 11.8819 E. Tenn. St. 11.8395 Old Dominion 11.7348 Nicholls St. 11.6002 IUPUI 11.5430 LA Monroe 11.2691 Samford 11.2131 Citadel 11.1936 Portland St. 11.1429 Howard 11.0323 Hofstra 11.0204 Alabama St. 10.9835 Longwood 10.7365 Furman 10.6795 Presbyterian 10.5587 New Orleans 10.4705 Lamar 10.2693 Florida A&M 10.0584 UC Riverside 9.9920 Kennesaw St. 9.7770 Binghamton 9.6115 Ste F Austin 9.1443 Weber State 9.1435 Col Charlestn 8.5979 N Dakota St. 8.4912 W Illinois 8.4275 UMass 8.2533 NC Central 8.1872 E Kentucky 8.1479 TX Southern 8.1355 W Michigan 7.9282 Kent State 7.8832 Ark Pine Bl 7.8599 Wright State 7.8351 LA Tech 7.8212 Gard-Webb 7.7709 Mt St.Mary's 7.7613 Jksnville St. 7.7428 Charl South 7.6965 E Carolina 7.6424 TX-Arlington 7.6413 Northeastrn 7.6071 Florida Intl 7.5889 TN State 7.5248 Central FL 7.4869 WI-GrnBay 7.4113 Boston Col 7.3077 SE Missouri 7.3018 St Josephs 7.2322 AR Lit Rock 7.2321 Ball State 7.2072 CS Bakersfld 7.0927 S Alabama 7.0577 San Fransco 7.0325 App State 7.0239 SC Upstate 7.0083 S Illinois 6.8708 VA Military 6.7742 TX-PanAm 6.7519 Fla Atlantic 6.7064 Central Ark 6.6817 Wash State 6.6623 IL-Chicago 6.6607 N Kentucky 6.6580 W Carolina 6.6478 Youngs St. 6.6009 E Michigan 6.5514 TN Tech 6.4862 Beth-Cook 6.4780 E Illinois 6.4418 N JIT 6.4057 Central Conn 6.3871 Prairie View 6.3793 Sac State 6.3764 Houston Bap 6.3659 S Methodist 6.3245 Wm & Mary 6.3052 S Carolina 6.2965 Cal St Nrdge 6.2900 Texas State 6.2669 St Fran (NY) 6.1695 Coastal Car 6.1570 Geo Wshgtn 6.1219 Loyola Mymt 6.0941 N Florida 6.0594 Missouri St. 6.0032 Neb Omaha 5.9981 GA Southern 5.9894 Miss State 5.9503 Utah Val St. 5.8729 Central Mich 5.8298 Bowling Grn 5.7696 CS Fullerton 5.6973 E Washingtn 5.6854 VA Tech 5.6753 Maryland BC 5.5945 TX Christian 5.4742 Alab A&M 5.4483 Coppin State 5.4384 U Penn 5.2858 TN Martin 5.2321 N Arizona 5.2213 N Hampshire 5.1888 NC-Grnsboro 5.1397 American 5.1245 Alcorn State 5.0936 Sacred Hrt 5.0929 UMKC 5.0428 NC-Wilmgton 5.0005 S Utah 4.9083 WI-Milwkee 4.8219 St Fran (PA) 4.7744 TX A&M-CC 4.7630 SIU Edward 4.7194 Idaho State 4.6572 Miss Val St. 4.6007 F Dickinson 4.5634 S Car State 4.4436 N Illinois 3.6664 Maryland ES 3.3431 Grambling St. 3.0220