I was asked by TSN to make predictions for the 2013 NCAA Men's Basketball "March Madness" tournament bracket based solely on a statistical analysis, without using any specific knowledge of NCAA teams (which is just as well since, although I like sports and watch them sometimes and even play a bit of neighbourhood pick-up basketball myself, I haven't closely followed any spectator sports in years).
So I proceeded by:
a) Gathering lots of different data variables for each team, for each of the past four regular seasons.
b) Separately gathering the results of each game of each of the past three years' March Madness tournaments.
c) Combining all of that data together for my computer programs to read (which turned out to be very time-consuming, since different data are available on different web sites in different formats with different team name abbreviations, so I had to "teach" my computer to match them all up).
d) Exploring different "non-negative linear combinations" of the data, i.e. formulas which use the data from a given regular season, to give an overall score to each team (I use the phrase "regular season" to include all games from that season prior to the NCAA March Madness tournament, including conference tournament games).
e) Developing computer programs to "fit" the formula based on previous seasons, i.e. to do an extensive search to figure out which of those formulas did the best job of predicting the winners for each game in that year's tournament, using data from the corresponding regular season.
f) Eventually coming up with a single best formula for this, which I call the "Rosenthal Fit."
g) Then, filling in the actual bracket simply by picking, for each game, whichever team has a larger value of their Rosenthal Fit.
The formula for the Rosenthal Fit, plus an evaluation of how well it performed when applied to data from the previous three years' tournaments, is provided below. Corresponding values for all teams for the 2012-2013 regular season (to be used to predict the 2013 tournament bracket) are listed just below:
General Observations:
The NCAA tournament is inherently hard to predict. Indeed, the total number of different ways of filling in your bracket predictions is 2^63 (i.e., 63 different 2's all multiplied together), which works out to about 9 x 10 to the 18th, i.e. a nine followed by 18 zeros, which equals nine billion billion, or nine million million million. That's a lot of possibilities!
In fact, even the experts find it challenging. For example, in past tournament games, the higher-seeded team only won about 70 per cent of the games. This means that even when many of the most knowledgeable people get together to seed the teams, they can still only correctly predict the winner about 70 per cent of the time. Individual expert basketball predictors (e.g. Kem Pomeroy at KenPom.com) tend to perform similarly, accurately predicting the winner in only about 70 per cent of the tournament games. Part of the reason is that each matchup is a single-elimination game, rather than e.g. a seven-game series, so there is lots of inherent day-to-day randomness, and it is quite possible for a weaker team to beat a "better" team in any one game, making predictions that much more difficult.
So, despite my extensive computer programming and statistical modeling, I do not expect to do better than calling about 70 per cent of the games correctly.
Indeed, I would say that anyone who does much better than 70 per cent would have to get fairly lucky (in addition to perhaps having a good predictive model and/or good knowledge of the basketball teams).
Statistical Data Considered:
To perform my statistical analysis, I downloaded and considered lots of different statistics, including the following (listed with sources):
- WinFrac: The team's overall game-winning fraction for the entire regular (pre-March Madness) season. (teamrankings.com)
- WinFrac3: The team's game-winning fraction in their final three regular season games. (teamrankings.com)
- CWinFrac: The team's game-winning fraction for games within their own conference. (realtimerpi.com)
- NCWinFrac: The team's game-winning fraction for games outside of their own conference. (realtimerpi.com)
- AdOff: The team's "adjusted" offensive efficiency rating. (KenPom.com)
- AdDef: The team's "adjusted" defensive effiiency rating. (KenPom.com)
- OffEff: The team's unadjusted offensive effiiency rating. (teamrankings.com)
- DefEff: The team's unadjusted offensive effiiency rating. (teamrankings.com)
- SOS: The team's "Strength of Schedule", a measure of the average strength of the opponents they played. (realtimerpi.com)
- RPI: The team's "Ratings Percentage Index". (realtimerpi.com)
- PntPG: The team's average number of points scored per game. (teamrankings.com)
- OpPnt: The team's average number of points scored against them per game. (teamrankings.com)
- I also examined the team statistics provided at ncaa.com and at espn.go.com, but they largely overlapped with the above statistics, so in the end I did not need to use them directly.
Finally, and most importantly, the "outcome" measure was:
- TourRes: The game-by-game, line-by-line win/loss results for each game of each of the past three March Madness tournaments. (kusports.com)
Statistical Modeling Approach Taken:
My approach was to try to figure out which linear combination of (i.e., formula using) the above-listed regular-season statistical values would do the best job of ranking the teams from highest to lowest, in terms of who won which games in the corresponding year's tournament. I computed this using regular-season statistical values, and corresponding tournament game results, for each of the three seasons 2009-2010, 2010-2011, and 2011-2012.
To perform this computation, I wrote computer programs in C and in R, which used such techniques as "linear regression", "constrained linear regression," and finally a "Monte Carlo (randomised) search algorithm," to find an optimal formula.
Although my computer programs considered all of the above variables, they ultimately selected just a few of those variables as being most relevant for prediction, namely: WinFrac, WinFrac3, OffEff, DefEff, SOS, and NCWinFrac.
Final Formula:
Using the above statistical analysis, the resulting best linear combination turned out to be:
Rosenthal Fit = 6:2337 x WinFrac + 1:7180 x WinFrac3 +1:1179 x OffEff + 1:9189 x DefEff + 11:9846 x SOS + 7:3712 x NCWinFrac
I then applied this linear combination formula to the regular-season statistics for the current (2012-2013) season. This provided an overall numerical rating for each team this year, based on their regular-season statistics. These ratings are listed, in order from highest to lowest below.
Then, to fill out this year's tournament bracket using this Rosenthal Fit, simply choose, for each game, whichever team has a higher value of the Rosenthal Fit.
Note: The above rating system is based purely on statistical analysis, without taking any other factors into account. Certain late-breaking events (e.g. Kentucky Wildcats superstar Nerlens Noel's major injury on February 12) could potentially have a large impact on a team's tournament performance despite making only small changes to their regular-season statistics, which could throw off my model's predictions. I did consider making a few post-hoc adjustments to account for such developments, but in the end I decided not to - thus keeping the Rosenthal Fit as a purely statistical measure.
Comparison to Other Predictors:
The following table shows how the Rosenthal Fit, and also the tournament seedings, and also the RPI (Ratings Percentage Index) itself, would have done at predicting tournament games in each of the past three tournaments. (In two of the tournaments, there was one game between two equally-seeded teams; those two games are excluded from the evaluation of the tournament seedings)
| Season |
Seedings |
RPI |
RF |
| 2009-2010 |
42/62 (67.74%) |
44/63 (69.84%) |
48/63 (76.19%) |
| 2010-2011 |
43/63 (68.25%) |
38/63 (60.32%) |
43/63 (68.25%) |
| 2011-2012 |
46/62 (74.19%) |
44/63 (69.84%) |
45/63 (71.43%) |
| Total |
131/187 (70.05%) |
126/189 (66.67%) |
136/189 (71.96%) |
This table shows that the Rosenthal Fit compares favourably with RPI and with the tournament seedings. This should not be taken as evidence of any particular superiority, since the Rosenthal Fit was developed precisely to try to maximise these predictions. Still, it does suggest that the Rosenthal Fit is at least roughly comparable in predictive power to these expert measures.
In a few weeks, we will know how well it performed this year.
Jeffrey Rosenthal is a professor in the Department of Statistics at the University of Toronto, and the author of the bestseller Struck by Lightning: The Curious World of Probabilities. His analysis can seen during TSN's coverage of the 2013 NCAA Men's Basketball tournament.
List of Rosenthal Fit Values:
Duke 24.1150
Louisville 23.7559
Kansas 23.6584
New Mexico 23.5325
Gonzaga 23.4355
Arizona 23.2148
Indiana 23.0785
Michigan 22.6300
Ohio St. 22.6260
Georgetown 22.5934
Syracuse 22.5526
Creighton 22.5324
Miami (FL) 22.3322
Notre Dame 22.2744
Pittsburgh 22.1597
Memphis 22.1042
Wichita St. 22.0946
Saint Louis 22.0907
Florida 22.0731
Michigan St. 22.0105
Butler 21.9748
Kansas St. 21.9461
Oregon 21.9407
Colorado St. 21.8670
Mississippi 21.8169
UNLV 21.7975
Cincinnati 21.7373
N.C. State 21.7080
VCU 21.6183
Bucknell 21.5939
Oklahoma St. 21.5885
St. Mary's 21.5479
Illinois 21.3910
Maryland 21.3721
Belmont 21.3090
UCLA 21.3080
Marquette 21.2605
Temple 21.2184
North Carolina 21.1325
Wyoming 21.0634
Wisconsin 20.9743
Missouri 20.8896
Charlotte 20.8322
Minnesota 20.8182
Middle Tenn.St. 20.8046
IowaSt. 20.8036
Valparaiso 20.7961
San Diego St. 20.6748
Connecticut 20.6519
Iowa 20.6125
Colorado 20.5972
Boise State 20.5151
Albany 20.3990
Utah St. 20.3426
Akron 20.3190
Southern Miss 20.2688
LaSalle 20.1715
Arizona St. 20.0918
Oklahoma 19.9951
Rutgers 19.8699
LSU 19.7374
Tennessee 19.5588
Villanova 19.5010
Houston 19.4979
Virginia 19.4679
Stanford 19.4496
Santa Clara 19.4331
Kentucky 19.3383
Brigham Young 19.3114
Lehigh 19.2614
Seton Hall 19.2364
Texas A&M 19.2074
California 19.1917
Stony Brook 19.1861
Georgia Tech 19.0646
Ohio 19.0342
New Mexico St. 18.9641
Florida St. 18.8859
S Dakota St. 18.8602
Arkansas 18.8197
Davidson 18.7817
Baylor 18.7774
Alabama 18.7748
Dayton 18.7484
Fla Gulf Cst 18.7107
Tulane 18.6753
Loyola (MD) 18.6450
Texas 18.6347
Murray St. 18.6279
Richmond 18.6116
Rob. Morris 18.5161
Providence 18.4669
Nebraska 18.4523
AirForce 18.4451
Iona 18.4391
Illinois St. 18.3915
Vermont 18.3840
Oregon St. 18.3567
South Florida 18.3112
Indiana St. 18.3080
Washington 18.2090
Evansville 18.2070
Harvard 18.1508
Bryant 17.9622
Denver 17.8817
TX El Paso 17.8263
Xavier 17.7947
W. Kentucky 17.7828
Utah 17.7690
St. John's 17.7554
Canisius 17.6712
Wagner 17.6241
Fairfield 17.5919
Tulsa 17.5297
Montana 17.4721
Pacific 17.4308
Vanderbilt 17.3922
Arkansas St. 17.3845
Penn St. 17.3180
Northern Iowa 17.3111
Northwestern 17.2556
Long Island 17.2556
James Madison 17.2510
Detroit 17.2379
George Mason 17.2111
Bradley 17.0855
Loyola (IL) 17.0722
Elon 17.0680
St. Bonaventure 17.0655
Mercer 17.0336
Drake 17.0289
NW State 17.0187
Wake Forest 17.0182
Niagara 16.9581
Purdue 16.9563
Hartford 16.9487
Texas Tech 16.9233
Boston U 16.8685
Rider 16.8067
Clemson 16.7166
De Paul 16.6454
Nevada 16.5988
Princeton 16.5938
UAB 16.5054
UC Irvine 16.5046
Delaware 16.4777
Towson 16.4171
Georgia 16.3679
Lafayette 16.3253
West Virginia 16.2019
San Diego 16.1158
NC A&T 16.1027
Southern 16.0950
Toledo 16.0701
Hawaii 16.0292
Cal Poly 15.8982
Idaho 15.8592
Cleveland St. 15.7620
IPFW 15.7000
Savannah St. 15.6405
Fresno St. 15.6242
Pepperdine 15.6083
Norfolk St. 15.5815
Holy Cross 15.5070
Marshall 15.4374
Army 15.3794
Oral Roberts 15.3730
USC 15.3022
Sam Houston St. 15.2898
Yale 15.1663
Winthrop 15.1356
Morehead St. 15.0979
Brown 15.0842
Drexel 15.0668
TX San Antonio 15.0024
Oakland 14.9904
McNeese St. 14.9467
Quinnipiac 14.9358
North Texas 14.8990
Duquesne 14.8985
Troy 14.8513
Morgan St. 14.7504
Georgia St. 14.7192
LA Lafayette 14.7140
Lipscomb 14.7121
Long Beach St. 14.7059
Manhattan 14.6780
UC Davis 14.5437
Columbia 14.5091
St. Peter's 14.4304
High Point 14.3977
Auburn 14.3659
Marist 14.3493
Wofford 14.3461
San Jose St. 14.3070
Cornell 14.2636
Buffalo 14.2271
Rhode Island 14.1902
Liberty 14.0328
Portland 13.9293
Delaware St. 13.7218
Miami (OH) 13.6686
South Dakota 13.6241
Stetson 13.5838
Fordham 13.5698
N.C. Asheville 13.5688
UCSB 13.5529
Campbell 13.4454
Colgate 13.4360
North Dakota 13.4358
Monmouth 13.3985
Chattanooga 13.3883
Dartmouth 13.2551
Maine 13.1639
Seattle 13.0385
Radford 12.9002
Montana St. 12.8383
Jacksonville 12.8043
Siena 12.7232
Hampton 12.7056
Navy 12.4556
Chicago St. 12.3891
SE Louisiana 12.2742
N. Colorado 12.1435
Jackson St. 12.1361
Austin Peay 12.0914
Rice 11.8819
E. Tenn. St. 11.8395
Old Dominion 11.7348
Nicholls St. 11.6002
IUPUI 11.5430
LA Monroe 11.2691
Samford 11.2131
Citadel 11.1936
Portland St. 11.1429
Howard 11.0323
Hofstra 11.0204
Alabama St. 10.9835
Longwood 10.7365
Furman 10.6795
Presbyterian 10.5587
New Orleans 10.4705
Lamar 10.2693
Florida A&M 10.0584
UC Riverside 9.9920
Kennesaw St. 9.7770
Binghamton 9.6115
Ste F Austin 9.1443
Weber State 9.1435
Col Charlestn 8.5979
N Dakota St. 8.4912
W Illinois 8.4275
UMass 8.2533
NC Central 8.1872
E Kentucky 8.1479
TX Southern 8.1355
W Michigan 7.9282
Kent State 7.8832
Ark Pine Bl 7.8599
Wright State 7.8351
LA Tech 7.8212
Gard-Webb 7.7709
Mt St.Mary's 7.7613
Jksnville St. 7.7428
Charl South 7.6965
E Carolina 7.6424
TX-Arlington 7.6413
Northeastrn 7.6071
Florida Intl 7.5889
TN State 7.5248
Central FL 7.4869
WI-GrnBay 7.4113
Boston Col 7.3077
SE Missouri 7.3018
St Josephs 7.2322
AR Lit Rock 7.2321
Ball State 7.2072
CS Bakersfld 7.0927
S Alabama 7.0577
San Fransco 7.0325
App State 7.0239
SC Upstate 7.0083
S Illinois 6.8708
VA Military 6.7742
TX-PanAm 6.7519
Fla Atlantic 6.7064
Central Ark 6.6817
Wash State 6.6623
IL-Chicago 6.6607
N Kentucky 6.6580
W Carolina 6.6478
Youngs St. 6.6009
E Michigan 6.5514
TN Tech 6.4862
Beth-Cook 6.4780
E Illinois 6.4418
N JIT 6.4057
Central Conn 6.3871
Prairie View 6.3793
Sac State 6.3764
Houston Bap 6.3659
S Methodist 6.3245
Wm & Mary 6.3052
S Carolina 6.2965
Cal St Nrdge 6.2900
Texas State 6.2669
St Fran (NY) 6.1695
Coastal Car 6.1570
Geo Wshgtn 6.1219
Loyola Mymt 6.0941
N Florida 6.0594
Missouri St. 6.0032
Neb Omaha 5.9981
GA Southern 5.9894
Miss State 5.9503
Utah Val St. 5.8729
Central Mich 5.8298
Bowling Grn 5.7696
CS Fullerton 5.6973
E Washingtn 5.6854
VA Tech 5.6753
Maryland BC 5.5945
TX Christian 5.4742
Alab A&M 5.4483
Coppin State 5.4384
U Penn 5.2858
TN Martin 5.2321
N Arizona 5.2213
N Hampshire 5.1888
NC-Grnsboro 5.1397
American 5.1245
Alcorn State 5.0936
Sacred Hrt 5.0929
UMKC 5.0428
NC-Wilmgton 5.0005
S Utah 4.9083
WI-Milwkee 4.8219
St Fran (PA) 4.7744
TX A&M-CC 4.7630
SIU Edward 4.7194
Idaho State 4.6572
Miss Val St. 4.6007
F Dickinson 4.5634
S Car State 4.4436
N Illinois 3.6664
Maryland ES 3.3431
Grambling St. 3.0220