Are you curious about how to determine the statistics that can be used by machine learning to predict the final winner of the tournament? We are!
With the men’s March Madness Tournament in full swing, all of the top men’s basketball teams are competing to win the coveted National Championship trophy. Many of the teams have different styles of play and coaching philosophies. However, the best teams share some characteristics that separate them from the rest of the college basketball programs.
We decided to take a deeper look into different statistics in basketball such as Assists, Rebounds, Steals, Shooting Percentages, and others to determine the most important statistics for great basketball teams. At first, we considered using a team’s winning percentage as a measure of how good a college basketball team is. However, unlike the NBA, some college basketball teams can have much easier schedules than others and may never get the opportunity to play the best teams due to their conference and location in the country. Since there are over 300 Division 1 college basketball teams, it makes sense that every team’s schedule would be vastly different.
Because of this, we decided to use the Simple Rating System (SRS) to rank the college basketball teams. A team’s SRS is calculated by taking their average point differential (number of points they win a game by) and subtracting or adding points based on how easy or hard their schedule was compared to other teams. For example, if a school on average wins their games by 8 points but faces many great teams, their SRS will be higher than 8. This statistic is perfect because it takes into account both how well a team is playing and their strength of schedule.
Next, we plotted each common basketball statistic (Assists Per Game, Rebounds Per Game, Field Goal Percentage, Steals Per Game, Blocks Per Game, etc.) for every team against their respective SRS value and calculated the correlation coefficient (R-value) between these two variables.
The dashboard above shows the four different statistics that had the strongest correlation with SRS. However, individually, each statistic does not have a strong correlation with the SRS value. To get a better understanding of which statistics are the most important in college basketball, we created a Naive Bayes classifier based on three statistics: Field Goal Percentage, Three-Point Percentage, and Assists Per Game.
The colors on the histogram above show how the different teams were classified based on their SRS. The labeling criteria used is displayed below:
‘Excellent Teams’: SRS greater than 15
‘Good Teams’: SRS between 0 and 15
‘Below Average Teams’: SRS between -15 and 0
‘Bad Teams’: SRS below 15
The Naive Bayes model was 62% accurate when the three predictors of Field Goal Percentage, Three-Point Percentage, and Assists Per Game were used. One aspect that we found interesting is that offensive stats such as Field Goal Percentage, Three-Point Percentage, and Assists Per Game are much better predictors of SRS compared to defensive statistics such as Blocks/Steals Per Game and Opponent Field Goal Percentage. This makes sense because both the NBA and College Basketball are becoming move offensive-driven, especially with the explosion of long-distance shooters. We believe that Gonzaga and Michigan are two of the best teams because of their high field goal percentage and three-point percentage.
While this model does provide some insight into the most important stats to look for when deciding which team will win the college championship, it also has some flaws. There are a couple of college basketball teams that were considered outliers because they had low SRS values but shot very efficiently from the field and three-point line. For example, McNeese State had some of the best shooting percentages (49.5% from the field and 39.2% from the three-point line), yet had an SRS of -18.02, which is worse than most college teams. Additionally, there are a lot of variables that can’t be quantified that make it very hard to predict the winner of the March Madness tournament.
We also used a couple of other machine learning models, specifically the Decision Tree, Random Forest, and the XGBoost algorithms, to get a better understanding of which team would perform the best in the tournament. After the dataset was split into train and test, we found that the decision tree model gave the least accuracy of around 51% while the XGBoost model gave the best accuracy of 77%.
All of these models point to Gonzaga as the college basketball team to beat in the March Madness tournament because of their amazing offensive statistics.