Photo by JC Gellidon on Unsplash
Using the NBA API to create your own ML models and predict the best player transaction
Originally published on Medium  link
“I have a plan of action, but the game is a game of adjustments”  Mike Krzyzewski
Owing to the Greek Freak who recently reached the peak, I gave a chance to this project, which I kept latent the past few months. NBA it is!
The main scope hereof is to present an endtoend ML app development procedure, which embodies quite a number of Supervised and Unsupervised algorithms, including Gaussian Mixtures Models (GMM), KMeans, Principal Component Analysis (PCA), XGBoost, Random Forest, and Multinomial Logistic Regression Classifiers.
Concept
After successfully clustering Whiskey varieties to boost a Vendor’s sales, Data Corp accepted a new project: assist the Milwaukee Bucks to make the best next move during the 2020 transaction window. That is, to preaccess the candidate players for the Shooting Guard (SG) position and buy the one who performs best. Being oblivious of Basketball knowledge leads me to a tricky alternative:
How about requesting the NBA API, fetching player data from the past seasons’ games (e.g. assist to turnovers, assist % and so on), categorising them in a meaningful way for the General Manager (GM), and finally guide him on whom they should spend the transfer budget on?
To better communicate the outcomes, a couple of assumptions were made:
#1: We are at the end of the 2020 season (Oct). Bucks GM has prepared a list of 3 candidates for the SG position: Jrue Holiday, Danny Green, and Bogdan Bogdanovic.
#2: To accomplish the mission we have to uncover any insights from data which may lead the Bucks to increase their performance on the respective home ground of attacking (max assists, min turnovers etc), while preserving the rest of the stats (i.e. Weighted Field Goal %, etc). That is, we should not simply suggest the GM to buy the best passer or scorer, for this might compromise the rest valuable statistics.
Modus Operandi

Build the dataset; fetch the playerwise statistics per game (from now on ‘plays’).

Perform EDA; build intuition on the variables’ exploitation, come to earliest conclusions.

Cluster ‘plays’ via KMeans & GMM; reveal underlying patterns and identify the most suitable cluster for the case.

Using the now labeled dataset (clusters = labels), train a number of Multiclass Classifiers, incl. Multinomial Logistic Regression, Random Forest & XGBoost.

Make Predictions on the candidate players’ latest ‘plays’ (2020 season) and benchmark them accordingly.

Serve the trained models to the enduser, by building & serving an API (analysed in next post).
Workflow Chart (Image by author)
You can either run the notebooks for an explained workflow or the script files (.py) for an automated one.
1. Dataset
The dataset is built in 2 steps: (a) starting from this Kaggle dataset we query the basketball.sqlite
to extract GAME_IDs
for seasons 2017–2020, (b) we make requests to the NBA_api to fetch the playerwise data per game.
The whole procedure is wrapped up in the dataset.py which you may choose to run, or else use the already prepared datasets in the ‘data/raw’ directory.
We use games from seasons 2017–2019 to train both clustering and classification models and keep 2020 data for testing purposes. Here is a sample of the dataset and an adequate explanation of the variables:
In the vein of reducing cluttering, I do not delve into the data cleaning and preprocessing procedures — you may refer to 00_EDA.ipynb & preprocess.py, respectively.
2. EDA
[A thorough EDA is provided in the 00_EDA.ipynb]
We have to build intuition on what is really important, when it comes to access a SG’s performance. In this context, we classify features from the least to the most important one, based on domain knowledge. This will also make it easier to take the final decision.
# classify features by domain importance
group_1 = [OF_RATING,AST_PCT,AST_TOV,TM_TOV_PCT,EFG_PCT,TS_PCT,POSS]
group_2 = [MIN, AST_RATIO, DREB_PCT]
group_3 = [OREB_PCT, REB_PCT, USG_PCT, PACE, PACE_PER40, PIE]
group_4 = [START_POSITION]
group_5 = [DEF_RATING]
In brief, all features are of high quality in terms of null presence, duplicated samples, or lowvariance, while their boundaries make sense (no suspicious cases of unreasonable extreme values).
Features’ Histograms
However, many of them contain outliers to either of the sides. This is quite anticipated, as we deal with real game plays and no one (even the same player in different games) can always perform within a fixed performance ‘bracket’.
Features’ Whisker Box Plots
Concerning the crucial set of group_1
features, they are almost balanced between left/rightskewed. However, the dominant holding factor is the great presence of outliers beyond the pertinent upper boundary. There are many players who oftentimes perform wellabove the expectations and this fact comes in line with our initial conclusion:
Induction #1: We have to deeply study
group_1
, in a way that will not only guarantee significant levels for the respective features, but also won’t compromise (the greatest possible number of) the rest.
With that in mind, we initiate a naive approach of sorting the dataset by a master feature (AST_PCT
), taking the upper segment of it (95th Percentile) and evaluating the plays ‘horizontally’ (across all features).
The outcome is disappointing. By comparing the population with the 95th percentile average features, we see that by maximising along AST_PCT many of the remaining features get worse, violating that way Assumption #2. Besides, we wouldn’t like to buy a SG of great Assist ratio but poor Field Goal performance (EFG_PCT
)!
Therefore, it gets easily conceived that we cannot accomplish our mission of building the optimum SG’s profile, based on plain exploratory techniques. Thus:
Induction #2: We have to build better intuition on the available data and use more advanced techniques, to effectively segment it and capture the underlying patterns, which may lead us to the best SG’s profile.
Clustering picks up the torch…
3. Clustering
[Refer to 01_clustering[kmeans_gmm].ipynb]
KMeans
We begin with the popular KMeans algorithm, but firstly implement PCA, in order to reduce the dataset dimensions, while retaining most of the original features’ variance [1].
PCA ~ Explained Variance
We opt for a 4component solution, as it explains at least 80% of the population’s variance. Next, we find the optimum # of clusters (k), by using the Elbow Method and plotting the WCSS line:
WCSS ~ Clusters Plot
The optimal # clusters is 4 and we are ready to fit KMeans.
KMeans Clusters
The resulted clustering is decent, however there are many overlapping points of cluster_2
and cluster_3
, turquoise & blue, respectively. Seeking for potential enhancement, we are going to examine another clustering algorithm. This time not a distancebased, but a distributionbased one; Gaussian Mixture Models [2].
GMM
In general, GMM can handle a greater variety of shapes without assuming the clusters to be of the circular type (like KMeans does). Also, as a probabilistic algorithm, it assigns probabilities to the datapoints, expressing how strong their association is with a specific cluster. Yet, there’s no free lunch; GMM may converge quickly to a local minimum, hence deteriorating results. To tackle this, we can initialize them with KMeans, by tweaking the respective Class parameter [3].
In order to pick the suitable # of clusters, we can utilize the Bayesian Gaussian Mixture Models class in ScikitLearn which weights clusters, leveling the erroneous ones at or near zero.
# returns
array([0.07, 0.19, 0.03, 0.14, 0.19, 0.09, 0.06, 0.18, 0.05, 0.01])
Obviously, only 4 clusters surpass the 0.01 threshold.
GMM Clusters
That’s it! cluster_3
(blue) is better separated this time, while cluster_2
(turquoise) is better contained, too.
Clusters Evaluation
For the purpose of enhancing the clusters assessment, we introduce a new variable which depicts the net score of the examined features. Each group is weighted in order to better express the magnitude it has on the final performance and their algebraic sum is calculated. I allocate weights as following:
NET_SCORE = 0.5*group_1 + 0.3*group_2 + 0.2*group_3  0.3*group_5
# group_4 (START_POSITION) shouldn't be scored (categorical feature)
# being a center ‘5’ doesn't mean to be ‘more’ of something a guard ‘1’ stands for!
# group_5 (DEF_RATING) is negative in nature
# it should be subtracted from the Net Score
So, let’s score and evaluate clusters.
Apparently, cluster_3
outperforms the rest ones with a NET_SCORE
of aprox. 662.49, while cluster_1
takes position next to it. But, what worths to be highlighted here is the quantified comparison between the 95th percentile and the newly introduced cluster_3
:
NET_SCORE Whisker Box Plots for 95th percentile & cluster_3
It gets visually clear that cluster_3
dominates the 95th percentile segment, by noting an increase of the 146.5 NET_SCORE
units! Consequently:
Induction #3:
Cluster_3
encapsulates those ‘plays’ which derive from great SG performance, in a really balanced way —group_1
features reach high levels, while most of the rest keep a decent average. This analysis, takes into account more features than the initially attempted (ref. EDA) which leveraged a dominant one (AST_PCT
). Which proves the point that…
Induction #4: Clustering promotes a more comprehensive separation of data, deriving from signals of more components and along these lines we managed to reveal a clearer indication of what performance to anticipate from a topclass SG.
Now, we are able to manipulate the labelled (with clusters) dataset and develop a way to predict the cluster a new sample (unlabelled ‘play’) belongs to.
4. Classifiers
[Refer to 02_classifying[logres_rf_xgboost].ipynb]
Our problem belongs to the category of MultiClass Classification and the first step to take is choosing a validation strategy to tackle potential overfitting.
# check for the clusters' balance
0 27508
1 17886
3 11770
2 5729
The skewed dataset implies that a Stratified Kfold crossvalidation has to be chosen over a random one. This will keep the labels’ ratio constant in each fold and whatever metric we choose to evaluate, it will give similar results across them all [4]. And speaking of metrics, the F1 score (harmonic mean of precision and recall) looks more appropriate than accuracy, since the targets are skewed [5].
Next, we normalise data in order to train our (baseline) Logistic Regression model. Be mindful here to fit firstly on the training dataset and then transform both training and testing data. This is crucial to avoid data leakage [6]!
# returns
Mean F1 Score = 0.9959940207018171
Feature Importance
Such a tremendous accuracy from the very beginning is suspicious. Among the available ways to check the features’ importance (e.g. MDI), I choose the Permutation Feature Importance, which is model agnostic, hence we are able to use any conclusions to all the models [7].
Permutation Feature Importance for: (a) all features, (b) all ≠ START_POSITION
, (c) all ≠ START_POSITION
, MIN
START_POSITION
contributes with extremely high importance (only by itself, scores F1=0.865). Should we check the pertinent descriptive statistics, we see that all group_1
features get the minimum level when START_POSITION
is 0 (i.e. NaN).
It betrays that those players didn’t start the game, so there is high possibility for them to have played for less time than the others, hence they have worse stats! The same applies for the MIN
variable— it precisely expresses the time a player spent on court. Therefore both cause data leakage and we ignore them. Further to that, we distinguish the most significant features.
Feature Engineering
Additionally, we make an attempt to reduce the # of features by constructing a new, smaller number of variables which capture a significant portion of the original ones information. We put PCA in the spotlight once again, this time trying for 9 and 7 components. Be careful to only use the remaining normalised features (≠ START_POSITION
, MIN
)!
Eventually, we result in the following feature ‘buckets’:
all_feats = [all]  [START_POSITION,MIN]
sgnft_feats = [all_feats]  [OFF_RATING,AST_TOV,PACE,PACE_PER40,PIE]
pca_feats = [pca x 9]
pca_feats = [pca x 7]
Hyperparameter Optimisation
After taking care of feature selection, we set for optimising each model’s hyperparameters. GridSearch is quite effective, albeit timeconsuming. The procedure is similar to all models — for the sake of simplicity I only serve out the XGBoost case:
# returns
Best score: 0.7152999106187636
Best parameters set:
colsample_bytree: 1.0
lambda: 0.1,
max_depth: 3,
n_estimators: 200
Models
Now, we declare the optimum hyperaparameters per model in the model_dipatcher.py which dispatches the model we choose into the train.py. The latter wrapsup the whole training procedure, making it easier to train the tuned models with every feature ‘bucket’. We get:
## Logistic Regression ##
used num_feats F1_weighted
=========  =========  ==========
all_feats  16  0.7144
sgnft_feats  11  0.7152
pca_feats  9  0.7111 # sweetspot
pca_feats  7  0.7076
## Random Forest ##
used num_feats F1_weighted
=========  =========  ==========
all_feats  16  0.7213
sgnft_feats  11  0.7145
pca_feats  9  0.7100
pca_feats  7  0.7049
## XGBoost ##
used num_feats F1_weighted
=========  =========  ==========
all_feats  16  0.7238 #best
sgnft_feats  11  0.7168
pca_feats  9  0.7104
pca_feats  7  0.7068
Note: Your results may vary due to either the model’s stochastic nature or the numerical precision.
A classical performance vs simplicity tradeoff is introduced; I choose the potential of Logistic Regression with the pca_feats (x9) to further proceed.
5. Predictions
Now, for the testing dataset’s plays, we predict their clusters by using the selected model.
Validation
For validation to happen, groundtruth labels are necessary. However, that is not our case as the testing dataset (test_proc.csv
) is not labelled. You may wonder why we don’t label it via clustering, but that would lead us to the very same procedure, Cross Validation has already done 5 times—isolate a small portion of data and validate on that.
Instead, we are going to further evaluate the classifier by conducting qualitative checks. We can either manually review the labels of a portion of data to ensure they are good or compare the predicted to the training clusters and check that any dominant descriptive statistics still hold.
Indeed, cluster_3
takes again the lead by outperforming the rest with a NET_SCORE
of 109.35 units, while noting the highest level along most of the crucial features (OFF_RATING
, AST_PCT
, AST_TOV
and POSS
).
Transaction
The last and most interesting part involves decision making. At first, we make predictions on the candidate players’ (Jrue Holiday, Danny Green, Bogdan Bogdanovic) firsthalf 2020 season ‘plays’ and label them with the respective cluster.
Then we check for their membership in the precious cluster_3
, ranking them according to the respective ratio of cluster_3_plays
/total_plays
. So, we run the predict.py
script and get:
# Results{'*Jrue Holiday*': *0.86*, 'Bogdan Bogdanovic': 0.38, 'Danny Green': 0.06}
And guess what?
On November 24th of 2020, Bucks officially announced Jrue Holiday’s transaction! You thought so; an outofreality validation…
Conclusion
We have come a long way so far… Starting from Kaggle & NBA API we built a vast dataset, clustered it and revealed insightful patterns on what it takes to be a really good Shooting Guard. We, then, trained various Classification models on the labelled dataset, predicting with decent accuracy the Cluster a new player entry may be registered in. By doing so, we managed to spotlight the next move Milwaukee Bucks should (and did!) take, to fill the SG position.
Similarly to the DJ vs Data Scientist case, it’s quasiimpossible to assertively answer the potential of Data Science in the Scouting Field. Yet, once again the signs of the times denote a favourable breeding ground for AI implementation in the decisionmaking field of the Sport Industry…
Photo by Patrick Fore on Unsplash
I dedicate this project to my good friend Panos — an ardent fan of Basketball, astronomy aficionado and IT expert.
Thank you for reading & have a nice week! Should any question arise, feel free to leave a comment below or reach me out on Twitter/LinkedIn. In any case…
Take your seat, clone the repo and make the next… move
References
[1] https://machinelearningmastery.com/dimensionalityreductionformachinelearning/
[2] https://www.analyticsvidhya.com/blog/2019/10/gaussianmixturemodelsclustering/
[3] https://towardsdatascience.com/gaussianmixturemodelsvskmeanswhichonetochoose62f2736025f0
[4] A. Thakur, Approaching (Almost) Any Machine Learning Problem, 1st edition (2020), ISBN10: 9390274435
[5] https://towardsdatascience.com/comprehensiveguideonmulticlassclassificationmetricsaf94cfb83fbd
[6] https://machinelearningmastery.com/datapreparationwithoutdataleakage/
[7] https://scikitlearn.org/stable/modules/permutation_importance.html