FighterForecast is designed to help both those interested in sports betting as well as those interested in the advanced stats offered. The predictive measures are designed to maximize return on investment based on Casino odds and the model's predictive capabilities. FighterForecast also offers advanced historical ratings in a variety of statistics that amateur or seasoned MMA viewers may find value in.
Outperform both casino odds and other machine learning models by leveraging historically backed data rather than "empty" stats. FighterForecast's model only uses ratings that surround a fighter's ability to perform an action against competition well versed in preventing that action, this is the fundamental backbone behind the ratings.
Anyone from casual fans to bettors looking for a quantitative supplement to their picks.
FighterForecast was designed by a creator that aimed to beat out other machine learning models that simply take in "empty" stats such as takedown average or average significant strikes thrown per minute. These stats are considered "empty" because while they are a historical average, they do not give full context to whom those strikes or takedowns were against nor their impact on the fight's outcome.
The website and its backend were designed by I. Ramirez, for inquiries or comments email: admin@fighterforecast.com
Data Collection was made possible by the efforts of Rajeev Warrier in 2021 and his wonderful scraper available here: https://www.kaggle.com/datasets/rajeevw/ufcdata?resource=download.
Barring a few errors, the data collected is on the Ultimate Fighting Championship from 1993-Present (Only post 1998 data is used however). MMA leagues outside of the UFC bear NO WEIGHT to any aspect of the ratings nor the ability of the model to predict.
Data is freshly scraped and backend reprocessed after every UFC event.
Cleaning had to take place to remove unnecessary data, rework stats and convert the data into a format that the ratings creation files could use.
Removal of "%" based data, referee, location, fight type and other data that was unusable to create ratings.
Converted all measurements to common units, trimmed whitespace, unified naming conventions and made assumptions for incomplete data (such as when a fighter had no listed reach simply use their height.)
From the data that only contained numerical information, further trimming took place to remove stats that provided little relevance to the model. Fights pre 1998 almost always ended in the favorite or red corner winning and so they had to be removed to prevent the model from overly favoring the red corner. Other data such as where shots were landed from or submission attempts were deemed unhelpful for fight predictions. The number of attributes or features was also limited to prevent the machine learning model from struggling with too many features and too little data as only 7000 fights became relevant.
From the cleaned data, ratings were chosen to stats relevant to both a fighter's ability to perform an action, as well as their ability to win when they out performed their opponent in that action. To create the ratings a graph was created with each node being a fighter and each edge being a fight. The edge contained all the stored information on the fights and allowed ratings to be created by iteratively going through each fight and adjusting them based on who they were against and how well they were able to perform the action.
Ratings were chosen based on a "normal" set of 4 parts: A fighters ability to perform action X, A fighters ability to stop opponent from performing action X, a fighter's ability to win when they perform action X, and a fighter's ability to win despite their opponent performing action X. Examples include: takedowns, knockdowns, significant strikes, etc.
The Quality of Win Score (QOW) was created to establish how dominant a victory was. Closer fights had lower scores and more dominant fights had larger scores. Ex. Jon Jones vs. Dominick Reyes: QOW: 0.886, Ilia Topuria vs. Josh Emmett: QOW: 0.975. While both of these fights ended in Unanimous Decision, the dominance within the fights (at least shown by QOW), is very different. Similarly for fights with finishes: Leon Edwards vs. Kamaru Usman 2: QOW: 0.863, Tom Aspinall vs. Sergei Pavlovich: QOW: 1.0. A fifth round last minute finish versus a first round finish.
Additional measures were put into place to tweak ratings such as increasing weight of more recent fights, increasing the amount of rating lost for losing, cross updating defensive versus offensive elos and other functions.
Using the ratings data in Python, many different iterations of models were tested to arrive at the best model and its chosen hyperparameters. Once a model was chosen it was trained on the full dataset where it achieved over 80% accuracy on fights it had been trained on. This number has been kept around 80% to prevent overfitting.
Once the ratings data was ready to be processed a Train/Validate/Test split was used to show the effectiveness of different models. Training data was from 1999-2022, validate from 2022-2023 and test from 2024-present.
Many different models were used and tested with over 20000 combinations of hyperparameters. Tested models included Logistic Regression, Random Forest Classifier, XGBoost, LightGBM, Neural Networks and more. On relevant models a GridSearchCV was used to evaluate the model's accuracy on over 20000 different sets of hyperparameters.
The model had issues such as the under sampling of the underdog or blue corner, too many features, a lack of time and resources to process EVERY hyperparameter combination and other issues. Many of these were solved in the final version through oversampling, delta-feature creation and other solutions.
Using a scraper, the next UFC event's fighters are grabbed and put through The Odds API as well as the predictive model to arrive at a combination of both casino odds and fighter win probability
A scraper was built in python to check for the next UFC event, includes fight-nights, PPV, etc.
Using the Odds API the casino odds are consistently updated and pulled from various bookmakers to have up to date betting odds.
Estimated Value is the model's probability of winning for a fighter multiplied by the expected return on 1 USD. Estimated Value theoretically beats out predicted winner as it comvines odds and probability. For example, if fighter A has odds of $3 and fighter B has odds of $1.5 (betting on A gives you $3 for your $1 and for fighter B you get back $1.5), and the model thinks that fighter B has a 55% chance of winning, although it is not the predicted winner the mathematically best bet is fighter A. This is because Fighter B winning will give you $1.50 55% of the time, while fighter A winning will give you $3 45% of the time. Taking these numbers over many iterations, fighter A should give you 3*0.45=1.35 which is larger than fighter B's 1.5*0.55=0.825. This is the essence of EV betting on the underdog when casino odds underestimate the underdog or betting on the favorite when casinos underestimate them.
The Frontend was developed using tailwindcss and other libraries. Information from the python files is read in by php and used to create the statistics.
Eventually the loading info will be done by SQL, but at the moment it is accomplished by CSV's, JS and JSON's.
Tailwind Css and some normal css is used to style the entire site.
Main page displays site information as well as a comprehensive comparison of the upcoming fights. Fights listed are only ones where I currently have the odds available on.
The ELO's or ratings page allows for the lookup of a fighter's ratings as well as the top ratings in some categories.