By: Michael Del Bene
In the NFL, and football in general, the Quarterback is said to be the most important player on the field. This is because he has the ball in his hands on every offensive play, and also makes a lot of the play calls. He is usually the leader of his team, on and off the field. With that, I am interested in how a Quarterbacks statistics affect their win rate so I am going to attempt to predict QB win ratios based off of their statistics using data analysis techniques.
To begin my data science project I will need to first import all of the libraries I am going to be using. Which is what this next cell accomplishes. After this I will begin to scrape the data, which is where I gather data from a website that has the information I need to do a data analysis. I will be using www.pro-football-reference.com, a website that has all the data I wanted about Quarterbacks from the 2019 and 2020 season so far.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
!pip3 install lxml
url = "https://www.pro-football-reference.com/years/2019/passing.htm"
reqs = []
for i in range(2019, 2021):
reqs.append(requests.get(url))
# change 2010 to 2011 in the url, 2011 to 2012, etc.
url = url.replace(str(i), str(i+1))
tables = []
for r in reqs:
root = BeautifulSoup(r.content, "html.parser")
tables.append(root.find("table"))
# converts the table to a dataframe. this is currently only the 2010 season, because
# tables[0], the first page we scraped, is 2010
temp = pd.read_html(str(tables[0]))[0]
data_2020 = pd.read_html(str(tables[1]))[0]
temp
My variable temp is storing the 2019 passer data and the variable data_2020 is storing the 2020 passer dataframe. I also want to look at QB rushing statistics so I will be scraping from a different table from the same website to accomplish this.
url = "https://www.pro-football-reference.com/years/2019/rushing.htm"
reqs = []
for i in range(2019, 2021):
reqs.append(requests.get(url))
# change 2010 to 2011 in the url, 2011 to 2012, etc.
url = url.replace(str(i), str(i+1))
tables = []
for r in reqs:
root = BeautifulSoup(r.content, "html.parser")
tables.append(root.find("table"))
# converts the table to a dataframe. this is currently only the 2010 season, because
# tables[0], the first page we scraped, is 2010
rushing = pd.read_html(str(tables[0]))[0]
rushing_2020 = pd.read_html(str(tables[1]))[0]
rushing
The variable rushing now stores the rushing data for 2019 and rushing_2020 for the 2020 season.
Now that I have scraped all of the necessary data I need, I will begin to tidy the data which means to prepare it to be analyzed. First, I will begin removing unnecessary rows from the data. I will also remove columns I do not need. I will do this for the 4 dataframes I am going to be working with. As part of the analysis, I have deemed it necessary to only look at QBs who started in more than half of the games that season to reduce possible bias, and to keep our stats consistent. Also, because we care about win ratio we will need to parse the column that contains wins and losses so that we can divide wins by losses and created a column called win percentage, which becomes the most important column in our dataframes because that is what we are comparing each statistic to.
# column titles. this removes them
temp = temp[~temp["GS"].str.contains("GS")]
# preparing QBrec for further analysis
temp["QBrec"] = temp["QBrec"].astype(str)
temp = temp[~temp["QBrec"].str.contains("NaN")]
temp = temp[~temp["QBrec"].str.contains("nan")]
temp["GS"] = pd.to_numeric(temp["GS"])
data_2020 = data_2020[~data_2020["GS"].str.contains("GS")]
data_2020["QBrec"] = data_2020["QBrec"].astype(str)
data_2020 = data_2020[~data_2020["QBrec"].str.contains("NaN")]
data_2020 = data_2020[~data_2020["QBrec"].str.contains("nan")]
data_2020["GS"] = pd.to_numeric(data_2020["GS"])
rushing.columns = ['Rk', 'Player', 'Tm', 'Age','Pos','G','GS','RAtt','RYds', 'RTD', '1D', 'Lng', 'RY/A', 'RY/G', 'Fmb']
# remove column titles from middle of dataframe
rushing = rushing[~rushing["GS"].str.contains("GS")]
rushing["GS"] = pd.to_numeric(rushing["GS"])
rushing = rushing.loc[rushing['Pos'] == 'QB']
rushing = rushing.drop(["1D", "Lng", "Tm", "Age", "Pos", "G", "GS", "Rk"], axis = 1)
rushing_2020.columns = ['Rk', 'Player', 'Tm', 'Age','Pos','G','GS','RAtt','RYds', 'RTD', '1D', 'Lng', 'RY/A', 'RY/G', 'Fmb']
# remove column titles from middle of dataframe
rushing_2020 = rushing_2020[~rushing_2020["GS"].str.contains("GS")]
rushing_2020["GS"] = pd.to_numeric(rushing_2020["GS"])
rushing_2020 = rushing_2020.loc[rushing_2020['Pos'] == 'QB']
rushing_2020 = rushing_2020.drop(["1D", "Lng", "Tm", "Age", "Pos", "G", "GS", "Rk"], axis = 1)
rushing_2020.head()
Here, we have a list of all of the statistics represented by column heads in our dataframes.
Rk -- Rank This is a count of the rows from top to bottom. It is recalculated following the sorting of a column.
Age -- Player's age on December 31st of that year
Pos -- Position
G -- Games played
GS -- Games started as an offensive or defensive player
QBrec -- Team record in games started by this QB (regular season)
Cmp -- Passes completed
Att -- Passes attempted
Cmp% -- Percentage of Passes Completed
Yds -- Yards Gained by Passing (For teams, sack yardage is deducted from this total)
TD -- Passing Touchdowns
TD% -- Percentage of Touchdowns Thrown when Attempting to Pass
Int -- Interceptions thrown
Int% -- Percentage of Times Intercepted when Attempting to Pass
1D -- First downs passing
Lng -- Longest Completed Pass Thrown (complete since 1975)
Y/A -- Yards gained per pass attempt
AY/A -- Adjusted Yards gained per pass attempt (Passing Yards + 20 Passing TD - 45 Interceptions) / (Passes Attempted)
Y/C -- Yards gained per pass completion (Passing Yards) / (Passes Completed)
Y/G -- Yards gained per game played
Rate -- Quarterback Rating
QBR -- QBR (ESPN s Total Quarterback Rating, calculated since 2006)
Sk -- Times Sacked (first recorded in 1969, player per game since 1981)
Yds -- Yards lost due to sacks (first recorded in 1969, player per game since 1981)
NY/A -- Net Yards gained per pass attempt (Passing Yards - Sack Yards) / (Passes Attempted + Times Sacked)
ANY/A -- Adjusted Net Yards per Pass Attempt
(Passing Yards - Sack Yards + (20 Passing TD) - (45 Interceptions)) / (Passes Attempted + Times Sacked)
Sk% -- Percentage of Time Sacked when Attempting to Pass: Times Sacked / (Passes Attempted + Times Sacked)
4QC -- Comebacks led by quarterback. Must be an offensive scoring drive in the 4th quarter, with the team trailing by one score, though not necessarily a drive to take the lead. Only games ending in a win or tie are included.
GWD -- Game-winning drives led by quarterback. Must be an offensive scoring drive in the 4th quarter or overtime that puts the winning team ahead for the last time.
# Only looking at quarterbacks who started atleast half of the seasons games
data = temp[temp["GS"] >= 8]
# Drop unnecessary columns
data = data.drop(["Cmp", "Att", "TD%", "1D", "Lng", "NY/A", "ANY/A", "4QC", "GWD"], axis=1)
data_2020 = data_2020[data_2020["GS"] >= 6]
data_2020 = data_2020.drop(["Cmp", "Att", "TD%", "1D", "Lng", "ANY/A", "4QC", "GWD"], axis=1)
data.head()
# Parse the record column so we can form win percentage
data = (data
.assign(Wins= data.QBrec.str.split('-').str.get(0),
Losses = data.QBrec.str.split('-').str.get(-2),
Ties = data.QBrec.str.split('-').str.get(-1)
)
)
# Creation of win percentage column based on record
data["Win Percentage"] = 100 * (data["Wins"].astype(int) + data["Ties"].astype(int)*0.5) / data["GS"].astype(int)
data = data.drop(["QBrec", "Wins", "Losses", "Ties"], axis=1)
# Merge passing and rushing data
final = pd.merge(data, rushing, how = "inner", left_on = "Player", right_on = "Player")
final = final.fillna(0)
final.index = range(0,30)
final.head()
# Repeat process for 2020 dataframe
data_2020 = (data_2020
.assign(Wins= data_2020.QBrec.str.split('-').str.get(0),
Losses = data_2020.QBrec.str.split('-').str.get(-2),
Ties = data_2020.QBrec.str.split('-').str.get(-1)
)
)
data_2020["Win Percentage"] = 100 * (data_2020["Wins"].astype(int) + data_2020["Ties"].astype(int)*0.5) / data_2020["GS"].astype(int)
data_2020 = data_2020.drop(["QBrec", "Wins", "Losses", "Ties"], axis=1)
final_2020 = pd.merge(data_2020, rushing_2020, how = "inner", left_on = "Player", right_on = "Player")
final_2020 = final_2020.fillna(0)
final_2020.head()
In this section I will try to answer the question of "What Quarterback statistics affects their win ratio the most?" This will allow me to then performa exact Linear Regression analysis on these statistics so that I can find weights for them in my formula to predict win ratio. I will do this by plotting data from the database using matplotlib so that I can make visual comparisons and also will use a line of best fit to visualize trends.
The first statistic I will compare with win ratio is passing yards per game because it is generally thought that the more yards a quarterback throws for the better, but is this true?
# Yards compared to win rate
x = final['Y/G'].astype(float) + final['RY/G'].astype(float)
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)
plt.title("Winning Percentage vs. Total Yards per Game")
plt.xlabel("Total Yards per Game")
plt.ylabel("Winning Percentage")
for i, txt in enumerate(final["Player"]):
plt.annotate(txt, (x[i], y[i]), size = 8)
plt.show()
Apparently the more yards a quarterback gains may not be for the better. I can actually see a negative trend on this plot which is extremely interesting. An overwhelming majority of Quarterbacks gain between 220 and 300 yards per game, but the win rates fluctuate so I can conclude that this statistic does not affect win rate. Quarterbacks like Jameis Winston, Dak Prescott, and Matthew Stafford had the most total yards per game, but were only winning about 50% or less of their games. Lets look deeper and see why this might be.
The next statistic I will look at is touchdowns per game.
x = final['TD'].astype(float)/final['G'].astype(float)
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)
plt.title("Winning Percentage vs. Touchdowns per Game")
plt.xlabel("Touchdowns per Game")
plt.ylabel("Winning Percentage")
for i, txt in enumerate(final["Player"]):
plt.annotate(txt, (x[i], y[i]), size = 8)
plt.show()
As expected, I can see a clear trend that the more touchdowns a quarterback throws per game the higher their win percentage will be. This makes a lot of sense because you have to score points to win games, and the quarterbacks who are throwing more touchdowns for their teams are winning more games. However, Jameis Winston, who averaged the most yards per game, also had a high touchdowns per game statistic, but his win ratio was still only around 40%. So clearly touchdowns are not the only thing that contribute to a Quarterback in the NFL's success.
I will now factor in turnovers, and compare a Quarterback's touchdowns to his fumbles and interceptions, to see if I can get more insights on why some Quarterbacks with high statistics are losing so many games.
x = final['TD'].astype(float)/(final['Int'].astype(float) + final['Int'].astype(float))
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)
plt.title("Winning Percentage vs. Touchdowns per Turnover")
plt.xlabel("Touchdowns per Turnover")
plt.ylabel("Winning Percentage")
for i, txt in enumerate(final["Player"]):
plt.annotate(txt, (x[i], y[i]), size = 8)
plt.show()
This looks to be the most telling statistic yet. I finally see the beloved Jameis Winston at the bottom end of this plot. I can conclude that the reason he is not winning games despite throwing for so many yards and touchdowns, is because he throws less touchdowns than he has turnovers. He is throwing touchdowns to interceptions and fumbles at a rate of almost 0.5:1 which explains why his win percentage is so low. The most efficient quarterbacks had the highest win percentages. Russell Wilson, Drew Brees, Patrick Mahomes, Lamar Jackson, and Aaron Rodgers threw between 5 and 7 touchdowns per 1 turnover and also won atleast 60% of their games. It is also worth noting that all of these players were selected to the pro bowl and Lamar was the league MVP. So, I can conclude here that Touchdowns per turnover has a direct effect on win ratio and is a very important statistic for Quarterbacks, however there is still more to discover because we still see some quarterbacks with high win percentages but low touchdowns per interceptions thrown.
Let's take a look at another negative statistic for Quarterbacks and see how many times they are sacked per game.
x = final['Sk'].astype(float)/final['G'].astype(float)
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)
plt.title("Winning Percentage vs. Sacks per Game")
plt.xlabel("Sacks per Game")
plt.ylabel("Winning Percentage")
for i, txt in enumerate(final["Player"]):
plt.annotate(txt, (x[i], y[i]), size = 8)
plt.show()
I can clearly see here that there is a negative trend for Quarterbacks who get sacked the most. The more sacks a quarterback takes results in a lower win percentage. I see similar names like Lamar Jackson, Patrick Mahomes, and Drew Brees at the better end of my data once again, and names like Jameis Winston and Andy Dalton towards the worse half. This is definitely an important statistic and I will use it in our calculation of rating the best winning Quarterbacks.
Let's now look at completion percentages and see how that statistic plays a factor.
x = final['Cmp%'].astype(float)
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)
plt.title("Winning Percentage vs. Completion Percentage")
plt.xlabel("Completion Percentage")
plt.ylabel("Winning Percentage")
for i, txt in enumerate(final["Player"]):
plt.annotate(txt, (x[i], y[i]), size = 8)
plt.show()
I see a slight trend here in better completion percentage resulting in a higher win percentage, but it is not overwhelming. Some Quarterbacks with very high win percentages have low completion percentages. This could be the result of them throwing more passes, so I will have to consider that in our calculation.
I have used plots to visualize winning percentage vs qb stats, but I need to be more precise so that I can put weights on stats in my formula for predicting QB win percentage. So, I will now use a linear regression from stats models api to get an exact coefficient for how each stat affects win percentage. This coefficient will also be able to do some hypothesis testing with the information I find to test if a stat does affect win percentage or not. I can use this coefficient as the weight for the equation I make, and hopefully will be able to create a 1:1 ratio.
X = final['Sk'].astype(float)/final['G'].astype(float)
Y = final["Win Percentage"].astype(float)
X = sm.add_constant(X)
model = sm.OLS(endog=Y,exog=X)
results = model.fit()
results.summary()
I see a clear effect here as win percentage decreased by 13.162 as sacks per game increased by 1. I reject the null hypothesis that sacks per game does not affect win rate.
X = final["Cmp%"].astype(float)
Y = final["Win Percentage"].astype(float)
X = sm.add_constant(X)
model = sm.OLS(endog=Y,exog=X)
results = model.fit()
results.summary()
I can see from the regression summary that win percentage increases by 1.94% as completion percentage increases, so I reject the null hypothesis that completion percentage does not affect win percentage.
X = final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))
Y = final["Win Percentage"].astype(float)
X = sm.add_constant(X)
model = sm.OLS(endog=Y,exog=X)
results = model.fit()
results.summary()
Like I saw in my plot this statistic is very important. I see that winning percentage increases by 8% as touchdowns per turnover increases by 1. Again, I am rejecting the null hypothesis.
X = final['Y/G'].astype(float) + final['RY/G'].astype(float)
Y = final["Win Percentage"].astype(float)
X = sm.add_constant(X)
model = sm.OLS(endog=Y,exog=X)
results = model.fit()
results.summary()
I see a coefficient of basically 0 here as it is less than 0.05 so I will accept the null hypothesis that total yards does not affect win percentage and I will not be using it in my formula.
Now that I have our coefficients for the statistics I analyzed to have affects on win percentage, I want to see how accurate my statistic actually is.
x = (8 * (final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))) + \
1.94 * (final["Cmp%"].astype(float)) - \
13.162 * final['Sk'].astype(float)/final['G'].astype(float))/2
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)
plt.title("Winning Percentage vs. Rating")
plt.xlabel("Rating")
plt.ylabel("Winning Percentage")
for i, txt in enumerate(final["Player"]):
plt.annotate(txt, (x[i], y[i]), size = 8)
plt.show()
X = (8 * (final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))) + \
1.94 * (final["Cmp%"].astype(float)) - \
13.162 * final['Sk'].astype(float)/final['G'].astype(float))/2
Y = final["Win Percentage"].astype(float)
X = sm.add_constant(X)
model = sm.OLS(endog=Y,exog=X)
results = model.fit()
results.summary()
My formula looks pretty good. I was able to get a coefficient of 1 which means that the average win rate goes up by 1 as my rating goes up by 1 which was my goal. Lets calculate how accurate the prediction was by taking the difference between my rating and the actual win percentage for each player.
final["QBwin"] = (8 * (final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))) + \
1.94 * (final["Cmp%"].astype(float)) - \
13.162 * final['Sk'].astype(float)/final['G'].astype(float))/2
final["difference"] = abs(final["Win Percentage"] - final["QBwin"])
final["difference"].mean()
So my model is on average off by about 12% which is actually pretty good considering there are only 16 games in a season so my model on average can predict within a margin of error of 2 games. Lets take a deeper look at this distribution with a violinplot to get a better understanding of the data.
sns.violinplot(x=final["difference"])
So I do not see a normal distribution here which is okay because the biggest chunk of our data actually falls between 0 and 10 percent meaning I was a bit more accurate on average. The median was a round 12% and the IQR was from 5% to 20%. Lets see how the data would do in this years season.
final_2020["QBwin"] = (8 * (final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))) + \
1.94 * (final["Cmp%"].astype(float)) - \
13.162 * final['Sk'].astype(float)/final['G'].astype(float))/2
prediction = abs(final_2020["Win Percentage"] - final_2020["QBwin"])
prediction.mean()
Considering that the model is based on the 2019 season this is not too bad. There have only been 14 games so far this season so this could get a little better once the season ends. Also the amount of games 19% represents is less so we are still around a margin of error of 2-3 games.
After analyzing the data from the 2019 season and comparing Quarterback statistics to their win rates I have found that the statistics that are most important are their touchdown per turnovers, completion percentage, and sacks per game. With that information being discovered teams should focus on acquiring efficient Quarterbacks. Quarterbacks who throw more touchdowns, have fewer turnovers, do not get sacked often, and complete a higher percentage of their passes simply win more games. There is a false narrative that the quarterbacks who throw for the most yards and touchdowns are the better ones, but just because they have higher stats does not mean that they are efficient. Negatives seem to outweigh the positives, like we saw with Jameis Winston, who threw for the most yards and threw for a lot of touchdowns, and yet his win rate was very bad. This is because he had a poor completion percentage and also had more turnovers than touchdowns. Efficiency reigned supreme for Quarterbacks who won the most games in the 2019 season, and teams should draft and acquire efficient quarterbacks if they want to win more games.
And the data science process is now complete. First, I scraped data from a table I found on www.pro-football-reference.com that had all of the necessary data I needed. Then I tidied this data and got rid of the data I did not need. I merged two dataframes so that I could analyze passing and rushing data for Quarterbacks in 2019. After all of this the data was ready to be analyzed, so I used matplotlib to plot statistics we hypothesized to affect win rate and then analyzed what I saw. With a line of regression I was able to see the average affect that these statistics had on the win ratio that a Quarterback had that season. Once I saw what statistics had larger affects on win rate I used statsmodels to perform a regression model on our data. Here, I was able to get exact coefficients that I could then use as weights in our formula. I was also able to confirm our hypotheses for which statistics affected win rate. Once this was completed I could then make my formula for predicting Quarterback win rate and analyze how accurate it was. I found my rating to have around a 10% margin of error for the 2019 season, and a 18% win rate for the 2020 season. I considered this to be a success considering this is only a few games off in each season. After all of this I can conclude that Quarterback win rate can be predicted with relative accuracy using Quarterback statistics, but there are many more factors that play into if a Quarterback will win a game or not. Yes, the Quarterback definitely plays a huge role in whether their team is successful or not, but they are not the only players on the team. Some teams defenses and players that surround a Quarterback are better or worse than others and they also play a role in whether the team wins or not. So, in conclusion I have found statistics that affect the win rate of a quarterback the most, and have created a formula to predict their win rate based on these statistics, but I understand that more data would need to be considered to predict the exact win rate of a Quarterback.