Predicting Current Quarterback Win Rates

image.png

By: Michael Del Bene

Introduction

In the NFL, and football in general, the Quarterback is said to be the most important player on the field. This is because he has the ball in his hands on every offensive play, and also makes a lot of the play calls. He is usually the leader of his team, on and off the field. With that, I am interested in how a Quarterbacks statistics affect their win rate so I am going to attempt to predict QB win ratios based off of their statistics using data analysis techniques.

Data Scraping

To begin my data science project I will need to first import all of the libraries I am going to be using. Which is what this next cell accomplishes. After this I will begin to scrape the data, which is where I gather data from a website that has the information I need to do a data analysis. I will be using www.pro-football-reference.com, a website that has all the data I wanted about Quarterbacks from the 2019 and 2020 season so far.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
!pip3 install lxml
Requirement already satisfied: lxml in /opt/conda/lib/python3.8/site-packages (4.6.2)
In [2]:
url = "https://www.pro-football-reference.com/years/2019/passing.htm"
reqs = []

for i in range(2019, 2021):
    reqs.append(requests.get(url))
    # change 2010 to 2011 in the url, 2011 to 2012, etc.
    url = url.replace(str(i), str(i+1))
tables = []
for r in reqs:
    root = BeautifulSoup(r.content, "html.parser")
    tables.append(root.find("table"))
# converts the table to a dataframe. this is currently only the 2010 season, because
# tables[0], the first page we scraped, is 2010
temp = pd.read_html(str(tables[0]))[0]
data_2020 = pd.read_html(str(tables[1]))[0]
temp
Out[2]:
Rk Player Tm Age Pos G GS QBrec Cmp Att ... Y/G Rate QBR Sk Yds.1 NY/A ANY/A Sk% 4QC GWD
0 1 Jared Goff LAR 25 QB 16 16 9-7-0 394 626 ... 289.9 86.5 50.2 22 170 6.90 6.46 3.4 1 2
1 2 Jameis Winston TAM 25 QB 16 16 7-9-0 380 626 ... 319.3 84.3 59.1 47 282 7.17 6.15 7.0 2 2
2 3 Matt Ryan ATL 34 QB 15 15 7-8-0 408 616 ... 297.7 92.1 60.4 48 316 6.25 6.08 7.2 3 2
3 4 Tom Brady NWE 42 QB 16 16 12-4-0 373 613 ... 253.6 88.0 54.5 27 185 6.05 6.24 4.2 1 1
4 5 Carson Wentz PHI 27 QB 16 16 9-7-0 388 607 ... 252.4 93.1 64.8 37 230 5.91 6.26 5.7 2 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
100 98 Emmanuel Sanders 2TM 32 NaN 17 16 NaN 1 1 ... 2.1 158.3 NaN 0 0 35.00 55.00 0.0 NaN NaN
101 99 Steven Sims WAS 22 NaN 16 2 NaN 0 1 ... 0.0 39.6 NaN 0 0 0.00 0.00 0.0 NaN NaN
102 100 Courtland Sutton * DEN 24 WR 16 14 NaN 1 1 ... 2.4 118.7 100.0 0 0 38.00 38.00 0.0 NaN NaN
103 101 Alex Tanney NYG 32 NaN 1 0 NaN 1 1 ... 1.0 79.2 1.6 0 0 1.00 1.00 0.0 NaN NaN
104 102 James White NWE 27 NaN 15 1 NaN 1 1 ... 2.3 118.7 99.9 0 0 35.00 35.00 0.0 NaN NaN

105 rows × 31 columns

My variable temp is storing the 2019 passer data and the variable data_2020 is storing the 2020 passer dataframe. I also want to look at QB rushing statistics so I will be scraping from a different table from the same website to accomplish this.

In [3]:
url = "https://www.pro-football-reference.com/years/2019/rushing.htm"
reqs = []

for i in range(2019, 2021):
    reqs.append(requests.get(url))
    # change 2010 to 2011 in the url, 2011 to 2012, etc.
    url = url.replace(str(i), str(i+1))
tables = []
for r in reqs:
    root = BeautifulSoup(r.content, "html.parser")
    tables.append(root.find("table"))
# converts the table to a dataframe. this is currently only the 2010 season, because
# tables[0], the first page we scraped, is 2010
rushing = pd.read_html(str(tables[0]))[0]
rushing_2020 = pd.read_html(str(tables[1]))[0]
rushing
Out[3]:
Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Games Rushing Unnamed: 14_level_0
Rk Player Tm Age Pos G GS Att Yds TD 1D Lng Y/A Y/G Fmb
0 1 Derrick Henry * TEN 25 RB 15 15 303 1540 16 73 74 5.1 102.7 5
1 2 Ezekiel Elliott* DAL 24 RB 16 16 301 1357 12 78 33 4.5 84.8 3
2 3 Nick Chubb* CLE 24 RB 16 16 298 1494 8 62 88 5.0 93.4 3
3 4 Christian McCaffrey*+ CAR 23 RB 16 16 287 1387 15 57 84 4.8 86.7 1
4 5 Chris Carson SEA 25 RB 15 15 278 1230 7 75 59 4.4 82.0 7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
340 330 Danny Vitale GNB 26 NaN 15 4 1 3 0 0 3 3.0 0.2 0
341 331 Greg Ward PHI 24 NaN 7 3 1 5 0 0 5 5.0 0.7 0
342 332 Trevon Wesco NYJ 24 NaN 16 1 1 2 0 1 2 2.0 0.1 0
343 333 Mike Williams LAC 25 WR 15 15 1 2 0 0 2 2.0 0.1 0
344 334 Jarius Wright CAR 30 WR/wr 16 9 1 -7 0 0 -7 -7.0 -0.4 0

345 rows × 15 columns

The variable rushing now stores the rushing data for 2019 and rushing_2020 for the 2020 season.

Data Tidying

Now that I have scraped all of the necessary data I need, I will begin to tidy the data which means to prepare it to be analyzed. First, I will begin removing unnecessary rows from the data. I will also remove columns I do not need. I will do this for the 4 dataframes I am going to be working with. As part of the analysis, I have deemed it necessary to only look at QBs who started in more than half of the games that season to reduce possible bias, and to keep our stats consistent. Also, because we care about win ratio we will need to parse the column that contains wins and losses so that we can divide wins by losses and created a column called win percentage, which becomes the most important column in our dataframes because that is what we are comparing each statistic to.

In [4]:
# column titles. this removes them
temp = temp[~temp["GS"].str.contains("GS")]

# preparing QBrec for further analysis 

temp["QBrec"] = temp["QBrec"].astype(str)
temp = temp[~temp["QBrec"].str.contains("NaN")]
temp = temp[~temp["QBrec"].str.contains("nan")]

temp["GS"] = pd.to_numeric(temp["GS"]) 


data_2020 = data_2020[~data_2020["GS"].str.contains("GS")]
data_2020["QBrec"] = data_2020["QBrec"].astype(str)
data_2020 = data_2020[~data_2020["QBrec"].str.contains("NaN")]
data_2020 = data_2020[~data_2020["QBrec"].str.contains("nan")]

data_2020["GS"] = pd.to_numeric(data_2020["GS"]) 
<ipython-input-4-f7cabff60bdc>:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp["QBrec"] = temp["QBrec"].astype(str)
In [5]:
rushing.columns = ['Rk', 'Player', 'Tm', 'Age','Pos','G','GS','RAtt','RYds', 'RTD', '1D', 'Lng', 'RY/A', 'RY/G', 'Fmb']


# remove column titles from middle of dataframe
rushing = rushing[~rushing["GS"].str.contains("GS")]

rushing["GS"] = pd.to_numeric(rushing["GS"]) 

rushing = rushing.loc[rushing['Pos'] == 'QB']
rushing = rushing.drop(["1D", "Lng", "Tm", "Age", "Pos", "G", "GS", "Rk"], axis = 1)


rushing_2020.columns = ['Rk', 'Player', 'Tm', 'Age','Pos','G','GS','RAtt','RYds', 'RTD', '1D', 'Lng', 'RY/A', 'RY/G', 'Fmb']


# remove column titles from middle of dataframe
rushing_2020 = rushing_2020[~rushing_2020["GS"].str.contains("GS")]

rushing_2020["GS"] = pd.to_numeric(rushing_2020["GS"]) 

rushing_2020 = rushing_2020.loc[rushing_2020['Pos'] == 'QB']
rushing_2020 = rushing_2020.drop(["1D", "Lng", "Tm", "Age", "Pos", "G", "GS", "Rk"], axis = 1)
rushing_2020.head()
<ipython-input-5-c4041bf072e4>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rushing["GS"] = pd.to_numeric(rushing["GS"])
Out[5]:
Player RAtt RYds RTD RY/A RY/G Fmb
24 Lamar Jackson 135 828 7 6.1 63.7 9
30 Kyler Murray 123 741 11 6.0 52.9 9
31 Cam Newton 122 489 11 4.0 37.6 6
44 Josh Allen 96 383 8 4.0 27.4 9
56 Deshaun Watson 82 394 3 4.8 28.1 6

Here, we have a list of all of the statistics represented by column heads in our dataframes.

Rk -- Rank This is a count of the rows from top to bottom. It is recalculated following the sorting of a column.

Age -- Player's age on December 31st of that year

Pos -- Position

G -- Games played

GS -- Games started as an offensive or defensive player

QBrec -- Team record in games started by this QB (regular season)

Cmp -- Passes completed

Att -- Passes attempted

Cmp% -- Percentage of Passes Completed

Yds -- Yards Gained by Passing (For teams, sack yardage is deducted from this total)

TD -- Passing Touchdowns

TD% -- Percentage of Touchdowns Thrown when Attempting to Pass

Int -- Interceptions thrown

Int% -- Percentage of Times Intercepted when Attempting to Pass

1D -- First downs passing

Lng -- Longest Completed Pass Thrown (complete since 1975)

Y/A -- Yards gained per pass attempt

AY/A -- Adjusted Yards gained per pass attempt (Passing Yards + 20 Passing TD - 45 Interceptions) / (Passes Attempted)

Y/C -- Yards gained per pass completion (Passing Yards) / (Passes Completed)

Y/G -- Yards gained per game played

Rate -- Quarterback Rating

QBR -- QBR (ESPN s Total Quarterback Rating, calculated since 2006)

Sk -- Times Sacked (first recorded in 1969, player per game since 1981)

Yds -- Yards lost due to sacks (first recorded in 1969, player per game since 1981)

NY/A -- Net Yards gained per pass attempt (Passing Yards - Sack Yards) / (Passes Attempted + Times Sacked)

ANY/A -- Adjusted Net Yards per Pass Attempt

(Passing Yards - Sack Yards + (20 Passing TD) - (45 Interceptions)) / (Passes Attempted + Times Sacked)

Sk% -- Percentage of Time Sacked when Attempting to Pass: Times Sacked / (Passes Attempted + Times Sacked)

4QC -- Comebacks led by quarterback. Must be an offensive scoring drive in the 4th quarter, with the team trailing by one score, though not necessarily a drive to take the lead. Only games ending in a win or tie are included.

GWD -- Game-winning drives led by quarterback. Must be an offensive scoring drive in the 4th quarter or overtime that puts the winning team ahead for the last time.

In [6]:
# Only looking at quarterbacks who started atleast half of the seasons games

data = temp[temp["GS"] >= 8]

# Drop unnecessary columns
data = data.drop(["Cmp", "Att", "TD%", "1D", "Lng", "NY/A", "ANY/A", "4QC", "GWD"], axis=1)

data_2020 = data_2020[data_2020["GS"] >= 6]

data_2020 = data_2020.drop(["Cmp", "Att", "TD%", "1D", "Lng", "ANY/A", "4QC", "GWD"], axis=1)

data.head()
Out[6]:
Rk Player Tm Age Pos G GS QBrec Cmp% Yds ... Int% Y/A AY/A Y/C Y/G Rate QBR Sk Yds.1 Sk%
0 1 Jared Goff LAR 25 QB 16 16 9-7-0 62.9 4638 ... 2.6 7.4 7.0 11.8 289.9 86.5 50.2 22 170 3.4
1 2 Jameis Winston TAM 25 QB 16 16 7-9-0 60.7 5109 ... 4.8 8.2 7.1 13.4 319.3 84.3 59.1 47 282 7.0
2 3 Matt Ryan ATL 34 QB 15 15 7-8-0 66.2 4466 ... 2.3 7.3 7.1 10.9 297.7 92.1 60.4 48 316 7.2
3 4 Tom Brady NWE 42 QB 16 16 12-4-0 60.8 4057 ... 1.3 6.6 6.8 10.9 253.6 88.0 54.5 27 185 4.2
4 5 Carson Wentz PHI 27 QB 16 16 9-7-0 63.9 4039 ... 1.2 6.7 7.0 10.4 252.4 93.1 64.8 37 230 5.7

5 rows × 22 columns

In [7]:
# Parse the record column so we can form win percentage
data = (data
         .assign(Wins= data.QBrec.str.split('-').str.get(0),
                 Losses = data.QBrec.str.split('-').str.get(-2),
                 Ties = data.QBrec.str.split('-').str.get(-1)
            )
          )

# Creation of win percentage column based on record
data["Win Percentage"] = 100 * (data["Wins"].astype(int) + data["Ties"].astype(int)*0.5) / data["GS"].astype(int)


data = data.drop(["QBrec", "Wins", "Losses", "Ties"], axis=1)

# Merge passing and rushing data 
final = pd.merge(data, rushing, how = "inner", left_on = "Player", right_on = "Player")

final = final.fillna(0)
final.index = range(0,30) 
final.head()
Out[7]:
Rk Player Tm Age Pos G GS Cmp% Yds TD ... Sk Yds.1 Sk% Win Percentage RAtt RYds RTD RY/A RY/G Fmb
0 1 Jared Goff LAR 25 QB 16 16 62.9 4638 22 ... 22 170 3.4 56.250000 33 40 2 1.2 2.5 10
1 2 Jameis Winston TAM 25 QB 16 16 60.7 5109 33 ... 47 282 7.0 43.750000 59 250 1 4.2 15.6 12
2 3 Matt Ryan ATL 34 QB 15 15 66.2 4466 26 ... 48 316 7.2 46.666667 34 147 1 4.3 9.8 9
3 4 Tom Brady NWE 42 QB 16 16 60.8 4057 24 ... 27 185 4.2 75.000000 26 34 3 1.3 2.1 4
4 5 Carson Wentz PHI 27 QB 16 16 63.9 4039 27 ... 37 230 5.7 56.250000 62 243 1 3.9 15.2 16

5 rows × 28 columns

In [8]:
# Repeat process for 2020 dataframe

data_2020 = (data_2020
         .assign(Wins= data_2020.QBrec.str.split('-').str.get(0),
                 Losses = data_2020.QBrec.str.split('-').str.get(-2),
                 Ties = data_2020.QBrec.str.split('-').str.get(-1)
            )
          )

data_2020["Win Percentage"] = 100 * (data_2020["Wins"].astype(int) + data_2020["Ties"].astype(int)*0.5) / data_2020["GS"].astype(int)


data_2020 = data_2020.drop(["QBrec", "Wins", "Losses", "Ties"], axis=1)

final_2020 = pd.merge(data_2020, rushing_2020, how = "inner", left_on = "Player", right_on = "Player")

final_2020 = final_2020.fillna(0)
final_2020.head()
Out[8]:
Rk Player Tm Age Pos G GS Cmp% Yds TD ... Yds.1 NY/A Sk% Win Percentage RAtt RYds RTD RY/A RY/G Fmb
0 1 Matt Ryan ATL 35 QB 14 14 64.2 4016 22 ... 227 6.50 6.2 28.571429 25 92 1 3.7 6.6 3
1 2 Patrick Mahomes KAN 25 QB 14 14 67.3 4462 36 ... 147 7.62 3.9 92.857143 59 287 2 4.9 20.5 5
2 3 Tom Brady TAM 43 QB 14 14 65.1 3886 32 ... 128 6.70 3.4 64.285714 25 3 3 0.1 0.2 4
3 4 Justin Herbert LAC 22 QB 13 13 66.5 3781 27 ... 171 6.47 4.8 30.769231 45 199 4 4.4 15.3 7
4 5 Ben Roethlisberger PIT 38 QB 13 13 66.2 3292 29 ... 97 6.01 2.1 84.615385 21 13 0 0.6 1.0 3

5 rows × 29 columns

Exploratory Data Analysis

In this section I will try to answer the question of "What Quarterback statistics affects their win ratio the most?" This will allow me to then performa exact Linear Regression analysis on these statistics so that I can find weights for them in my formula to predict win ratio. I will do this by plotting data from the database using matplotlib so that I can make visual comparisons and also will use a line of best fit to visualize trends.

The first statistic I will compare with win ratio is passing yards per game because it is generally thought that the more yards a quarterback throws for the better, but is this true?

In [9]:
# Yards compared to win rate 

x = final['Y/G'].astype(float) + final['RY/G'].astype(float)
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)


plt.title("Winning Percentage vs. Total Yards per Game")
plt.xlabel("Total Yards per Game")
plt.ylabel("Winning Percentage")

for i, txt in enumerate(final["Player"]):
        plt.annotate(txt, (x[i], y[i]), size = 8)

plt.show()

Apparently the more yards a quarterback gains may not be for the better. I can actually see a negative trend on this plot which is extremely interesting. An overwhelming majority of Quarterbacks gain between 220 and 300 yards per game, but the win rates fluctuate so I can conclude that this statistic does not affect win rate. Quarterbacks like Jameis Winston, Dak Prescott, and Matthew Stafford had the most total yards per game, but were only winning about 50% or less of their games. Lets look deeper and see why this might be.

The next statistic I will look at is touchdowns per game.

In [10]:
x = final['TD'].astype(float)/final['G'].astype(float)
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)


plt.title("Winning Percentage vs. Touchdowns per Game")
plt.xlabel("Touchdowns per Game")
plt.ylabel("Winning Percentage")

for i, txt in enumerate(final["Player"]):
        plt.annotate(txt, (x[i], y[i]), size = 8)

plt.show()

As expected, I can see a clear trend that the more touchdowns a quarterback throws per game the higher their win percentage will be. This makes a lot of sense because you have to score points to win games, and the quarterbacks who are throwing more touchdowns for their teams are winning more games. However, Jameis Winston, who averaged the most yards per game, also had a high touchdowns per game statistic, but his win ratio was still only around 40%. So clearly touchdowns are not the only thing that contribute to a Quarterback in the NFL's success.

I will now factor in turnovers, and compare a Quarterback's touchdowns to his fumbles and interceptions, to see if I can get more insights on why some Quarterbacks with high statistics are losing so many games.

In [11]:
x = final['TD'].astype(float)/(final['Int'].astype(float) + final['Int'].astype(float))
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)


plt.title("Winning Percentage vs. Touchdowns per Turnover")
plt.xlabel("Touchdowns per Turnover")
plt.ylabel("Winning Percentage")

for i, txt in enumerate(final["Player"]):
        plt.annotate(txt, (x[i], y[i]), size = 8)

plt.show()

This looks to be the most telling statistic yet. I finally see the beloved Jameis Winston at the bottom end of this plot. I can conclude that the reason he is not winning games despite throwing for so many yards and touchdowns, is because he throws less touchdowns than he has turnovers. He is throwing touchdowns to interceptions and fumbles at a rate of almost 0.5:1 which explains why his win percentage is so low. The most efficient quarterbacks had the highest win percentages. Russell Wilson, Drew Brees, Patrick Mahomes, Lamar Jackson, and Aaron Rodgers threw between 5 and 7 touchdowns per 1 turnover and also won atleast 60% of their games. It is also worth noting that all of these players were selected to the pro bowl and Lamar was the league MVP. So, I can conclude here that Touchdowns per turnover has a direct effect on win ratio and is a very important statistic for Quarterbacks, however there is still more to discover because we still see some quarterbacks with high win percentages but low touchdowns per interceptions thrown.

Let's take a look at another negative statistic for Quarterbacks and see how many times they are sacked per game.

In [12]:
x = final['Sk'].astype(float)/final['G'].astype(float)
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)


plt.title("Winning Percentage vs. Sacks per Game")
plt.xlabel("Sacks per Game")
plt.ylabel("Winning Percentage")

for i, txt in enumerate(final["Player"]):
        plt.annotate(txt, (x[i], y[i]), size = 8)

plt.show()

I can clearly see here that there is a negative trend for Quarterbacks who get sacked the most. The more sacks a quarterback takes results in a lower win percentage. I see similar names like Lamar Jackson, Patrick Mahomes, and Drew Brees at the better end of my data once again, and names like Jameis Winston and Andy Dalton towards the worse half. This is definitely an important statistic and I will use it in our calculation of rating the best winning Quarterbacks.

Let's now look at completion percentages and see how that statistic plays a factor.

In [13]:
x = final['Cmp%'].astype(float)
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)


plt.title("Winning Percentage vs. Completion Percentage")
plt.xlabel("Completion Percentage")
plt.ylabel("Winning Percentage")

for i, txt in enumerate(final["Player"]):
        plt.annotate(txt, (x[i], y[i]), size = 8)

plt.show()

I see a slight trend here in better completion percentage resulting in a higher win percentage, but it is not overwhelming. Some Quarterbacks with very high win percentages have low completion percentages. This could be the result of them throwing more passes, so I will have to consider that in our calculation.

Linear Regression Models

I have used plots to visualize winning percentage vs qb stats, but I need to be more precise so that I can put weights on stats in my formula for predicting QB win percentage. So, I will now use a linear regression from stats models api to get an exact coefficient for how each stat affects win percentage. This coefficient will also be able to do some hypothesis testing with the information I find to test if a stat does affect win percentage or not. I can use this coefficient as the weight for the equation I make, and hopefully will be able to create a 1:1 ratio.

In [14]:
X = final['Sk'].astype(float)/final['G'].astype(float)
Y = final["Win Percentage"].astype(float)

X = sm.add_constant(X)

model = sm.OLS(endog=Y,exog=X)

results = model.fit()
results.summary()
Out[14]:
OLS Regression Results
Dep. Variable: Win Percentage R-squared: 0.220
Model: OLS Adj. R-squared: 0.192
Method: Least Squares F-statistic: 7.903
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.00891
Time: 17:34:45 Log-Likelihood: -124.79
No. Observations: 30 AIC: 253.6
Df Residuals: 28 BIC: 256.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 84.7075 11.075 7.649 0.000 62.022 107.393
0 -13.1620 4.682 -2.811 0.009 -22.752 -3.572
Omnibus: 1.019 Durbin-Watson: 2.053
Prob(Omnibus): 0.601 Jarque-Bera (JB): 0.821
Skew: 0.009 Prob(JB): 0.663
Kurtosis: 2.190 Cond. No. 10.4


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I see a clear effect here as win percentage decreased by 13.162 as sacks per game increased by 1. I reject the null hypothesis that sacks per game does not affect win rate.

In [15]:
X = final["Cmp%"].astype(float)
Y = final["Win Percentage"].astype(float)

X = sm.add_constant(X)

model = sm.OLS(endog=Y,exog=X)

results = model.fit()
results.summary()
Out[15]:
OLS Regression Results
Dep. Variable: Win Percentage R-squared: 0.166
Model: OLS Adj. R-squared: 0.137
Method: Least Squares F-statistic: 5.588
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.0253
Time: 17:34:45 Log-Likelihood: -125.79
No. Observations: 30 AIC: 255.6
Df Residuals: 28 BIC: 258.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -70.1253 52.882 -1.326 0.196 -178.450 38.199
Cmp% 1.9427 0.822 2.364 0.025 0.259 3.626
Omnibus: 0.250 Durbin-Watson: 2.324
Prob(Omnibus): 0.883 Jarque-Bera (JB): 0.444
Skew: 0.050 Prob(JB): 0.801
Kurtosis: 2.412 Cond. No. 1.12e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

I can see from the regression summary that win percentage increases by 1.94% as completion percentage increases, so I reject the null hypothesis that completion percentage does not affect win percentage.

In [16]:
X = final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))
Y = final["Win Percentage"].astype(float)

X = sm.add_constant(X)

model = sm.OLS(endog=Y,exog=X)

results = model.fit()
results.summary()
Out[16]:
OLS Regression Results
Dep. Variable: Win Percentage R-squared: 0.290
Model: OLS Adj. R-squared: 0.265
Method: Least Squares F-statistic: 11.44
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.00214
Time: 17:34:45 Log-Likelihood: -123.38
No. Observations: 30 AIC: 250.8
Df Residuals: 28 BIC: 253.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 42.1624 4.638 9.092 0.000 32.663 51.662
0 7.9962 2.364 3.383 0.002 3.154 12.839
Omnibus: 0.378 Durbin-Watson: 2.032
Prob(Omnibus): 0.828 Jarque-Bera (JB): 0.539
Skew: -0.152 Prob(JB): 0.764
Kurtosis: 2.419 Cond. No. 3.84


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Like I saw in my plot this statistic is very important. I see that winning percentage increases by 8% as touchdowns per turnover increases by 1. Again, I am rejecting the null hypothesis.

In [17]:
X = final['Y/G'].astype(float) + final['RY/G'].astype(float)
Y = final["Win Percentage"].astype(float)

X = sm.add_constant(X)

model = sm.OLS(endog=Y,exog=X)

results = model.fit()
results.summary()
Out[17]:
OLS Regression Results
Dep. Variable: Win Percentage R-squared: 0.007
Model: OLS Adj. R-squared: -0.029
Method: Least Squares F-statistic: 0.1872
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.669
Time: 17:34:46 Log-Likelihood: -128.42
No. Observations: 30 AIC: 260.8
Df Residuals: 28 BIC: 263.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 66.0814 26.558 2.488 0.019 11.681 120.482
0 -0.0429 0.099 -0.433 0.669 -0.246 0.160
Omnibus: 0.284 Durbin-Watson: 1.972
Prob(Omnibus): 0.868 Jarque-Bera (JB): 0.469
Skew: -0.066 Prob(JB): 0.791
Kurtosis: 2.402 Cond. No. 2.15e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.15e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

I see a coefficient of basically 0 here as it is less than 0.05 so I will accept the null hypothesis that total yards does not affect win percentage and I will not be using it in my formula.

Now that I have our coefficients for the statistics I analyzed to have affects on win percentage, I want to see how accurate my statistic actually is.

In [18]:
x = (8 * (final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))) + \
          1.94 * (final["Cmp%"].astype(float)) - \
          13.162 * final['Sk'].astype(float)/final['G'].astype(float))/2
        
y = final['Win Percentage'].astype(float)
z = np.polyfit(x = x, y = y, deg = 1)
f = np.poly1d(z)
x2 = np.linspace(x.min(), x.max(), 100)
y2 = f(x2)
plt.figure(figsize=(15,10))
plt.plot(x, y,'o', x2, y2)


plt.title("Winning Percentage vs. Rating")
plt.xlabel("Rating")
plt.ylabel("Winning Percentage")

for i, txt in enumerate(final["Player"]):
        plt.annotate(txt, (x[i], y[i]), size = 8)

plt.show()
          
In [19]:
X = (8 * (final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))) + \
          1.94 * (final["Cmp%"].astype(float)) - \
          13.162 * final['Sk'].astype(float)/final['G'].astype(float))/2
Y = final["Win Percentage"].astype(float)

X = sm.add_constant(X)

model = sm.OLS(endog=Y,exog=X)

results = model.fit()
results.summary()
Out[19]:
OLS Regression Results
Dep. Variable: Win Percentage R-squared: 0.351
Model: OLS Adj. R-squared: 0.328
Method: Least Squares F-statistic: 15.14
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.000563
Time: 17:34:46 Log-Likelihood: -122.03
No. Observations: 30 AIC: 248.1
Df Residuals: 28 BIC: 250.9
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.8900 14.531 -0.061 0.952 -30.655 28.875
0 1.0374 0.267 3.891 0.001 0.491 1.584
Omnibus: 2.158 Durbin-Watson: 2.163
Prob(Omnibus): 0.340 Jarque-Bera (JB): 1.273
Skew: -0.175 Prob(JB): 0.529
Kurtosis: 2.054 Cond. No. 296.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

My formula looks pretty good. I was able to get a coefficient of 1 which means that the average win rate goes up by 1 as my rating goes up by 1 which was my goal. Lets calculate how accurate the prediction was by taking the difference between my rating and the actual win percentage for each player.

In [20]:
final["QBwin"] = (8 * (final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))) + \
          1.94 * (final["Cmp%"].astype(float)) - \
          13.162 * final['Sk'].astype(float)/final['G'].astype(float))/2


final["difference"] = abs(final["Win Percentage"] - final["QBwin"])

final["difference"].mean()
Out[20]:
11.914310751189037

So my model is on average off by about 12% which is actually pretty good considering there are only 16 games in a season so my model on average can predict within a margin of error of 2 games. Lets take a deeper look at this distribution with a violinplot to get a better understanding of the data.

In [21]:
sns.violinplot(x=final["difference"])
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9612b90520>

So I do not see a normal distribution here which is okay because the biggest chunk of our data actually falls between 0 and 10 percent meaning I was a bit more accurate on average. The median was a round 12% and the IQR was from 5% to 20%. Lets see how the data would do in this years season.

In [22]:
final_2020["QBwin"] = (8 * (final["TD"].astype(float)/(final["Int"].astype(float) + final["Fmb"].astype(float))) + \
          1.94 * (final["Cmp%"].astype(float)) - \
          13.162 * final['Sk'].astype(float)/final['G'].astype(float))/2

prediction = abs(final_2020["Win Percentage"] - final_2020["QBwin"])

prediction.mean()
Out[22]:
19.35355429143091

Considering that the model is based on the 2019 season this is not too bad. There have only been 14 games so far this season so this could get a little better once the season ends. Also the amount of games 19% represents is less so we are still around a margin of error of 2-3 games.

Policy and Insight Decisions

After analyzing the data from the 2019 season and comparing Quarterback statistics to their win rates I have found that the statistics that are most important are their touchdown per turnovers, completion percentage, and sacks per game. With that information being discovered teams should focus on acquiring efficient Quarterbacks. Quarterbacks who throw more touchdowns, have fewer turnovers, do not get sacked often, and complete a higher percentage of their passes simply win more games. There is a false narrative that the quarterbacks who throw for the most yards and touchdowns are the better ones, but just because they have higher stats does not mean that they are efficient. Negatives seem to outweigh the positives, like we saw with Jameis Winston, who threw for the most yards and threw for a lot of touchdowns, and yet his win rate was very bad. This is because he had a poor completion percentage and also had more turnovers than touchdowns. Efficiency reigned supreme for Quarterbacks who won the most games in the 2019 season, and teams should draft and acquire efficient quarterbacks if they want to win more games.

Conclusion

And the data science process is now complete. First, I scraped data from a table I found on www.pro-football-reference.com that had all of the necessary data I needed. Then I tidied this data and got rid of the data I did not need. I merged two dataframes so that I could analyze passing and rushing data for Quarterbacks in 2019. After all of this the data was ready to be analyzed, so I used matplotlib to plot statistics we hypothesized to affect win rate and then analyzed what I saw. With a line of regression I was able to see the average affect that these statistics had on the win ratio that a Quarterback had that season. Once I saw what statistics had larger affects on win rate I used statsmodels to perform a regression model on our data. Here, I was able to get exact coefficients that I could then use as weights in our formula. I was also able to confirm our hypotheses for which statistics affected win rate. Once this was completed I could then make my formula for predicting Quarterback win rate and analyze how accurate it was. I found my rating to have around a 10% margin of error for the 2019 season, and a 18% win rate for the 2020 season. I considered this to be a success considering this is only a few games off in each season. After all of this I can conclude that Quarterback win rate can be predicted with relative accuracy using Quarterback statistics, but there are many more factors that play into if a Quarterback will win a game or not. Yes, the Quarterback definitely plays a huge role in whether their team is successful or not, but they are not the only players on the team. Some teams defenses and players that surround a Quarterback are better or worse than others and they also play a role in whether the team wins or not. So, in conclusion I have found statistics that affect the win rate of a quarterback the most, and have created a formula to predict their win rate based on these statistics, but I understand that more data would need to be considered to predict the exact win rate of a Quarterback.

In [ ]: