Sunday, January 3, 2016

First Foray Into Baseball Analytics: Pitch Sequencing

I've been enjoying sports analytics for a long while now, as they're the intersection of two of my favorite things: sports on one hand, and data, statistics, and and programming on the other. I started with baseball analytics, mostly thanks to the fine folks at Fangraphs, and I also quite enjoy basketball analytics, especially the work of Kirk Goldsberry, first at Grantland (RIP) and now at FiverThirtyEight. Ultimate frisbee, my favorite sport, is still lacking the level of data collection required for insightful analytics; this FiveThirtyEight piece details that.

The first baseball question I chose to tackle has to do with pitch sequencing - the order in which the pitcher and catcher choose to employ the pitcher's arsenal in order to get batters out. It is something that has some age-old adages (e.g. using the fastball to set up the change-up), and it seemed to like it should have a significant effect on outcomes. Eventually, I would to be able to answer questions such as:
  • Which pitchers gain the most due to their sequencing, over what would be expected by the strength of their pitches alone?
  • Can the effect of the catcher be separated from the role of the pitcher? 
  • Are certain batters more or less susceptible to certain sequences, beyond their strength or weakness against specific pitches? 
  • Are there any universally strong pitch sequences? Maybe for specific types of pitchers?
Before I tackle these harder questions, I realized I must start with a simpler one: Is there even a statistically significant effect of pitch sequencing? Can it be meaningfully distinguished from random selection from the independent pitches?

To do that, I started with the PitchFX database, which contains every pitch thrown in the past few MLB seasons, as well as its classification to one of around ten pitch types. Looking at the entire 2015 MLB season, I first calculated the individual probability of each pitch type being thrown:

Pitch ID Pitch Description Probability
FF 4-Seam Fastball 0.3629
SL Slider 0.1463
FT 2-Seam Fastball 0.1365
CH Change-up 0.1040
CU Curveball 0.0783
SI Sinker 0.0723
FC Cut Fastball 0.0530
KC Knuckle Curve 0.0211
FS Split-finger Fastball 0.0154
KN Knuckleball 0.0052
FA Fastball 0.0036

I then created a new table in the database, to store pitch transitions. I defined a transition as two consecutive pitches within the same at-bat; a four-pitch at-bat would generate three transitions. Of course, this loses some information - one-pitch at-bats would not be examined, and the first and last pitch of each bat appear in the table only once, while the middle pitches appear twice. Still, I believe it's a fitting system. I then calculated the probability of each transition, building the markov transition table:

***** FF SL FT CH CU SI FC KC FS KN FA
FF 0.5105 0.1433 0.0755 0.0951 0.0789 0.0177 0.0389 0.0203 0.0133 0.0014 0.0036
SL 0.3374 0.3397 0.1152 0.0645 0.0405 0.0658 0.0204 0.0039 0.0094 0.0000 0.0027
FT 0.1996 0.1331 0.4162 0.1142 0.0722 0.0001 0.0336 0.0148 0.0115 0.0000 0.0044
CH 0.3233 0.0873 0.1379 0.2536 0.0641 0.0681 0.0436 0.0166 0.0010 0.0002 0.0036
CU 0.3459 0.0842 0.1190 0.1124 0.2159 0.0506 0.0546 0.0000 0.0108 0.0019 0.0037
SI 0.0900 0.1419 0.0001 0.1087 0.0574 0.5170 0.0425 0.0239 0.0153 0.0000 0.0018
FC 0.2528 0.0613 0.0805 0.0876 0.0843 0.0544 0.3253 0.0293 0.0216 0.0000 0.0017
KC 0.3366 0.0372 0.0919 0.0994 0.0000 0.0776 0.0729 0.2809 0.0028 0.0000 0.0004
FS 0.2873 0.0711 0.0854 0.0096 0.0376 0.0602 0.0657 0.0027 0.3776 0.0000 0.0001
KN 0.1041 0.0011 0.0000 0.0037 0.0369 0.0000 0.0004 0.0000 0.0000 0.8328 0.0000
FA 0.3284 0.1155 0.1458 0.1001 0.0750 0.0245 0.0383 0.0027 0.0005 0.0000 0.1692

One expected outcome of this table are the high values on the diagonal - as a pitcher who can throw a specific pitch can certainly repeat it, while he probably only throws two to four other pitchers.

Next, I compared the observed dependent transitions with a hypothetical independent pitch transitions. The independent probability for each pair of pitches is the product of the two independent probabilities. For the dependent probability, I multiplied the probability of a transition starting with a certain pitch, by the probability of the second pitch given the first one. Thus, I created two tables of the same size, each with the joint probabilities of each pitch transition according to its model. To compare the two, I used Shannon's entropy - as entropy is maximized when probabilities are uniform, and I expected the dependent decisions to show more variability (due to some sequences being more preferable), I expected lower entropy from the dependent probabilities. These were the results:

Independent H(X,Y) = 5.49738472712
Dependent H(X,Y) = 5.14084599364

To test the significance, I performed a permutation test. I sampled individual pitches from the independent probabilities to create the same number of transitions as observed, calculated the entropy of each such permutation, and calculated the Z-score of the observed dependent entropy compared to the distribution of the permutation entropies. The results were striking: a Z-score of -102.57, which is off-the-charts low.

This is where I got suspicious. These results were too good to be true. I suspected it is either due to the extremely large sample size (the entire season gave me over 500,000 transitions), or because of the limitation on each pitcher's arsenal. Because each individual pitcher only throws a few pitches, he cannot make all transitions listed in the table - making the joint distribution skew sharper than it should. 

I decided to test individual players, to see if the results replicate. While it does not separate the hypotheses, as the sample sizes are indeed much smaller, it felt like a good start. Rather than randomly sample, I picked three groups of ten pitchers each: the ten pitchers with the most transitions last season, three around the average, and three around one standard deviation over the average. I then repeated the same individual and joint probability and entropy calculations, and the permutation tests:

Edit #1: it seems I was a little unclear. The permutation tests for each individual player were sampling from their individual probabilities - counting only the pitches they threw, to establish the frequency of each. That individual distribution is also the basis for the independent entropy (third column).

Group Name Indep. Entropy Cond. Entropy Z-Score P-Value
Pitch transition count closest to mean (~688) 
(until Josh Tomlin)
Francisco Rodriguez 3.7134 3.5884 -2.1538 0.0156
Yimi Garcia 2.6504 2.4703 -1.4416 0.0747
Sergio Romo 3.0703 2.9099 -1.4861 0.0686
Derek Holland 3.3051 3.3169 0.4957 0.6900
Hansel Robles 1.8342 1.8017 -0.2640 0.3959
Carlos Villanueva 4.2170 4.1108 -0.8810 0.1892
John Lamb 3.3414 3.2973 -0.4065 0.3422
Joshua Fields 2.3622 2.3360 -0.1912 0.4242
Santiago Casilla 4.0146 4.0413 0.8235 0.7949
Josh Tomlin 3.6684 3.5650 -0.9883 0.1615
Pitch transition count closest to mean + std. dev (~1380)
(until Tim Hudson)
Roenis Elias 4.2148 4.1512 -1.7099 0.0436
Jake Peavy 4.4581 4.4121 -0.4792 0.3159
Hisashi Iwakuma 4.4289 4.4015 -0.1079 0.4570
Chad Bettis 4.1127 4.0334 -0.9856 0.1622
Jaime Garcia 4.5430 4.4571 -1.3102 0.0951
Williams Perez 2.5985 2.5346 -0.6638 0.2534
Kevin Gausman 2.9166 2.8262 -1.0872 0.1385
Michael Lorenzen 3.9086 3.7307 -1.9956 0.0230
Kendall Graveman 3.8000 3.8172 0.6236 0.7336
Tim Hudson 3.9299 3.9379 0.4617 0.6778
Max pitch transition count (2563-2831) Collin McHugh 4.3544 4.2274 -3.2359 0.0006
Jose Quintana 3.3734 3.3498 -0.2440 0.4036
Christopher Archer 3.3490 3.2755 -1.9169 0.0276
Cole Hamels 4.4435 4.4177 -0.9683 0.1664
Clayton Kershaw 2.9702 3.0248 2.0182 0.9782
Dallas Keuchel 4.3491 4.3080 -1.3321 0.0914
Johnny Cueto 4.5603 4.5172 -1.3415 0.0899
David Price 4.4113 4.3996 -0.2504 0.4011
Edinson Volquez 3.7520 3.7481 -0.0093 0.4963
Jake Arrieta 4.1939 4.2031 0.5471 0.7079

As is probably evident from the variability of the Z-scores, the results did not quite replicate. With an average z-score of -0.683, the p-value was 0.25, nowhere near significant (using a t-test with 29 degrees of freedom). To verify this result, I repeated this test for all 297 MLB pitchers at or above the average pitch transition count. The pitch transitions counts skew right, as many pitchers threw much more than the average, probably most of them starters and high-usage relievers. The results were similar - a mean Z-score of -0.77, which with the larger sample (df = 296) gave a p-value of 0.22.

While these results are somewhat discouraging, I intend to investigate further, to see if I can find a way to sufficiently distinguish between the independent and dependent transitions. First, I will check my math and code, and verify that my my methods were appropriate. I will then try to process another season or two, and see how that changes the results. In order to increase sample sizes, I'm also considering to attempt to cluster pitchers (perhaps by their individual pitch distributions) and testing them together, to see if such an examination by pitcher archetype makes sense. There are other confounding variables which might be harder to remove, such as the variation in location (more than merely pitch types), and the preferences of the batters. 

Either way, this has been very interesting, and great fun, and I'm looking forwards to continuing it!

Notes:

  • This project would have been much more difficult and clunky without the help of Professor Michelle Greene, who teaches at Minerva, and another individual (whose consent I'm awaiting to put his name here). Thank you!
  • This question has been on my mind for a while, and I found the time for it when I could also use it as my final project for my first semester at the Minerva Schools, where I'm studying.
  • I will upload my code (Python and SQL) to github shortly.
  • I'm aware how clunky the tables look; I need to find a better solution to post them as HTML, or perhaps write the whole post in some other editor and export the HTML here.
  • If you've made it this far - I would love any comments or feedback!

2 comments:

  1. Do you have data on the count at which the pitch was thrown in? You can also look at the higher order sequences. If you are right, it is quite cool that the pitchers are this unpredictable!

    ReplyDelete
    Replies
    1. Indeed - I have the full data on the at-bat, so looking at higher-order sequences is possible. My concern is sample size. For individual pitchers, the mean number of two-pitch transitions observed was around 700 in a single season. Three pitch transitions will probably make that number drop even further.

      I guess I should start by figuring out how to calculate a meaningful sample size for a significant effect, and once I do that, see how many seasons' worth of data I need to analyze higher order transitions.

      Delete