The first baseball question I chose to tackle has to do with pitch sequencing - the order in which the pitcher and catcher choose to employ the pitcher's arsenal in order to get batters out. It is something that has some age-old adages (e.g. using the fastball to set up the change-up), and it seemed to like it should have a significant effect on outcomes. Eventually, I would to be able to answer questions such as:
- Which pitchers gain the most due to their sequencing, over what would be expected by the strength of their pitches alone?
- Can the effect of the catcher be separated from the role of the pitcher?
- Are certain batters more or less susceptible to certain sequences, beyond their strength or weakness against specific pitches?
- Are there any universally strong pitch sequences? Maybe for specific types of pitchers?
Before I tackle these harder questions, I realized I must start with a simpler one: Is there even a statistically significant effect of pitch sequencing? Can it be meaningfully distinguished from random selection from the independent pitches?
To do that, I started with the PitchFX database, which contains every pitch thrown in the past few MLB seasons, as well as its classification to one of around ten pitch types. Looking at the entire 2015 MLB season, I first calculated the individual probability of each pitch type being thrown:
Pitch ID | Pitch Description | Probability |
FF | 4-Seam Fastball | 0.3629 |
SL | Slider | 0.1463 |
FT | 2-Seam Fastball | 0.1365 |
CH | Change-up | 0.1040 |
CU | Curveball | 0.0783 |
SI | Sinker | 0.0723 |
FC | Cut Fastball | 0.0530 |
KC | Knuckle Curve | 0.0211 |
FS | Split-finger Fastball | 0.0154 |
KN | Knuckleball | 0.0052 |
FA | Fastball | 0.0036 |
I then created a new table in the database, to store pitch transitions. I defined a transition as two consecutive pitches within the same at-bat; a four-pitch at-bat would generate three transitions. Of course, this loses some information - one-pitch at-bats would not be examined, and the first and last pitch of each bat appear in the table only once, while the middle pitches appear twice. Still, I believe it's a fitting system. I then calculated the probability of each transition, building the markov transition table:
***** | FF | SL | FT | CH | CU | SI | FC | KC | FS | KN | FA |
FF | 0.5105 | 0.1433 | 0.0755 | 0.0951 | 0.0789 | 0.0177 | 0.0389 | 0.0203 | 0.0133 | 0.0014 | 0.0036 |
SL | 0.3374 | 0.3397 | 0.1152 | 0.0645 | 0.0405 | 0.0658 | 0.0204 | 0.0039 | 0.0094 | 0.0000 | 0.0027 |
FT | 0.1996 | 0.1331 | 0.4162 | 0.1142 | 0.0722 | 0.0001 | 0.0336 | 0.0148 | 0.0115 | 0.0000 | 0.0044 |
CH | 0.3233 | 0.0873 | 0.1379 | 0.2536 | 0.0641 | 0.0681 | 0.0436 | 0.0166 | 0.0010 | 0.0002 | 0.0036 |
CU | 0.3459 | 0.0842 | 0.1190 | 0.1124 | 0.2159 | 0.0506 | 0.0546 | 0.0000 | 0.0108 | 0.0019 | 0.0037 |
SI | 0.0900 | 0.1419 | 0.0001 | 0.1087 | 0.0574 | 0.5170 | 0.0425 | 0.0239 | 0.0153 | 0.0000 | 0.0018 |
FC | 0.2528 | 0.0613 | 0.0805 | 0.0876 | 0.0843 | 0.0544 | 0.3253 | 0.0293 | 0.0216 | 0.0000 | 0.0017 |
KC | 0.3366 | 0.0372 | 0.0919 | 0.0994 | 0.0000 | 0.0776 | 0.0729 | 0.2809 | 0.0028 | 0.0000 | 0.0004 |
FS | 0.2873 | 0.0711 | 0.0854 | 0.0096 | 0.0376 | 0.0602 | 0.0657 | 0.0027 | 0.3776 | 0.0000 | 0.0001 |
KN | 0.1041 | 0.0011 | 0.0000 | 0.0037 | 0.0369 | 0.0000 | 0.0004 | 0.0000 | 0.0000 | 0.8328 | 0.0000 |
FA | 0.3284 | 0.1155 | 0.1458 | 0.1001 | 0.0750 | 0.0245 | 0.0383 | 0.0027 | 0.0005 | 0.0000 | 0.1692 |
One expected outcome of this table are the high values on the diagonal - as a pitcher who can throw a specific pitch can certainly repeat it, while he probably only throws two to four other pitchers.
Next, I compared the observed dependent transitions with a hypothetical independent pitch transitions. The independent probability for each pair of pitches is the product of the two independent probabilities. For the dependent probability, I multiplied the probability of a transition starting with a certain pitch, by the probability of the second pitch given the first one. Thus, I created two tables of the same size, each with the joint probabilities of each pitch transition according to its model. To compare the two, I used Shannon's entropy - as entropy is maximized when probabilities are uniform, and I expected the dependent decisions to show more variability (due to some sequences being more preferable), I expected lower entropy from the dependent probabilities. These were the results:
Independent H(X,Y) = 5.49738472712
Dependent H(X,Y) = 5.14084599364
To test the significance, I performed a permutation test. I sampled individual pitches from the independent probabilities to create the same number of transitions as observed, calculated the entropy of each such permutation, and calculated the Z-score of the observed dependent entropy compared to the distribution of the permutation entropies. The results were striking: a Z-score of -102.57, which is off-the-charts low.
This is where I got suspicious. These results were too good to be true. I suspected it is either due to the extremely large sample size (the entire season gave me over 500,000 transitions), or because of the limitation on each pitcher's arsenal. Because each individual pitcher only throws a few pitches, he cannot make all transitions listed in the table - making the joint distribution skew sharper than it should.
I decided to test individual players, to see if the results replicate. While it does not separate the hypotheses, as the sample sizes are indeed much smaller, it felt like a good start. Rather than randomly sample, I picked three groups of ten pitchers each: the ten pitchers with the most transitions last season, three around the average, and three around one standard deviation over the average. I then repeated the same individual and joint probability and entropy calculations, and the permutation tests:
Edit #1: it seems I was a little unclear. The permutation tests for each individual player were sampling from their individual probabilities - counting only the pitches they threw, to establish the frequency of each. That individual distribution is also the basis for the independent entropy (third column).
Edit #1: it seems I was a little unclear. The permutation tests for each individual player were sampling from their individual probabilities - counting only the pitches they threw, to establish the frequency of each. That individual distribution is also the basis for the independent entropy (third column).
Group | Name | Indep. Entropy | Cond. Entropy | Z-Score | P-Value |
Pitch transition count closest to mean (~688) (until Josh Tomlin) |
Francisco Rodriguez | 3.7134 | 3.5884 | -2.1538 | 0.0156 |
Yimi Garcia | 2.6504 | 2.4703 | -1.4416 | 0.0747 | |
Sergio Romo | 3.0703 | 2.9099 | -1.4861 | 0.0686 | |
Derek Holland | 3.3051 | 3.3169 | 0.4957 | 0.6900 | |
Hansel Robles | 1.8342 | 1.8017 | -0.2640 | 0.3959 | |
Carlos Villanueva | 4.2170 | 4.1108 | -0.8810 | 0.1892 | |
John Lamb | 3.3414 | 3.2973 | -0.4065 | 0.3422 | |
Joshua Fields | 2.3622 | 2.3360 | -0.1912 | 0.4242 | |
Santiago Casilla | 4.0146 | 4.0413 | 0.8235 | 0.7949 | |
Josh Tomlin | 3.6684 | 3.5650 | -0.9883 | 0.1615 | |
Pitch transition count closest to mean + std. dev (~1380) (until Tim Hudson) |
Roenis Elias | 4.2148 | 4.1512 | -1.7099 | 0.0436 |
Jake Peavy | 4.4581 | 4.4121 | -0.4792 | 0.3159 | |
Hisashi Iwakuma | 4.4289 | 4.4015 | -0.1079 | 0.4570 | |
Chad Bettis | 4.1127 | 4.0334 | -0.9856 | 0.1622 | |
Jaime Garcia | 4.5430 | 4.4571 | -1.3102 | 0.0951 | |
Williams Perez | 2.5985 | 2.5346 | -0.6638 | 0.2534 | |
Kevin Gausman | 2.9166 | 2.8262 | -1.0872 | 0.1385 | |
Michael Lorenzen | 3.9086 | 3.7307 | -1.9956 | 0.0230 | |
Kendall Graveman | 3.8000 | 3.8172 | 0.6236 | 0.7336 | |
Tim Hudson | 3.9299 | 3.9379 | 0.4617 | 0.6778 | |
Max pitch transition count (2563-2831) | Collin McHugh | 4.3544 | 4.2274 | -3.2359 | 0.0006 |
Jose Quintana | 3.3734 | 3.3498 | -0.2440 | 0.4036 | |
Christopher Archer | 3.3490 | 3.2755 | -1.9169 | 0.0276 | |
Cole Hamels | 4.4435 | 4.4177 | -0.9683 | 0.1664 | |
Clayton Kershaw | 2.9702 | 3.0248 | 2.0182 | 0.9782 | |
Dallas Keuchel | 4.3491 | 4.3080 | -1.3321 | 0.0914 | |
Johnny Cueto | 4.5603 | 4.5172 | -1.3415 | 0.0899 | |
David Price | 4.4113 | 4.3996 | -0.2504 | 0.4011 | |
Edinson Volquez | 3.7520 | 3.7481 | -0.0093 | 0.4963 | |
Jake Arrieta | 4.1939 | 4.2031 | 0.5471 | 0.7079 |
As is probably evident from the variability of the Z-scores, the results did not quite replicate. With an average z-score of -0.683, the p-value was 0.25, nowhere near significant (using a t-test with 29 degrees of freedom). To verify this result, I repeated this test for all 297 MLB pitchers at or above the average pitch transition count. The pitch transitions counts skew right, as many pitchers threw much more than the average, probably most of them starters and high-usage relievers. The results were similar - a mean Z-score of -0.77, which with the larger sample (df = 296) gave a p-value of 0.22.
While these results are somewhat discouraging, I intend to investigate further, to see if I can find a way to sufficiently distinguish between the independent and dependent transitions. First, I will check my math and code, and verify that my my methods were appropriate. I will then try to process another season or two, and see how that changes the results. In order to increase sample sizes, I'm also considering to attempt to cluster pitchers (perhaps by their individual pitch distributions) and testing them together, to see if such an examination by pitcher archetype makes sense. There are other confounding variables which might be harder to remove, such as the variation in location (more than merely pitch types), and the preferences of the batters.
Either way, this has been very interesting, and great fun, and I'm looking forwards to continuing it!
Notes:
- This project would have been much more difficult and clunky without the help of Professor Michelle Greene, who teaches at Minerva, and another individual (whose consent I'm awaiting to put his name here). Thank you!
- This question has been on my mind for a while, and I found the time for it when I could also use it as my final project for my first semester at the Minerva Schools, where I'm studying.
- I will upload my code (Python and SQL) to github shortly.
- I'm aware how clunky the tables look; I need to find a better solution to post them as HTML, or perhaps write the whole post in some other editor and export the HTML here.
- If you've made it this far - I would love any comments or feedback!