View source on GitHub

Current Version: 2.2 (Blyleven)

- 2.2 (2021-09-15): Add QS projections for pitchers, using same settings as for W
- 2.1 (2021-08-19): Update minor league factors. (Previous factors were too high for HR, especially.)
- 2.0 (2021-08-18): Initial release of Blyleven with separate weighting and regression for each stat!

The focus of the second iteration of the Open Projections—called **Blyleven**—is separate weightings and regressions for each component stat in the projection.

We know that some stats (strikeouts, for example) can be projected based on relatively small, recent sample. Other stats (like hits) require a bigger sample and more regression.

This is not a new idea. I'm guessing that this is typical of most closed projection systems. For example:

- Sean Smith (CHONE projections) was thinking about this back in 2009.
- Jared Cross mentioned how they "have different weights for each component and regress some components more heavily than others" in the 2009 Steamer projections.
- Clay Davenport wrote about incorporating this idea in his projections back in 2012.

So, not new, but I'd still guess it's the next best upgrade for the Open Projections.

Version 1 (Aparicio)

- 1.2 (2021-08-19): Update minor league factors. (Previous factors were too high for HR, especially.)
- 1.1 (2021-08-07): Project RP at 250 BF and SP at 800 BF. (Previously both projected to 650 BF.)
- 1.0 (2021-03-31): Aparicio initial release!

The inaugural version 1 of the Open Projections—named **Aparicio**—starts with daily game logs for the past 2000 days (roughly the past six years). By working with daily stats, we are able to build projections for any point in time. A player's projection will gradually shift everyday as the new day's stats are collected and old stats fall out of the projection.

Aparicio includes game data for every level of professional baseball, from Rookie leagues to MLB. It even includes spring training and postseason. We're going to throw everything in the mix and try to filter out the most valuable data.

Stats for each game are weighted by recency, so that more recent games are more influential in the projection. Batting stats are weighted at 0.9994^daysAgo. This means that yesterday's games receive almost full weighting (99.4%), while a game from three years ago counts for about half as much (0.9994^1095 = 51.8%).

Pitching stats are weighted at 0.999^daysAgo. Weighting for both batting and pitching are ideas taken from Tom Tango and are comparable to the yearly weights used by Marcel. (See the discussion of New PECOTA.)

Spring Training and exhibition games receive a weighting of 45%, a number which comes from Neil Paine's study of spring training stats. (See Neil's When Spring Training Matters and the follow-up discussion by Tom Tango.) Notice this considers the most recent Spring Training as roughly equivalent to MLB performances from three years ago.

No adjustment is made to postseason stats, which are counted the same as the regular season.

For minor league stats, each component receives a separate weight for each level of minor league ball. For this, I did a rough reverse engineering of Clay Davenport's Davenport Translations for 2019. For stats not available in DTs, I approximated with a similar stat (e.g. SV for HLD, BB for HBP). With Version 1.2, I re-ran the Davenport Translation numbers from 2021 and updated the minor league factors.

A more advanced approach would adjust for various minor league park factors and year-by-year changes in league quality. But this is enough to get us started.

Level | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

AAA | 1.00 | 0.79 | 0.95 | 0.80 | 0.84 | 0.66 | 0.79 | 0.72 | 1.01 | 0.78 | 0.90 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 |

AA/Fall | 1.02 | 0.82 | 0.99 | 0.85 | 0.95 | 0.67 | 0.83 | 0.70 | 0.91 | 0.79 | 0.90 | 0.79 | 0.79 | 0.79 | 0.79 | 0.79 |

High-A | 1.03 | 0.78 | 0.95 | 0.79 | 0.81 | 0.61 | 0.75 | 0.51 | 0.93 | 0.72 | 0.93 | 0.72 | 0.72 | 0.72 | 0.72 | 0.72 |

Low-A | 1.05 | 0.69 | 0.92 | 0.76 | 0.69 | 0.65 | 0.68 | 0.46 | 0.84 | 0.64 | 0.89 | 0.64 | 0.64 | 0.64 | 0.64 | 0.64 |

Rookie | 1.08 | 0.51 | 0.77 | 0.63 | 0.26 | 0.39 | 0.48 | 0.27 | 0.48 | 0.54 | 1.06 | 0.54 | 0.54 | 0.54 | 0.54 | 0.54 |

Level | W | L | G | GS | CG | SHO | SV | HLD | BFP | IP | H | ER | R | HR | SO | BB | IBB | HBP | WP | BK |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

AAA | 0.92 | 1.14 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.96 | 1.00 | 1.00 | 1.13 | 0.92 | 0.92 | 0.88 | 0.56 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 |

AA/Fall | 0.92 | 1.24 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 0.99 | 0.99 | 1.25 | 1.05 | 1.05 | 1.02 | 0.50 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |

High-A | 0.92 | 1.31 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.96 | 0.98 | 0.98 | 1.28 | 1.03 | 1.03 | 1.28 | 0.46 | 0.92 | 0.92 | 0.92 | 0.92 | 0.92 |

Low-A | 0.86 | 1.27 | 1.00 | 1.00 | 1.00 | 1.00 | 1.04 | 1.04 | 1.00 | 1.00 | 1.30 | 1.04 | 1.04 | 1.56 | 0.45 | 0.89 | 0.89 | 0.89 | 0.89 | 0.89 |

Rookie | 0.68 | 1.10 | 1.00 | 1.00 | 1.00 | 1.00 | 1.04 | 1.04 | 0.96 | 0.96 | 1.30 | 1.08 | 1.08 | 2.01 | 0.38 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |

After each player's stats have been weighted, we roll them up into a single statline. Now it's time to regress.

Our regression component will amount to 15% of the PA (or, for pitchers, BF) for the player with the most PA/BF in the sample. For comparison, Marcel uses 1200 PA for its regression. A fixed number of PA, however, doesn't deal well with longer and shorter season lengths (e.g. the shortened season of 2020), so we want to instead use a percentage of PA. With Marcel, a top-of-the-lineup hitter might get 700 PA over the three years Marcel considers. That player would have 8400 PA after Marcel's weighting, and 1200 PA of regression would be about 14% of the total. That's what we are trying to match with our 15% regression.

The regression will be based on the average stats from our universe of players. Before we find the average, though, we discard any players who accumulated less than 10% of the largest number of PA/BF in our set. This clears out pitchers who are hitting, hitters who are pitching, etc. (Question: Should we try to find the average from only MLB stats (i.e. true league average), rather than the average from our universe of players at every level?)

The statline for this average player is added to each player's projected stats.

For now, playing time projections are kept simple. Every batter's stats are adjusted to exactly 650 PA. Version 1.0 also adjusted every pitcher to 650 BF, but this created obviously unrealistic projections for RP. Version 1.1 tweaks pitcher playing time by setting RP at 250 BF and SP at 800 BF. Pitchers in a mixed role fall in between those two numbers. For example, a pitcher who starts half of his games would be projected halfway between 250 and 800 at 525 BF.

We're finished! Our Aparicio projections are completed. Hopefully you can see the many ways that this can still be improved, but it gives us a good starting point.