Automated Pitch Type Classification

Introduction

In this report, we investigate the viability of a system for real-time automatic pitch-type recognition, to be used on broadcasts and in-stadium displays. To do so, we develop a pitch-type classifier that utilizes pitch characteristics such as velocity and spin rates that are collected in real-time via Statcast. As a proof-of-concept, we develop a model for a single pitcher, Kevin Gausman, however the process utilized can easily be replicated for pitchers league-wide. Results indicate that an automated system is likely viable but we note potential improvements and additional engineering work required to fully implement such a system.

Methods

To develop a pitch-type classifier, we utilize data from Statcast and train machine learning models using scikit-learn.

Data

Each sample contains Statcast data for a single pitch thrown by Kevin Gausman in either 2024 (train data) or 2025 (test data) during an MLB regular season game.

  • Train: 2024 MLB Season
  • Test: (First Half of) 2025 MLB Season

Full documentation for Statcast data can be found in the Statcast Search CSV Documentation. Descriptions of the variables utilized here can be found in the data dictionary below.

Data Dictionary

pitch_type

  • [object] the type of the pitch

    • FF: 4-Seam Fastball
    • FS: Split-Finger
    • SI: Sinker
    • SL: Slider

release_speed

  • [float64] pitch velocity (miles per hour) measured shortly after leaving the pitcher’s hand

release_spin_rate

  • [float64] pitch spin rate (revolutions per minute) measured shortly after leaving the pitcher’s hand

pfx_x

  • [float64] horizontal movement (feet) of the pitch from the catcher’s perspective.

pfx_z

  • [float64] vertical movement (feet) of the pitch from the catcher’s perspective.

stand

  • [object] side of the plate batter is standing, either L (left) or R (right)

Exploratory Data Analysis

Table 1 displays Gausman’s 2024 pitch mix (distribution of pitch utilization). We note that sinkers and sliders are quite rare compared to fastballs and split-fingers.

Pitch Type Count Proportion
FF 1488 0.52
FS 959 0.33
SL 240 0.08
SI 181 0.06
Table 1: Pitch mix for Kevin Gausman during the 2024 MLB season (training data).

Figure 1 visualizes the relationship between velocity and spin for these pitches. Fastballs, split-fingers, and sliders appear well-separated, which suggests they will be relatively easy to classify. There seems to be significant overlap between fastballs and sinkers. This does not make classification impossible, as these types may still be distinguishable based on other features. Additionally, because sinkers are the least frequent pitch, misclassifying them will have the smallest potential impact on performance.

Figure 1: Relationship between pitch velocity and spin rate for Kevin Gausman pitches from the 2024 MLB season (training data).

Models

To develop a model for pitch-type classification, we establish a pipeline using scikit-learn. Figure 2 summarizes the pipeline and the resultant model.

  • Numeric features were processed using an imputer and scaler.
  • Categorical features were processed with an imputer and one-hot encoder.
  • The preprocessed data was then passed to a \(k\)-nearest neighbors (KNN) classifier.

The pipeline was tuned via 5-fold cross-validation using GridSearchCV over a parameter grid that considered the following potential values of \(k\):

k = [1, 5, 10, 15, 20, 25, 50, 100, 250, 500]
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('scaler',
                                                                                          StandardScaler())]),
                                                                         ['release_speed',
                                                                          'release_spin_rate',
                                                                          'pfx_x',
                                                                          'pfx_z']),
                                                                        ('categorical',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('encoder',
                                                                                          OneHotEncoder())]),
                                                                         ['stand'])])),
                                       ('classifier', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'classifier__n_neighbors': [1, 5, 10, 15, 20, 25, 50,
                                                     100, 250, 500]},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Figure 2: The learned pipeline, including preprocessing and KNN classifier.

Results

As indicated in Figure 2 and shown in Table 2, the tuned (chosen) model is a KNN classifier with \(k = 50\).

k CV Accuracy Standard Deviation
1 0.977 0.006
5 0.978 0.005
10 0.980 0.004
15 0.980 0.006
20 0.980 0.004
25 0.980 0.005
50 0.981 0.005
100 0.980 0.004
250 0.959 0.007
500 0.878 0.005
Table 2: Cross-validation results for values of k considered in the \(k\)-nearest neighbors classifier.

Figure 3 demonstrates that we have considered models that span a reasonable range of model flexibilities, as indicated by the expected (inverted) U-shaped curve.

Figure 3: Cross-validation accuracy as a function of \(k\) for the KNN classifier.

Evaluating this tuned model on the held-out test data, we obtain a test accuracy of 0.986. Figure 4 further describes the test performance with a confusion matrix.

Figure 4: Confusion matrix showing the performance of the \(k\)-nearest neighbors classifier on the test data.

We note that the proportion of sinkers in the test data is in fact lower than the training data, where it was already infrequent. In particular, there are only two pitches in the test data with type SI. We expand on this observation in the discussion section.

Discussion

The model developed here is fast and accurate. Thus, building a real-time automatic pitch-type classification system seems feasible. Of course, the work presented here is only a limited proof-of-concept, and as-is, could only be used for Kevin Gausman. To fully implement a system in production, additional engineering work is required.

We further detail the benefits, risks, limitations, and potential improvements of the work presented in this report.

Benefits

By building a pitch-classification system, we would be able to deliver additional information to fans attending games, and use that information to improve the game broadcasts.

By using a scikit-learn pipeline, the process we developed here is maintainable and reproducible, which provides a foundation for future model development if the system is expanded for use with all MLB pitchers.

Despite using KNN, the speed of predictions is fast. Making predictions for the full test data on a consumer grade laptop is nearly instantaneous. In practice, pitches will be predicted one-at-a-time which of course is much faster.

Risks

The predictions from this model are used only for informational and entertainment purposes, so incorrect predictions have minimal consequences. While it is of course preferable to only show the correct pitch types to fans, doing so does not truly create any harm, and the rate at which errors occur is relatively low. Although, the performance of the model for Gausman should not be assumed for other pitchers. If a system is developed to be used for all pitches, performance should be monitored for every pitcher-model.

While not initially proposed, we would currently recommend against using the model development techniques seen here to label data with a pitch type as data is collected. For such a task, we would need to develop more performant models (with nearer if not perfect accuracy) or human-review each pitch type after automatic labeling.

Limitations

The main and obvious limitation here is that we have only developed a model for a single pitcher. However, the techniques used here could easily be applied to all pitchers for which historical data exists to develop a model for each MLB pitcher. Although, using these techniques, no predictions could be provided for pitchers new to MLB who do not have historical data.

We also need to be aware of potential data drift over time, which would cause degraded performance. Data drift could occur due to:

  • Changes in pitch mechanics, that is, modifying how a specific pitch is thrown, and thus the resulting characteristics.
  • Changes in pitch mix, that is, modifying how often each pitch is thrown.
  • Changes to data collection tools (Statcast measurement tools) and procedures.

We see an example of data drift in the data used here. Table 1 shows that Gausman threw 181 sinkers in 2024. However, Figure 4 indicates that through the first half of 2025, Gausman has only thrown 2 sinkers, which potentially suggests he either has or will stop throwing sinkers altogether. This is contributing, negatively, to the performance of our model on new data, as indicated by the 17 sinkers predicted for the test data.

Potential remedies for these limitations are included in the improvements section below.

Improvements

To address the limitations discussed above and improve the overall system, we recommend several enhancements.

While the KNN classifier performs well for Gausman, exploring other model classes such as logistic regression, decision trees, or ensemble methods could potentially improve performance or provide better computational efficiency. Additionally, incorporating more features from the Statcast data, such as release position or plate location, could help better distinguish between similar pitch types.

To combat the data drift issues observed with Gausman’s sinker usage, we recommend regularly re-training models after each game or series. This would incorporate the most recent data and allow models to adapt to changes in pitch mechanics and pitch mix over time. This approach would also naturally facilitate model development for pitchers new to MLB as they accumulate sufficient historical data. To best manage drift in pitch mixes, in addition to adding new data when re-training, consideration should be given to similarly removing some old data, on a rolling basis. How to best add data, remove data, and re-train models should be explored.

Rather than always displaying a predicted pitch type, we could use the model to obtain estimated probabilities for each pitch type. These probabilities represent the model’s belief about the likelihood of each possible pitch type given the observed pitch characteristics.

We can define the model’s confidence as the maximum probability among all pitch types for a given prediction. For example, if the model estimates probabilities of 0.85 for a fastball (FF), 0.10 for a split-finger (FS), 0.03 for a sinker (SI), and 0.02 for a slider (SL), the confidence would be 0.85. A high confidence indicates the model strongly believes one pitch type is most likely, while a low confidence suggests uncertainty between multiple pitch types.

In our test data, the highest confidence prediction has a confidence of 1e+00, with estimated probabilities:

  • FF: 1.0
  • FS: 0.0
  • SI: 0.0
  • SL: 0.0

In contrast, the lowest confidence prediction has a confidence of only 0.5, with estimated probabilities:

  • FF: 0.5
  • FS: 0.0
  • SI: 0.5
  • SL: 0.0

This low-confidence example demonstrates a case where the model cannot clearly distinguish between pitch types based on the available features.

By setting a confidence threshold (for example, 0.60 or 0.70), the system could refrain from displaying a pitch type when the model is uncertain. This would reduce the risk of showing incorrect predictions to viewers and maintain the credibility of the system. The appropriate threshold should be determined based on stakeholder input and the acceptable trade-off between coverage (the percentage of pitches with displayed predictions) and accuracy.

For pitchers with limited (or no) historical data, a league-wide “fallback” model trained on data from all pitchers could be used to provide reasonable predictions. While such a model would likely be less accurate than pitcher-specific models, it would ensure the system can always provide some information to viewers.

To fully implement a production version of this system for broadcast and in-stadium use, significant engineering work remains. The system would require:

  • automated re-training (and storage) pipelines
  • a monitoring dashboard to track model performance for each pitcher as re-training occurs
  • an API (using tools such as FastAPI) for integration with data collection and broadcast services.

If work is to proceed, significant input and collaboration with the data engineering and broadcast teams will be necessary.

Back to top