Pitches

Assignment

For Fall 2025 the Pitches lab will be used as Lab 01.

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

The model portion of the lab is due on Saturday, September 27.
The report portion of the lab is due on Saturday, October 4.

Background

It’s tough to make predictions, especially about the future.

– Yogi Berra

What is a pitch type you might ask? Well, it’s complicated.

YouTube: How to Identify Baseball Pitches

As we’ll see in a moment, while the pitch type is technically defined by what the pitcher claims they threw, we can probably infer the pitch type based on only speed and spin.

Now that you’re a pitch type expert, here’s a game to see how well you can identify pitches from video:

Baseball Savant: Guess the Pitch Type

That game was difficult wasn’t it? Don’t feel bad! Identifying pitches with your eyes is difficult. It is even more difficult when you realize the cameras are playing tricks on you:

YouTube: Camera Angles in MLB and How It Affects Us, A Deep-Dive

But wait! Then how do television broadcasts of baseball games instantly display the pitch type? You guessed it… machine learning! For a deep dive on how they do this, see here:

MLB Technology Blog: MLB Pitch Classification

The long story short is:

Have advanced tracking technology that can instantly record speed, spin, and other measurements for each pitch.
Have a trained classifier for pitch type based on speed, spin, and more.
In real time, make a prediction of the pitch type as soon as the speed and spin are recorded.
Display the result in the stadium and on the broadcast!

There are several ways to accomplish this task, but one decision we’ll make is to make one-model-per-pitcher. For the purpose of this lab, we will only model the pitches for a single pitcher, Kevin Gausman.

Baseball Savant: Kevin Gausman

Scenario and Goal

Who are you?

You are a data scientist working for Major League Baseball (MLB) as part of the broadcast operations team. The current date is July 15, 2025, the date of the 2025 MLB All-Star Game.

What is your task?

You are tasked with developing a model to assist with automatically displaying the pitch type for each pitch in real-time, both in the stadium, and on the television broadcast. You have access to data on every pitch thrown in MLB to date, including characteristics of the pitch such as its velocity and rotation, as well as the type of pitch thrown. Additionally, tracking technology in each stadium will provide data on (at least) the speed and rotation of each pitch, in real-time. You are asked to create a proof-of-concept for a single pitcher, Kevin Gausman, but your process should be designed such that it can easily be applied to other pitchers as well.

Who are you writing for?

To summarize your work, you will write a report for your manager, who reports to the Vice President of Media Operations. You can assume your manager is very familiar with baseball and associated broadcasting. With a focus on broadcasting, your manager is especially concerned with the ability of your model to work in real-time, that is, it must make predictions nearly instantaneously after the pitch is thrown.

Data

To achieve the goal of this lab, we will need historical pitching data. The necessary data is provided in the following files:

You are not required to download and manage this data. Below, we provide code to directly import this data into Python.

Source

The original source of the data is Statcast. Specifically, the pybaseball package was used to interface with Statcast Search, which is part of Baseball Savant.

Data Dictionary

Each sample contains Statcast data for a single pitch thrown by Kevin Gausman in either 2024 (train data) or 2025 (test data) during an MLB regular season game.

Here, the train-test split is based on time.

Train: 2024 MLB Season
Test: (First Half of) 2025 MLB Season

Original and (mostly) complete documentation for Statcast data can be found in the Statcast Search CSV Documentation. A more detailed reference can be found in Appendix C of Analyzing Baseball Data with R.¹ Notably, Table C.3 maps the pitch type to a pitch name, as pitch types are shortcodes for the more descriptive pitch (type) names.

The variable descriptions listed below are available in the Markdown file variable-descriptions.md for ease of inclusion in reports.

Variable Descriptions

pitch_type

[object] the type of the pitch

release_speed

[float64] pitch velocity (miles per hour) measured shortly after leaving the pitcher’s hand

release_spin_rate

[float64] pitch spin rate (revolutions per minute) measured shortly after leaving the pitcher’s hand

pfx_x

[float64] horizontal movement (feet) of the pitch from the catcher’s perspective.

pfx_z

[float64] vertical movement (feet) of the pitch from the catcher’s perspective.

stand

[object] side of the plate batter is standing, either L (left) or R (right)

Data in Python

To load the data in Python, use:

import pandas as pd

pitches_train = pd.read_parquet(
    "https://lab.cs307.org/pitches/data/pitches-train.parquet",
)
pitches_test = pd.read_parquet(
    "https://lab.cs307.org/pitches/data/pitches-test.parquet",
)

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = pitches_train.drop("pitch_type", axis=1)
y_train = pitches_train["pitch_type"]

# create X and y for test
X_test = pitches_test.drop("pitch_type", axis=1)
y_test = pitches_test["pitch_type"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.

Models

To obtain the maximum points via the autograder, your submitted model must outperform the following metrics:

Test Accuracy: 0.983
Production Accuracy: 0.983

Footnotes

This book is the book if you’re looking to get into baseball analytics. Hopefully a Python version will be available soon.↩︎