Swing

Assignment

For Fall 2025 the Swing lab will be used as Lab 04.

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

The model portion of the lab is due on Saturday, November 8.
The report portion of the lab is due on Saturday, November 15.

Background

Swing at the strikes.

– Yogi Berra

While the game of baseball is a competition between two teams, it ultimately reduces to a struggle between a specific batter and pitcher.

The job of a pitcher is to prevent batters from reaching base. They can do this by striking them out, inducing a ground out, or getting the batter to fly out.
The objective of the batter is to reach base, either via a walk or a hit that results from making solid contact.

These conflicting goals are sought via a psychological struggle, with the two players trying to outthink each other. The batter tries to anticipate the type and location of a pitch, while the pitcher tries to deceive the batter.

Depending on the game situation and characteristics of the pitcher, a pitcher often throws a pitch with the intention to make the batter swing, or not.

Pitchers want batters to swing at pitches that have a low probability of success, that is, either a swing-and-miss (strike), or weak contact that results in a field out. They do this by throwing pitches to locations that are hard to hit, but attempt to make them appear like they will be easy to hit.
Pitchers want batters to take (not swing) pitches that will be called for a strike. Pitchers do this by throwing pitches that look like balls as they approach the plate, but just barely cross the strike zone.

Modern baseball analytics is interested in studying when batters swing at pitches, both to assist in pitcher development, and as part of larger data analytics systems.

Scenario and Goal

Who are you?

You are a data scientist working for a Major League Baseball (MLB) team as part of their Research & Development department. The current date is October 2, 2023, the final day of the 2023 MLB regular season.

What is your task?

Your goal is to develop a well calibrated probability model that estimates the probability that a batter swings at a pitch given pitch characteristics and game situation information for a particular pitcher. You have access to data on every pitch thrown by your team’s pitcher this season. You are asked to create a model for Zac Gallen, one of your team’s starting pitchers, who may be used in the playoffs.

Who are you writing for?

To summarize your work, you will write a report for the VP of Research & Development. You can assume the VP is a baseball expert, and reasonably familiar with the general concepts of data analysis and machine learning.

Data

To achieve the goal of this lab, we will need previously thrown pitches. The necessary data is provided in the following files:

You are not required to download and manage this data. Below, we provide code to directly import this data into Python.

Source

The original source of the data is Statcast. Specifically, the pybaseball package was used to interface with Statcast Search, which is part of Baseball Savant.

Data Dictionary

Each sample contains Statcast data for a single pitch thrown by Zac Gallen in the 2023 MLB regular season.

Baseball Savant: Zac Gallen

Here, the train-test split is based on time.

Train: March 30 through August 31, 2023
Test: September 1 through October 2, 2023

Original and (mostly) complete documentation for Statcast data can be found in the Statcast Search CSV Documentation. A more detailed reference can be found in Appendix C of Analyzing Baseball Data with R.¹

The variable descriptions listed below are available in the Markdown file variable-descriptions.md for ease of inclusion in reports.

Response

swing

[int64] Whether or not the batter swung (1) or took (0).

Features

While we will certainly not be able to make any truly causal claims about our model, it is important to understand which variables are controlled by the pitcher. We could imagine a coach using this model to help explain to a pitcher where and how to throw a pitch if they want to induce a swing. As such, we will group the feature variables based on the degree of control the pitcher asserts over them.

Fully Pitcher Controlled

This variable is fully controlled by the pitcher. In modern baseball, this information is communicated between the pitcher and catcher before the pitch via PitchCom.

pitch_name

[object] The name of the pitch type to be thrown.

Mostly Pitcher Controlled

These variables are largely controlled by the pitcher, but even at the highest levels of baseball, there will be variance based on skill, fatigue, etc. These variables essentially measure where the pitcher’s arm is located as a pitch is thrown.

release_extension

[float64] Release extension of pitch in feet as tracked by Statcast.

release_pos_x

[float64] Horizontal release position of the ball measured in feet from the catcher’s perspective.

release_pos_y

[float64] Release position of pitch measured in feet from the catcher’s perspective.

release_pos_z

[float64] Vertical release position of the ball measured in feet from the catcher’s perspective.

Somewhat Pitcher Controlled

These variables are in some sense controlled by the pitcher, but less so than the previous variables. At the MLB level, pitchers will have some control here, but even at the highest levels, there can be a lot of variance. The speed and spin features are highly dependent on the pitch type thrown.

release_speed

[float64] Velocity of the pitch thrown (miles per hour).

release_spin_rate

[float64] Spin rate of pitch tracked by Statcast (revolutions per minute).

spin_axis

[float64] The spin axis in the 2D X-Z plane in degrees from 0 to 360, such that 180 represents a pure backspin fastball and 0 degrees represents a pure topspin (12-6) curveball.

plate_x

[float64] Horizontal position of the ball when it crosses home plate from the catcher’s perspective (feet).

plate_z

[float64] Vertical position of the ball when it crosses home plate from the catcher’s perspective (feet).

Downstream Pitcher Controlled

These variables are pitch characteristics, and may be somewhat controlled by the pitcher, but are largely functions of the previous variables.

pfx_x

[float64] Horizontal movement in feet from the catcher’s perspective.

pfx_z

[float64] Vertical movement in feet from the catcher’s perspective.

Situational Information

These variables describe part of the game situation when the pitch was thrown. (We have omitted some other obvious variables here like score and inning for simplicity.) These are fixed before a pitch is thrown, but could have an effect. Pitchers and batters often act differently based on the game situation. For example, batters are known to “protect” when there are two strikes, thus, are much more likely to swing.

balls

[int64] Pre-pitch number of balls in count.

strikes

[int64] Pre-pitch number of strikes in count.

on_3b

[int64] Indicator (0 or 1) for runner on 3rd base.

on_2b

[int64] Indicator (0 or 1) for runner on 2nd base.

on_1b

[int64] Indicator (0 or 1) for runner on 1st base.

outs_when_up

[int64] Pre-pitch number of outs.

Fixed Batter Information

These variables give some information about the batter facing the pitcher. In particular, are they a righty or lefty, and the size of their strike zone, which is a function of their height.

stand

[object] Side of the plate batter is standing, either L (left) or R (right).

sz_top

[float64] Top of the batter’s strike zone set by the operator when the ball is halfway to the plate (feet).

sz_bot

[float64] Bottom of the batter’s strike zone set by the operator when the ball is halfway to the plate (feet).

Data in Python

To load the data in Python, use:

import pandas as pd

swing_train = pd.read_parquet(
    "https://lab.cs307.org/swing/data/swing-train.parquet",
)
swing_test = pd.read_parquet(
    "https://lab.cs307.org/swing/data/swing-test.parquet",
)

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = swing_train.drop("swing", axis=1)
y_train = swing_train["swing"]

# create X and y for test
X_test = swing_test.drop("swing", axis=1)
y_test = swing_test["swing"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.

Models

To obtain the maximum points via the autograder, your submitted model must outperform the following metrics:

Test ECE: 0.065
Test MCE: 0.12
Test Brier Score: 0.19
Production ECE: 0.065
Production MCE: 0.12
Production Brier Score: 0.19

Probability Calibration

What are the metrics ECE and MCE? Brier Score? How are we evaluating models here?

CS 307 Notes: Classifier Calibration

Models will not be evaluated on their ability to classify a swing or not. Instead, we will directly assess their ability to estimate the probability of a swing. Thus, you need a well-calibrated model.

sklearn: Probability Calibration

The above sklearn user guide page will provide some hints. Importantly, you may need to use CalibratedClassifierCV to further calibrate the probability estimates from a classifier.

In the autograder, we will use three metrics to assess your submitted model:

Expected Calibration Error (ECE): This is essentially an average of the distance the points on a calibration plot are from the “perfect” line.
Maximum Calibration Error (MCE): This is essentially the furthest any point on a calibration plot is from the “perfect” line.
Brier Score: This is the mean squared error of the predicted probabilities. The Brier score is a measure of how close the predicted probabilities are to the actual outcomes. The Brier Score can be calculated using sklearn.metrics.brier_score_loss.

We provide Python functions for creating calibration plots and calculating ECE and MCE.

calibration.py

Footnotes

This book is the book if you’re looking to get into baseball analytics. Hopefully a Python version will be available soon.↩︎