Wine

Assignment

For Fall 2025 the Wine lab will be used as Lab 02.

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

The model portion of the lab is due on Saturday, October 11.
The report portion of the lab is due on Saturday, October 18.

Background

Wine is a popular alcoholic beverage made from fermented fruit, typically grapes. Wine has been produced for thousands of years and plays an important role in many cultures around the world.

A Sommelier is a professional that specializes in wine services, especially wine-food pairings. Sommeliers receive extensive training and certification to develop their expertise in wine tasting and evaluation. They learn to assess wine quality based on appearance, aroma, taste, and texture.

However, human wine evaluation is subjective and can vary between experts. Additionally, training sommeliers is expensive and time-consuming. This raises an interesting question: Can we use objective, measurable properties of wine to predict quality?

Wine quality is influenced by many factors, including:

The grape variety and growing conditions
The fermentation process
Chemical composition (acidity, sugar content, alcohol level, etc.)
Aging and storage conditions

Modern chemistry allows us to measure the physicochemical properties of wine with high precision. If we can establish relationships between these measurable properties and perceived quality, we could potentially automate quality assessment.

Scenario and Goal

Who are you?

You work for a startup that wants to create an AI Sommelier.

What is your task?

Rather than using a highly trained human, you will purchase chemistry equipment to generate physicochemical data for wines, and train models based on previous wine quality reviews by human sommeliers. Your goal is to create a model that predicts a wine’s quality given its physicochemical characteristics.

Who are you writing for?

To summarize your work, you will write a report for the startup’s founders. They are experts in the wine industry, but not necessarily in data science.

Data

To achieve the goal of this lab, we will need wine quality data. The necessary data is provided in the following files:

Source

The original source of the data is the following paper:

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4), 547-553. https://doi.org/10.1016/j.dss.2009.05.016

However, the data from this paper has become a standard dataset in the machine learning community, and thus is made available via the UC Irvine Machine Learning Repository.

UCI MLR: Wine Quality

The original data contains two separate datasets, one for red wine and one for white wine. Here, we have combined the data and added a column for the color of the wine. We have made additional modifications to the original data.

Data Dictionary

Each observation in the train, test, and (hidden) production data contains information about a particular Portuguese “Vinho Verde” wine.

Vinho verde is a unique product from the Minho (northwest) region of Portugal. Medium in alcohol, is it particularly appreciated due to its freshness (specially in the summer).

Original and complete documentation for this data can be found in the original paper. Additionally, minimal documentation is provided by the UCI MLR.

The variable descriptions listed below are available in the Markdown file variable-descriptions.md for ease of inclusion in reports.

Variable Descriptions

quality

[int64] the quality of the wine based on evaluation by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent)

color

[object] the (human perceivable) color of the wine, red or white

fixed acidity

[float64] grams of tartaric acid per cubic decimeter

volatile acidity

[float64] grams of acetic acid per cubic decimeter

citric acid

[float64] grams of citric acid per cubic decimeter

residual sugar

[float64] grams of residual sugar per cubic decimeter

chlorides

[float64] grams of sodium chloride cubic decimeter

free sulfur dioxide

[float64] milligrams of free sulfur dioxide per cubic decimeter

total sulfur dioxide

[float64] milligrams of total sulfur dioxide per cubic decimeter

density

[float64] the total density of the wine in grams per cubic centimeter

pH

[float64] the acidity of the wine measured using pH

sulphates

[float64] grams of potassium sulphate cubic decimeter

alcohol

[float64] percent alcohol by volume

Data in Python

To load the data in Python, use:

import pandas as pd

wine_train = pd.read_parquet(
    "https://lab.cs307.org/wine/data/wine-train.parquet",
)
wine_test = pd.read_parquet(
    "https://lab.cs307.org/wine/data/wine-test.parquet",
)

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = wine_train.drop("quality", axis=1)
y_train = wine_train["quality"]

# create X and y for test
X_test = wine_test.drop("quality", axis=1)
y_test = wine_test["quality"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.

Models

To obtain the maximum points via the autograder, your submitted model must outperform the following metrics:

Test MAE: 0.5
Production MAE: 0.5