AI Sommelier

Introduction

In this report, we investigate the feasibility of automating wine quality assessment by training a machine learning model (an “AI sommelier”) to replace traditional human sommeliers. The latter require years of expensive training and tend to make subjective judgments, while a machine learning model can be trained cheaply and efficiently to make quality predictions based on measurable properties of wines. To do so, we develop a wine-quality regressor that utilizes physicochemical readings collected from wines. As a proof of concept, we develop a model using data collected from red and white variants of the Portuguese “Vinho Verde” wine, a process that can be easily generalized to additional wines. Results indicate that an automated system is potentially viable but we note possible improvements and additional engineering work required to fully implement such a system.

Methods

To develop a wine-quality regressor, we utilize data from UCI MLR and train machine learning models using scikit-learn.

Data

Each sample contains physicochemical (input features) and sensory (output target) data for a single “Vinho Verde” wine. The dataset was randomly partitioned into train and test sets.

Full documentation for the wine quality data can be found in the paper Modeling Wine Preferences by Data Mining from Physicochemical Properties. Descriptions of the variables utilized here can be found in the data dictionary below.

Data Dictionary

quality

[int64] the quality of the wine based on evaluation by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent)

color

[object] the (human perceivable) color of the wine, red or white

fixed acidity

[float64] grams of tartaric acid per cubic decimeter

volatile acidity

[float64] grams of acetic acid per cubic decimeter

citric acid

[float64] grams of citric acid per cubic decimeter

residual sugar

[float64] grams of residual sugar per cubic decimeter

chlorides

[float64] grams of sodium chloride per cubic decimeter

free sulfur dioxide

[float64] milligrams of free sulfur dioxide per cubic decimeter

total sulfur dioxide

[float64] milligrams of total sulfur dioxide per cubic decimeter

density

[float64] the total density of the wine in grams per cubic centimeter

pH

[float64] the acidity of the wine measured using pH

sulphates

[float64] grams of potassium sulphate per cubic decimeter

alcohol

[float64] percent alcohol by volume

Exploratory Data Analysis

Table 1 displays the distribution of Vinho Verde quality in the training data. We note that excellent (high quality) and poor (low quality) wines are rare compared to medium quality wine.

Quality	Count	Proportion
3	19	0.005
4	133	0.032
5	1385	0.333
6	1810	0.435
7	686	0.165
8	122	0.029
9	2	0.000

Table 1: Distribution of wine qualities for Vinho Verde wine in the training data.

Figure 1 visualizes the relationships between alcohol, chloride, and wine quality. There exists a trend that quality is positively correlated with alcohol, but the former seems to vary little with chloride. This implies that some of the available features are useful (alcohol), while others may be less predictive (chloride). However, we will (at least initially) consider all available features when modeling.

**Figure 1:** Relationship between alcohol, chloride, and wine quality for (500 randomly selected) training samples.

Models

To develop a regression model for wine quality, we establish a pipeline using scikit-learn. Figure 2 summarizes the pipeline and the resultant model.

Numeric features were processed using an imputer and scaler.
The preprocessed data was then passed to a KNN regressor.

The pipeline was tuned via 5-fold cross-validation using GridSearchCV over a parameter grid that considered the following potential numbers of nearest neighbors \(k\):

k = [1, 5, 10, 20, 40, 80, 100]

Additionally, we consider both uniform and distance-based weighting of observations.

A tuned model is selected based on mean absolute error (MAE).

Results

As indicated in Figure 2 and shown in Table 2 and Table 3, the tuned (chosen) model is a KNN regressor with:

\(k = 10\) and
distance weighting.

k	weights	CV MAE	Standard Deviation
1	uniform	0.514	0.016
5	uniform	0.553	0.010
10	uniform	0.555	0.008
20	uniform	0.563	0.009
40	uniform	0.567	0.011
80	uniform	0.576	0.012
100	uniform	0.580	0.013

Table 2: Cross-validation results for values of \(k\) considered in the \(k\)-nearest neighbors regressor with uniform weighting.

k	weights	CV MAE	Standard Deviation
1	distance	0.514	0.016
5	distance	0.495	0.015
10	distance	0.490	0.013
20	distance	0.495	0.014
40	distance	0.499	0.014
80	distance	0.507	0.014
100	distance	0.510	0.014

Table 3: Cross-validation results for values of \(k\) considered in the \(k\)-nearest neighbors regressor with distance weighting.

Figure 3 demonstrates that we have considered models that span a reasonable range of model flexibilities, as indicated by the expected U-shaped curve for the distance weights. The uniform weights do not demonstrate a U-shaped curve, but we cannot fit a more flexible model than \(k = 1\).

**Figure 3:** Cross-validated MAE as a function of \(k\) for the KNN regressor.

**Figure 4:** Predicted versus actual wine quality for the test data.

Evaluating this tuned model on the held-out test data, we obtain a test MAE of 0.473. Figure 4 further describes the test performance using a scatterplot of predicted and actual wine qualities with a reference line for perfect prediction.

Discussion

This model has questionable performance. While a test MAE of 0.473 initially suggests a performant model, Figure 4 suggests that this metric does not reveal the full picture.

First, note that the target wine qualities are integer valued, while our predictions are floating point numbers, such as 7.3. This is technically a non-issue, as there are two simple ways to deal with this discrepancy:

Simply ignore it, as a wine quality of 7.3 is still easily interpretable despite existing qualities being integer valued.
Round or truncate predictions from our model to force them to be integer valued.

In either case, errors less than 1, or in this case less than 0.5, are considered quite small since the minimum difference in potential quality in the original data is 1. However, the test MAE of 0.473 looks at the average errors, while Figure 4 attempts to show all errors made in the test set. In particular, we should pay close attention to the predictions made for wines with a true quality that is low or high.

The following are the predictions made for wines with a true quality of 3.

The smallest of these is 4.876. Rounded or not, this is quite far from 3, and the difference is great for all other predictions made, thus we are overrating bad wine. The opposite pattern is observed for predictions of high quality wines, that is, we are systematically underrating those wines.

This pattern is largely hidden in the test MAE value because most wines in the data are of medium quality, as noted in Table 1. However, proper predictions for low quality wine (which consumers would want to avoid) and high quality wine (which consumers would want to seek) are likely much more practically impactful than properly predicting a medium quality wine.

Given this unacceptable performance, we recommend further improvements to this proof-of-concept model before expanding to additional wines or putting this model into practice for Vinho Verde wine.

We further detail the benefits, risks, limitations, and potential improvements of the work presented in this report.

Benefits and Risks

If functional, the benefit of replacing human sommeliers with machine learning, even in a limited capacity of rating wine quality, would provide numerous benefits. For a startup that wants to publish automated wine ratings, they would save the cost and time to train or hire sommeliers. The size of that benefit depends on the cost to obtain, maintain, and use the necessary chemistry equipment to produce the data needed to train the models, as well as any cost associated with obtaining existing human sommelier ratings that are necessary to train the models.

The risk of using a poorly performing model is essentially an existential threat to the core business proposition of developing an AI sommelier. If consumers purchase poor quality wine due to an ML model overrating such a wine, those consumers are unlikely to continue use of the product.

Limitations

The current model has several key limitations that constrain its practical application:

Data Limitations

The training data is limited to Vinho Verde wine from Portugal, which restricts its ability to generalize to other wine varieties, regions, and styles.
The imbalanced distribution of quality scores (Table 1) means the model has limited exposure to extreme quality wines (both excellent and poor), leading to systematic prediction errors at these extremes.

Model Performance Issues

As demonstrated, the model systematically overrates low-quality wines and underrates high-quality wines, which is particularly problematic for consumer decision-making.
The model treats all prediction errors equally, whereas in practice, incorrectly rating a poor wine as good has more severe consequences than modest errors on medium-quality wines.

Improvements

Several approaches could address these limitations:

Expanded data collection: Gather additional training data, particularly for rare quality classes, and expand to multiple wine varieties to improve model generalization.
Alternative models: Explore models better suited to imbalanced regression problems, such as weighted regression, ensemble methods, or ordinal classification approaches that explicitly model the ordered but discrete nature of quality ratings.

	estimator	Pipeline(step...Regressor())])
	param_grid	{'regressor__n_neighbors': [1, 5, ...], 'regressor__weights': ['uniform', 'distance']}
	scoring	'neg_mean_absolute_error'
	n_jobs	1
	refit	True
	cv	5
	verbose	0
	pre_dispatch	'2*n_jobs'
	error_score	nan
	return_train_score	False

	transformers	[('numeric', ...), ('categorical', ...)]
	remainder	'drop'
	sparse_threshold	0.3
	n_jobs	None
	transformer_weights	None
	verbose	False
	verbose_feature_names_out	True
	force_int_remainder_cols	'deprecated'

	missing_values	nan
	strategy	'median'
	fill_value	None
	copy	True
	add_indicator	False
	keep_empty_features	False

	copy	True
	with_mean	True
	with_std	True

	missing_values	nan
	strategy	'most_frequent'
	fill_value	None
	copy	True
	add_indicator	False
	keep_empty_features	False

	categories	'auto'
	drop	None
	sparse_output	False
	dtype	<class 'numpy.float64'>
	handle_unknown	'error'
	min_frequency	None
	max_categories	None
	feature_name_combiner	'concat'

	n_neighbors	10
	weights	'distance'
	algorithm	'auto'
	leaf_size	30
	p	2
	metric	'minkowski'
	metric_params	None
	n_jobs	None