import pandas as pdHousing
Assignment
For Fall 2025 the Housing lab will be used as Lab 05.
- The model portion of the lab is due on Saturday, November 22.
- The report portion of the lab is due on Saturday, December 06.
Background
For many people, buying a home is the largest financial decision of their lives. A typical home purchase involves taking out a mortgage with a repayment schedule that can span 15 to 30 years.1 For both buyers and sellers, understanding the fair market value of a property is crucial.
Traditionally, determining a home’s value required hiring a professional appraiser who would physically inspect the property and compare it to recent sales of similar homes. This process is time-consuming, expensive, and subjective.
In the 2000s, online real estate platforms began offering automated valuation models (AVMs) that could instantly estimate home values using machine learning. The most famous of these is Zillow’s Zestimate, which provides estimated market values for millions of homes across the United States.

These automated valuations have revolutionized real estate by:
- Providing instant price estimates to buyers and sellers.
- Helping homeowners track their property’s value over time.
- Democratizing access to market information.
However, building accurate home price prediction models is challenging because property values depend on dozens of factors: location, size, age, condition, features, and local market conditions. The goal of this lab is to develop a model that can predict home sale prices based on these characteristics.
Scenario and Goal
Who are you?
- You work as a data scientist for an online real estate listing aggregator startup, CornHawkHomes, that hopes to compete with Zillow, but specifically focusing on the real estate market in Iowa. Given the data available for this lab, you can assume that the current year is 2011.
What is your task?
- You will develop a model to predict (estimate) the sale price of homes given features of the home such as size, number of bathrooms, etc. CornHawkHomes will provide this estimate, which is branded as KernelEstimate, to users of the website, as a competitor to Zillow’s Zestimate. Your goal is to create a model that minimizes errors that these predictions make.
Who are you writing for?
- To summarize your work, you will write a report for your manager, who oversees data science at CornHawkHomes. They are knowledgeable about real estate, and have a basic understanding of data science and machine learning.
Data
To achieve the goal of this lab, we will need housing data from Ames, Iowa. The necessary data is provided in the following files:
Source
The data for this lab originally comes from the following publication:
- De Cock, D. (2011). Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3). https://doi.org/10.1080/10691898.2011.11889627
However, the data from this paper has become a standard dataset in the machine learning community, and thus is also available via Kaggle:
You should not use that data directly, but instead use the data provided for this lab.
We have made modifications to the original data, including:
- Splitting the data into train, test, and production sets
- Withholding some data for the production data
Data Dictionary
Each observation in the train, test, and (hidden) production data contains information about a particular home in Ames, Iowa that sold between 2006 and 2010.
Original and complete documentation for this data can be found on Kaggle.
The variable descriptions listed below are available in the Markdown file variable-descriptions.md for ease of inclusion in reports.
Variable Descriptions
SalePrice
[int64]Sale price in USD.
Order
[int64]Observation number.
PID
[int64]Parcel identification number - can be used with city web site for parcel review.
MS SubClass
[int64]Identifies the type of dwelling involved in the sale.
MS Zoning
[object]Identifies the general zoning classification of the sale.
Lot Frontage
[float64]Linear feet of street connected to property.
Lot Area
[int64]Lot size in square feet.
Street
[object]Type of road access to property.
Alley
[object]Type of alley access to property.
Lot Shape
[object]General shape of property.
Land Contour
[object]Flatness of the property.
Utilities
[object]Type of utilities available.
Lot Config
[object]Lot configuration.
Land Slope
[object]Slope of property.
Neighborhood
[object]Physical locations within Ames city limits (map available).
Condition 1
[object]Proximity to various conditions.
Condition 2
[object]Proximity to various conditions (if more than one is present).
Bldg Type
[object]Type of dwelling.
House Style
[object]Style of dwelling.
Overall Qual
[int64]Rates the overall material and finish of the house.
Overall Cond
[int64]Rates the overall condition of the house.
Year Built
[int64]Original construction date.
Year Remod/Add
[int64]Remodel date (same as construction date if no remodeling or additions).
Roof Style
[object]Type of roof.
Roof Matl
[object]Roof material.
Exterior 1st
[object]Exterior covering on house.
Exterior 2nd
[object]Exterior covering on house (if more than one material).
Mas Vnr Type
[object]Masonry veneer type.
Mas Vnr Area
[float64]Masonry veneer area in square feet.
Exter Qual
[object]Evaluates the quality of the material on the exterior.
Exter Cond
[object]Evaluates the present condition of the material on the exterior.
Foundation
[object]Type of foundation.
Bsmt Qual
[object]Evaluates the height of the basement.
Bsmt Cond
[object]Evaluates the general condition of the basement.
Bsmt Exposure
[object]Refers to walkout or garden level walls.
BsmtFin Type 1
[object]Rating of basement finished area.
BsmtFin SF 1
[float64]Type 1 finished square feet.
BsmtFin Type 2
[object]Rating of basement finished area (if multiple types).
BsmtFin SF 2
[float64]Type 2 finished square feet.
Bsmt Unf SF
[float64]Unfinished square feet of basement area.
Total Bsmt SF
[float64]Total square feet of basement area.
Heating
[object]Type of heating.
Heating QC
[object]Heating quality and condition.
Central Air
[object]Central air conditioning.
Electrical
[object]Electrical system.
1st Flr SF
[int64]First Floor square feet.
2nd Flr SF
[int64]Second floor square feet.
Low Qual Fin SF
[int64]Low quality finished square feet (all floors).
Gr Liv Area
[int64]Above grade (ground) living area square feet.
Bsmt Full Bath
[float64]Basement full bathrooms.
Bsmt Half Bath
[float64]Basement half bathrooms.
Full Bath
[int64]Full bathrooms above grade.
Half Bath
[int64]Half baths above grade.
Bedroom AbvGr
[int64]Bedrooms above grade (does not include basement bedrooms).
Kitchen AbvGr
[int64]Kitchens above grade.
Kitchen Qual
[object]Kitchen quality.
TotRms AbvGrd
[int64]Total rooms above grade (does not include bathrooms).
Functional
[object]Home functionality (Assume typical unless deductions are warranted).
Fireplaces
[int64]Number of fireplaces.
Fireplace Qu
[object]Fireplace quality.
Garage Type
[object]Garage location.
Garage Yr Blt
[float64]Year garage was built.
Garage Finish
[object]Interior finish of the garage.
Garage Cars
[float64]Size of garage in car capacity.
Garage Area
[float64]Size of garage in square feet.
Garage Qual
[object]Garage quality.
Garage Cond
[object]Garage condition.
Paved Drive
[object]Paved driveway.
Wood Deck SF
[int64]Wood deck area in square feet.
Open Porch SF
[int64]Open porch area in square feet.
Enclosed Porch
[int64]Enclosed porch area in square feet.
3Ssn Porch
[int64]Three season porch area in square feet.
Screen Porch
[int64]Screen porch area in square feet.
Pool Area
[int64]Pool area in square feet.
Pool QC
[object]Pool quality.
Fence
[object]Fence quality.
Misc Feature
[object]Miscellaneous feature not covered in other categories.
Misc Val
[int64]Value of miscellaneous feature.
Mo Sold
[int64]Month Sold (MM).
Yr Sold
[int64]Year Sold (YYYY).
Sale Type
[object]Type of sale.
Sale Condition
[object]Condition of sale.
Data in Python
To load the data in Python, use:
housing_train = pd.read_parquet(
"https://lab.cs307.org/housing/data/housing-train.parquet",
)
housing_test = pd.read_parquet(
"https://lab.cs307.org/housing/data/housing-test.parquet",
)Prepare Data for Machine Learning
Create the X and y variants of the data for use with sklearn:
# create X and y for train
X_train = housing_train.drop("SalePrice", axis=1)
y_train = housing_train["SalePrice"]
# create X and y for test
X_test = housing_test.drop("SalePrice", axis=1)
y_test = housing_test["SalePrice"]You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.
Models
To obtain the maximum points via the autograder, your submitted model must outperform the following metrics:
- Test MAPE: 0.085
- Production MAPE: 0.085
- Test Proportion of Predictions Within 20% of Sale Price: 0.92
- Production Proportion of Predictions Within 20% of Sale Price: 0.92
For the latter two metrics, the following function may be used to assist in calculations.
def proportion_within_threshold(y_true, y_pred, threshold=0.20):
return np.mean((np.abs(y_pred - y_true) / y_true) <= threshold)If you would like to tune with this metric, you must create a scorer using exactly this function, else the autograder will error.
Footnotes
Recently, there have been some suggestions that mortgages be extended to 50 years. This extension could potentially decrease monthly payments by a small, potentially insignificant amount, while doubling the interest paid over the term of the loan.↩︎