Genetics

Assignment

For Fall 2025 the Genetics lab will be used as a potential Final Project.

Before submission of your report, you should be sure to review the Project Policy page.

Canvas: Final Project

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

Your report is due on Tuesday, December 16.

Background

Cancer detection is an unfortunate but important reality. Early detection can significantly improve survival. One consistently researched possibility is the use of genetic information to push detection earlier and earlier. The BRCA mutation is an example of a simple genetic screening that can help better estimate the probability of developing breast cancer.

Downstream of DNA itself, gene expression can give insight into the effects of DNA on phenotypical outcomes.

Next-generation sequencing is a constantly evolving set of technologies that can measure gene expression. As these technologies become cheaper to use, and more readily available, they can potentially be used as part of the process of detecting and identifying cancers.

Scenario and Goal

Who are you?

You are a data scientist working for a small biotechnology startup.

What is your task?

You are asked to begin to explore the possibility of developing a “universal” cancer detection and classification model, given gene expression data collected via next-generation sequencing such as RNA-Seq. Your goal is not to create a product that is immediately useful, but instead, to simply work towards a proof of concept.

Who are you writing for?

To summarize your work, you will write a report for your manager, which in this case, is the CEO and founder of the startup. You can assume your manager is very familiar with biology and related technologies, and reasonably familiar with the general concepts of machine learning. They have worked with groups who have placed machine learning models into practice in the past.

Data

To achieve the goal of this lab, we will need gene expression and clinical outcome data. The necessary data is provided in the following files:

Full Data: genetics.parquet

Source

The underlying source of this data is the The Cancer Genome Atlas Pan-Cancer Analysis Project. The data was accessed via synapse.org.

The specific data for this lab was collected and modified based on a submission to the UCI Irvine Machine Learning Repository.

UCI MLR: Gene Expression Cancer RNA-Seq

We are providing a modified version of this data for this lab. Modifications include:

Limiting the number of gene expression features to 2000.
Withholding some data that will be considered the production data.

Data Dictionary

Each observation in the data contains clinical and gene expression information from a tissue sample of a cancer patient.

The variable descriptions listed below are available in the Markdown file variable-descriptions.md for ease of inclusion in reports.

Variable Descriptions

Response

cancer

[object] the clinically determined cancer type, one of:
- BRCA: Breast Invasive Carcinoma
- PRAD: Prostate Adenocarcinoma
- KIRC: Kidney Renal Clear Cell Carcinoma
- LUAD: Lung Adenocarcinoma
- COAD: Colon Adenocarcinoma

Features

gene_####

[float64] gene expression (for gene number #### in the dataset) quantification as measured by an Illumina HiSeq platform

Data in Python

To load the data in Python, use:

import pandas as pd

genetics = pd.read_parquet(
    "https://lab.cs307.org/genetics/data/genetics.parquet",
)