import pandas as pdGenetics
Assignment
For Fall 2025 the Genetics lab will be used as a potential Final Project.
- Your report is due on Tuesday, December 16.
Background
Cancer detection is an unfortunate but important reality. Early detection can significantly improve survival. One consistently researched possibility is the use of genetic information to push detection earlier and earlier. The BRCA mutation is an example of a simple genetic screening that can help better estimate the probability of developing breast cancer.
Downstream of DNA itself, gene expression can give insight into the effects of DNA on phenotypical outcomes.
Next-generation sequencing is a constantly evolving set of technologies that can measure gene expression. As these technologies become cheaper to use, and more readily available, they can potentially be used as part of the process of detecting and identifying cancers.
Scenario and Goal
Who are you?
- You are a data scientist working for a small biotechnology startup.
What is your task?
- You are asked to begin to explore the possibility of developing a “universal” cancer detection and classification model, given gene expression data collected via next-generation sequencing such as RNA-Seq. Your goal is not to create a product that is immediately useful, but instead, to simply work towards a proof of concept.
Who are you writing for?
- To summarize your work, you will write a report for your manager, which in this case, is the CEO and founder of the startup. You can assume your manager is very familiar with biology and related technologies, and reasonably familiar with the general concepts of machine learning. They have worked with groups who have placed machine learning models into practice in the past.
Data
To achieve the goal of this lab, we will need gene expression and clinical outcome data. The necessary data is provided in the following files:
Source
The underlying source of this data is the The Cancer Genome Atlas Pan-Cancer Analysis Project. The data was accessed via synapse.org.
The specific data for this lab was collected and modified based on a submission to the UCI Irvine Machine Learning Repository.
We are providing a modified version of this data for this lab. Modifications include:
- Limiting the number of gene expression features to 2000.
- Withholding some data that will be considered the production data.
Data Dictionary
Each observation in the data contains clinical and gene expression information from a tissue sample of a cancer patient.
The variable descriptions listed below are available in the Markdown file variable-descriptions.md for ease of inclusion in reports.
Variable Descriptions
Response
cancer
[object]the clinically determined cancer type, one of:BRCA: Breast Invasive CarcinomaPRAD: Prostate AdenocarcinomaKIRC: Kidney Renal Clear Cell CarcinomaLUAD: Lung AdenocarcinomaCOAD: Colon Adenocarcinoma
Features
gene_####
[float64]gene expression (for gene number####in the dataset) quantification as measured by an Illumina HiSeq platform
Data in Python
To load the data in Python, use:
genetics = pd.read_parquet(
"https://lab.cs307.org/genetics/data/genetics.parquet",
)