Tweets

Assignment

For Spring 2026, the Tweets lab will be used as a potential Final Project.

Before submission of your report, you should be sure to review the Project Policy page.

Canvas: Final Project

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

Your report is due on Tuesday, May 12.

Background

Air travel can be a miserable experience. Travelers have a habit of taking to the platform formerly known as Twitter to complain and seek support from customer service. As such, airlines likely employ machine learning models, in addition to customer service representatives, to help efficiently process these communications.

Scenario and Goal

Who are you?

You are a data scientist working for the social team of a major US airline in 2015.

What is your task?

You are tasked with building a sentiment classifier that will alert customer service representatives to respond to negative tweets about the airline and for positive tweets to be automatically acknowledged. Your goal is to develop a model that accurately classifies tweets as one of negative, neutral, or positive.

Who are you writing for?

To summarize your work, you will write a report for your manager, who manages the social media team. You can assume your manager is very familiar with the platform formerly known as Twitter, and somewhat familiar with the general concepts of machine learning.

Data

To achieve the goal of this lab, we will need previous tweets and their sentiment. The necessary data is provided in the following files:

Full Data: tweets.parquet

Source

The data for this lab originally comes from Kaggle.

Kaggle: Twitter US Airline Sentiment

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).

We are providing a modified version of this data for this lab.

Data Dictionary

Each observation in the data contains information about a particular tweet.

The variable descriptions listed below are available in the Markdown file variable-descriptions.md for ease of inclusion in reports.

Variable Descriptions

Response

sentiment

[object] the sentiment of the tweet. One of negative, neutral, or positive.

Features

airline

[object] the airline the tweet was “sent” to.

text

[object] the full text of the tweet.

Data in Python

To load the data in Python, use:

import pandas as pd

tweets = pd.read_parquet(
    "https://lab.cs307.org/tweets/data/tweets.parquet",
)

Text Processing

To use the text of the tweets as input to machine learning models, you will need to do some preprocessing. The text cannot simply be input into the models we have seen.

tweet_text = tweets["text"]
tweet_text

0                      @VirginAmerica What @dhepburn said.
1        @VirginAmerica plus you've added commercials t...
2        @VirginAmerica I didn't today... Must mean I n...
3        @VirginAmerica it's really aggressive to blast...
4        @VirginAmerica and it's a really big bad thing...
                               ...                        
14635    @AmericanAir thank you we got on a different f...
14636    @AmericanAir leaving over 20 minutes Late Flig...
14637    @AmericanAir Please bring American Airlines to...
14638    @AmericanAir you have my money, you change my ...
14639    @AmericanAir we have 8 ppl so we need 2 know h...
Name: text, Length: 14640, dtype: object

To do so, one strategy is to create a so-called bag-of-words. Let’s see what that looks like with a small set of strings.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

word_counter = CountVectorizer()

word_counts = word_counter.fit_transform(
    [
        "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo",
        "The quick brown fox jumps over the lazy dog",
        "",
    ]
).todense()

print(word_counts)

[[0 8 0 0 0 0 0 0 0]
 [1 0 1 1 1 1 1 1 2]
 [0 0 0 0 0 0 0 0 0]]

pd.DataFrame(
    word_counts,
    columns=sorted(list(word_counter.vocabulary_.keys())),
)

	brown	buffalo	dog	fox	jumps	lazy	over	quick	the
0	0	8	0	0	0	0	0	0	0
1	1	0	1	1	1	1	1	1	2
2	0	0	0	0	0	0	0	0	0

Essentially, we’ve created a number of feature variables, each one counting how many times words in the vocabulary appear in a sample’s text. This is an example of feature engineering.

Let’s find the 100 most common words in the train tweets at the airlines.

top_100_counter = CountVectorizer(max_features=100)
tweet_text_100 = top_100_counter.fit_transform(tweet_text)
print("Top 100 Words:")
print(top_100_counter.get_feature_names_out())
print("")

Top 100 Words:
['about' 'after' 'again' 'airline' 'all' 'am' 'americanair' 'amp' 'an'
 'and' 'any' 'are' 'as' 'at' 'back' 'bag' 'be' 'been' 'but' 'by' 'call'
 'can' 'cancelled' 'co' 'customer' 'delayed' 'do' 'don' 'flight'
 'flightled' 'flights' 'for' 'from' 'gate' 'get' 'got' 'had' 'has' 'have'
 'help' 'hold' 'hour' 'hours' 'how' 'http' 'if' 'in' 'is' 'it' 'jetblue'
 'just' 'late' 'like' 'me' 'more' 'my' 'need' 'no' 'not' 'now' 'of' 'on'
 'one' 'or' 'our' 'out' 'over' 'phone' 'plane' 'please' 'service' 'so'
 'southwestair' 'still' 'thank' 'thanks' 'that' 'the' 'there' 'they'
 'this' 'time' 'to' 'today' 'united' 'up' 'us' 'usairways' 've'
 'virginamerica' 'was' 'we' 'what' 'when' 'why' 'will' 'with' 'would'
 'you' 'your']

tweet_text_100_dense = tweet_text_100.todense()
tweet_text_100_dense

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 1, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 2, 1],
        [0, 0, 0, ..., 0, 0, 0]], shape=(14640, 100))

tweet_text_100.shape

(14640, 100)

plane_idx = np.where(top_100_counter.get_feature_names_out() == "plane")
plane_count = np.sum(tweet_text_100.todense()[:, plane_idx])
print('The Word "plane" Appears:', plane_count)

The Word "plane" Appears: 638

Note that you’ll need to do this same process, but within a pipeline! You will also need to consider other techniques to process text for input to models. A bag-of-words encoding might not be sufficient.

Additional information: