import pandas as pdTweets
Assignment
For Fall 2025 the Tweets lab will be used as a potential Final Project.
- Your report is due on Tuesday, December 16.
Background
Air travel can be a miserable experience. Travelers have a habit of taking to the platform formerly known as Twitter to complain and seek support from customer service. As such, airlines likely employ machine learning models, in addition to customer service representatives, to help efficiently process these communications.
Scenario and Goal
Who are you?
- You are a data scientist working for the social team of a major US airline in 2015.
What is your task?
- You are tasked with building a sentiment classifier that will alert customer service representatives to respond to negative tweets about the airline and for positive tweets to be automatically acknowledged. Your goal is to develop a model that accurately classifies tweets as one of negative, neutral, or positive.
Who are you writing for?
- To summarize your work, you will write a report for your manager, who manages the social media team. You can assume your manager is very familiar with the platform formerly known as Twitter, and somewhat familiar with the general concepts of machine learning.
Data
To achieve the goal of this lab, we will need previous tweets and their sentiment. The necessary data is provided in the following files:
Source
The data for this lab originally comes from Kaggle.
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).
We are providing a modified version of this data for this lab. Modifications include:
- Keeping only the
airline_sentiment,text, andairlinevariables. - Withholding some data that will be considered the production data.
Data Dictionary
Each observation in the data contains information about a particular tweet.
The variable descriptions listed below are available in the Markdown file variable-descriptions.md for ease of inclusion in reports.
Variable Descriptions
Response
sentiment
[object]the sentiment of the tweet. One ofnegative,neutral, orpositive.
Features
airline
[object]the airline the tweet was “sent” to.
text
[object]the full text of the tweet.
Data in Python
To load the data in Python, use:
tweets = pd.read_parquet(
"https://lab.cs307.org/tweets/data/tweets.parquet",
)Text Processing
To use the text of the tweets as input to machine learning models, you will need to do some preprocessing. The text cannot simply be input into the models we have seen.
tweet_text = tweets["text"]
tweet_text0 @VirginAmerica What @dhepburn said.
1 @VirginAmerica plus you've added commercials t...
2 @VirginAmerica I didn't today... Must mean I n...
3 @VirginAmerica it's really aggressive to blast...
4 @VirginAmerica and it's a really big bad thing...
...
14635 @AmericanAir thank you we got on a different f...
14636 @AmericanAir leaving over 20 minutes Late Flig...
14637 @AmericanAir Please bring American Airlines to...
14638 @AmericanAir you have my money, you change my ...
14639 @AmericanAir we have 8 ppl so we need 2 know h...
Name: text, Length: 14640, dtype: object
To do so, one strategy is to create a so-called bag-of-words. Let’s see what that looks like with a small set of strings.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
word_counter = CountVectorizer()word_counts = word_counter.fit_transform(
[
"Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo",
"The quick brown fox jumps over the lazy dog",
"",
]
).todense()print(word_counts)[[0 8 0 0 0 0 0 0 0]
[1 0 1 1 1 1 1 1 2]
[0 0 0 0 0 0 0 0 0]]
pd.DataFrame(
word_counts,
columns=sorted(list(word_counter.vocabulary_.keys())),
)| brown | buffalo | dog | fox | jumps | lazy | over | quick | the | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Essentially, we’ve created a number of feature variables, each one counting how many times words in the vocabulary appear in a sample’s text. This is an example of feature engineering.
Let’s find the 100 most common words in the train tweets at the airlines.
top_100_counter = CountVectorizer(max_features=100)
tweet_text_100 = top_100_counter.fit_transform(tweet_text)
print("Top 100 Words:")
print(top_100_counter.get_feature_names_out())
print("")Top 100 Words:
['about' 'after' 'again' 'airline' 'all' 'am' 'americanair' 'amp' 'an'
'and' 'any' 'are' 'as' 'at' 'back' 'bag' 'be' 'been' 'but' 'by' 'call'
'can' 'cancelled' 'co' 'customer' 'delayed' 'do' 'don' 'flight'
'flightled' 'flights' 'for' 'from' 'gate' 'get' 'got' 'had' 'has' 'have'
'help' 'hold' 'hour' 'hours' 'how' 'http' 'if' 'in' 'is' 'it' 'jetblue'
'just' 'late' 'like' 'me' 'more' 'my' 'need' 'no' 'not' 'now' 'of' 'on'
'one' 'or' 'our' 'out' 'over' 'phone' 'plane' 'please' 'service' 'so'
'southwestair' 'still' 'thank' 'thanks' 'that' 'the' 'there' 'they'
'this' 'time' 'to' 'today' 'united' 'up' 'us' 'usairways' 've'
'virginamerica' 'was' 'we' 'what' 'when' 'why' 'will' 'with' 'would'
'you' 'your']
tweet_text_100_dense = tweet_text_100.todense()
tweet_text_100_densematrix([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 1, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 2, 1],
[0, 0, 0, ..., 0, 0, 0]], shape=(14640, 100))
tweet_text_100.shape(14640, 100)
plane_idx = np.where(top_100_counter.get_feature_names_out() == "plane")
plane_count = np.sum(tweet_text_100.todense()[:, plane_idx])
print('The Word "plane" Appears:', plane_count)The Word "plane" Appears: 638
Note that you’ll need to do this same process, but within a pipeline! You will also need to consider other techniques to process text for input to models. A bag-of-words encoding might not be sufficient.
Additional information: