TrillionDollarWords

TrillionDollarWords.BaselineModel
TrillionDollarWords.BaselineModel
TrillionDollarWords.BaselineModel
TrillionDollarWords.BaselineModel
TrillionDollarWords.get_embeddings
TrillionDollarWords.get_embeddings
TrillionDollarWords.get_embeddings
TrillionDollarWords.layerwise_activations
TrillionDollarWords.layerwise_activations
TrillionDollarWords.load_all_data
TrillionDollarWords.load_all_sentences
TrillionDollarWords.load_cpi_data
TrillionDollarWords.load_market_data
TrillionDollarWords.load_model
TrillionDollarWords.load_ppi_data
TrillionDollarWords.load_training_sentences
TrillionDollarWords.load_ust_data
TrillionDollarWords.prepare_probe

TrillionDollarWords.BaselineModel — Type

Struct for the baseline model (i.e. the model presented in the paper).

TrillionDollarWords.BaselineModel — Method

(mod::BaselineModel)(atomic_model::HGFRobertaForSequenceClassification, queries::Vector{String})

Computes a forward pass of the model on the given queries and returns the logits.

source

TrillionDollarWords.BaselineModel — Method

(mod::BaselineModel)(atomic_model::HGFRobertaModel, queries::Vector{String})

Computes a forward pass of the model on the given queries and returns the embeddings.

source

TrillionDollarWords.BaselineModel — Method

(mod::BaselineModel)(queries::Vector{String})

Computes a forward pass of the model on the given queries and returns either the logits or embeddings depending on whether or not the model was loaded with the head for classification.

source

TrillionDollarWords.get_embeddings — Method

get_embeddings(mod::BaselineModel, queries::Vector{String})

Computes a forward pass of the model on the given queries and returns the embeddings.

source

TrillionDollarWords.get_embeddings — Method

get_embeddings(atomic_model::HGFRobertaForSequenceClassification, tokens::NamedTuple)

Extends the embeddings function to HGFRobertaForSequenceClassification. Performs a forward pass through the model and returns the embeddings. Then performs a forward pass through the classification head and returns the activations going into the final linear layer.

source

TrillionDollarWords.get_embeddings — Method

get_embeddings(atomic_model::HGFRobertaModel, tokens::NamedTuple)

Extends the embeddings function to HGFRobertaModel.

source

TrillionDollarWords.layerwise_activations — Method

layerwise_activations(mod::BaselineModel, queries::DataFrame)

Computes a forward pass of the model on the given queries and returns the layerwise activations in a DataFrame where activations are uniquely idendified by the sentence_id. If output_hidden_states=false was passed to load_model (default), only the last layer is returned. If output_hidden_states=true was passed to load_model, all layers are returned. The layer column indicates the layer number.

Each single activation receives its own cell to make it possible to save the output to a CSV file.

source

TrillionDollarWords.layerwise_activations — Method

laywerwise_activations(mod::BaselineModel, queries::Vector{String})

Computes a forward pass of the model on the given queries and returns the layerwise activations for the HGFRobertaModel. If output_hidden_states=false was passed to load_model (default), only the last layer is returned. If output_hidden_states=true was passed to load_model, all layers are returned. If the model is loaded with the head for classification, the activations going into the final linear layer are returned.

source

TrillionDollarWords.load_all_data — Method

load_all_data()

Load the combined dataset from the artifact. This dataset combines all sentences and the market data used in the paper.

The sentence_id column is the unique identifier of the sentence.
The doc_id column is the unique identifier of the document.
The date column is the date of the event.
The event_type column is the type of event (meeting minutes, speech, or press conference).
The labels in label are predicted by the model proposed in the paper.

We use the RoBERTa-large model finetuned on the combined data to label all the filtered sentences in the meeting minutes, speeches, and press conferences.

The sentence column is the sentence itself.
The score column is the softmax probability of the label.
The speaker column is the speaker of the sentence (if applicable).
The value columns is the value of the market indicator (CPI, PPI, or UST).
The indicator column is the market indicator (CPI, PPI, or UST).
The maturity column is the maturity of the UST (if applicable).

source

TrillionDollarWords.load_all_sentences — Method

load_all_sentences()

Load the dataset with all sentences from the artifact. This is the complete dataset with sentences from press conferences, meeting minutes, and speeches.

The sentence_id column is the unique identifier of the sentence.
The doc_id column is the unique identifier of the document.
The date column is the date of the event.
The event_type column is the type of event (meeting minutes, speech, or press conference).
The labels in label are predicted by the model proposed in the paper.

We use the RoBERTa-large model finetuned on the combined data to label all the filtered sentences in the meeting minutes, speeches, and press conferences.

The sentence column is the sentence itself.
The score column is the softmax probability of the label.
The speaker column is the speaker of the sentence (if applicable).

source

TrillionDollarWords.load_cpi_data — Method

load_cpi_data()

Load the CPI data from the artifact. This is the CPI data used in the paper.

The date column is the date of the event.
The value columns is the value of the market indicator (CPI, PPI, or UST).
The indicator column is the market indicator (CPI, PPI, or UST).

source

TrillionDollarWords.load_market_data — Method

load_market_data()

Load the combined market data from the artifact. This dataset combines the CPI, PPI and UST data used in the paper.

The date column is the date of the event.
The value columns is the value of the market indicator (CPI, PPI, or UST).
The indicator column is the market indicator (CPI, PPI, or UST).
The maturity column is the maturity of the UST (if applicable).

source

TrillionDollarWords.load_model — Method

load_model

Loads the model presented in the paper from HuggingFace. If load_head is true, the model is loaded with the head (i.e. the final layer) for classification. If load_head is false, the model is loaded without the head. The latter is useful for fine-tuning the model on a different task or in case the classification head is not needed. Accepts any additional keyword arguments that are accepted by Transformers.HuggingFace.HGFConfig.

source

TrillionDollarWords.load_ppi_data — Method

load_ppi_data()

Load the PPI data from the artifact. This is the PPI data used in the paper.

The date column is the date of the event.
The value columns is the value of the market indicator (CPI, PPI, or UST).
The indicator column is the market indicator (CPI, PPI, or UST).

source

TrillionDollarWords.load_training_sentences — Method

load_training_sentences()

Load the dataset with training sentences from the artifact. This is a combined dataset containing sentences from press conferences, meeting minutes, and speeches.

The sentence column is the sentence itself.
The year column is the year of the event.
The labels in label are the manually annotated labels from the paper.
The seed column is the seed that was used to split the data into train and test set in the paper.
The sentence_splitting column indicates if the sentence was split or not (see the paper for details).
The event_type column is the type of event (meeting minutes, speech, or press conference).
The split column indicates if the sentence is in the train or test set.

source

TrillionDollarWords.load_ust_data — Method

load_ust_data()

Load the UST (treasury yields) data from the artifact. This is the UST data used in the paper.

The date column is the date of the event.
The value columns is the value of the market indicator (CPI, PPI, or UST).
The indicator column is the market indicator (CPI, PPI, or UST).
The maturity column is the maturity of the UST (if applicable).

source

TrillionDollarWords.prepare_probe — Method

prepare_probe(outcome_data::DataFrame; layer::Int=24, value_var::Symbol=:value)

Prepare data for a linear probe. The outcome_data should be a DataFrame with a sentence_id column, which should contain unique values. There should also be a column containing the outcome variable. By default, this column is assumed to be called value, but this can be changed with the value_var argument. The layer argument indicates which layer to use for the probe. The default is the last layer (24).

source