TrillionDollarWords

Documentation for TrillionDollarWords.

TrillionDollarWords.BaselineModelMethod
(mod::BaselineModel)(atomic_model::HGFRobertaForSequenceClassification, queries::Vector{String})

Computes a forward pass of the model on the given queries and returns the logits.

source
TrillionDollarWords.BaselineModelMethod
(mod::BaselineModel)(atomic_model::HGFRobertaModel, queries::Vector{String})

Computes a forward pass of the model on the given queries and returns the embeddings.

source
TrillionDollarWords.BaselineModelMethod
(mod::BaselineModel)(queries::Vector{String})

Computes a forward pass of the model on the given queries and returns either the logits or embeddings depending on whether or not the model was loaded with the head for classification.

source
TrillionDollarWords.get_embeddingsMethod
get_embeddings(atomic_model::HGFRobertaForSequenceClassification, tokens::NamedTuple)

Extends the embeddings function to HGFRobertaForSequenceClassification. Performs a forward pass through the model and returns the embeddings. Then performs a forward pass through the classification head and returns the activations going into the final linear layer.

source
TrillionDollarWords.layerwise_activationsMethod
layerwise_activations(mod::BaselineModel, queries::DataFrame)

Computes a forward pass of the model on the given queries and returns the layerwise activations in a DataFrame where activations are uniquely idendified by the sentence_id. If output_hidden_states=false was passed to load_model (default), only the last layer is returned. If output_hidden_states=true was passed to load_model, all layers are returned. The layer column indicates the layer number.

Each single activation receives its own cell to make it possible to save the output to a CSV file.

source
TrillionDollarWords.layerwise_activationsMethod
laywerwise_activations(mod::BaselineModel, queries::Vector{String})

Computes a forward pass of the model on the given queries and returns the layerwise activations for the HGFRobertaModel. If output_hidden_states=false was passed to load_model (default), only the last layer is returned. If output_hidden_states=true was passed to load_model, all layers are returned. If the model is loaded with the head for classification, the activations going into the final linear layer are returned.

source
TrillionDollarWords.load_all_dataMethod
load_all_data()

Load the combined dataset from the artifact. This dataset combines all sentences and the market data used in the paper.

  • The sentence_id column is the unique identifier of the sentence.
  • The doc_id column is the unique identifier of the document.
  • The date column is the date of the event.
  • The event_type column is the type of event (meeting minutes, speech, or press conference).
  • The labels in label are predicted by the model proposed in the paper.

We use the RoBERTa-large model finetuned on the combined data to label all the filtered sentences in the meeting minutes, speeches, and press conferences.

  • The sentence column is the sentence itself.
  • The score column is the softmax probability of the label.
  • The speaker column is the speaker of the sentence (if applicable).
  • The value columns is the value of the market indicator (CPI, PPI, or UST).
  • The indicator column is the market indicator (CPI, PPI, or UST).
  • The maturity column is the maturity of the UST (if applicable).
source
TrillionDollarWords.load_all_sentencesMethod
load_all_sentences()

Load the dataset with all sentences from the artifact. This is the complete dataset with sentences from press conferences, meeting minutes, and speeches.

  • The sentence_id column is the unique identifier of the sentence.
  • The doc_id column is the unique identifier of the document.
  • The date column is the date of the event.
  • The event_type column is the type of event (meeting minutes, speech, or press conference).
  • The labels in label are predicted by the model proposed in the paper.

We use the RoBERTa-large model finetuned on the combined data to label all the filtered sentences in the meeting minutes, speeches, and press conferences.

  • The sentence column is the sentence itself.
  • The score column is the softmax probability of the label.
  • The speaker column is the speaker of the sentence (if applicable).
source
TrillionDollarWords.load_cpi_dataMethod
load_cpi_data()

Load the CPI data from the artifact. This is the CPI data used in the paper.

  • The date column is the date of the event.
  • The value columns is the value of the market indicator (CPI, PPI, or UST).
  • The indicator column is the market indicator (CPI, PPI, or UST).
source
TrillionDollarWords.load_market_dataMethod
load_market_data()

Load the combined market data from the artifact. This dataset combines the CPI, PPI and UST data used in the paper.

  • The date column is the date of the event.
  • The value columns is the value of the market indicator (CPI, PPI, or UST).
  • The indicator column is the market indicator (CPI, PPI, or UST).
  • The maturity column is the maturity of the UST (if applicable).
source
TrillionDollarWords.load_modelMethod
load_model

Loads the model presented in the paper from HuggingFace. If load_head is true, the model is loaded with the head (i.e. the final layer) for classification. If load_head is false, the model is loaded without the head. The latter is useful for fine-tuning the model on a different task or in case the classification head is not needed. Accepts any additional keyword arguments that are accepted by Transformers.HuggingFace.HGFConfig.

source
TrillionDollarWords.load_ppi_dataMethod
load_ppi_data()

Load the PPI data from the artifact. This is the PPI data used in the paper.

  • The date column is the date of the event.
  • The value columns is the value of the market indicator (CPI, PPI, or UST).
  • The indicator column is the market indicator (CPI, PPI, or UST).
source
TrillionDollarWords.load_training_sentencesMethod
load_training_sentences()

Load the dataset with training sentences from the artifact. This is a combined dataset containing sentences from press conferences, meeting minutes, and speeches.

  • The sentence column is the sentence itself.
  • The year column is the year of the event.
  • The labels in label are the manually annotated labels from the paper.
  • The seed column is the seed that was used to split the data into train and test set in the paper.
  • The sentence_splitting column indicates if the sentence was split or not (see the paper for details).
  • The event_type column is the type of event (meeting minutes, speech, or press conference).
  • The split column indicates if the sentence is in the train or test set.
source
TrillionDollarWords.load_ust_dataMethod
load_ust_data()

Load the UST (treasury yields) data from the artifact. This is the UST data used in the paper.

  • The date column is the date of the event.
  • The value columns is the value of the market indicator (CPI, PPI, or UST).
  • The indicator column is the market indicator (CPI, PPI, or UST).
  • The maturity column is the maturity of the UST (if applicable).
source
TrillionDollarWords.prepare_probeMethod
prepare_probe(outcome_data::DataFrame; layer::Int=24, value_var::Symbol=:value)

Prepare data for a linear probe. The outcome_data should be a DataFrame with a sentence_id column, which should contain unique values. There should also be a column containing the outcome variable. By default, this column is assumed to be called value, but this can be changed with the value_var argument. The layer argument indicates which layer to use for the probe. The default is the last layer (24).

source