corpus

Corpus

class diagnnose.corpus.corpus.Corpus(examples: List[Example], fields: List[Tuple[str, Field]], create_pos_tags: bool = False, sen_column: str = 'sen', labels_column: Optional[str] = None)[source]

Bases: Dataset

classmethod create(path: str, header: Optional[List[str]] = None, header_from_first_line: bool = False, to_lower: bool = False, sen_column: str = 'sen', labels_column: Optional[str] = None, sep: str = '\t', tokenize_columns: Optional[List[str]] = None, convert_numerical: bool = False, create_pos_tags: bool = False, tokenizer: Optional[transformers.PreTrainedTokenizer] = None) Corpus[source]
static create_examples(raw_corpus: List[List[str]], fields: List[Tuple[str, Field]]) List[Example][source]
static create_fields(header: List[str], to_lower: bool = False, sen_column: str = 'sen', tokenize_columns: Optional[List[str]] = None, convert_numerical: bool = False, tokenizer: Optional[transformers.PreTrainedTokenizer] = None) List[Tuple[str, Field]][source]
static create_header(header: Optional[List[str]] = None, header_from_first_line: bool = False, corpus_path: Optional[str] = None, sen_column: str = 'sen', sep: str = '\t') List[str][source]
static create_raw_corpus(path: str, header_from_first_line: bool = False, sep: str = '\t') List[List[str]][source]
slice(sen_ids: List[int]) Corpus[source]

Returns a new Corpus only containing examples from sen_ids.

Parameters

sen_ids (List[int]) – List of sentence indices based on which the examples in the current Corpus will be filtered. These indices refer to the sen_idx in the original corpus; the newly sliced corpus will retain the original sen_idx of an Example item.

Returns

subcorpus – A new Corpus instance containing the filtered list of Examples.

Return type

Corpus

diagnnose.corpus.corpus.attach_tokenizer(field: Field, tokenizer: transformers.PreTrainedTokenizer) None[source]

Creates a tokenizer that is attached to a Corpus Field.

Parameters
  • field (Field) – Field to which the vocabulary will be attached

  • tokenizer (PreTrainedTokenizer) – Tokenizer that will convert tokens to their index.

Create Iterator

Transforms a Corpus into an torchtext.data.Iterator.

param corpus

Corpus containing sentences that will be tokenized and transformed into a batch.

type corpus

Corpus

param batch_size

Amount of sentences processed per forward step. Higher batch size increases processing speed, but should be done accordingly to the amount of available RAM. Defaults to 1.

type batch_size

int, optional

param device

Torch device on which forward passes will be run. Defaults to cpu.

type device

str, optional

param sort

Toggle to sort the corpus based on sentence length. Defaults to False.

type sort

bool, optional

returns

iterator – Iterator containing the batchified Corpus.

rtype

Iterator

Create Labels

diagnnose.corpus.create_labels.create_labels_from_corpus(corpus: ~diagnnose.corpus.corpus.Corpus, selection_func: ~typing.Callable[[int, ~torchtext.data.example.Example], bool] = <function <lambda>>, control_task: ~typing.Optional[~typing.Callable[[int, ~torchtext.data.example.Example], ~typing.Union[str, int]]] = None) Tensor[source]

Creates labels based on the selection_func that was used during extraction.

Parameters
  • corpus (Corpus) – Labeled corpus containing sentence and label information.

  • selection_func (SelectFunc, optional) – Function that determines whether a label should be stored.

  • control_task (ControlTask, optional) – Control task function of Hewitt et al. (2019), mapping a corpus item to a random label.