corpus¶

Corpus¶

class diagnnose.corpus.corpus.Corpus(examples: List[Example], fields: List[Tuple[str, Field]], create_pos_tags: bool = False, sen_column: str = 'sen', labels_column: Optional[str] = None)[source]¶

Bases: Dataset

classmethod create(path: str, header: Optional[List[str]] = None, header_from_first_line: bool = False, to_lower: bool = False, sen_column: str = 'sen', labels_column: Optional[str] = None, sep: str = '\t', tokenize_columns: Optional[List[str]] = None, convert_numerical: bool = False, create_pos_tags: bool = False, tokenizer: Optional[transformers.PreTrainedTokenizer] = None) → Corpus[source]¶

static create_examples(raw_corpus: List[List[str]], fields: List[Tuple[str, Field]]) → List[Example][source]¶

static create_fields(header: List[str], to_lower: bool = False, sen_column: str = 'sen', tokenize_columns: Optional[List[str]] = None, convert_numerical: bool = False, tokenizer: Optional[transformers.PreTrainedTokenizer] = None) → List[Tuple[str, Field]][source]¶

static create_header(header: Optional[List[str]] = None, header_from_first_line: bool = False, corpus_path: Optional[str] = None, sen_column: str = 'sen', sep: str = '\t') → List[str][source]¶

static create_raw_corpus(path: str, header_from_first_line: bool = False, sep: str = '\t') → List[List[str]][source]¶

slice(sen_ids: List[int]) → Corpus[source]¶

Returns a new Corpus only containing examples from sen_ids.

Parameters: sen_ids (List[int]) – List of sentence indices based on which the examples in the current Corpus will be filtered. These indices refer to the sen_idx in the original corpus; the newly sliced corpus will retain the original sen_idx of an Example item.
Returns: subcorpus – A new Corpus instance containing the filtered list of Examples.
Return type: Corpus

diagnnose.corpus.corpus.attach_tokenizer(field: Field, tokenizer: transformers.PreTrainedTokenizer) → None[source]¶

Creates a tokenizer that is attached to a Corpus Field.

Parameters

field (Field) – Field to which the vocabulary will be attached
tokenizer (PreTrainedTokenizer) – Tokenizer that will convert tokens to their index.

Create Iterator¶

Transforms a Corpus into an torchtext.data.Iterator.

param corpus: Corpus containing sentences that will be tokenized and transformed into a batch.
type corpus: Corpus
param batch_size: Amount of sentences processed per forward step. Higher batch size increases processing speed, but should be done accordingly to the amount of available RAM. Defaults to 1.
type batch_size: int, optional
param device: Torch device on which forward passes will be run. Defaults to cpu.
type device: str, optional
param sort: Toggle to sort the corpus based on sentence length. Defaults to False.
type sort: bool, optional
returns: iterator – Iterator containing the batchified Corpus.
rtype: Iterator

Create Labels¶

diagnnose.corpus.create_labels.create_labels_from_corpus(corpus: ~diagnnose.corpus.corpus.Corpus, selection_func: ~typing.Callable[[int, ~torchtext.data.example.Example], bool] = <function <lambda>>, control_task: ~typing.Optional[~typing.Callable[[int, ~torchtext.data.example.Example], ~typing.Union[str, int]]] = None) → Tensor[source]¶

Creates labels based on the selection_func that was used during extraction.

Parameters

corpus (Corpus) – Labeled corpus containing sentence and label information.
selection_func (SelectFunc, optional) – Function that determines whether a label should be stored.
control_task (ControlTask, optional) – Control task function of Hewitt et al. (2019), mapping a corpus item to a random label.