corpus¶
Corpus¶
- class diagnnose.corpus.corpus.Corpus(examples: List[Example], fields: List[Tuple[str, Field]], create_pos_tags: bool = False, sen_column: str = 'sen', labels_column: Optional[str] = None)[source]¶
Bases:
Dataset
- classmethod create(path: str, header: Optional[List[str]] = None, header_from_first_line: bool = False, to_lower: bool = False, sen_column: str = 'sen', labels_column: Optional[str] = None, sep: str = '\t', tokenize_columns: Optional[List[str]] = None, convert_numerical: bool = False, create_pos_tags: bool = False, tokenizer: Optional[transformers.PreTrainedTokenizer] = None) Corpus [source]¶
- static create_examples(raw_corpus: List[List[str]], fields: List[Tuple[str, Field]]) List[Example] [source]¶
- static create_fields(header: List[str], to_lower: bool = False, sen_column: str = 'sen', tokenize_columns: Optional[List[str]] = None, convert_numerical: bool = False, tokenizer: Optional[transformers.PreTrainedTokenizer] = None) List[Tuple[str, Field]] [source]¶
- static create_header(header: Optional[List[str]] = None, header_from_first_line: bool = False, corpus_path: Optional[str] = None, sen_column: str = 'sen', sep: str = '\t') List[str] [source]¶
- static create_raw_corpus(path: str, header_from_first_line: bool = False, sep: str = '\t') List[List[str]] [source]¶
- slice(sen_ids: List[int]) Corpus [source]¶
Returns a new Corpus only containing examples from sen_ids.
- Parameters
sen_ids (List[int]) – List of sentence indices based on which the examples in the current Corpus will be filtered. These indices refer to the sen_idx in the original corpus; the newly sliced corpus will retain the original sen_idx of an Example item.
- Returns
subcorpus – A new Corpus instance containing the filtered list of Examples.
- Return type
- diagnnose.corpus.corpus.attach_tokenizer(field: Field, tokenizer: transformers.PreTrainedTokenizer) None [source]¶
Creates a tokenizer that is attached to a Corpus Field.
- Parameters
field (Field) – Field to which the vocabulary will be attached
tokenizer (PreTrainedTokenizer) – Tokenizer that will convert tokens to their index.
Create Iterator¶
Transforms a Corpus into an torchtext.data.Iterator
.
- param corpus
Corpus containing sentences that will be tokenized and transformed into a batch.
- type corpus
Corpus
- param batch_size
Amount of sentences processed per forward step. Higher batch size increases processing speed, but should be done accordingly to the amount of available RAM. Defaults to 1.
- type batch_size
int, optional
- param device
Torch device on which forward passes will be run. Defaults to cpu.
- type device
str, optional
- param sort
Toggle to sort the corpus based on sentence length. Defaults to
False
.- type sort
bool, optional
- returns
iterator – Iterator containing the batchified Corpus.
- rtype
Iterator
Create Labels¶
- diagnnose.corpus.create_labels.create_labels_from_corpus(corpus: ~diagnnose.corpus.corpus.Corpus, selection_func: ~typing.Callable[[int, ~torchtext.data.example.Example], bool] = <function <lambda>>, control_task: ~typing.Optional[~typing.Callable[[int, ~torchtext.data.example.Example], ~typing.Union[str, int]]] = None) Tensor [source]¶
Creates labels based on the selection_func that was used during extraction.
- Parameters
corpus (Corpus) – Labeled corpus containing sentence and label information.
selection_func (SelectFunc, optional) – Function that determines whether a label should be stored.
control_task (ControlTask, optional) – Control task function of Hewitt et al. (2019), mapping a corpus item to a random label.