deeppavlov.models.spelling_correction

class deeppavlov.models.spelling_correction.brillmoore.ErrorModel(dictionary: StaticDictionary, window: int = 1, candidates_count: int = 1, *args, **kwargs)[source]

Component that uses statistics based error model to find best candidates in a static dictionary. Based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore

Parameters
  • dictionary – a StaticDictionary object

  • window – maximum context window size

  • candidates_count – maximum number of replacement candidates to return for every token in the input

costs

logarithmic probabilities of character sequences replacements

dictionary

a StaticDictionary object

window

maximum context window size

candidates_count

maximum number of replacement candidates to return for every token in the input

__call__(data: Iterable[Iterable[str]], *args, **kwargs) List[List[List[Tuple[float, str]]]][source]

Propose candidates for tokens in sentences

Parameters

data – batch of tokenized sentences

Returns

batch of lists of probabilities and candidates for every token

fit(x: List[str], y: List[str])[source]

Calculate character sequences replacements probabilities

Parameters
  • x – words with spelling errors

  • y – words without spelling errors

save()[source]

Save replacements probabilities to a file

load()[source]

Load replacements probabilities from a file

class deeppavlov.models.spelling_correction.levenshtein.LevenshteinSearcherComponent(words: Iterable[str], max_distance: int = 1, error_probability: float = 0.0001, vocab_penalty: Optional[float] = None, **kwargs)[source]

Component that finds replacement candidates for tokens at a set Damerau-Levenshtein distance

Parameters
  • words – list of every correct word

  • max_distance – maximum allowed Damerau-Levenshtein distance between source words and candidates

  • error_probability – assigned probability for every edit

  • vocab_penalty – assigned probability of an out of vocabulary token being the correct one without changes

max_distance

maximum allowed Damerau-Levenshtein distance between source words and candidates

error_probability

assigned logarithmic probability for every edit

vocab_penalty

assigned logarithmic probability of an out of vocabulary token being the correct one without changes

__call__(batch: Iterable[Iterable[str]], *args, **kwargs) List[List[List[Tuple[float, str]]]][source]

Propose candidates for tokens in sentences

Parameters

batch – batch of tokenized sentences

Returns

batch of lists of probabilities and candidates for every token

class deeppavlov.models.spelling_correction.electors.top1_elector.TopOneElector(*args, **kwargs)[source]

Component that chooses a candidate with highest base probability for every token

__call__(batch: List[List[List[Tuple[float, str]]]]) List[List[str]][source]

Choose the best candidate for every token

Parameters

batch – batch of probabilities and string values of candidates for every token in a sentence

Returns

batch of corrected tokenized sentences

class deeppavlov.models.spelling_correction.electors.kenlm_elector.KenlmElector(load_path: Path, beam_size: int = 4, *args, **kwargs)[source]

Component that chooses a candidate with the highest product of base and language model probabilities

Parameters
  • load_path – path to the kenlm model file

  • beam_size – beam size for highest probability search

lm

kenlm object

beam_size

beam size for highest probability search

__call__(batch: List[List[List[Tuple[float, str]]]]) List[List[str]][source]

Choose the best candidate for every token

Parameters

batch – batch of probabilities and string values of candidates for every token in a sentence

Returns

batch of corrected tokenized sentences