deeppavlov.models.spelling_correction

class deeppavlov.models.spelling_correction.brillmoore.ErrorModel(dictionary: deeppavlov.vocabs.typos.StaticDictionary, window: int = 1, candidates_count: int = 1, *args, **kwargs)[source]

Component that uses statistics based error model to find best candidates in a static dictionary. Based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore

Parameters:
  • dictionary – a StaticDictionary object
  • window – maximum context window size
  • candidates_count – maximum number of replacement candidates to return for every token in the input
costs

logarithmic probabilities of character sequences replacements

dictionary

a StaticDictionary object

window

maximum context window size

candidates_count

maximum number of replacement candidates to return for every token in the input

__call__(data: Iterable[Iterable[str]], *args, **kwargs) → List[List[List[Tuple[float, str]]]][source]

Propose candidates for tokens in sentences

Parameters:data – batch of tokenized sentences
Returns:batch of lists of probabilities and candidates for every token
fit(x: List[str], y: List[str])[source]

Calculate character sequences replacements probabilities

Parameters:
  • x – words with spelling errors
  • y – words without spelling errors
save()[source]

Save replacements probabilities to a file

load()[source]

Load replacements probabilities from a file

class deeppavlov.models.spelling_correction.levenshtein.LevenshteinSearcherComponent(words: Iterable[str], max_distance: int = 1, error_probability: float = 0.0001, vocab_penalty: Optional[float] = None, **kwargs)[source]

Component that finds replacement candidates for tokens at a set Damerau-Levenshtein distance

Parameters:
  • words – list of every correct word
  • max_distance – maximum allowed Damerau-Levenshtein distance between source words and candidates
  • error_probability – assigned probability for every edit
  • vocab_penalty – assigned probability of an out of vocabulary token being the correct one without changes
max_distance

maximum allowed Damerau-Levenshtein distance between source words and candidates

error_probability

assigned logarithmic probability for every edit

vocab_penalty

assigned logarithmic probability of an out of vocabulary token being the correct one without changes

__call__(batch: Iterable[Iterable[str]], *args, **kwargs) → List[List[List[Tuple[float, str]]]][source]

Propose candidates for tokens in sentences

Parameters:batch – batch of tokenized sentences
Returns:batch of lists of probabilities and candidates for every token
class deeppavlov.models.spelling_correction.electors.top1_elector.TopOneElector(*args, **kwargs)[source]

Component that chooses a candidate with highest base probability for every token

__call__(batch: List[List[List[Tuple[float, str]]]]) → List[List[str]][source]

Choose the best candidate for every token

Parameters:batch – batch of probabilities and string values of candidates for every token in a sentence
Returns:batch of corrected tokenized sentences
class deeppavlov.models.spelling_correction.electors.kenlm_elector.KenlmElector(load_path: pathlib.Path, beam_size: int = 4, *args, **kwargs)[source]

Component that chooses a candidate with the highest product of base and language model probabilities

Parameters:
  • load_path – path to the kenlm model file
  • beam_size – beam size for highest probability search
lm

kenlm object

beam_size

beam size for highest probability search

__call__(batch: List[List[List[Tuple[float, str]]]]) → List[List[str]][source]

Choose the best candidate for every token

Parameters:batch – batch of probabilities and string values of candidates for every token in a sentence
Returns:batch of corrected tokenized sentences