deeppavlov.models.spelling_correction¶

class deeppavlov.models.spelling_correction.brillmoore.ErrorModel(dictionary: StaticDictionary, window: int = 1, candidates_count: int = 1, *args, **kwargs)[source]¶

Component that uses statistics based error model to find best candidates in a static dictionary. Based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore

Parameters

dictionary – a StaticDictionary object
window – maximum context window size
candidates_count – maximum number of replacement candidates to return for every token in the input

costs¶: logarithmic probabilities of character sequences replacements

dictionary¶: a StaticDictionary object

window¶: maximum context window size

candidates_count¶: maximum number of replacement candidates to return for every token in the input

__call__(data: Iterable[Iterable[str]], *args, **kwargs) → List[List[List[Tuple[float, str]]]][source]¶

Propose candidates for tokens in sentences

Parameters: data – batch of tokenized sentences
Returns: batch of lists of probabilities and candidates for every token

fit(x: List[str], y: List[str])[source]¶

Calculate character sequences replacements probabilities

Parameters

x – words with spelling errors
y – words without spelling errors

save()[source]¶: Save replacements probabilities to a file

load()[source]¶: Load replacements probabilities from a file

class deeppavlov.models.spelling_correction.levenshtein.LevenshteinSearcherComponent(words: Iterable[str], max_distance: int = 1, error_probability: float = 0.0001, vocab_penalty: Optional[float] = None, **kwargs)[source]¶

Component that finds replacement candidates for tokens at a set Damerau-Levenshtein distance

Parameters

words – list of every correct word
max_distance – maximum allowed Damerau-Levenshtein distance between source words and candidates
error_probability – assigned probability for every edit
vocab_penalty – assigned probability of an out of vocabulary token being the correct one without changes

max_distance¶: maximum allowed Damerau-Levenshtein distance between source words and candidates

error_probability¶: assigned logarithmic probability for every edit

vocab_penalty¶: assigned logarithmic probability of an out of vocabulary token being the correct one without changes

__call__(batch: Iterable[Iterable[str]], *args, **kwargs) → List[List[List[Tuple[float, str]]]][source]¶

Propose candidates for tokens in sentences

Parameters: batch – batch of tokenized sentences
Returns: batch of lists of probabilities and candidates for every token

class deeppavlov.models.spelling_correction.electors.top1_elector.TopOneElector(*args, **kwargs)[source]¶

Component that chooses a candidate with highest base probability for every token

__call__(batch: List[List[List[Tuple[float, str]]]]) → List[List[str]][source]¶

Choose the best candidate for every token

Parameters: batch – batch of probabilities and string values of candidates for every token in a sentence
Returns: batch of corrected tokenized sentences

class deeppavlov.models.spelling_correction.electors.kenlm_elector.KenlmElector(load_path: Path, beam_size: int = 4, *args, **kwargs)[source]¶

Component that chooses a candidate with the highest product of base and language model probabilities

Parameters

load_path – path to the kenlm model file
beam_size – beam size for highest probability search

lm¶: kenlm object

beam_size¶: beam size for highest probability search

__call__(batch: List[List[List[Tuple[float, str]]]]) → List[List[str]][source]¶

Choose the best candidate for every token

Parameters: batch – batch of probabilities and string values of candidates for every token in a sentence
Returns: batch of corrected tokenized sentences