Spelling correction

Open In Colab

1. Introduction to the task

Spelling correction is detection of words in the text with spelling errors and replacement them with correct ones.

For example, the sentence

The platypus lives in eastern Astralia, inkluding Tasmania.

with spelling mistakes (‘Astralia’, ‘inkluding’) will be corrected as

The platypus lives in eastern Australia, including Tasmania.

2. Get started with the model

First make sure you have the DeepPavlov Library installed. More info about the first installation.

[ ]:
!pip install -q deeppavlov

Then make sure that all the required packages for the model are installed.

[ ]:
!python -m deeppavlov install brillmoore_wikitypos_en

brillmoore_wikitypos_en is the name of the model’s config_file. What is a Config File?

There are alternative ways to install the model’s packages that do not require executing a separate command – see the options in the next sections of this page. The full list of models for spelling correction with their config names can be found in the table.

3. Models list

The table presents a list of all of the models for entity detection, linking and extraction available in the DeepPavlov Library.

Config name

Language

RAM

brillmoore_wikitypos_en

En

6.7 Gb

levenshtein_corrector_ru

Ru

8.7 Gb

We provide two types of pipelines for spelling correction:

In both cases correction candidates are chosen based on context with the help of a kenlm language model.

You can find the comparison of these and other approaches near the end of this readme.

4. Use the model for prediction

4.1 Predict using Python

4.1.1 Levenshtein corrector

This component finds all the candidates in a static dictionary on a set Damerau-Levenshtein distance. It can separate one token into two but it will not work the other way around.

Component config parameters:

  • in — list with one element: name of this component’s input in chainer’s shared memory

  • out — list with one element: name for this component’s output in chainer’s shared memory

  • class_name always equals to "spelling_levenshtein" or deeppavlov.models.spelling_correction.levenshtein.searcher_component:LevenshteinSearcherComponent.

  • words — list of all correct words (should be a reference)

  • max_distance — maximum allowed Damerau-Levenshtein distance between source words and candidates

  • error_probability — assigned probability for every edit

[ ]:
from deeppavlov import build_model, configs

model = build_model('levenshtein_corrector_ru', download=True)
[ ]:
model(['Утканос живет в Васточной Австралии на обширном ареале от холодных плато Тасмании и Австралийских Альп до дождевых лесов прибрежного Квинсленда.'])
['утконос живет в восточной австралии на обширном ареале от холодных плато тасмании и австралийских альп до дождевых лесов прибрежного квинсленда.']

4.1.2 Brillmoore

This component is based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore and uses statistics based error model to find best candidates in a static dictionary.

Component config parameters:

  • in — list with one element: name of this component’s input in chainer’s shared memory

  • out — list with one element: name for this component’s output in chainer’s shared memory

  • class_name always equals to "spelling_error_model" or deeppavlov.models.spelling_correction.brillmoore.error_model:ErrorModel.

  • save_path — path where the model will be saved at after a training session

  • load_path — path to the pretrained model

  • window — window size for the error model from 0 to 4, defaults to 1

  • candidates_count — maximum allowed count of candidates for every source token

  • dictionary — description of a static dictionary model, instance of (or inherited from) deeppavlov.vocabs.static_dictionary.StaticDictionary

    • class_name — "static_dictionary" for a custom dictionary or one of two provided:

      • "russian_words_vocab" to automatically download and use a list of russian words from https://github.com/danakt/russian-words/ <https://github.com/danakt/russian-words/>__

      • "wikitionary_100K_vocab" to automatically download a list of most common words from Project Gutenberg from Wiktionary <https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_Gutenberg>__

    • dictionary_name — name of a directory where a dictionary will be built to and loaded from, defaults to "dictionary" for static_dictionary

    • raw_dictionary_path — path to a file with a line-separated list of dictionary words, required for static_dictionary

[ ]:
from deeppavlov import build_model, configs

model = build_model('brillmoore_wikitypos_en', download=True)
[ ]:
model(['The platypus lives in Astralia.'])
['the platypus lives in australia.']

4.2 Predict using CLI

You can also get predictions in an interactive mode through CLI (Сommand Line Interface).

[ ]:
! python -m deeppavlov interact brillmoore_wikitypos_en -d

5. Customize the model

5.1 Training configuration

For the training phase config file needs to also include these parameters:

  • dataset_iterator — it should always be set like "dataset_iterator": {"class_name": "typos_iterator"}

    • class_name always equals to typos_iterator

    • test_ratio — ratio of test data to train, from 0. to 1., defaults to 0.

  • dataset_reader

Component’s configuration for spelling_error_model also has to have as fit_on parameter — list of two elements: names of component’s input and true output in chainer’s shared memory.

5.2 Language model

Provided pipelines use KenLM to process language models, so if you want to build your own, we suggest you consult its website. We do also provide our own language models for english (5.5GB) and russian (3.1GB) languages.

6. Comparison

We compared our pipelines with Yandex.Speller, JamSpell and PyHunSpell on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:

Correction method

Precision

Recall

F-measure

Speed (sentences/s)

Yandex.Speller

83.09

59.86

69.59

DeepPavlov levenshtein_corrector_ru

59.38

53.44

56.25

39.3

Hunspell + lm

41.03

48.89

44.61

2.1

JamSpell

44.57

35.69

39.64

136.2

Hunspell

30.30

34.02

32.06

20.3