deeppavlov.models.morpho_tagger

class deeppavlov.models.morpho_tagger.morpho_tagger.MorphoTagger(*args, **kwargs)[source]

A class for character-based neural morphological tagger

Parameters
  • symbols – character vocabulary

  • tags – morphological tags vocabulary

  • save_path – the path where model is saved

  • load_path – the path from where model is loaded

  • mode – usage mode

  • word_rnn – the type of character-level network (only cnn implemented)

  • char_embeddings_size – the size of character embeddings

  • char_conv_layers – the number of convolutional layers on character level

  • char_window_size – the width of convolutional filter (filters). It can be a list if several parallel filters are applied, for example, [2, 3, 4, 5].

  • char_filters – the number of convolutional filters for each window width. It can be a number, a list (when there are several windows of different width on a single convolution layer), a list of lists, if there are more than 1 convolution layers, or None. If None, a layer with width width contains min(char_filter_multiple * width, 200) filters.

  • char_filter_multiple – the ratio between filters number and window width

  • char_highway_layers – the number of highway layers on character level

  • conv_dropout – the ratio of dropout between convolutional layers

  • highway_dropout – the ratio of dropout between highway layers,

  • intermediate_dropout – the ratio of dropout between convolutional and highway layers on character level

  • lstm_dropout – dropout ratio in word-level LSTM

  • word_vectorizers – list of parameters for additional word-level vectorizers, for each vectorizer it stores a pair of vectorizer dimension and the dimension of the corresponding word embedding

  • word_lstm_layers – the number of word-level LSTM layers

  • word_lstm_units – hidden dimensions of word-level LSTMs

  • word_dropout – the ratio of dropout before word level (it is applied to word embeddings)

  • regularizer – l2 regularization parameter

  • verbose – the level of verbosity

A subclass of KerasModel

__call__(*x_batch: numpy.ndarray, **kwargs) → Union[List, numpy.ndarray][source]

Predicts answers on batch elements.

Parameters

x_batch – a batch to predict answers on. It can be either a single array for basic model or a sequence of arrays for a complex one ( configuration file or its lemmatized version).

build()[source]

Builds the network using Keras.

load()None[source]

Checks existence of the model file, loads the model if the file exists Loads model weights from a file

predict_on_batch(data: Union[List[numpy.ndarray], Tuple[numpy.ndarray]], return_indexes: bool = False) → List[List[str]][source]

Makes predictions on a single batch

Parameters
  • data – model inputs for a single batch, data[0] contains input character encodings

  • is the only element of data for mist models. Subsequent elements of data (and) –

  • the output of additional vectorizers, e.g., dictionary-based one. (include) –

  • return_indexes – whether to return tag indexes in vocabulary or the tags themselves

Returns

a batch of label sequences

save()None[source]

Saves model weights to the save_path, provided in config. The directory is already created by super().__init__, which is called in __init__ of this class

train_on_batch(*args)None[source]

Trains the model on a single batch.

Parameters
  • *args – the list of network inputs.

  • element of args is the batch of targets, (Last) –

  • previous elements are training data batches (all) –

deeppavlov.models.morpho_tagger.common.predict_with_model(config_path: [<class 'pathlib.Path'>, <class 'str'>], infile: Optional[Union[str, pathlib.Path]] = None, input_format: str = 'ud', batch_size: [<class 'int'>] = 16, output_format: str = 'basic') → List[Optional[List[str]]][source]

Returns predictions of morphotagging model given in config :config_path:.

Parameters

config_path – a path to config

Returns

a list of morphological analyses for each sentence. Each analysis is either a list of tags or a list of full CONLL-U descriptions.

class deeppavlov.models.morpho_tagger.lemmatizer.UDPymorphyLemmatizer(save_path: Optional[str] = None, load_path: Optional[str] = None, rare_grammeme_penalty: float = 1.0, long_lemma_penalty: float = 1.0, **kwargs)[source]

A class that returns a normal form of a Russian word given its morphological tag in UD format. Lemma is selected from one of PyMorphy parses, the parse whose tag resembles the most a known UD tag is chosen.

__call__(data: List[List[str]], tags: Optional[List[List[str]]] = None) → List[List[str]]

Lemmatizes each word in a batch of sentences.

Parameters
  • data – the batch of sentences (lists of words).

  • tags – the batch of morphological tags (if available).

Returns

a batch of lemmatized sentences.

class deeppavlov.models.morpho_tagger.common.TagOutputPrettifier(format_mode: str = 'basic', return_string: bool = True, begin: str = '', end: str = '', sep: str = '\n', **kwargs)[source]

Class which prettifies morphological tagger output to 4-column or 10-column (Universal Dependencies) format.

Parameters
  • format_mode – output format, in basic mode output data contains 4 columns (id, word, pos, features), in conllu or ud mode it contains 10 columns: id, word, lemma, pos, xpos, feats, head, deprel, deps, misc (see http://universaldependencies.org/format.html for details) Only id, word, tag and pos values are present in current version, other columns are filled by _ value.

  • return_string – whether to return a list of strings or a single string

  • begin – a string to append in the beginning

  • end – a string to append in the end

  • sep – separator between word analyses

__call__(X: List[List[str]], Y: List[List[str]]) → List[Union[List[str], str]][source]

Calls the prettify() function for each input sentence.

Parameters
  • X – a list of input sentences

  • Y – a list of list of tags for sentence words

Returns

a list of prettified morphological analyses

prettify(tokens: List[str], tags: List[str]) → Union[List[str], str][source]

Prettifies output of morphological tagger.

Parameters
  • tokens – tokenized source sentence

  • tags – list of tags, the output of a tagger

Returns

the prettified output of the tagger.

Examples

>>> sent = "John really likes pizza .".split()
>>> tags = ["PROPN,Number=Sing", "ADV",
>>>         "VERB,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
>>>         "NOUN,Number=Sing", "PUNCT"]
>>> prettifier = TagOutputPrettifier(mode='basic')
>>> self.prettify(sent, tags)
    1       John    PROPN   Number=Sing
    2       really  ADV     _
    3       likes   VERB    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
    4       pizza   NOUN    Number=Sing
    5       .       PUNCT   _
>>> prettifier = TagOutputPrettifier(mode='ud')
>>> self.prettify(sent, tags)
    1       John    _       PROPN   _       Number=Sing     _       _       _       _
    2       really  _       ADV     _       _       _       _       _       _
    3       likes   _       VERB    _       Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   _       _       _       _
    4       pizza   _       NOUN    _       Number=Sing     _       _       _       _
    5       .       _       PUNCT   _       _       _       _       _       _
set_format_mode(format_mode: str = 'basic')None[source]

A function that sets format for output and recalculates self.format_string.

Parameters

format_mode – output format, in basic mode output data contains 4 columns (id, word, pos, features), in conllu or ud mode it contains 10 columns: id, word, lemma, pos, xpos, feats, head, deprel, deps, misc (see http://universaldependencies.org/format.html for details) Only id, word, tag and pos values are present in current version, other columns are filled by _ value.

Returns:

class deeppavlov.models.morpho_tagger.common.LemmatizedOutputPrettifier(return_string: bool = True, begin: str = '', end: str = '', sep: str = '\n', **kwargs)[source]

Class which prettifies morphological tagger output to 4-column or 10-column (Universal Dependencies) format.

Parameters
  • format_mode – output format, in basic mode output data contains 4 columns (id, word, pos, features), in conllu or ud mode it contains 10 columns: id, word, lemma, pos, xpos, feats, head, deprel, deps, misc (see http://universaldependencies.org/format.html for details) Only id, word, lemma, tag and pos columns are predicted in current version, other columns are filled by _ value.

  • return_string – whether to return a list of strings or a single string

  • begin – a string to append in the beginning

  • end – a string to append in the end

  • sep – separator between word analyses

__call__(X: List[List[str]], Y: List[List[str]], Z: List[List[str]]) → List[Union[List[str], str]][source]

Calls the prettify() function for each input sentence.

Parameters
  • X – a list of input sentences

  • Y – a list of list of tags for sentence words

  • Z – a list of lemmatized sentences

Returns

a list of prettified morphological analyses

prettify(tokens: List[str], tags: List[str], lemmas: List[str]) → Union[List[str], str][source]

Prettifies output of morphological tagger.

Parameters
  • tokens – tokenized source sentence

  • tags – list of tags, the output of a tagger

  • lemmas – list of lemmas, the output of a lemmatizer

Returns

the prettified output of the tagger.

Examples

>>> sent = "John really likes pizza .".split()
>>> tags = ["PROPN,Number=Sing", "ADV",
>>>         "VERB,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
>>>         "NOUN,Number=Sing", "PUNCT"]
>>> lemmas = "John really like pizza .".split()
>>> prettifier = LemmatizedOutputPrettifier()
>>> self.prettify(sent, tags, lemmas)
    1       John    John    PROPN   _       Number=Sing     _       _       _       _
    2       really  really  ADV     _       _       _       _       _       _
    3       likes   like    VERB    _       Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   _       _       _       _
    4       pizza   pizza   NOUN    _       Number=Sing     _       _       _       _
    5       .       .       PUNCT   _       _       _       _       _       _