deeppavlov.models.morpho_tagger¶

class deeppavlov.models.morpho_tagger.network.MorphoTagger(*args, **kwargs)[source]¶: A wrapper over CharacterTagger. It is inherited from KerasWrapper. It accepts initialization parameters of CharacterTagger

deeppavlov.models.morpho_tagger.common.predict_with_model(config_path: [<class 'pathlib.Path'>, <class 'str'>]) → List[Optional[List[str]]][source]¶

Returns predictions of morphotagging model given in config :config_path:.

Parameters: config_path – a path to config
Returns: a list of morphological analyses for each sentence. Each analysis is either a list of tags or a list of full CONLL-U descriptions.

class deeppavlov.models.morpho_tagger.network.CharacterTagger(symbols: deeppavlov.core.data.vocab.DefaultVocabulary, tags: deeppavlov.core.data.vocab.DefaultVocabulary, word_rnn: str = 'cnn', char_embeddings_size: int = 16, char_conv_layers: int = 1, char_window_size: Union[int, List[int]] = 5, char_filters: Union[int, List[int]] = None, char_filter_multiple: int = 25, char_highway_layers: int = 1, conv_dropout: float = 0.0, highway_dropout: float = 0.0, intermediate_dropout: float = 0.0, lstm_dropout: float = 0.0, word_vectorizers: List[Tuple[int, int]] = None, word_lstm_layers: int = 1, word_lstm_units: Union[int, List[int]] = 128, word_dropout: float = 0.0, regularizer: float = None, verbose: int = 1)[source]¶

A class for character-based neural morphological tagger

Parameters

symbols – character vocabulary
tags – morphological tags vocabulary
word_rnn – the type of character-level network (only cnn implemented)
char_embeddings_size – the size of character embeddings
char_conv_layers – the number of convolutional layers on character level
char_window_size – the width of convolutional filter (filters). It can be a list if several parallel filters are applied, for example, [2, 3, 4, 5].
char_filters – the number of convolutional filters for each window width. It can be a number, a list (when there are several windows of different width on a single convolution layer), a list of lists, if there are more than 1 convolution layers, or None. If None, a layer with width width contains min(char_filter_multiple * width, 200) filters.
char_filter_multiple – the ratio between filters number and window width
char_highway_layers – the number of highway layers on character level
conv_dropout – the ratio of dropout between convolutional layers
highway_dropout – the ratio of dropout between highway layers,
intermediate_dropout – the ratio of dropout between convolutional and highway layers on character level
lstm_dropout – dropout ratio in word-level LSTM
word_vectorizers – list of parameters for additional word-level vectorizers, for each vectorizer it stores a pair of vectorizer dimension and the dimension of the corresponding word embedding
word_lstm_layers – the number of word-level LSTM layers
word_lstm_units – hidden dimensions of word-level LSTMs
word_dropout – the ratio of dropout before word level (it is applied to word embeddings)
regularizer – l2 regularization parameter
verbose – the level of verbosity

build()[source]¶: Builds the network using Keras.

load(infile) → None[source]¶

Loads model weights from a file

Parameters: infile – file to load model weights from

predict_on_batch(data: Union[list, tuple], return_indexes: bool = False) → List[List[str]][source]¶

Makes predictions on a single batch

Parameters

data – a batch of word sequences together with additional inputs
return_indexes – whether to return tag indexes in vocabulary or tags themselves

Returns

a batch of label sequences

save(outfile) → None[source]¶

Saves model weights to a file

Parameters: outfile – file with model weights (other model components should be given in config)

symbols_number_¶: Character vocabulary size

tags_number_¶: Tag vocabulary size

train_on_batch(data: List[Iterable], labels: Iterable[list]) → None[source]¶

Trains model on a single batch

Parameters

data – a batch of word sequences
labels – a batch of correct tag sequences

Returns

the trained model

class deeppavlov.models.morpho_tagger.common.TagOutputPrettifier(format_mode: str = 'basic', return_string: bool = True, begin: str = '', end: str = '', sep: str = 'n', **kwargs)[source]¶

Class which prettifies morphological tagger output to 4-column or 10-column (Universal Dependencies) format.

Parameters

format_mode – output format, in basic mode output data contains 4 columns (id, word, pos, features), in conllu or ud mode it contains 10 columns: id, word, lemma, pos, xpos, feats, head, deprel, deps, misc (see http://universaldependencies.org/format.html for details) Only id, word, tag and pos values are present in current version, other columns are filled by _ value.
return_string – whether to return a list of strings or a single string
begin – a string to append in the beginning
end – a string to append in the end
sep – separator between word analyses

__call__(X: List[List[str]], Y: List[List[str]]) → List[Union[List[str], str]][source]¶

Calls the prettify function for each input sentence.

Parameters

X – a list of input sentences
Y – a list of list of tags for sentence words

Returns

a list of prettified morphological analyses

prettify(tokens: List[str], tags: List[str]) → Union[List[str], str][source]¶

Prettifies output of morphological tagger.

Parameters

tokens – tokenized source sentence
tags – list of tags, the output of a tagger

Returns

the prettified output of the tagger.

Examples

>>> sent = "John really likes pizza .".split()
>>> tags = ["PROPN,Number=Sing", "ADV",
>>>         "VERB,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
>>>         "NOUN,Number=Sing", "PUNCT"]
>>> prettifier = TagOutputPrettifier(mode='basic')
>>> self.prettify(sent, tags)
    1       John    PROPN   Number=Sing
    2       really  ADV     _
    3       likes   VERB    Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
    4       pizza   NOUN    Number=Sing
    5       .       PUNCT   _
>>> prettifier = TagOutputPrettifier(mode='ud')
>>> self.prettify(sent, tags)
    1       John    _       PROPN   _       Number=Sing     _       _       _       _
    2       really  _       ADV     _       _       _       _       _       _
    3       likes   _       VERB    _       Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   _       _       _       _
    4       pizza   _       NOUN    _       Number=Sing     _       _       _       _
    5       .       _       PUNCT   _       _       _       _       _       _

set_format_mode(format_mode: str = 'basic') → None[source]¶

A function that sets format for output and recalculates self.format_string.

Parameters: format_mode – output format, in basic mode output data contains 4 columns (id, word, pos, features), in conllu or ud mode it contains 10 columns: id, word, lemma, pos, xpos, feats, head, deprel, deps, misc (see http://universaldependencies.org/format.html for details) Only id, word, tag and pos values are present in current version, other columns are filled by _ value.

Returns: