Graph Transliterator

PyPi Version Documentation Status PyUp Updates Code Style: Black PyPI - Python Version Software repository DOI Paper DOI

A graph-based transliteration tool that lets you convert the symbols of one language or script to those of another using rules that you define.

Transliteration… What? Why?

Moving text or data from one script or encoding to another is a common problem:

  • Many languages are written in multiple scripts, and many people can only read one of them. Moving between them can be a complex but necessary task in order to make texts accessible.

  • The identification of names and locations, as well as machine translation, benefit from transliteration.

  • Library systems often require metadata be in particular forms of romanization in addition to the original script.

  • Linguists need to move between different methods of phonetic transcription.

  • Documents in legacy fonts must now be converted to contemporary Unicode ones.

  • Complex-script languages are frequently approached in natural language processing and in digital humanities research through transliteration, as it provides disambiguating information about pronunciation, morphological boundaries, and unwritten elements not present in the original script.

Graph Transliterator abstracts transliteration, offering an “easy reading” method for developing transliterators that does not require writing a complex program. It also contains bundled transliterators that are rigorously tested. These can be expanded to handle many transliteration tasks.

Contributions are very welcome!

Features

  • Provides a transliteration tool that can be configured to convert the tokens of an input string into an output string using:

    • user-defined types of input tokens and token classes

    • transliteration rules based on:

      • a sequence of input tokens

      • specific input tokens that precede or follow the token sequence

      • classes of input tokens preceding or following specified tokens

    • “on match” rules for output to be inserted between transliteration rules involving particular token classes

    • defined rules for whitespace, including its optional consolidation

  • Can be setup using:

    • an “easy reading” YAML format that lets you quickly craft settings for the transliteration tool

    • a JSON dump of a transliterator (quicker!)

    • “direct” settings, perhaps passed programmatically, using a dictionary

  • Automatically orders rules by the number of tokens in a transliteration rule

  • Checks for ambiguity in transliteration rules

  • Can provide details about each transliteration rule match

  • Allows optional matching of all possible rules in a particular location

  • Permits pruning of rules with certain productions

  • Validates, as well as serializes to and deserializes from JSON and Python data types, using accessible marshmallow schemas

  • Provides full support for Unicode, including Unicode character names in the “easy reading” YAML format

  • Constructs and uses a directed tree and performs a best-first search to find the most specific transliteration rule in a given context

  • Includes bundled transliterators that you can add to hat check for full test coverage of the nodes and edges of the internal graph and any “on match” rules

  • Includes a command-line interface to perform transliteration and other tasks

Sample Code and Graph

from graphtransliterator import GraphTransliterator
GraphTransliterator.from_yaml("""
    tokens:
      h: [consonant]
      i: [vowel]
      " ": [whitespace]
    rules:
      h: \N{LATIN SMALL LETTER TURNED I}
      i: \N{LATIN SMALL LETTER TURNED H}
      <whitespace> i: \N{LATIN CAPITAL LETTER TURNED H}
      (<whitespace> h) i: \N{LATIN SMALL LETTER TURNED H}!
    onmatch_rules:
      - <whitespace> + <consonant>: ¡
    whitespace:
      default: " "
      consolidate: true
      token_class: whitespace
    metadata:
      title: "Upside Down Greeting Transliterator"
      version: "1.0.0"
""").transliterate("hi")
'¡ᴉɥ!'
sample graph

Sample directed tree created by Graph Transliterator. The rule nodes are in double circles, and token nodes are single circles. The numbers are the cost of the particular edge, and less costly edges are searched first. Previous token classes and previous tokens that must be present are found as constraints on the edges incident to the terminal leaf rule nodes.

Get It Now

$ pip install -U graphtransliterator

Citation

To cite Graph Transliterator, please use:

Pue, A. Sean (2019). Graph Transliterator: A graph-based transliteration tool. Journal of Open Source Software, 4(44), 1717, https://doi.org/10.21105/joss.01717

Installation

Stable release

To install Graph Transliterator, run this command in your terminal:

$ pip install graphtransliterator

This is the preferred method to install Graph Transliterator, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for Graph Transliterator can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/seanpue/graphtransliterator

Or download the tarball:

$ curl  -OL https://github.com/seanpue/graphtransliterator/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install
Required modules

Graph Transliterator requires three Python modules, click, marshmallow and pyyaml. These modules will be installed automatically using the methods described above.

Usage

To use Graph Transliterator in a project:

1from graphtransliterator import GraphTransliterator
Overview

Graph Transliterator requires that you first configure a GraphTransliterator. Then you can transliterate an input string using transliterate(). There are a few additional methods that can be used to extract information for specific use cases, such as details about which rules were matched.

Configuration

Graph Transliterator takes the following parameters:

  1. The acceptable types of tokens in the input string as well as any associated token classes.

  2. The transliteration rules for the transformation of the input string.

  3. Rules for dealing with whitespace.

  4. “On match” rules for strings to be inserted in particular contexts right before a transliteration rule’s output is added (optional).

  5. Metadata settings for the transliterator (optional).

Initialization

Defining the rules for transliteration can be difficult, especially when dealing with complex scripts. That is why Graph Transliterator uses an “easy reading” format that allows you to enter the transliteration rules in the popular YAML format, either from a string (using from_yaml()) or by reading from a file or stream (GraphTransliterator.from_yaml_file()). You can also initialize from the loaded contents of YAML (GraphTransliterator.from_easyreading_dict()).

Here is a quick sample that parameterizes GraphTransliterator using an easy reading YAML string (with comments):

 2yaml_ = """
 3  tokens:
 4    a: [vowel]               # type of token ("a") and its class (vowel)
 5    bb: [consonant, b_class] # type of token ("bb") and its classes (consonant, b_class)
 6    ' ': [wb]                # type of token (" ") and its class ("wb", for wordbreak)
 7  rules:
 8    a: A       # transliterate "a" to "A"
 9    bb: B      # transliterate "bb" to "B"
10    a a: <2AS> # transliterate ("a", "a") to "<2AS>"
11    ' ': ' '   # transliterate ' ' to ' '
12  whitespace:
13    default: " "        # default whitespace token
14    consolidate: false  # whitespace should not be consolidated
15    token_class: wb     # whitespace token class
16"""
17gt_one = GraphTransliterator.from_yaml(yaml_)
18gt_one.transliterate('a')
'A'
19gt_one.transliterate('bb')
'B'
20gt_one.transliterate('aabb')
'<2AS>B'

The example above shows a very simple transliterator that replaces the input token “a” with “A”, “bb” with “B”, ” ” with ” “, and two “a” in a row with “<2AS>”. It does not consolidate whitespace, and treats ” ” as its default whitespace token. Tokens contain strings of one or more characters.

Input Tokens and Token Class Settings

During transliteration, Graph Transliterator first attempts to convert the input string into a list of tokens. This is done internally using GraphTransliterator.tokenize():

21gt_one.tokenize('abba')
[' ', 'a', 'bb', 'a', ' ']

Note that the default whitespace token is added to the start and end of the input tokens.

Tokens can be more than one character, and longer tokens are matched first:

22yaml_ = """
23  tokens:
24    a: []      # "a" token with no classes
25    aa: []     # "aa" token with no classes
26    ' ': [wb]  # " " token and its class ("wb", for wordbreak)
27  rules:
28    aa: <DOUBLE_A>  # transliterate "aa" to "<DOUBLE_A>"
29    a: <SINGLE_A>   # transliterate "a" to "<SINGLE_A>"
30  whitespace:
31    default: " "        # default whitespace token
32    consolidate: false  # whitespace should not be consolidated
33    token_class: wb     # whitespace token class
34"""
35gt_two = GraphTransliterator.from_yaml(yaml_)
36gt_two.transliterate('a')
'<SINGLE_A>'
37gt_two.transliterate('aa')
'<DOUBLE_A>'
38gt_two.transliterate('aaa')
'<DOUBLE_A><SINGLE_A>'

Here the input “aaa” is transliterated as “<DOUBLE_A><SINGLE_A>”, as the longer token “aa” is matched before “a”.

Tokens can be assigned zero or more classes. Each class is a string of your choice. These classes are used in transliteration rules. In YAML they are defined as a dictionary, but internally the rules are stored as a dictionary of token strings keyed to a set of token classes. They can be accessed using GraphTransliterator.tokens:

39gt_two.tokens
{'a': set(), 'aa': set(), ' ': {'wb'}}
Transliteration Rules

Graph Transliterator can handle a variety of transliteration tasks. To do so, it uses transliteration rules that contain match settings for particular tokens in specific contexts and also a resulting production, or string to be appended to the output string.

Match Settings

Transliteration rules contain the following parameters (ordered by where they would appear in a list of tokens):

  • previous token classes : a list of token classes (optional)

  • previous tokens : a list of tokens (optional)

  • tokens : a list of tokens

  • next tokens : a list of tokens (optional)

  • next token classes : a list of token classes (optional)

One or more (tokens) must be matched in a particular location. However, specific tokens can be required before (previous tokens) or behind (next tokens) those tokens. Additionally, particular token classes can be required before (previous token classes) and behind (next token classes) all of the specific tokens required (previous tokens, tokens, next tokens).

Depending on their complexity, these match conditions can be entered using the “easy reading” format in the following ways.

If there are no required lookahead or lookbehind tokens, the rule can be as follows:

rules:
   a a: aa  # two tokens (a,a), with production "production_aa"

If, in an addition to tokens, there are specific previous or following tokens that must be matched, the rule can be entered as:

tokens:
  a: []
  b: []
  c: []
  d: []
rules:
   a (b): a_before_b  # matches  token 'a' with the next token 'b'
   (c) a: a_after_c   # matches token 'a' when the previous token is 'c'
   a (b c): a_before_b_and_c # matches token 'a' when next tokens are 'b' then 'c'
   (d) a (b c): a_after_d_and_before_b,c  # matches the token 'a' after 'd' and before 'b' and 'c'

Token class names are indicated between angular brackets (“<classname>”). If preceding and following tokens are not required but classes are, these can be entered as follows:

tokens:
  a: []
  b: [class_b]
  c: []
  ' ': [wb]
rules:
  c <class_b>: c_after _class_b  # match token 'c' before a token of class 'class_b`
  <class_b> a: a_before_class_b  # match token 'a' after a token of class `class_b`
  <class_b> a <class_b>: a_between_class_b #  match token 'a' between tokens of class 'class_b'

If token classes must precede or follow specific tokens, these can be entered as:

tokens:
  a: []
  b: []
  c: [class_c]
  d: [class_d]
  ' ': [wb]
rules:
  d (b <class_c>): a_before_b_and_class_c # match token 'd' before 'b' and a token of class 'class_c'
  (<class_c> b) a: a_after_b_and_class_c  # match token 'a' after 'b' and a token of class 'class_c'
  (<class_c> d) a (b <class_c> <class_d>): x # match 'a' after token of 'class_c' and 'd' and before a token of 'class_c' and of 'class_d'
whitespace:
  default: ' '
  token_class: wb
  consolidate: false
Automatic Ordering of Transliteration Rules

Graph Transliterator automatically orders the transliteration rules based on the number of tokens required by the rule. It picks the rule requiring the longest match in a given context. It does so by assigning a cost to each transliteration rule that decreases depending on the number of tokens required by the rule. More tokens decreases the cost of a rule causing it to be matched first:

40yaml_ = """
41  tokens:
42    a: []
43    b: []
44    c: [class_of_c]
45    ' ': [wb]
46  rules:
47    a: <<A>>
48    a b: <<AB>>
49    b: <<B>>
50    c: <<C>>
51    ' ': _
52    <class_of_c> a b: <<AB_after_C>>
53  whitespace:
54    default: " "
55    consolidate: false
56    token_class: wb
57"""
58gt_three = GraphTransliterator.from_yaml(yaml_)
59gt_three.transliterate("ab")  # should match rule "a b"
'<<AB>>'
60gt_three.transliterate("cab") # should match rules: "c", and "<class_of_c> a b"
'<<C>><<AB_after_C>>'

Internally, Graph Transliterator uses a special TransliterationRule class. These can be accessed using GraphTransliterator.rules. Rules are sorted by cost, lowest to highest:

61gt_three.rules
[TransliterationRule(production='<<AB_after_C>>', prev_classes=['class_of_c'], prev_tokens=None, tokens=['a', 'b'], next_tokens=None, next_classes=None, cost=0.32192809488736235),
 TransliterationRule(production='<<AB>>', prev_classes=None, prev_tokens=None, tokens=['a', 'b'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='<<A>>', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='<<B>>', prev_classes=None, prev_tokens=None, tokens=['b'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='<<C>>', prev_classes=None, prev_tokens=None, tokens=['c'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='_', prev_classes=None, prev_tokens=None, tokens=[' '], next_tokens=None, next_classes=None, cost=0.5849625007211562)]
Whitespace Settings

Whitespace is often very important in transliteration tasks, as the form of many letters may change at the start or end of words, as in the right-to-left Perso-Arabic and left-to-right Indic scripts. Therefore, Graph Transliterator requires the following whitespace settings:

  • the default whitespace token

  • the whitespace token class

  • whether or not to consolidate whitespace

A whitespace token and token class must be defined for any Graph Transliterator. A whitespace character is added temporarily to the start and end of the input tokens during the transliteration process.

The consolidate option may be useful in particular transliteration tasks. It replaces any sequential whitespace tokens in the input string with the default whitespace character. At the start and end of input, it removes any whitespace:

62yaml_ = """
63  tokens:
64    a: []
65    ' ': [wb]
66  rules:
67    <wb> a: _A
68    a <wb>: A_
69    <wb> a <wb>: _A_
70    a: a
71    ' ': ' '
72  whitespace:
73    default: " "        # default whitespace token
74    consolidate: true   # whitespace should be consolidated
75    token_class: wb     # whitespace token class
76"""
77gt = GraphTransliterator.from_yaml(yaml_)
78gt.transliterate('a')   # whitespace present at start of string
'_A_'
79gt.transliterate('aa')  # whitespace present at start and end of string
'_AA_'
80gt.transliterate(' a')  # consolidate removes whitespace at start of string
'_A_'
81gt.transliterate('a ')  # consolidate removes whitespace at end of string
'_A_'

Whitespace settings are stored internally as WhitespaceRules and can be accessed using GraphTransliterator.whitespace:

82gt.whitespace
WhitespaceRules(default=' ', token_class='wb', consolidate=True)
On Match Rules

Graph Transliterator allows strings to be inserted right before the productions of transliteration rules. These take as parameters:

  • a list of previous token classes, preceding the location of the transliteration rule match

  • a list of next token classes, from the index of the transliteration rule match

  • a production string to insert

In the easy reading YAML format, the onmatch_rules are a list of dictionaries. The key consists of the token class names in angular brackets (“<classname>”), and the previous classes to match are separated from the following classes by a “+”. The production is the value of the dictionary:

83yaml_ = """
84  tokens:
85    a: [vowel]
86    ' ': [wb]
87  rules:
88    a: A
89    ' ': ' '
90  whitespace:
91    default: " "
92    consolidate: false
93    token_class: wb
94  onmatch_rules:
95    - <vowel> + <vowel>: ',' # add a comma between vowels
96 """
97gt = GraphTransliterator.from_yaml(yaml_)
98gt.transliterate('aa')
'A,A'

On Match rules are stored internally as a OnMatchRule and can be accessed using GraphTransliterator.onmatch_rules:

99gt.onmatch_rules
[OnMatchRule(prev_classes=['vowel'], next_classes=['vowel'], production=',')]
Metadata

Graph Transliterator allows for the storage of metadata as another input parameter, metadata. It is a dictionary, and fields can be added to it:

100yaml_ = """
101  tokens:
102    a: []
103    ' ': [wb]
104  rules:
105    a: A
106    ' ': ' '
107  whitespace:
108    default: " "
109    consolidate: false
110    token_class: wb
111  metadata:
112    author: Author McAuthorson
113    version: 0.1.1
114    description: A sample Graph Transliterator
115  """
116gt = GraphTransliterator.from_yaml(yaml_)
117gt.metadata
{'author': 'Author McAuthorson',
 'version': '0.1.1',
 'description': 'A sample Graph Transliterator'}
Unicode Support

Graph Transliterator allows Unicode characters to be specified by name, including in YAML files, using the format “\N{UNICODE CHARACTER NAME}” or “\u{####}” (where #### is the hexadecimal character code):

118yaml_ = """
119  tokens:
120    b: []
121    c: []
122    ' ': [wb]
123  rules:
124    b: \N{LATIN CAPITAL LETTER B}
125    c: \u0043    # hexadecimal Unicode character code for 'C'
126    ' ': ' '
127  whitespace:
128    default: " "
129    consolidate: false
130    token_class: wb
131  """
132gt = GraphTransliterator.from_yaml(yaml_)
133gt.transliterate('b')
'B'
134gt.transliterate('c')
'C'
Configuring Directly

In addition to using GraphTansliterator.from_yaml() and GraphTransliterator.from_yaml_file(), Graph Transliterator can also be configured and initialized directly using basic Python types passed as dictionary to GraphTransliterator.from_dict()

135settings = {
136  'tokens': {'a': ['vowel'],
137             ' ': ['wb']},
138  'rules': [
139      {'production': 'A', 'tokens': ['a']},
140      {'production': ' ', 'tokens': [' ']}],
141  'onmatch_rules': [
142      {'prev_classes': ['vowel'],
143       'next_classes': ['vowel'],
144       'production': ','}],
145  'whitespace': {
146      'default': ' ',
147      'consolidate': False,
148      'token_class': 'wb'},
149  'metadata': {
150      'author': 'Author McAuthorson'}
151}
152gt = GraphTransliterator.from_dict(settings)
153gt.transliterate('a')
'A'

This feature can be useful if generating a Graph Transliterator using code as opposed to a configuration file.

Ambiguity Checking

Graph Transliterator, by default, will check for ambiguity in its transliteration rules. If two rules of the same cost would match the same string(s) and those strings would not be matched by a less costly rule, an AmbiguousTransliterationRulesException occurs. Details of all exceptions will be reported as a logging.warning():

155yaml_ = """
156tokens:
157  a: [class1, class2]
158  b: []
159  ' ': [wb]
160rules:
161  <class1> a: A
162  <class2> a: AA # ambiguous rule
163  <class1> b: BB
164  b <class2>: BB # also ambiguous
165whitespace:
166  default: ' '
167  consolidate: True
168  token_class: wb
169"""
170gt = GraphTransliterator.from_yaml(yaml_)
WARNING:root:The pattern [{'a'}, {'a'}, {' ', 'a', 'b'}] can be matched by both:
  <class1> a
  <class2> a

WARNING:root:The pattern [{'a'}, {'b'}, {'a'}] can be matched by both:
  <class1> b
  b <class2>

AmbiguousTransliterationRulesException

The warning shows the set of possible previous tokens, matched tokens, and next tokens as three sets.

Ambiguity checking is only necessary when using an untested Graph Transliterator. It can be turned off during initialization. To do so, set the initialization parameter check_ambiguity to False.

Ambiguity checking can also be done on demand using check_for_ambiguity().

Ambiguity checking is not performed if loading from a serialized GraphTransliterator using GraphTransliterator.load() or GraphTransliterator.loads().

Setup Validation

Graph Transliterator validates both the “easy reading” configuration and the direct configuration using the marshmallow library.

Transliteration and Its Exceptions

The main method of Graph Transliterator is GraphTransliterator.transliterate(). It will return a string:

171GraphTransliterator.from_yaml(
172'''
173tokens:
174  a: []
175  ' ': [wb]
176rules:
177  a: A
178  ' ': '_'
179whitespace:
180  default: ' '
181  consolidate: True
182  token_class: wb
183''').transliterate("a a")
'A_A'

Details of transliteration error exceptions will be logged using logging.warning().

Unrecognizable Input Token

Unless the GraphTransliterator is initialized with or has the property ignore_errors set as True, GraphTransliterator.transliterate() will raise UnrecognizableInputTokenException when character(s) in the input string do not correspond to any defined types of input tokens. In both cases, there will be a logging.warning():

184from graphtransliterator import GraphTransliterator
185yaml_ = """
186  tokens:
187   a: []
188   ' ': [wb]
189  rules:
190    a: A
191    ' ': ' '
192  whitespace:
193    default: " "
194    consolidate: true
195    token_class: wb
196"""
197GraphTransliterator.from_yaml(yaml_).transliterate("a!a") # ignore_errors=False
WARNING:graphtransliterator:Unrecognizable token ! at pos 1 of a!a
UnrecognizableInputTokenException
198GraphTransliterator.from_yaml(yaml_, ignore_errors=True).transliterate("a!a") # ignore_errors=True
WARNING:graphtransliterator:Unrecognizable token ! at pos 1 of a!a
'AA'
No Matching Transliteration Rule

Another possible error occurs when no transliteration rule can be identified at a particular index in the index string. In that case, there will be a logging.warning(). If the parameter ignore_errors is set to True, the token index will be advanced. Otherwise, there will be a NoMatchingTransliterationRuleException:

199yaml_='''
200  tokens:
201    a: []
202    b: []
203    ' ': [wb]
204  rules:
205    a: A
206    b (a): B
207  whitespace:
208    default: ' '
209    token_class: wb
210    consolidate: False
211'''
212gt = GraphTransliterator.from_yaml(yaml_)
213gt.transliterate("ab")
WARNING:graphtransliterator:No matching transliteration rule at token pos 2 of [' ', 'a', 'b', ' ']
NoMatchingTransliterationRuleException
214gt.ignore_errors = True
215gt.transliterate("ab")
WARNING:graphtransliterator:No matching transliteration rule at token pos 2 of [' ', 'a', 'b', ' ']
'A'
Additional Methods

Graph Transliterator also offers a few additional methods that may be useful for particular tasks.

Serialization and Deserialization

The settings of a Graph Transliterator can be serialized using GraphTransliterator.dump(), which returns a dictionary of native Python data types. A JSON string of the same can be accessed using GraphTransliterator.dumps(). Validation is not performed during a dump.

By default, GraphTransliterator.dumps() will use compression level 2, which removes the internal graph and indexes tokens and graph node labels. Compression level 1 also indexes tokens and graph node labels and contains the graph. Compression level 0 is human readable and includes the graph. No information is lost during compression. Level 2, the default, loads the fastest and also has the smallest file size.

A GraphTransliterator can be loaded from serialized settings, e.g. in an API context, using GraphTransliterator.load() and from JSON data as GraphTransliterator.loads(). Because they are intended to be quick, neither method performs ambiguity checks or strict validation checking by default.

Serialization can be useful if providing an API or making the configured Graph Transliterator available in other programming languages, e.g. Javascript.

Matching at an Index

The method match_at() is also public. It matches the best transliteration rule at a particular index, which is the rule that contains the largest number of required tokens. The method also has the option match_all which, if set, returns all possible transliteration matches at a particular location:

216gt = GraphTransliterator.from_yaml('''
217        tokens:
218            a: []
219            a a: []
220            ' ': [wb]
221        rules:
222            a: <A>
223            a a: <AA>
224        whitespace:
225            default: ' '
226            consolidate: True
227            token_class: wb
228''')
229tokens = gt.tokenize("aa")
230tokens # whitespace added to ends
[' ', 'a', 'a', ' ']
231gt.match_at(1, tokens) # returns index to rule
0
232gt.rules[gt.match_at(1, tokens)] # actual rule
TransliterationRule(production='<AA>', prev_classes=None, prev_tokens=None, tokens=['a', 'a'], next_tokens=None, next_classes=None, cost=0.41503749927884376)
233gt.match_at(1, tokens, match_all=True) # index to rules, with match_all
[0, 1]
234[gt.rules[_] for _ in gt.match_at(1, tokens, match_all=True)] # actual rules, with match_all
[TransliterationRule(production='<AA>', prev_classes=None, prev_tokens=None, tokens=['a', 'a'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='<A>', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562)]
Details of Matches

Each Graph Transliterator has a property last_matched_rules which returns a list of TransliterationRule of the previously matched transliteration rules:

235gt.transliterate("aaa")
'<AA><A>'
236gt.last_matched_rules
[TransliterationRule(production='<AA>', prev_classes=None, prev_tokens=None, tokens=['a', 'a'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='<A>', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562)]

The particular tokens matched by those rules can be accessed using last_matched_rule_tokens:

237gt.last_matched_rule_tokens
[['a', 'a'], ['a']]
Pruning of Rules

In particular cases, it may be useful to remove certain transliteration rules from a more robustly defined Graph Transliterator based on the string output produced by the rules. That can be done using pruned_of():

238gt.rules
[TransliterationRule(production='<AA>', prev_classes=None, prev_tokens=None, tokens=['a', 'a'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='<A>', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562)]
239gt.pruned_of('<AA>').rules
[TransliterationRule(production='<A>', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562)]
240gt.pruned_of(['<A>', '<AA>']).rules
[]
Internal Graph

Graph Transliterator creates a directed tree during its initialization. During calls to transliterate(), it searches that graph to find the best transliteration match at a particular index in the tokens of the input string.

DirectedGraph

The tree is an instance of DirectedGraph that can be accessed using GraphTransliterator.graph. It contains: a list of nodes, each consisting of a dictionary of attributes; a dictionary of edges keyed between the head and tail of an edge that contains a dictionary of edge attributes; and finally an edge list.

241gt = GraphTransliterator.from_yaml(
242    """
243    tokens:
244      a: []
245      ' ': [wb]
246    rules:
247      a: b
248      <wb> a: B
249      ' ': ' '
250    whitespace:
251      token_class: wb
252      default: ' '
253      consolidate: false
254    """)
255gt.graph
<graphtransliterator.graphs.DirectedGraph at 0x7fd14c3c5800>
Nodes

The tree has nodes of three types: Start, token, and rule. A single Start node, the root, is connected to all other nodes. A token node corresponds to a token having been matched. Finally, rule nodes are leaf nodes (with no outgoing edges) that correspond to matched transliteration rules:

256gt.graph.node
[{'type': 'Start', 'ordered_children': {'a': [1], ' ': [4]}},
 {'type': 'token', 'token': 'a', 'ordered_children': {'__rules__': [2, 3]}},
 {'type': 'rule', 'rule_key': 0, 'accepting': True, 'ordered_children': {}},
 {'type': 'rule', 'rule_key': 1, 'accepting': True, 'ordered_children': {}},
 {'type': 'token', 'token': ' ', 'ordered_children': {'__rules__': [5]}},
 {'type': 'rule', 'rule_key': 2, 'accepting': True, 'ordered_children': {}}]
Edges

Edges between these nodes may have different constraints in their attributes:

257gt.graph.edge
{0: {1: {'token': 'a', 'cost': 0.41503749927884376},
  4: {'token': ' ', 'cost': 0.5849625007211562}},
 1: {2: {'cost': 0.41503749927884376, 'constraints': {'prev_classes': ['wb']}},
  3: {'cost': 0.5849625007211562}},
 4: {5: {'cost': 0.5849625007211562}}}

Before the token nodes, there is a token constraint on the edge that must be matched before the transliterator can visit the token node:

258gt.graph.edge[0][1]
{'token': 'a', 'cost': 0.41503749927884376}

On the edges before rules there may be other constraints, such as certain tokens preceding or following tokens of the corresponding transliteration rule:

259gt.graph.edge[1][2]
{'cost': 0.41503749927884376, 'constraints': {'prev_classes': ['wb']}}

An edge list is also maintained that consists of a tuple of (head, tail):

260gt.graph.edge_list
[(0, 1), (1, 2), (1, 3), (0, 4), (4, 5)]
Search and Preprocessing

Graph Transliterator uses a best-first search, implemented using a stack, that finds the transliteration with the the lowest cost. The cost function is:

\[\text{cost}(rule) = \log_2{\big(1+\frac{1}{1+\text{count}\_\text{of}\_ \text{tokens}\_ \text{in}(rule)}\big)}\]

It results in a number between 1 and 0 that lessens as more tokens must be matched. Each edge on the graph has a cost attribute that is set to the lowest cost transliteration rule following it. When transliterating, Graph Transliterator will try lower cost edges first and will backtrack if the constraint conditions are not met.

Sample graph

An example graph created for the simple case of a Graph Transliterator that takes as input two token types, a and " " (space), and renders " " as " ", and a as b unless it follows a token of class wb (for wordbreak), in which case it renders a as B. The rule nodes are in double circles, and token nodes are single circles. The numbers are the cost of the particular edge, and less costly edges are searched first. Previous token class (prev_classes) constraints are found on the edge before the leftmost leaf rule node.

To optimize the search, during initialization an ordered_children dictionary is added to each non-leaf node. Its values are a list of node indexes sorted by cost and keyed by the following token:

261gt.graph.node[0]
{'type': 'Start', 'ordered_children': {'a': [1], ' ': [4]}}

Any rule connected to a node is added to each ordered_children. Any rule nodes immediately following the current node are keyed to __rules__:

262gt.graph.node[1]
{'type': 'token', 'token': 'a', 'ordered_children': {'__rules__': [2, 3]}}

Because of this preprocessing, Graph Transliterator does not need to iterate through all of the outgoing edges of a node to find the next node to search.

Bundled Transliterators

Note

Python code on this page: bundled.py Jupyter Notebook: bundled.ipynb

Graph Transliterator includes bundled transliterators in a Bundled subclass of GraphTransliterator that can be used as follows:

1import graphtransliterator.transliterators as transliterators
2example_transliterator = transliterators.Example()
3example_transliterator.transliterate('a')
'A'

To access transliterator classes, use the iterator transliterators.iter_transliterators():

4bundled_iterator = transliterators.iter_transliterators()
5next(bundled_iterator)
<example.Example at 0x7fdb727b9f10>

To access the names of transliterator classes, use the iterator transliterators.iter_names():

6bundled_names_iterator = transliterators.iter_names()
7next(bundled_names_iterator)
'Example'

The actual bundled transliterators are submodules of graphtransliterator.transliterators, but they are loaded into the namespace of transliterators:

8from graphtransliterator.transliterators import Example

Each instance of Bundled contains a directory attribute:

 9transliterator = Example()
10transliterator.directory
'/home/docs/checkouts/readthedocs.org/user_builds/graphtransliterator/checkouts/latest/graphtransliterator/transliterators/example'

Each will contain an easy-reading YAML file that you can view:

tokens:
  a: [vowel]
  ' ': [whitespace]
  b: [consonant]
rules:
  a: A
  b: B
  ' ': ' '
  (<consonant> a) b (a <consonant>):  "!B!"
onmatch_rules:
  - <vowel> + <vowel>: ","
whitespace:
  consolidate: False
  default: " "
  token_class: whitespace
metadata:
  name: example
  version: 1.0.0
  description: "An Example Bundled Transliterator"
  url: https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample
  author: Author McAuthorson
  author_email: author_mcauthorson@msu.edu
  license: MIT License
  keywords:
    - example
  project_urls:
    Documentation: https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example
    Source: https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example
    Tracker: https://github.com/seanpue/graphtransliterator/issues

There is also a JSON dump of the transliterator for quick loading:

{"graphtransliterator_version":"1.2.0","compressed_settings":[["consonant","vowel","whitespace"],[" ","a","b"],[[2],[1],[0]],[["!B!",[0],[1],[2],[1],[0],-5],["A",0,0,[1],0,0,-1],["B",0,0,[2],0,0,-1],[" ",0,0,[0],0,0,-1]],[" ","whitespace",0],[[[1],[1],","]],{"name":"example","version":"1.0.0","description":"An Example Bundled Transliterator","url":"https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample","author":"Author McAuthorson","author_email":"author_mcauthorson@msu.edu","license":"MIT License","keywords":["example"],"project_urls":{"Documentation":"https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example","Source":"https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example","Tracker":"https://github.com/seanpue/graphtransliterator/issues"}},null]}
Test Coverage of Bundled Transliterators

Each bundled transliterators requires rigorous testing: every node and edge, as well as any onmatch rules, if applicable, must be visited. A separate subclass CoverageTransliterator of GraphTransliterator is used during testing.

It logs visits to nodes, edges, and onmatch rules. The tests are found in a subdirectory of the transliterator named “tests”. They are in a YAML file consisting of a dictionary keyed from transliteration input to correct output, e.g.:

# YAML declaration of tests for bundled Graph Transliterator
# These are in the form of a dictionary.
# The key is the source text, and the value is the correct transliteration.
' ': ' '
a: A
aa: A,A
babab: BA!B!AB
b: B

Once the tests are completed, Graph Transliterator checks that all components of the graph and all of the onmatch rules have been visited.

Class Structure and Naming Conventions

Each transliterator must include a class definition in a submodule of transliterators.

The class name of each transliterator must be unique and follow camel-case conventions, e.g. SourceToTarget. File and directory names should, if applicable, be lowercased as source_to_target.

The bundled files should follow this directory structure, where {{source_to_target}} is the name of the transliterator:

transliterators
├── {{source_to_target}}
|   ├── __init__.py
|   ├── {{source_to_target}}.json
|   ├── {{source_to_target}}.yaml
└── tests
    ├── test_{{source_to_target}}.py
    └── {{source_to_target}}_tests.yaml

The bundled transliterator will:

  • include both an easy-reading YAML file {{source_to_target}}.yaml and a JSON file {{source_to_target}}.json.

  • have tests in a YAML format consisting of a dictionary keyed from transliteration to correct output in {{source_to_target}}_tests.yaml. It must include complete test coverage of its graph. Every node and edge of the graph must be visited during the course of the tests, as well as every on-match rule. Each on-match rule must be utilized during the course of the tests.

  • include metadata about the transliterator in its easy-reading YAML file.

  • have an optional custom test file test_{{source_to_target.py}}. This is useful during development.

Metadata Requirements

Each Bundled transliterator can include the following metadata fields. These fields are a subset of the metadata of setuptools.

name (str)

Name of the transliterator, e.g. “source_to_target”.

version (str, optional)

Version of the transliterator. Semantic versioning (https://semver.org) is recommended.

url (str, optional)

URL for the transliterator, e.g. github repository.

author (str, optional)

Author of the transliterator

author_email (str, optional)

E-mail address of the author.

maintainer (str, optional)

Name of the maintainer.

maintainer_email (str, optional)

E-mail address of the maintainer.

license (str, optional)

License of the transliterator. An open-source license is required for inclusion in this project.

keywords (list of str, optional)

List of keywords.

project_urls (dict of {str: str}, optional)

Dictionary of project URLS, e.g. Documentation, Source, etc.

Metadata is validated using a BundledMetadataSchema found in transliterators.schemas.

To browse metadata, you can use iter_transliterators():

11import pprint
12transliterator = next(transliterators.iter_transliterators())
13pprint.pprint(transliterator.metadata)
{'author': 'Author McAuthorson',
 'author_email': 'author_mcauthorson@msu.edu',
 'description': 'An Example Bundled Transliterator',
 'keywords': ['example'],
 'license': 'MIT License',
 'name': 'example',
 'project_urls': {'Documentation': 'https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example',
                  'Source': 'https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example',
                  'Tracker': 'https://github.com/seanpue/graphtransliterator/issues'},
 'url': 'https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample',
 'version': '1.0.0'}

Command Line Interface

Graph Transliterator has a simple command line interface with six commands: dump, dump-tests, generate-tests, list-bundled, make-json, test, and transliterate.

$ graphtransliterator --help
Usage: main [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  dump            Dump transliterator as JSON.
  dump-tests      Dump BUNDLED tests.
  generate-tests  Generate tests as YAML.
  list-bundled    List BUNDLED transliterators.
  make-json       Make JSON rules of BUNDLED transliterator(s).
  test            Test BUNDLED transliterator.
  transliterate   Transliterate INPUT.

Dump

The dump command will output the specified transliterator as JSON:

$ graphtransliterator dump --help
Usage: dump [OPTIONS]

  Dump transliterator as JSON.

Options:
  -f, --from <CHOICE TEXT>...     Format (bundled/yaml_file) and source (name or
                                  filename) of transliterator  [required]
  -ca, --check-ambiguity / -nca, --no-check-ambiguity
                                  Check for ambiguity.  [default: no-check-
                                  ambiguity]
  -cl, --compression-level INTEGER
                                  Compression level, from 0 to 2  [default: 2]
  --help                          Show this message and exit.

It require a --from or -f option with two arguments. The first argument specifies the format of the transliterator (bundled or yaml_file) and the second a parameter for that format (the name of the bundled transliterator or the name of a YAML file).

To load a bundled transliterator, used bundled as the first parameter and give its (class) name, which will be in CamelCase, as the second:

$ graphtransliterator dump --from bundled Example
{"graphtransliterator_version":"1.2.4","compressed_settings":[["consonant","vowel","whitespace"],[" ","a","b"],[[2],[1],[0]],[["!B!",[0],[1],[2],[1],[0],-5],["A",0,0,[1],0,0,-1],["B",0,0,[2],0,0,-1],[" ",0,0,[0],0,0,-1]],[" ","whitespace",0],[[[1],[1],","]],{"name":"example","version":"1.0.0","description":"An Example Bundled Transliterator","url":"https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample","author":"Author McAuthorson","author_email":"author_mcauthorson@msu.edu","license":"MIT License","keywords":["example"],"project_urls":{"Documentation":"https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example","Source":"https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example","Tracker":"https://github.com/seanpue/graphtransliterator/issues"}},null]}

To load from a YAML file, give yaml_file as the first and the the name of the file as the second parameter:

$ graphtransliterator dump --from yaml_file ../graphtransliterator/transliterators/example/example.yaml
{"graphtransliterator_version":"1.2.4","compressed_settings":[["consonant","vowel","whitespace"],[" ","a","b"],[[2],[1],[0]],[["!B!",[0],[1],[2],[1],[0],-5],["A",0,0,[1],0,0,-1],["B",0,0,[2],0,0,-1],[" ",0,0,[0],0,0,-1]],[" ","whitespace",0],[[[1],[1],","]],{"name":"example","version":"1.0.0","description":"An Example Bundled Transliterator","url":"https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample","author":"Author McAuthorson","author_email":"author_mcauthorson@msu.edu","license":"MIT License","keywords":["example"],"project_urls":{"Documentation":"https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example","Source":"https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example","Tracker":"https://github.com/seanpue/graphtransliterator/issues"}},null]}

If you want to check for ambiguity in the transliterator before the dump, use the --check-ambiguity or -ca option:

$ graphtransliterator dump --from bundled Example --check-ambiguity # human readable
{"graphtransliterator_version":"1.2.4","compressed_settings":[["consonant","vowel","whitespace"],[" ","a","b"],[[2],[1],[0]],[["!B!",[0],[1],[2],[1],[0],-5],["A",0,0,[1],0,0,-1],["B",0,0,[2],0,0,-1],[" ",0,0,[0],0,0,-1]],[" ","whitespace",0],[[[1],[1],","]],{"name":"example","version":"1.0.0","description":"An Example Bundled Transliterator","url":"https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample","author":"Author McAuthorson","author_email":"author_mcauthorson@msu.edu","license":"MIT License","keywords":["example"],"project_urls":{"Documentation":"https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example","Source":"https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example","Tracker":"https://github.com/seanpue/graphtransliterator/issues"}},null]}

The compression level can of the JSON be specified using the --compression-level or -cl command. Compression level 0 is human readable; compression level 1 is not human readable and includes the generated graph; compression level 2 is not human readable and does not include the graph. Compression level 2, which is the fastest, is the default. There is no information lost during these compressions:

$ graphtransliterator dump --from bundled Example --compression-level 0 # human readable, with graph
{"tokens": {"a": ["vowel"], " ": ["whitespace"], "b": ["consonant"]}, "rules": [{"production": "!B!", "prev_classes": ["consonant"], "prev_tokens": ["a"], "tokens": ["b"], "next_classes": ["consonant"], "next_tokens": ["a"], "cost": 0.22239242133644802}, {"production": "A", "tokens": ["a"], "cost": 0.5849625007211562}, {"production": "B", "tokens": ["b"], "cost": 0.5849625007211562}, {"production": " ", "tokens": [" "], "cost": 0.5849625007211562}], "whitespace": {"default": " ", "token_class": "whitespace", "consolidate": false}, "onmatch_rules": [{"prev_classes": ["vowel"], "next_classes": ["vowel"], "production": ","}], "metadata": {"name": "example", "version": "1.0.0", "description": "An Example Bundled Transliterator", "url": "https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample", "author": "Author McAuthorson", "author_email": "author_mcauthorson@msu.edu", "license": "MIT License", "keywords": ["example"], "project_urls": {"Documentation": "https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example", "Source": "https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example", "Tracker": "https://github.com/seanpue/graphtransliterator/issues"}}, "ignore_errors": false, "onmatch_rules_lookup": {"a": {"a": [0]}}, "tokens_by_class": {"vowel": ["a"], "whitespace": [" "], "consonant": ["b"]}, "graph": {"node": [{"type": "Start", "ordered_children": {"b": [1], "a": [3], " ": [6]}}, {"token": "b", "type": "token", "ordered_children": {"__rules__": [2, 5]}}, {"type": "rule", "accepting": true, "rule_key": 0}, {"token": "a", "type": "token", "ordered_children": {"__rules__": [4]}}, {"type": "rule", "accepting": true, "rule_key": 1}, {"type": "rule", "accepting": true, "rule_key": 2}, {"token": " ", "type": "token", "ordered_children": {"__rules__": [7]}}, {"type": "rule", "accepting": true, "rule_key": 3}], "edge": {"0": {"1": {"token": "b", "cost": 0.22239242133644802}, "3": {"token": "a", "cost": 0.5849625007211562}, "6": {"token": " ", "cost": 0.5849625007211562}}, "1": {"2": {"cost": 0.22239242133644802, "constraints": {"prev_classes": ["consonant"], "prev_tokens": ["a"], "next_tokens": ["a"], "next_classes": ["consonant"]}}, "5": {"cost": 0.5849625007211562}}, "3": {"4": {"cost": 0.5849625007211562}}, "6": {"7": {"cost": 0.5849625007211562}}}, "edge_list": [[0, 1], [0, 3], [0, 6], [1, 2], [1, 5], [3, 4], [6, 7]]}, "tokenizer_pattern": "(b|a|\\ )", "graphtransliterator_version": "1.2.4"}

$ graphtransliterator dump --from bundled Example --compression-level 1 # not human readable, with graph
{"graphtransliterator_version":"1.2.4","compressed_settings":[["consonant","vowel","whitespace"],[" ","a","b"],[[2],[1],[0]],[["!B!",[0],[1],[2],[1],[0],-5],["A",0,0,[1],0,0,-1],["B",0,0,[2],0,0,-1],[" ",0,0,[0],0,0,-1]],[" ","whitespace",0],[[[1],[1],","]],{"name":"example","version":"1.0.0","description":"An Example Bundled Transliterator","url":"https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample","author":"Author McAuthorson","author_email":"author_mcauthorson@msu.edu","license":"MIT License","keywords":["example"],"project_urls":{"Documentation":"https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example","Source":"https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example","Tracker":"https://github.com/seanpue/graphtransliterator/issues"}},[["Start","rule","token"],[[0,0,{"2":[1],"1":[3],"0":[6]}],[2,0,2,{"-1":[2,5]}],[1,1,0],[2,0,1,{"-1":[4]}],[1,1,1],[1,1,2],[2,0,0,{"-1":[7]}],[1,1,3]],{"0":{"1":[0,-5,2],"3":[0,-1,1],"6":[0,-1,0]},"1":{"2":[[[0],[1],[1],[0]],-5,-1],"5":[0,-1,-1]},"3":{"4":[0,-1,-1]},"6":{"7":[0,-1,-1]}}]]}

$ graphtransliterator dump --from bundled Example --compression-level 2 # default; not human readable, no graph
{"graphtransliterator_version":"1.2.4","compressed_settings":[["consonant","vowel","whitespace"],[" ","a","b"],[[2],[1],[0]],[["!B!",[0],[1],[2],[1],[0],-5],["A",0,0,[1],0,0,-1],["B",0,0,[2],0,0,-1],[" ",0,0,[0],0,0,-1]],[" ","whitespace",0],[[[1],[1],","]],{"name":"example","version":"1.0.0","description":"An Example Bundled Transliterator","url":"https://github.com/seanpue/graphtransliterator/tree/master/transliterator/sample","author":"Author McAuthorson","author_email":"author_mcauthorson@msu.edu","license":"MIT License","keywords":["example"],"project_urls":{"Documentation":"https://github.com/seanpue/graphtransliterator/tree/master/graphtransliterator/transliterators/example","Source":"https://github.com/seanpue/graphtransliterator/tree/graphtransliterator/transliterators/example","Tracker":"https://github.com/seanpue/graphtransliterator/issues"}},null]}

Dump Tests

The dump-tests command dumps the tests of a bundled transliterator:

$ graphtransliterator dump-tests --help
Usage: dump-tests [OPTIONS] BUNDLED

  Dump BUNDLED tests.

Options:
  -t, --to [json|yaml]  Format (json/yaml) in which to dump  [default: yaml]
  --help                Show this message and exit.

By default, it outputs the original YAML tests file, preserving any comments:

$ graphtransliterator dump-tests Example
# YAML declaration of tests for bundled Graph Transliterator
# These are in the form of a dictionary.
# The key is the source text, and the value is the correct transliteration.
' ': ' '
a: A
aa: A,A
babab: BA!B!AB
b: B


To output as JSON, use the --to or -t flag:

$ graphtransliterator dump-tests --to json Example
{" ": " ", "a": "A", "aa": "A,A", "babab": "BA!B!AB", "b": "B"}

Generate Tests

The generate-tests command generates YAML tests keyed from input to desired output covering the entire internal graph. This command can be used to view the output of the transliterator in Unicode. It can also be used to generate starter tests for bundled transliterators:

$ graphtransliterator generate-tests --help
Usage: generate-tests [OPTIONS]

  Generate tests as YAML.

Options:
  -f, --from <CHOICE TEXT>...     Format (bundled/json/json_file/yaml_file) and
                                  source (name, JSON, or filename) of
                                  transliterator  [required]
  -ca, --check-ambiguity / -nca, --no-check-ambiguity
                                  Check for ambiguity.  [default: no-check-
                                  ambiguity]
  --help                          Show this message and exit.

It also require a --from or -f option with two arguments. The first argument specifies the format of the transliterator (bundled, json, json_file, yaml_file), and the second a parameter for that format (the name of the bundled transliterator, the actual JSON, or the name of a YAML file). Ambiguity checking can be turned on using --check_ambiguity or -ca:

$ graphtransliterator generate-tests --from bundled Example
' ': ' '
a: A
aa: A,A
b: B
babab: BA!B!AB


List Bundled Transliterators

The list-bundled command provides a list of bundled transliterators:

$ graphtransliterator test --help
Make JSON of Bundled Transliterator(s)

The make-json command makes new JSON files of bundled transliterators:

$ graphtransliterator make-json --help

It also allows regular-expression matching using the --reg-ex or -re flag. Matching starts at the start of the string. This command is for people creating new bundled transliterators.

Test

The test command tests a bundled transliterator:

$ graphtransliterator test --help
Usage: test [OPTIONS] BUNDLED

  Test BUNDLED transliterator.

Options:
  -ca, --check-ambiguity / -nca, --no-check-ambiguity
                                  Check for ambiguity.  [default: no-check-
                                  ambiguity]
  --help                          Show this message and exit.

It can only be used with bundled transliterators, so it only needs the name of the transliterator as its argument. This feature is useful when developing a transliterator. You can write the tests first and then begin developing the transliterator:

$ graphtransliterator test Example
True

Transliterate

The transliterate command will transliterate any following arguments:

$ graphtransliterator transliterate --help
Usage: transliterate [OPTIONS] [INPUT]...

  Transliterate INPUT.

Options:
  -f, --from <CHOICE TEXT>...     Format (bundled/json/json_file/yaml_file) and
                                  source (name, JSON, or filename) of
                                  transliterator  [required]
  -t, --to [json|python]          Format in which to output  [default: python]
  -ca, --check-ambiguity / -nca, --no-check-ambiguity
                                  Check for ambiguity.  [default: no-check-
                                  ambiguity]
  -ie, -nie, --ignore-errors / --no-ignore-errors
                                  Ignore errors.  [default: no-ignore-errors]
  --help                          Show this message and exit.

It also requires a --from or -f option with two arguments. The first argument specifies the format of the transliterator (bundled, json, json_file, yaml_file), and the second a parameter for that format (the name of the bundled transliterator, the actual JSON, or the name of a YAML file).

The transliterate command will transliterate every argument that follows. If there is only one input string, it will return a string:

$ graphtransliterator transliterate --from bundled Example a
A

$ graphtransliterator transliterate -f json_file ../graphtransliterator/transliterators/example/example.json a
A

$ graphtransliterator transliterate -f yaml_file ../graphtransliterator/transliterators/example/example.yaml a
A

Otherwise, it will return a list:

$ graphtransliterator transliterate -f bundled Example a a
['A', 'A']

The transliterate command also an optional --to or -t command that specifies the output format, a `python string (default) or a json string:

$ graphtransliterator transliterate --from bundled Example a
A

$ graphtransliterator transliterate --from bundled Example --to json a
"A"

$ graphtransliterator transliterate --from bundled Example --to python a a
['A', 'A']

$ graphtransliterator transliterate --from bundled Example --to json a a
["A", "A"]

Tutorial: Using GraphTransliterator

Note

Python code on this page: tutorial.py Jupyter Notebook: tutorial.ipynb

Graph Transliterator is designed to allow you to quickly develop rules for transliterating between languages and scripts. In this tutorial you will use a portion of Graph Transliterators features, including its token matching, class-based matching, and on match rules, using the GraphTransliterator class.

Tutorial Overview

The task for this tutorial will be to design a transliterator between the ITRANS (Indian languages TRANSliteration) encoding for Devanagari (Hindi) and standard Unicode. ITRANS developed as a means to transliterate Indic-language using the latin alphabet and punctuation marks before there were Unicode fonts.

The Devanagari alphabet is an abugida (alphasyllabary), where each “syllable” is a separate symbol. Vowels, except for the default अ (“a”) have a unique symbol that connects to a consonant. At the start of the words, they have a unique shape. Consonants in sequence, without intermediary vowels, change their shape and are joined together. In Unicode, that is accomplished by using the Virama character.

Graph Transliterator works by first converting the input text into a series of tokens. In this tutorial you will define the tokens of ITRANS and necessary token classes that will allow us to generate rules for conversion.

Graph Transliterator allows rule matching by preceding tokens, tokens, and following tokens. It allows token classes to precede or follow any specific tokens. For this task, you will use a preceding token class to identify when to write vowel signs as opposed to full vowel characters.

Graph Transliterator also allows the insertion of strings between matches involving particular token classes. This transliterator will need to insert the virama character between transliteration rules ending with consonants in order to create consonant clusters.

Configuring

Here you will parameterize the Graph Transliterator using its “easy reading” format, which uses YAML. It maps to a dictionary containing up to five keys: tokens, rules, onmatch_rules (optional), whitespace, and metadata (optional).

Token Definitions

Graph Transliterator tokenizes its input before transliterating. The tokens section will map the input tokens to their token classes. The main class you will need is one for consonants, so you can use consonant as the class. Graph Transliterator also requires a dedicated whitespace class, so you can use whitespace.

Graph Transliterator allows the use of Unicode character names in files using \N{UNICODE CHARACTER NAME HERE}} notation. You can enter the Unicode characters using that notation or directly. YAML will also unescape \u####, where #### is the hexadecimal notation for a character.

Here is a subsection of that definition:

tokens:
  k: [consonant]
  kh: [consonant]
  "\N{LATIN SMALL LETTER N WITH DOT ABOVE}": [consonant]
  a: [vowel]
  aa: [vowel]
  A: [vowel]
  ' ': [wb,whitespace]
  "\t": [wb,whitespace]
  .N: [vowel_sign]
Transliteration Rule Definitions

The rule definitions in Graph Transliterator in “easy reading” format are also a dictionary where the rules are the key and the production—what should be outputted by the rule—is the value. For this task, you just need to match individual tokens and also any preceding token classes:

rules:
  b: \N{DEVANAGARI LETTER B}
  <consonant> A: \N{DEVANAGARI LETTER AA}
  A: \N{DEVANAGARI LETTER AA}

These rules will replace “b” with the devanagari equivalent (ब), and “A” with with a full letter अा if it is at a start of a word (following a token of class “wb”, for wordbreak) or otherwise with a vowel sign ा if it is not, presumably following a consonant. Graph Transliterator automatically sorts rules by how many tokens are required for them to be matched, and it picks the one with that requires the most tokens. So the “A” following a consonant would be matched before an “A” after any other character. Graph Transliterator will also check for ambiguity in these rules, unless check_ambiguity is set to False.

While not necessary for this tutorial, Graph Transliterator can also require matching of specific previous or following tokens and also classes preceding and following those tokens, e.g.

k a r (U M g A <wb>): k,a,r_followed_by_U,M,g,A_and_a_wordbreak
s o (n a): s,o_followed_by_n,a
(<wb> p y) aa r: aa,r_preceded_by_a_wordbreak,p,and_y

Here is a subsection of the rules:

rules:
  "\t": "\t"
  ' ': ' '
  ',': ','
  .D: "\N{DEVANAGARI LETTER DDDHA}"
  <consonant> A: "\N{DEVANAGARI VOWEL SIGN AA}"
  "\N{LATIN SMALL LETTER N WITH DOT ABOVE}": "\N{DEVANAGARI LETTER NGA}"
On Match Rule Definitions

You will want to insert the Virama character between consonants so that they will join together in Unicode output. To do so, add an “onmatch_rules” section:

onmatch_rules:
  - <consonant> + <consonant>: "\N{DEVANAGARI SIGN VIRAMA}"

Unlike the tokens and rules, the onmatch rules are ordered. The first rule matched is applied. In YAML, they consist of a list of dictionaries each with a single key and value. The value is the production string to be inserted between matches. The ` + ` represents that space. So in the input string kyA, which would tokenize as [' ','k','y','A',' '], a virama character would be inserted when y is matched, as it is of class “consonant” and the previously matched transliteration rule for “k” ends with a “consonant”.

Whitespace Definitions

The final required setup parameter is for whitespace. These include the default whitespace token, which is temporarily added before and after the input tokens; the consolidate option to replace sequential whitespace characters with a single default whitespace character; and the token_class of whitespace tokens:

whitespace:
  consolidate: false
  default: ' '
  token_class: whitespace
Metadata Definitions

Graph Transliterator also allows metadata to be added to its settings:

metadata:
  title: "ITRANS Devanagari to Unicode"
  version: "0.1.0"
Creating a Transliterator

Now that the settings are ready, you can create a Graph Transliterator. Since you have been using the “easy reading” format, you can use GraphTransliterator.from_yaml_file() to read from a specific file or the GraphTransliterator.from_yaml() to read from a YAML string. You read from the loaded contents of an “easy reading” YAML file using GraphTransliterator.from_dict(). Graph Transliterator will convert those settings into basic Python types and then return a GraphTransliterator:

  1from graphtransliterator import GraphTransliterator
  2easyreading_yaml = """
  3tokens:
  4  k: [consonant]
  5  kh: [consonant]
  6  g: [consonant]
  7  gh: [consonant]
  8  ~N: [consonant]
  9  "\N{LATIN SMALL LETTER N WITH DOT ABOVE}": [consonant]
 10  ch: [consonant]
 11  chh: [consonant]
 12  Ch: [consonant]
 13  j: [consonant]
 14  jh: [consonant]
 15  ~n: [consonant]
 16  T: [consonant]
 17  Th: [consonant]
 18  D: [consonant]
 19  Dh: [consonant]
 20  N: [consonant]
 21  t: [consonant]
 22  th: [consonant]
 23  d: [consonant]
 24  dh: [consonant]
 25  n: [consonant]
 26  ^n: [consonant]
 27  p: [consonant]
 28  ph: [consonant]
 29  b: [consonant]
 30  bh: [consonant]
 31  m: [consonant]
 32  y: [consonant]
 33  r: [consonant]
 34  R: [consonant]
 35  l: [consonant]
 36  ld: [consonant]
 37  L: [consonant]
 38  zh: [consonant]
 39  v: [consonant]
 40  sh: [consonant]
 41  Sh: [consonant]
 42  s: [consonant]
 43  h: [consonant]
 44  x: [consonant]
 45  kSh: [consonant]
 46  GY: [consonant]
 47  j~n: [consonant]
 48  dny: [consonant]
 49  q: [consonant]
 50  K: [consonant]
 51  G: [consonant]
 52  J: [consonant]
 53  z: [consonant]
 54  .D: [consonant]
 55  .Dh: [consonant]
 56  f: [consonant]
 57  Y: [consonant]
 58  a: [vowel]
 59  aa: [vowel]
 60  A: [vowel]
 61  i: [vowel]
 62  ii: [vowel]
 63  I: [vowel]
 64  ee: [vowel]
 65  u: [vowel]
 66  uu: [vowel]
 67  U: [vowel]
 68  RRi: [vowel]
 69  R^i: [vowel]
 70  LLi: [vowel]
 71  L^i: [vowel]
 72  RRI: [vowel]
 73  LLI: [vowel]
 74  a.c: [vowel]
 75  ^e: [vowel]
 76  e: [vowel]
 77  ai: [vowel]
 78  A.c: [vowel]
 79  ^o: [vowel]
 80  o: [vowel]
 81  au: [vowel]
 82  ' ': [wb,whitespace]
 83  "\t": [wb,whitespace]
 84  ',': [wb]
 85  .h: [wb]
 86  H: [wb]
 87  OM: [wb]
 88  AUM: [wb]
 89  '|': [wb]
 90  '||': [wb]
 91  '0': [wb]
 92  '1': [wb]
 93  '2': [wb]
 94  '3': [wb]
 95  '4': [wb]
 96  '5': [wb]
 97  '6': [wb]
 98  '7': [wb]
 99  '8': [wb]
100  '9': [wb]
101  Rs.: [wb]
102  ~Rs.: [wb]
103  .a: [wb]
104  a.e: [vowel_sign]
105  .N: [vowel_sign]
106  .n: [vowel_sign]
107  M: [vowel_sign]
108  .m: [vowel_sign]
109rules:
110  "\t": "\t"
111  ' ': ' '
112  ',': ','
113  .D: "\N{DEVANAGARI LETTER DDDHA}"
114  .Dh: "\N{DEVANAGARI LETTER RHA}"
115  .N: "\N{DEVANAGARI SIGN CANDRABINDU}"
116  .a: "\N{DEVANAGARI SIGN AVAGRAHA}"
117  .h: "\N{DEVANAGARI SIGN VIRAMA}\N{ZERO WIDTH NON-JOINER}"
118  .m: "\N{DEVANAGARI SIGN ANUSVARA}"
119  .n: "\N{DEVANAGARI SIGN ANUSVARA}"
120  '0': "\N{DEVANAGARI DIGIT ZERO}"
121  '1': "\N{DEVANAGARI DIGIT ONE}"
122  '2': "\N{DEVANAGARI DIGIT TWO}"
123  '3': "\N{DEVANAGARI DIGIT THREE}"
124  '4': "\N{DEVANAGARI DIGIT FOUR}"
125  '5': "\N{DEVANAGARI DIGIT FIVE}"
126  '6': "\N{DEVANAGARI DIGIT SIX}"
127  '7': "\N{DEVANAGARI DIGIT SEVEN}"
128  '8': "\N{DEVANAGARI DIGIT EIGHT}"
129  '9': "\N{DEVANAGARI DIGIT NINE}"
130  <consonant> A: "\N{DEVANAGARI VOWEL SIGN AA}"
131  <consonant> A.c: "\N{DEVANAGARI VOWEL SIGN CANDRA O}"
132  <consonant> I: "\N{DEVANAGARI VOWEL SIGN II}"
133  <consonant> LLI: "\N{DEVANAGARI VOWEL SIGN VOCALIC LL}"
134  <consonant> LLi: "\N{DEVANAGARI VOWEL SIGN VOCALIC L}"
135  <consonant> L^i: "\N{DEVANAGARI VOWEL SIGN VOCALIC L}"
136  <consonant> RRI: "\N{DEVANAGARI VOWEL SIGN VOCALIC RR}"
137  <consonant> RRi: "\N{DEVANAGARI VOWEL SIGN VOCALIC R}"
138  <consonant> R^i: "\N{DEVANAGARI VOWEL SIGN VOCALIC R}"
139  <consonant> U: "\N{DEVANAGARI VOWEL SIGN UU}"
140  <consonant> ^e: "\N{DEVANAGARI VOWEL SIGN SHORT E}"
141  <consonant> ^o: "\N{DEVANAGARI VOWEL SIGN SHORT O}"
142  <consonant> a: ''
143  <consonant> a.c: "\N{DEVANAGARI VOWEL SIGN CANDRA E}"
144  <consonant> aa: "\N{DEVANAGARI VOWEL SIGN AA}"
145  <consonant> ai: "\N{DEVANAGARI VOWEL SIGN AI}"
146  <consonant> au: "\N{DEVANAGARI VOWEL SIGN AU}"
147  <consonant> e: "\N{DEVANAGARI VOWEL SIGN E}"
148  <consonant> ee: "\N{DEVANAGARI VOWEL SIGN II}"
149  <consonant> i: "\N{DEVANAGARI VOWEL SIGN I}"
150  <consonant> ii: "\N{DEVANAGARI VOWEL SIGN II}"
151  <consonant> o: "\N{DEVANAGARI VOWEL SIGN O}"
152  <consonant> u: "\N{DEVANAGARI VOWEL SIGN U}"
153  <consonant> uu: "\N{DEVANAGARI VOWEL SIGN UU}"
154  A: "\N{DEVANAGARI LETTER AA}"
155  A.c: "\N{DEVANAGARI LETTER CANDRA O}"
156  AUM: "\N{DEVANAGARI OM}"
157  Ch: "\N{DEVANAGARI LETTER CHA}"
158  D: "\N{DEVANAGARI LETTER DDA}"
159  Dh: "\N{DEVANAGARI LETTER DDHA}"
160  G: "\N{DEVANAGARI LETTER GHHA}"
161  GY: "\N{DEVANAGARI LETTER JA}\N{DEVANAGARI SIGN VIRAMA}\N{DEVANAGARI LETTER NYA}"
162  H: "\N{DEVANAGARI SIGN VISARGA}"
163  I: "\N{DEVANAGARI LETTER II}"
164  J: "\N{DEVANAGARI LETTER ZA}"
165  K: "\N{DEVANAGARI LETTER KHHA}"
166  L: "\N{DEVANAGARI LETTER LLA}"
167  LLI: "\N{DEVANAGARI LETTER VOCALIC LL}"
168  LLi: "\N{DEVANAGARI LETTER VOCALIC L}"
169  L^i: "\N{DEVANAGARI LETTER VOCALIC L}"
170  M: "\N{DEVANAGARI SIGN ANUSVARA}"
171  N: "\N{DEVANAGARI LETTER NNA}"
172  OM: "\N{DEVANAGARI OM}"
173  R: "\N{DEVANAGARI LETTER RRA}"
174  RRI: "\N{DEVANAGARI LETTER VOCALIC RR}"
175  RRi: "\N{DEVANAGARI LETTER VOCALIC R}"
176  R^i: "\N{DEVANAGARI LETTER VOCALIC R}"
177  Rs.: "\N{INDIAN RUPEE SIGN}"
178  Sh: "\N{DEVANAGARI LETTER SSA}"
179  T: "\N{DEVANAGARI LETTER TTA}"
180  Th: "\N{DEVANAGARI LETTER TTHA}"
181  U: "\N{DEVANAGARI LETTER UU}"
182  Y: "\N{DEVANAGARI LETTER YYA}"
183  ^e: "\N{DEVANAGARI LETTER SHORT E}"
184  ^n: "\N{DEVANAGARI LETTER NNNA}"
185  ^o: "\N{DEVANAGARI LETTER SHORT O}"
186  a: "\N{DEVANAGARI LETTER A}"
187  a.c: "\N{DEVANAGARI LETTER CANDRA E}"
188  a.e: "\N{DEVANAGARI LETTER CANDRA A}"
189  aa: "\N{DEVANAGARI LETTER AA}"
190  ai: "\N{DEVANAGARI LETTER AI}"
191  au: "\N{DEVANAGARI LETTER AU}"
192  b: "\N{DEVANAGARI LETTER BA}"
193  bh: "\N{DEVANAGARI LETTER BHA}"
194  ch: "\N{DEVANAGARI LETTER CA}"
195  chh: "\N{DEVANAGARI LETTER CHA}"
196  d: "\N{DEVANAGARI LETTER DA}"
197  dh: "\N{DEVANAGARI LETTER DHA}"
198  dny: "\N{DEVANAGARI LETTER JA}\N{DEVANAGARI SIGN VIRAMA}\N{DEVANAGARI LETTER NYA}"
199  e: "\N{DEVANAGARI LETTER E}"
200  ee: "\N{DEVANAGARI LETTER II}"
201  f: "\N{DEVANAGARI LETTER FA}"
202  g: "\N{DEVANAGARI LETTER GA}"
203  gh: "\N{DEVANAGARI LETTER GHA}"
204  h: "\N{DEVANAGARI LETTER HA}"
205  i: "\N{DEVANAGARI LETTER I}"
206  ii: "\N{DEVANAGARI LETTER II}"
207  j: "\N{DEVANAGARI LETTER JA}"
208  jh: "\N{DEVANAGARI LETTER JHA}"
209  j~n: "\N{DEVANAGARI LETTER JA}\N{DEVANAGARI SIGN VIRAMA}\N{DEVANAGARI LETTER NYA}"
210  k: "\N{DEVANAGARI LETTER KA}"
211  kSh: "\N{DEVANAGARI LETTER KA}\N{DEVANAGARI SIGN VIRAMA}\N{DEVANAGARI LETTER SSA}"
212  kh: "\N{DEVANAGARI LETTER KHA}"
213  l: "\N{DEVANAGARI LETTER LA}"
214  ld: "\N{DEVANAGARI LETTER LLA}"
215  m: "\N{DEVANAGARI LETTER MA}"
216  n: "\N{DEVANAGARI LETTER NA}"
217  o: "\N{DEVANAGARI LETTER O}"
218  p: "\N{DEVANAGARI LETTER PA}"
219  ph: "\N{DEVANAGARI LETTER PHA}"
220  q: "\N{DEVANAGARI LETTER QA}"
221  r: "\N{DEVANAGARI LETTER RA}"
222  s: "\N{DEVANAGARI LETTER SA}"
223  sh: "\N{DEVANAGARI LETTER SHA}"
224  t: "\N{DEVANAGARI LETTER TA}"
225  th: "\N{DEVANAGARI LETTER THA}"
226  u: "\N{DEVANAGARI LETTER U}"
227  uu: "\N{DEVANAGARI LETTER UU}"
228  v: "\N{DEVANAGARI LETTER VA}"
229  x: "\N{DEVANAGARI LETTER KA}\N{DEVANAGARI SIGN VIRAMA}\N{DEVANAGARI LETTER SSA}"
230  y: "\N{DEVANAGARI LETTER YA}"
231  z: "\N{DEVANAGARI LETTER ZA}"
232  zh: "\N{DEVANAGARI LETTER LLLA}"
233  '|': "\N{DEVANAGARI DANDA}"
234  '||': "\N{DEVANAGARI DOUBLE DANDA}"
235  ~N: "\N{DEVANAGARI LETTER NGA}"
236  ~Rs.: "\N{INDIAN RUPEE SIGN}"
237  ~n: "\N{DEVANAGARI LETTER NYA}"
238  "\N{LATIN SMALL LETTER N WITH DOT ABOVE}": "\N{DEVANAGARI LETTER NGA}"
239onmatch_rules:
240- <consonant> + <consonant>: "\N{DEVANAGARI SIGN VIRAMA}"
241whitespace:
242  consolidate: false
243  default: ' '
244  token_class: whitespace
245metadata:
246  title: ITRANS to Unicode
247  version: 0.1.0
248"""
249gt = GraphTransliterator.from_yaml(easyreading_yaml)
Transliterating

With the transliterator created, you can now transliterate using GraphTransliterator.transliterate():

250gt.transliterate("aaj mausam ba.Daa beiimaan hai, aaj mausam")
'आज मौसम बड़ा बेईमान है, आज मौसम'
Other Information

Graph Transliterator has a few other tools built in that are for more specialized applications.

If you want to receive the details of the most recent transliteration, access GraphTransliterator.last_matched_rules to get this list of rules matched:

251gt.last_matched_rules
[TransliterationRule(production='आ', prev_classes=None, prev_tokens=None, tokens=['aa'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ज', prev_classes=None, prev_tokens=None, tokens=['j'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production=' ', prev_classes=None, prev_tokens=None, tokens=[' '], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='म', prev_classes=None, prev_tokens=None, tokens=['m'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ौ', prev_classes=['consonant'], prev_tokens=None, tokens=['au'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='स', prev_classes=None, prev_tokens=None, tokens=['s'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='', prev_classes=['consonant'], prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='म', prev_classes=None, prev_tokens=None, tokens=['m'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production=' ', prev_classes=None, prev_tokens=None, tokens=[' '], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ब', prev_classes=None, prev_tokens=None, tokens=['b'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='', prev_classes=['consonant'], prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='ड़', prev_classes=None, prev_tokens=None, tokens=['.D'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ा', prev_classes=['consonant'], prev_tokens=None, tokens=['aa'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production=' ', prev_classes=None, prev_tokens=None, tokens=[' '], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ब', prev_classes=None, prev_tokens=None, tokens=['b'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='े', prev_classes=['consonant'], prev_tokens=None, tokens=['e'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='ई', prev_classes=None, prev_tokens=None, tokens=['ii'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='म', prev_classes=None, prev_tokens=None, tokens=['m'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ा', prev_classes=['consonant'], prev_tokens=None, tokens=['aa'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='न', prev_classes=None, prev_tokens=None, tokens=['n'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production=' ', prev_classes=None, prev_tokens=None, tokens=[' '], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ह', prev_classes=None, prev_tokens=None, tokens=['h'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ै', prev_classes=['consonant'], prev_tokens=None, tokens=['ai'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production=',', prev_classes=None, prev_tokens=None, tokens=[','], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production=' ', prev_classes=None, prev_tokens=None, tokens=[' '], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='आ', prev_classes=None, prev_tokens=None, tokens=['aa'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ज', prev_classes=None, prev_tokens=None, tokens=['j'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production=' ', prev_classes=None, prev_tokens=None, tokens=[' '], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='म', prev_classes=None, prev_tokens=None, tokens=['m'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='ौ', prev_classes=['consonant'], prev_tokens=None, tokens=['au'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='स', prev_classes=None, prev_tokens=None, tokens=['s'], next_tokens=None, next_classes=None, cost=0.5849625007211562),
 TransliterationRule(production='', prev_classes=['consonant'], prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='म', prev_classes=None, prev_tokens=None, tokens=['m'], next_tokens=None, next_classes=None, cost=0.5849625007211562)]

Or if you just want to know the tokens matched by each rule, check GraphTransliterator.last_matched_rule_tokens:

252gt.last_matched_rule_tokens
[['aa'],
 ['j'],
 [' '],
 ['m'],
 ['au'],
 ['s'],
 ['a'],
 ['m'],
 [' '],
 ['b'],
 ['a'],
 ['.D'],
 ['aa'],
 [' '],
 ['b'],
 ['e'],
 ['ii'],
 ['m'],
 ['aa'],
 ['n'],
 [' '],
 ['h'],
 ['ai'],
 [','],
 [' '],
 ['aa'],
 ['j'],
 [' '],
 ['m'],
 ['au'],
 ['s'],
 ['a'],
 ['m']]

You can access the directed tree used by GraphTransliterator using GraphTransliterator.graph:

253gt.graph
<graphtransliterator.graphs.DirectedGraph at 0x7fb20445f680>

Advanced Tutorial: Bundling a Transliterator

This advanced tutorial builds upon the original tutorial to show you how to bundle a transliterator for inclusion in Graph Transliterator.

Contributions to Graph Transliterator are strongly encouraged!

You will make a very simple transliterator while going through the steps of bundling it into Graph Transliterator.

Git Basics: Fork, Branch, Sync, Commit
Fork

The first thing to do, if you have not already, is to create a fork of Graph Transliterator. See https://help.github.com/en/articles/fork-a-repo

(From here on out, we will be using the command line.)

After creating a fork, clone your forked repo:

git clone https://github.com/YOUR-USERNAME/graphtransliterator
Branch

Once you have done that, go into that directory and create a new branch:

cd graphtransliterator
git checkout -b [name_of_your_transliterator_branch]

For this example, you can use the branch a_to_b:

cd graphtransliterator
git checkout -b a_to_b

Then, push that branch to the origin (your personal github fork):

git push origin [name_of_your_transliterator_branch]

Here that would be:

.. code-block:: bash

  git push origin a_to_b

Next, add a remote upstream for Graph Transliterator (the official Graph Transliterator repo):

git remote add upstream https://github.com/seanpue/graphtransliterator.git
Sync

To update your local copy of the the remote (official Graph Transliterator repo), run:

git fetch upstream

To sync your personal fork with the remote, run:

git merge upstream/master

See https://help.github.com/en/articles/syncing-a-fork for more info. You can run the previous two commands at any time.

Commit

You can commit your changes by running:

git commit -m 'comment here about the commit'
Adding A Transliterator

To add a transliterator, the next step is to create a subdirectory in transliterators. For this tutorial, you can make a branch named a_to_b.

Note that this will be under graphtransliterator/transliterators, so from the root directory enter:

cd graphtransliterator/transliterators
mkdir [name_of_your_transliterator]
cd [name_of_your_transliterator]

For this example, you would enter:

cd graphtransliterator/transliterators
mkdir a_to_b
cd a_to_b

In the graphtransliterator/transliterators/[name_of_your_transliterator] directory, you will add:

  • an __init__.py

  • a YAML file in the “easy reading format”

  • a JSON file that is a serialization of the transliterator (optional)

  • a tests directory including a file named [name_of_your_transliterator]_tests.yaml

  • a Python test named test_[name_of_your_transliterator].py (optional)

Here is a tree showing the file organization:

transliterators
├── {{source_to_target}}
|   ├── __init__.py
|   ├── {{source_to_target}}.json
|   ├── {{source_to_target}}.yaml
└── tests
    ├── test_{{source_to_target}}.py
    └── {{source_to_target}}_tests.yaml
YAML File

The YAML file should contain the “easy reading” version of your transliterator. For this example, create a file called a_to_b.yaml. Add a metadata field to the YAML file, as well, following the guidelines.

tokens:
  a: [a_class]
  ' ': [whitespace]
rules:
  a: A
onmatch_rules:
  - <a_class> + <a_class>: ","
whitespace:
  default: ' '
  token_class: whitespace
  consolidate: false
metadata:
  name: A to B
  version: 0.0.1
  url: http://website_of_project.com
  author: Your Name is Optional
  author_email: your_email@is_option.al
  maintainer: Maintainer's Name is Optional
  maintainer_email: maintainers_email@is_option.al
  license: MIT or Other Open Source License
  keywords: [add, keywords, here, as, a, list]
  project_urls:
     Documentation: https://link_to_documentation.html
     Source: https://link_to_sourcecode.html
     Tracker: https://link_to_issue_tracker.html

For most use cases, the project_urls can link to the Graph Transliterator Github page.

JSON File

To create a JSON file, you can use the command line interface:

$ graphtransliterator dump –from yaml_file a_to_b.yaml > a_to_b.json

Alternatively, you can use the make-json command:

$ graphtransliterator make-json AToB

The JSON file loads more quickly than the YAML one, but it is not necessary during development.

__init__.py

The __init__.py will create the bundled transliterator, which is a subclass of GraphTransliterator named Bundled.

Following convention, uou need to name your transliterator’s class is CamelCase. For this example, it would be AToB:

from graphtransliterator.transliterators import Bundled

class AToB(Bundled):
    """
    A to B Bundled Graph Transliterator
    """

    def __init__(self, **kwargs):
        """Initialize transliterator from YAML."""
        self.from_YAML(
            **kwargs
        )  # defaults to check_ambiguity=True, check_coverage=True
        # When ready, remove the previous lines and initialize more quickly from JSON:
        # self.init_from_JSON(**kwargs) # check_ambiguity=False, check_coverage=False

When you load the bundled transliterator from YAML using from_YAML it will check for ambiguity as well as check the coverage of the tests. You can turn those features off temporarily here.

When a transliterator is added into Graph Transliterator, it will likely be set to load from JSON by default. Tests will check for ambiguity and coverage.

Tests

Graph Transliterator requires that all bundled transliterators have tests that visit every edge and node of the internal graph and that use all on-match rules. The test file should be a YAML file defining a dictionary keyed from input to correct output.

You can test the transliterator as you are developing it by adding YAML tests and running the command:

graphtransliterator test [name_of_your_transliterator]

Tests can be generated using the command line interface:

mkdir tests
graphtransliterator generate-tests --from bundled [name_of_your_transliterator] > tests/[name_of_your_transliterator]
Testing the Transliterator

You should test the transliterator to make sure everything is correct, including its metadata. To do that, navigate back to the root directory of graphtransliterator and execute the command:

py.test tests/test_transliterators.py

You can also run the complete suite of tests by running:

tox
Pushing Your Transliterator

When you are finished with a version of your transliterator, you should once again commit it to your github branch after syncing your branch with the remote. Then you can make a pull request to include the transliterator in Graph Transliterator. You can do that from the Graph Transliterator Github page. See https://help.github.com/en/articles/creating-a-pull-request-from-a-fork.

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

Contributor Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Types of Contributions

You can contribute in many ways:

Report Bugs

Report bugs at https://github.com/seanpue/graphtransliterator/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.

  • Any details about your local setup that might be helpful in troubleshooting.

  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

Graph-based Transliterator could always use more documentation, whether as part of the official Graph-based Transliterator docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/seanpue/graphtransliterator/issues.

If you are proposing a feature:

  • Explain in detail how it would work.

  • Keep the scope as narrow as possible, to make it easier to implement.

  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Add Transliterators

We welcome new transliterators to be added to the bundled transliterators!

See the documentation about Bundled Transliterators and look at Example as a model.

Raise an issue on Github, https://github.com/seanpue/graphtransliterator/issues

Then create a new branch with the new transliterator. Make sure the transliterator passes all of these requirements:

  • is a submodule of graphtransliterator.transliterators

  • has a unique name, preferably in format source_to_target

  • has the following files: - __init__.py - {{source_to_target}}.yaml - {{source_to_target}}.json - tests/{{source_to_target}}_tests.yaml - tests/test_{{source_to_target}}.py (optional)

  • has a classname in camel case, e.g. SourceToTarget

  • has complete test coverage of all nodes and edges of generated graph and all onmatch rules, if present

  • has required metadata in the YAML file.

When all the requirements are fulfilled, submit a pull request, and it will be reviewed for inclusion in a near-future release.

Get Started!

Ready to contribute? Here’s how to set up graphtransliterator for local development.

  1. Fork the graphtransliterator repo on GitHub.

  2. Clone your fork locally:

    $ git clone git@github.com:your_name_here/graphtransliterator.git
    
  3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    $ mkvirtualenv graphtransliterator
    $ cd graphtransliterator/
    $ python setup.py develop
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  5. When you’re done making changes, format your code using the Black code formatter. (You can do that in your editor, as well). Then check that your changes pass flake8 and the tests, including testing other Python versions with tox:

    $ black graphtransliterator
    $ flake8 graphtransliterator tests
    $ python setup.py test or py.test
    $ tox
    

    To get black, flake8, and tox, just pip install them into your virtualenv.

    You should also test your coverage using make:

    $ make coverage

  6. Commit your changes and push your branch to GitHub:

    $ git add .
    $ git commit -m "Your detailed description of your changes."
    $ git push origin name-of-your-bugfix-or-feature
    
  7. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.

  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.

  3. The pull request should work for Python 3.7 and 3.8 for PyPy. Check https://travis-ci.org/seanpue/graphtransliterator/pull_requests and make sure that the tests pass for all supported Python versions.

Tips

To run a subset of tests:

$ py.test tests.test_graphtransliterator
Deploying

A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.rst). Then run:

$ bumpversion patch # possible: major / minor / patch
$ git push
$ git push --tags

The module uses Github Actions to deploy to TestPyPI and to PyPI.

API Reference

A list of the full API reference of all public classes and functions is below.

Public members can (and should) be imported from graphtransliterator:

from graphtransliterator import GraphTransliterator

Bundled transliterators require that graphtransliterator.transliterators: be imported:

import graphtransliterator.transliterators
transliterators.iter_names()
Core Classes
class graphtransliterator.GraphTransliterator(tokens, rules, whitespace, onmatch_rules=None, metadata=None, ignore_errors=False, check_ambiguity=True, onmatch_rules_lookup=None, tokens_by_class=None, graph=None, tokenizer_pattern=None, graphtransliterator_version=None, **kwargs)[source]

A graph-based transliteration tool that lets you convert the symbols of one language or script to those of another using rules that you define.

Transliteration of tokens of an input string to an output string is configured by: a set of input token types with classes, pattern-matching rules involving sequences of tokens as well as preceding or following tokens and token classes, insertion rules between matches, and optional consolidation of whitespace. Rules are ordered by specificity.

Note

This constructor does not validate settings and should typically not be called directly. Use from_dict() instead. For “easy reading” support, use from_easyreading_dict(), from_yaml(), or from_yaml_file(). Keyword parameters used here (ignore_errors, check_ambiguity) can be passed from those other constructors.

Parameters:
  • tokens (dict of {str: set of str}) – Mapping of input token types to token classes

  • rules (list of TransliterationRule) – list of transliteration rules ordered by cost

  • onmatch_rules (list of OnMatchRule, or None) – Rules for output to be inserted between tokens of certain classes when a transliteration rule has been matched but before its production string has been added to the output

  • whitespace (WhitespaceRules) – Rules for handling whitespace

  • metadata (dict or None) – Metadata settings

  • ignore_errors (bool, optional) – If true, transliteration errors are ignored and do not raise an exception. The default is false.

  • check_ambiguity (bool, optional) – If true (default), transliteration rules are checked for ambiguity. load() and loads() do not check ambiguity by default.

  • onmatch_rules_lookup (dict of {str: dict of {str: list of int}}, optional`) – OnMatchRules lookup, used internally, will be generated if not present.

  • tokens_by_class (dict of {str: set of str}, optional) – Tokens by class, used internally, will be generated if not present.

  • graph (DirectedGraph, optional) – Directed graph used by Graph Transliterator, will be generated if not present.

  • tokenizer_pattern (str, optional) – Regular expression pattern for input string tokenization, will be generated if not present.

  • graphtransliterator_version (str, optional) – Version of graphtransliterator, added by dump() and dumps().

Example

1from graphtransliterator import GraphTransliterator, OnMatchRule, TransliterationRule, WhitespaceRules
2settings = {'tokens': {'a': {'vowel'}, ' ': {'wb'}}, 'onmatch_rules': [OnMatchRule(prev_classes=['vowel'], next_classes=['vowel'], production=',')], 'rules': [TransliterationRule(production='A', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562), TransliterationRule(production=' ', prev_classes=None, prev_tokens=None, tokens=[' '], next_tokens=None, next_classes=None, cost=0.5849625007211562)], 'metadata': {'author': 'Author McAuthorson'}, 'whitespace': WhitespaceRules(default=' ', token_class='wb', consolidate=False)}
3gt = GraphTransliterator(**settings)
4gt.transliterate('a')
'A'

See also

from_dict

Constructor from dictionary of settings

from_easyreading_dict

Constructor from dictionary in “easy reading” format

from_yaml

Constructor from YAML string in “easy reading” format

from_yaml_file

Constructor from YAML file in “easy reading” format

dump(compression_level=0)[source]

Dump configuration of Graph Transliterator to Python data types.

Compression is turned off by default.

Parameters:

compression_level (int) – A value in 0 (default, no compression), 1 (compression including graph), and 2 (compressiong without graph)

Returns:

GraphTransliterator configuration as a dictionary with keys:

"tokens"

Mappings of tokens to their classes (OrderedDict of {str: list of str})

"rules"

Transliteration rules in direct format (list of dict of {str: str})

"whitespace"

Whitespace settings (dict of {str: str})

"onmatch_rules"

On match rules (list of OrderedDict)

"metadata"

Dictionary of metadata (dict)

"ignore_errors"

Ignore errors in transliteration (bool)

"onmatch_rules_lookup"

Dictionary keyed by current token to previous token containing a list of indexes of applicable OnmatchRule to try (dict of {str: dict of {str: list of int}})

"tokens_by_class"

Tokens keyed by token class, used internally (dict of {str: list of str})

"graph"

Serialization of DirectedGraph (dict)

"tokenizer_pattern"

Regular expression for tokenizing (str)

"graphtransliterator_version"

Module version of graphtransliterator (str)

Return type:

OrderedDict

Example

 5yaml_ = '''
 6tokens:
 7  a: [vowel]
 8  ' ': [wb]
 9rules:
10  a: A
11  ' ': ' '
12whitespace:
13  default: " "
14  consolidate: false
15  token_class: wb
16onmatch_rules:
17  - <vowel> + <vowel>: ','  # add a comma between vowels
18metadata:
19  author: "Author McAuthorson"
20'''
21gt = GraphTransliterator.from_yaml(yaml_)
22gt.dump()
OrderedDict([('tokens', {'a': ['vowel'], ' ': ['wb']}),
             ('rules',
              [OrderedDict([('production', 'A'),
                            ('tokens', ['a']),
                            ('cost', 0.5849625007211562)]),
               OrderedDict([('production', ' '),
                            ('tokens', [' ']),
                            ('cost', 0.5849625007211562)])]),
             ('whitespace',
              {'default': ' ', 'token_class': 'wb', 'consolidate': False}),
             ('onmatch_rules',
              [OrderedDict([('prev_classes', ['vowel']),
                            ('next_classes', ['vowel']),
                            ('production', ',')])]),
             ('metadata', {'author': 'Author McAuthorson'}),
             ('ignore_errors', False),
             ('onmatch_rules_lookup', {'a': {'a': [0]}}),
             ('tokens_by_class', {'vowel': ['a'], 'wb': [' ']}),
             ('graph',
              {'node': [{'type': 'Start',
                 'ordered_children': {'a': [1], ' ': [3]}},
                {'token': 'a',
                 'type': 'token',
                 'ordered_children': {'__rules__': [2]}},
                {'type': 'rule', 'accepting': True, 'rule_key': 0},
                {'token': ' ',
                 'type': 'token',
                 'ordered_children': {'__rules__': [4]}},
                {'type': 'rule', 'accepting': True, 'rule_key': 1}],
               'edge': {0: {1: {'token': 'a', 'cost': 0.5849625007211562},
                 3: {'token': ' ', 'cost': 0.5849625007211562}},
                1: {2: {'cost': 0.5849625007211562}},
                3: {4: {'cost': 0.5849625007211562}}},
               'edge_list': [(0, 1), (0, 3), (1, 2), (3, 4)]}),
             ('tokenizer_pattern', '(a|\\ )'),
             ('graphtransliterator_version', '1.2.4')])

See also

dumps

Dump Graph Transliterator configuration to JSON string

load

Load Graph Transliteration from configuration in Python data types

loads

Load Graph Transliteration from configuration as a JSON string

dumps(compression_level=2)[source]
Parameters:
  • compression_level (int) – A value in 0 (no compression), 1 (compression including graph), and 2 (default, compression without graph)

  • separators (tuple of str) – Separators used by json.dumps(), default is compact

  • (JSON). (Dump settings of Graph Transliterator to Javascript Object Notation) –

  • default. (Compression is turned on by) –

Returns:

JSON string

Return type:

str

Examples

23yaml_ = '''
24  tokens:
25    a: [vowel]
26    ' ': [wb]
27  rules:
28    a: A
29    ' ': ' '
30  whitespace:
31    default: " "
32    consolidate: false
33    token_class: wb
34  onmatch_rules:
35    - <vowel> + <vowel>: ','  # add a comma between vowels
36  metadata:
37    author: "Author McAuthorson"
38'''
39gt = GraphTransliterator.from_yaml(yaml_)
40gt.dumps()
'{"graphtransliterator_version":"1.2.4","compressed_settings":[["vowel","wb"],[" ","a"],[[1],[0]],[["A",0,0,[1],0,0,-1],[" ",0,0,[0],0,0,-1]],[" ","wb",0],[[[0],[0],","]],{"author":"Author McAuthorson"},null]}'

See also

dump

Dump Graph Transliterator configuration to Python data types

load

Load Graph Transliteration from configuration in Python data types

loads

Load Graph Transliteration from configuration as a JSON string

static from_dict(dict_settings, **kwargs)[source]

Generate GraphTransliterator from dict settings.

Parameters:

dict_settings (dict) – Dictionary of settings

Returns:

Graph transliterator

Return type:

GraphTransliterator

static from_easyreading_dict(easyreading_settings, **kwargs)[source]

Constructs GraphTransliterator from a dictionary of settings in “easy reading” format, i.e. the loaded contents of a YAML string.

Parameters:

easyreading_settings (dict) –

Settings dictionary in easy reading format with keys:

"tokens"

Mappings of tokens to their classes (dict of {str: list of str})

"rules"

Transliteration rules in “easy reading” format (list of dict of {str: str})

"onmatch_rules"

On match rules in “easy reading” format (dict of {str: str}, optional)

"whitespace"

Whitespace definitions, including default whitespace token, class of whitespace tokens, and whether or not to consolidate (dict of {‘default’: str, ‘token_class’: str, consolidate: bool}, optional)

"metadata"

Dictionary of metadata (dict, optional)

Returns:

Graph Transliterator

Return type:

GraphTransliterator

Note

Called by from_yaml().

Example

41tokens = {
42    'ab': ['class_ab'],
43    ' ': ['wb']
44}
45whitespace = {
46    'default': ' ',
47    'token_class': 'wb',
48    'consolidate': True
49}
50onmatch_rules = [
51    {'<class_ab> + <class_ab>': ','}
52]
53rules = {'ab': 'AB',
54         ' ': '_'}
55settings = {'tokens': tokens,
56            'rules': rules,
57            'whitespace': whitespace,
58            'onmatch_rules': onmatch_rules}
59gt = GraphTransliterator.from_easyreading_dict(settings)
60gt.transliterate("ab abab")
'AB_AB,AB'

See also

from_yaml

Constructor from YAML string in “easy reading” format

from_yaml_file

Constructor from YAML file in “easy reading” format

static from_yaml(yaml_str, charnames_escaped=True, **kwargs)[source]

Construct GraphTransliterator from a YAML str.

Parameters:
  • yaml_str (str) – YAML mappings of tokens, rules, and (optionally) onmatch_rules

  • charnames_escaped (boolean) – Unescape Unicode during YAML read (default True)

Note

Called by from_yaml_file() and calls from_easyreading_dict().

Example

61yaml_ = '''
62tokens:
63  a: [class1]
64  ' ': [wb]
65rules:
66  a: A
67  ' ': ' '
68whitespace:
69  default: ' '
70  consolidate: True
71  token_class: wb
72onmatch_rules:
73  - <class1> + <class1>: "+"
74'''
75gt = GraphTransliterator.from_yaml(yaml_)
76gt.transliterate("a aa")
'A A+A'

See also

from_easyreading_dict

Constructor from dictionary in “easy reading” format

from_yaml

Constructor from YAML string in “easy reading” format

from_yaml_file

Constructor from YAML file in “easy reading” format

static from_yaml_file(yaml_filename, **kwargs)[source]

Construct GraphTransliterator from YAML file.

Parameters:

yaml_filename (str) – Name of YAML file, containing tokens, rules, and (optionally) onmatch_rules

Note

Calls from_yaml().

See also

from_yaml

Constructor from YAML string in “easy reading” format

from_easyreading_dict

Constructor from dictionary in “easy reading” format

property graph

Graph used in transliteration.

Type:

DirectedGraph

property graphtransliterator_version

Graph Transliterator version.

Type:

str

property ignore_errors

Ignore transliteration errors setting.

Type:

bool

property last_input_tokens

Last tokenization of the input string, with whitespace at start and end.

Type:

list of str

property last_matched_rule_tokens

Last matched tokens for each rule.

Type:

list of list of str

property last_matched_rules

Last transliteration rules matched.

Type:

list of TransliterationRule

static load(settings, **kwargs)[source]

Create GraphTransliterator from settings as Python data types.

Parameters:

settings

GraphTransliterator configuration as a dictionary with keys:

"tokens"

Mappings of tokens to their classes (dict of {str: list of str})

"rules"

Transliteration rules in direct format (list of OrderedDict of {str: str})

"whitespace"

Whitespace settings (dict of {str: str})

"onmatch_rules"

On match rules (list of OrderedDict, optional)

"metadata"

Dictionary of metadata (dict, optional)

"ignore_errors"

Ignore errors. (bool, optional)

"onmatch_rules_lookup"

Dictionary keyed by current token to previous token containing a list of indexes of applicable OnmatchRule to try (dict of {str: dict of {str: list of int}}, optional)

tokens_by_class

Tokens keyed by token class, used internally (dict of {str: list of str}, optional)

graph

Serialization of DirectedGraph (dict, optional)

"tokenizer_pattern"

Regular expression for tokenizing (str, optional)

"graphtransliterator_version"

Module version of graphtransliterator (str, optional)

Returns:

Graph Transliterator

Return type:

GraphTransliterator

Example

 77from collections import OrderedDict
 78settings =           {'tokens': {'a': ['vowel'], ' ': ['wb']},
 79 'rules': [OrderedDict([('production', 'A'),
 80               # Can be compacted, removing None values
 81               # ('prev_tokens', None),
 82               ('tokens', ['a']),
 83               ('next_classes', None),
 84               ('next_tokens', None),
 85               ('cost', 0.5849625007211562)]),
 86  OrderedDict([('production', ' '),
 87               ('prev_classes', None),
 88               ('prev_tokens', None),
 89               ('tokens', [' ']),
 90               ('next_classes', None),
 91               ('next_tokens', None),
 92               ('cost', 0.5849625007211562)])],
 93 'whitespace': {'default': ' ', 'token_class': 'wb', 'consolidate': False},
 94 'onmatch_rules': [OrderedDict([('prev_classes', ['vowel']),
 95               ('next_classes', ['vowel']),
 96               ('production', ',')])],
 97 'metadata': {'author': 'Author McAuthorson'},
 98 'onmatch_rules_lookup': {'a': {'a': [0]}},
 99 'tokens_by_class': {'vowel': ['a'], 'wb': [' ']},
100 'graph': {'edge': {0: {1: {'token': 'a', 'cost': 0.5849625007211562},
101    3: {'token': ' ', 'cost': 0.5849625007211562}},
102   1: {2: {'cost': 0.5849625007211562}},
103   3: {4: {'cost': 0.5849625007211562}}},
104  'node': [{'type': 'Start', 'ordered_children': {'a': [1], ' ': [3]}},
105   {'type': 'token', 'token': 'a', 'ordered_children': {'__rules__': [2]}},
106   {'type': 'rule',
107    'rule_key': 0,
108    'accepting': True,
109    'ordered_children': {}},
110   {'type': 'token', 'token': ' ', 'ordered_children': {'__rules__': [4]}},
111   {'type': 'rule',
112    'rule_key': 1,
113    'accepting': True,
114    'ordered_children': {}}],
115  'edge_list': [(0, 1), (1, 2), (0, 3), (3, 4)]},
116 'tokenizer_pattern': '(a|\ )',
117 'graphtransliterator_version': '0.3.3'}
118gt = GraphTransliterator.load(settings)
119gt.transliterate('aa')
'A,A'
120# can be compacted
121settings.pop('onmatch_rules_lookup')
122GraphTransliterator.load(settings).transliterate('aa')
'A,A'

See also

dump

Dump Graph Transliterator configuration to Python data types

dumps

Dump Graph Transliterator configuration to JSON string

loads

Load Graph Transliteration from configuration as a JSON string

static loads(settings, **kwargs)[source]

Create GraphTransliterator from JavaScript Object Notation (JSON) string.

Parameters:

settings – JSON settings for GraphTransliterator

Returns:

Graph Transliterator

Return type:

GraphTransliterator

Example

123JSON_settings = '''{"tokens": {"a": ["vowel"], " ": ["wb"]}, "rules": [{"production": "A", "prev_classes": null, "prev_tokens": null, "tokens": ["a"], "next_classes": null, "next_tokens": null, "cost": 0.5849625007211562}, {"production": " ", "prev_classes": null, "prev_tokens": null, "tokens": [" "], "next_classes": null, "next_tokens": null, "cost": 0.5849625007211562}], "whitespace": {"default": " ", "token_class": "wb", "consolidate": false}, "onmatch_rules": [{"prev_classes": ["vowel"], "next_classes": ["vowel"], "production": ","}], "metadata": {"author": "Author McAuthorson"}, "ignore_errors": false, "onmatch_rules_lookup": {"a": {"a": [0]}}, "tokens_by_class": {"vowel": ["a"], "wb": [" "]}, "graph": {"node": [{"type": "Start", "ordered_children": {"a": [1], " ": [3]}}, {"type": "token", "token": "a", "ordered_children": {"__rules__": [2]}}, {"type": "rule", "rule_key": 0, "accepting": true, "ordered_children": {}}, {"type": "token", "token": " ", "ordered_children": {"__rules__": [4]}}, {"type": "rule", "rule_key": 1, "accepting": true, "ordered_children": {}}], "edge": {"0": {"1": {"token": "a", "cost": 0.5849625007211562}, "3": {"token": " ", "cost": 0.5849625007211562}}, "1": {"2": {"cost": 0.5849625007211562}}, "3": {"4": {"cost": 0.5849625007211562}}}, "edge_list": [[0, 1], [1, 2], [0, 3], [3, 4]]}, "tokenizer_pattern": "(a| )", "graphtransliterator_version": "1.2.2"}'''
124
125gt = GraphTransliterator.loads(JSON_settings)
126gt.transliterate('a')
'A'

See also

dump

Dump Graph Transliterator configuration to Python data types

dumps

Dump Graph Transliterator configuration to JSON string

load

Load Graph Transliteration from configuration in Python data types

match_at(token_i, tokens, match_all=False)[source]

Match best (least costly) transliteration rule at a given index in the input tokens and return the index to that rule. Optionally, return all rules that match.

Parameters:
  • token_i (int) – Location in tokens at which to begin

  • tokens (list of str) – List of tokens

  • match_all (bool, optional) – If true, return the index of all rules matching at the given index. The default is false.

Returns:

Index of matching transliteration rule in GraphTransliterator.rules or None. Returns a list of int or an empty list if match_all is true.

Return type:

int, None, or list of int

Note

Expects whitespaces token at beginning and end of tokens.

Examples

127gt = GraphTransliterator.from_yaml('''
128        tokens:
129            a: []
130            a a: []
131            ' ': [wb]
132        rules:
133            a: <A>
134            a a: <AA>
135        whitespace:
136            default: ' '
137            consolidate: True
138            token_class: wb
139''')
140tokens = gt.tokenize("aa")
141tokens # whitespace added to ends
[' ', 'a', 'a', ' ']
142gt.match_at(1, tokens) # returns index to rule
0
143gt.rules[gt.match_at(1, tokens)] # actual rule
TransliterationRule(production='<AA>', prev_classes=None, prev_tokens=None, tokens=['a', 'a'], next_tokens=None, next_classes=None, cost=0.41503749927884376)
144gt.match_at(1, tokens, match_all=True) # index to rules, with match_all
[0, 1]
145[gt.rules[_] for _ in gt.match_at(1, tokens, match_all=True)]
[TransliterationRule(production='<AA>', prev_classes=None, prev_tokens=None, tokens=['a', 'a'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='<A>', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562)]
property metadata

Metadata of transliterator

Type:

dict

property onmatch_rules

Rules for productions between matches.

Type:

list of OnMatchRules

property onmatch_rules_lookup

On Match Rules lookup

Type:

dict

property productions

List of productions of each transliteration rule.

Type:

list of str

pruned_of(productions)[source]

Remove transliteration rules with specific output productions.

Parameters:

productions (str, or list of str) – list of productions to remove

Returns:

Graph transliterator pruned of certain productions.

Return type:

graphtransliterator.GraphTransliterator

Note

Uses original initialization parameters to construct a new GraphTransliterator.

Examples

146gt = GraphTransliterator.from_yaml('''
147        tokens:
148            a: []
149            a a: []
150            ' ': [wb]
151        rules:
152            a: <A>
153            a a: <AA>
154        whitespace:
155            default: ' '
156            consolidate: True
157            token_class: wb
158''')
159gt.rules
[TransliterationRule(production='<AA>', prev_classes=None, prev_tokens=None, tokens=['a', 'a'], next_tokens=None, next_classes=None, cost=0.41503749927884376),
 TransliterationRule(production='<A>', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562)]
160gt.pruned_of('<AA>').rules
[TransliterationRule(production='<A>', prev_classes=None, prev_tokens=None, tokens=['a'], next_tokens=None, next_classes=None, cost=0.5849625007211562)]
161gt.pruned_of(['<A>', '<AA>']).rules
[]
property rules

Transliteration rules sorted by cost.

Type:

list of TransliterationRule

tokenize(input)[source]

Tokenizes an input string.

Adds initial and trailing whitespace, which can be consolidated.

Parameters:

input (str) – String to tokenize

Returns:

List of tokens, with default whitespace token at beginning and end.

Return type:

list of str

Raises:

ValueError – Unrecognizable input, such as a character that is not in a token

Examples

162tokens = {'ab': ['class_ab'], ' ': ['wb']}
163whitespace = {'default': ' ', 'token_class': 'wb', 'consolidate': True}
164rules = {'ab': 'AB', ' ': '_'}
165settings = {'tokens': tokens, 'rules': rules, 'whitespace': whitespace}
166gt = GraphTransliterator.from_easyreading_dict(settings)
167gt.tokenize('ab ')
[' ', 'ab', ' ']
property tokenizer_pattern

Tokenizer pattern from transliterator

Type:

str

property tokens

Mappings of tokens to their classes.

Type:

dict of {str

Type:

set of str}

property tokens_by_class

Tokenizer pattern from transliterator

Type:

dict of {str

Type:

list of str}

transliterate(input)[source]

Transliterate an input string into an output string.

Parameters:

input (str) – Input string to transliterate

Returns:

Transliteration output string

Return type:

str

Raises:

ValueError – Cannot parse input

Note

Whitespace will be temporarily appended to start and end of input string.

Example

168GraphTransliterator.from_yaml(
169'''
170tokens:
171  a: []
172  ' ': [wb]
173rules:
174  a: A
175  ' ': '_'
176whitespace:
177  default: ' '
178  consolidate: True
179  token_class: wb
180''').transliterate("a a")
'A_A'
property whitespace

Whitespace rules.

Type:

WhiteSpaceRules

class graphtransliterator.CoverageTransliterator(*args, **kwargs)[source]

Subclass of GraphTransliterator that logs visits to graph and on_match rules.

Used to confirm that tests cover the entire graph and onmatch_rules.

check_coverage(raise_exception=True)[source]

Check coverage of graph and onmatch rules.

First checks graph coverage, then checks onmatch rules.

check_onmatchrules_coverage(raise_exception=True)[source]

Check coverage of onmatch rules.

clear_visited()[source]

Clear visited flags from graph and onmatch_rules.

Bundled Transliterators
graphtransliterator.transliterators

Bundled transliterators are loaded by explicitly importing graphtransliterator.transliterators. Each is an instance of graphtransliterator.bundled.Bundled.

class graphtransliterator.transliterators.Bundled(*args, **kwargs)[source]

Subclass of GraphTransliterator used for bundled Graph Transliterator.

property directory

Directory of bundled transliterator, used to load settings.

from_JSON(check_ambiguity=False, coverage=False, **kwargs)[source]

Initialize from bundled JSON file (best for speed).

Parameters:
  • check_ambiguity (bool,) – Should ambiguity be checked. Default is False.

  • coverage (bool) – Should test coverage be checked. Default is False.

from_YAML(check_ambiguity=True, coverage=True, **kwargs)[source]

Initialize from bundled YAML file (best for development).

Parameters:
  • check_ambiguity (bool,) – Should ambiguity be checked. Default is True.

  • coverage (bool) – Should test coverage be checked. Default is True.

generate_yaml_tests(file=None)[source]

Generates YAML tests with complete coverage.

Uses the first token in a class as a sample. Assumes for onmatch rules that the first sample token in a class has a unique production, which may not be the case. These should be checked and edited.

load_yaml_tests()[source]

Iterator for YAML tests.

Assumes tests are found in subdirectory tests of module with name NAME_tests.yaml, e.g. `source_to_target/tests/source_to_target_tests.yaml.

property name

Name of bundled transliterator, e.g. ‘Example’

classmethod new(method='json', **kwargs)[source]

Return a new class instance from method (json/yaml).

Parameters:

method (str (json or yaml)) – How to load bundled transliterator, JSON or YAML.

run_tests(transliteration_tests)[source]

Run transliteration tests.

Parameters:

transliteration_tests (dict of {str:str}) – Dictionary of test from source -> correct target.

run_yaml_tests()[source]

Run YAML tests in MODULE/tests/MODULE_tests.yaml

property yaml_tests_filen

Metadata of transliterator

Type:

dict

class graphtransliterator.transliterators.Example(**kwargs)[source]

Example Bundled Graph Transliterator.

class graphtransliterator.transliterators.ITRANSDevanagariToUnicode(**kwargs)[source]

ITRANS Devanagari to Unicode Transliterator.

class graphtransliterator.transliterators.MetadataSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for Bundled metadata.

graphtransliterator.transliterators.iter_names()[source]

Iterate through bundled transliterator names.

graphtransliterator.transliterators.iter_transliterators(**kwds)[source]

Iterate through instances of bundled transliterators.

Graph Classes
class graphtransliterator.DirectedGraph(node=None, edge=None, edge_list=None)[source]

A very basic dictionary- and list-based directed graph. Nodes are a list of dictionaries of node data. Edges are nested dictionaries keyed from the head -> tail -> edge properties. An edge list is maintained. Can be exported as a dictionary.

node

List of node data

Type:

list of dict

edge

Mapping from head to tail of edge, holding edge data

Type:

dict of {int: dict of {int: dict}}

edge_list

List of head and tail of each edge

Type:

list of tuple of (int, int)

Examples

181from graphtransliterator import DirectedGraph
182DirectedGraph()
<graphtransliterator.graphs.DirectedGraph at 0x7f8d24500140>
add_edge(head, tail, edge_data=None)[source]

Add an edge to a graph and return its attributes as dict.

Parameters:
  • head (int) – Index of head of edge

  • tail (int) – Index of tail of edge

  • edge_data (dict, default {}) – Edge data

Returns:

Data of created edge

Return type:

dict

Raises:

ValueError – Invalid head or tail, or edge_data is not a dict.

Examples

183g = DirectedGraph()
184g.add_node()
(0, {})
185g.add_node()
(1, {})
186g.add_edge(0,1, {'data_key_1': 'some edge data here'})
{'data_key_1': 'some edge data here'}
187g.edge
{0: {1: {'data_key_1': 'some edge data here'}}}
add_node(node_data=None)[source]

Create node and return (int, dict) of node key and object.

Parameters:

node_data (dict, default {}) – Data to be stored in created node

Returns:

Index of created node and its data

Return type:

tuple of (int, dict)

Raises:

ValueErrornode_data is not a dict

Examples

188g = DirectedGraph()
189g.add_node()
(0, {})
190g.add_node({'datakey1': 'data value'})
(1, {'datakey1': 'data value'})
191g.node
[{}, {'datakey1': 'data value'}]
class graphtransliterator.VisitLoggingDirectedGraph(graph)[source]

A DirectedGraph that logs visits to all nodes and edges.

Used to measure the coverage of tests for bundled transliterators.

check_coverage(raise_exception=True)[source]

Checks that all nodes and edges are visited.

Parameters:

raise_exception (bool, default) – Raise IncompleteGraphCoverageException (default, True)

Raises:

IncompleteGraphCoverageException – Not all nodes/edges of a graph have been visited.

clear_visited()[source]

Clear all visited attributes on nodes and edges.

Rule Classes
class graphtransliterator.TransliterationRule(production, prev_classes, prev_tokens, tokens, next_tokens, next_classes, cost)[source]

A transliteration rule containing the specific match conditions and string output to be produced, as well as the rule’s cost.

production

Output produced on match of rule

Type:

str

prev_classes

List of previous token classes to be matched before tokens or, if they exist, prev_tokens

Type:

list of str, or None

prev_tokens

List of tokens to be matched before tokens

Type:

list of str, or None

tokens

List of tokens to match

Type:

list of str

next_tokens

List of tokens to match after tokens

Type:

list of str, or None

next_classes

List of tokens to match after tokens or, if they exist, next_tokens

Type:

list of str, or None

cost

Cost of the rule, where less specific rules are more costly

Type:

float

class graphtransliterator.OnMatchRule(prev_classes, next_classes, production)[source]

Rules about adding text between certain combinations of matched rules.

When a translation rule has been found and before its production is added to the output, the productions string of an OnMatch rule is added if previously matched tokens and current tokens are of the specified classes.

prev_classes

List of previously matched token classes required

Type:

list of str

next_classes

List of current and following token classes required

Type:

list of str

production

String to added before current rule

Type:

str

class graphtransliterator.WhitespaceRules(default, token_class, consolidate)[source]

Whitespace rules of GraphTransliterator.

default

Default whitespace token

Type:

str

token_class

Whitespace token class

Type:

str

consolidate

Consolidate consecutive whitespace tokens and render as a single instance of the specified default whitespace token.

Type:

bool

Exceptions
exception graphtransliterator.GraphTransliteratorException[source]

Base exception class. All Graph Transliterator-specific exceptions should subclass this class.

exception graphtransliterator.AmbiguousTransliterationRulesException[source]

Raised when multiple transliteration rules can match the same pattern. Details of ambiguities are given in a logging.warning().

exception graphtransliterator.NoMatchingTransliterationRuleException[source]

Raised when no transliteration rule can be matched at a particular location in the input string’s tokens. Details of the location are given in a logging.warning().

exception graphtransliterator.UnrecognizableInputTokenException[source]

Raised when a character in the input string does not correspond to any tokens in the GraphTransliterator’s token settings. Details of the location are given in a logging.warning().

Schemas
class graphtransliterator.DirectedGraphSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for DirectedGraph.

Validates graph somewhat rigorously.

class graphtransliterator.EasyReadingSettingsSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for easy reading settings.

Provides initial validation based on easy reading format.

class graphtransliterator.GraphTransliteratorSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for Graph Transliterator.

class graphtransliterator.OnMatchRuleSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for OnMatchRule.

class graphtransliterator.SettingsSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for settings in dictionary format.

Performs validation.

class graphtransliterator.TransliterationRuleSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for TransliterationRule.

class graphtransliterator.WhitespaceDictSettingsSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for Whitespace definition as a dict.

class graphtransliterator.WhitespaceSettingsSchema(*, only: Sequence[str] | AbstractSet[str] | None = None, exclude: Sequence[str] | AbstractSet[str] = (), many: bool = False, context: dict | None = None, load_only: Sequence[str] | AbstractSet[str] = (), dump_only: Sequence[str] | AbstractSet[str] = (), partial: bool | Sequence[str] | AbstractSet[str] | None = None, unknown: str | None = None)[source]

Schema for Whitespace definition that loads as WhitespaceRules.

Credits

Development Lead
Contributors

Acknowledgements

Software development was supported by an Andrew W. Mellon Foundation New Directions Fellowship (Grant Number 11600613) and by matching funds provided by the College of Arts and Letters, Michigan State University.

Kudos

Graph Transliterator’s developers acknowledge the following open-access projects, which have been particularly helpful in Graph Transliterator’s development. These include: astropy (guide for documentation style expanding on numpy), click (command line interface), contributor_covenant (basis for the code of conduct), cookiecutter-pypackage (initial Python module template), jupyter-sphinx (renderer of live code results in the docs), and marshmallow (object serializer/deserializer).

Those from which code/text has been adopted are mentioned in NOTICE.

[ ~ Dependencies scanned by PyUp.io ~ ]

History

[Unreleased - Maybe]
  • save match location in tokenize using token_details

  • allow insertion of transliteration error messages into output

  • fix Devanagari output in Sphinx-generated Latex PDF

  • add translated messages

  • add static typing with mypy

  • adjust IncorrectVersionException to only consider major, minor versioning not patch

  • Adjust CSS for CLI output in docs

  • add doc making test to commit

[To do]
  • Add on/off switch characters

  • Update module publication

1.2.4 (2023-10-15)
  • switched to poetry for module publishing, based on cookiecutter-poetry

  • fixed click version command

  • added code coverage

1.2.3 (2023-10-09)
  • added python 3.10, 3.11, removed <=3.8

  • updated dependencies (used pur)

  • updated jupyter-download syntax

  • reformatted with black

  • adjusted flake8 line length

  • removed collect_ignore for pytest

  • updated Github actions

1.2.2 (2021-08-11)
  • updated CONTRIBUTING.rst for new Python versions

  • added github actions to publish to pypi and testpypi

  • shifted to github CI

  • updated dependencies

  • fixed tox.ini

  • updated schema.py error message

  • updated docs/conf.py for jupyter_sphinx

1.2.1 (2020-10-29)
  • updated docs/conf.py for jupyter_sphinx

1.2.0 (2020-05-13)
  • changes to bundled.py and cli.py with dump-tests command

  • updated cli.rst

1.1.2 (2020-04-29)
  • updated LICENSE, minor code updates, security updates

1.1.1 (2020-04-21)
  • Added test to check compressed dump is uniform

  • Fixed sorting of class id in compressed dump to make JSON output uniform

  • Added Python 3.8 support

1.1.0 (2020-01-10)
  • Added pre-commit hook to rebuild bundled transliterators with bump2version

  • remove to_dict from DirectedGraph, since it is handled through Marshmallow schemas.

  • Adjust documentation to mention compression.

  • added list-bundled CLI command

  • added –regex/-re flag to graphtransliterator make-json CLI command to allow regular expressions

  • removed coverage keyword from GraphTransliterator

  • reorganized core.py

  • converted from_dict, from_easyreading_dict, from_yaml, and from_yaml_file to static methods from class methods

  • moved ambiguity-checking functions to ambiguity.py and tests to test_ambiguity.py

  • set three levels of compression: 0 (Human-readable), 1 (no data loss, includes graph), 2 (no data loss, and no graph); 2 is fastest and set to default.

  • set check_ambiguity to read keyword during JSON load

  • allowed empty string productions during JSON compression

  • added compression.py with decompress_config() and compress_config() to compress JSON

  • added tests/test_compression.py to test compression.py

  • added sorting of edge_list to DirectedGraph to allow dumped JSON comparison in tests

  • adjusted _tokenizer_string_from() to sort by length then string for JSON comparison

1.0.7 (2019-12-22)
  • added IncorrectVersionException, if serialized version being loaded is from a later version than the current graphtransliterator version

  • added automatic edge_list creation if edge parameter in DirectedGraph

  • added fields to and started using NodeDataSchema

  • added pre_dump to GraphTransliteratorSchema, NodeDataSchema to remove empty values to compress Serialization

  • removed rule from graph leaves and updated docs accordingly

1.0.6 (2019-12-15)
  • fixed serialization of graph node indexes as integer rather than strings

1.0.5 (2019-12-14)
  • added JOSS citation to README

  • added –version to cli

  • removed some asserts

  • removed rule dictionaries from graph leaves to compress and simplify serialization

1.0.4 (2019-11-30)
  • updates to docs

1.0.3 (2019-11-30)
  • update to paper

1.0.2 (2019-11-30)
  • updates for Zenodo

1.0.1 (2019-11-29)
  • updated requirements_dev.txt

1.0.0 (2019-11-26)
  • removed extraneous files

  • updated development status in setup.py

  • set to current jupyter-sphinx

0.4.10 (2019-11-04)
  • fixed typo in requirements_dev.txt

0.4.9 (2019-11-04)
  • quick fix to requirements_dev.txt due to readthedocs problem with not reading changes

0.4.8 (2019-11-04)
  • twine update to 2.0

0.4.7 (2019-11-04)
  • temp switch back to dev version of jupyter-sphinx for overflow error

  • Dropped Python 3.5 support for twine 2.0 update

0.4.6 (2019-11-04)
  • switched to latest jupyter-sphinx

  • travis adjustments

0.4.5 (2019-10-31)
  • Adjusted make-json CLI test to restore original example.json

0.4.4 (2019-10-24)
  • moved README.rst to include in index.rst

  • fixed error in advanced_tutorial.rst

0.4.3 (2019-10-24)
  • fixed requirements_dev.txt

0.4.2 (2019-10-24)
  • fixed README.rst for PyPI

0.4.1 (2019-10-24)
  • fixed links to code in docs

  • fixed link to NOTICE

  • added acknowledgements

0.4.0 (2019-10-24)
  • added bundled transliterators to api.rst

  • adjustments to usage.rst

  • adjustments to tutorial.rst

  • fixes to docs (linking module)

  • adjustments to advanced_tutorial.rst

  • adjustments to README.rst

  • fixes to AUTHORS.rst

  • added kudos.rst to docs to acknowledge inspirational projects

  • added advanced tutorial on bundling a transliterator.

  • added cli.rst to docs

  • fixed regex in get_unicode_char to allow hyphen

  • added cli.py and adjusted setup.py

  • updated tutorial

  • added statement of need to README. Thanks @rlskoeser.

  • Removed continue-linenos jupyter-sphinx directive in favor of configuration settings

  • added preface to documentation source files with links to production version, etc. Thanks @rlskoeser.

  • added custom css for jupyter-sphinx cells

  • added jupyter-sphinx documentation with line numbering

  • removed pkg_resources as source for version due to problem with loading from pythonpath for jupyter-sphinx in readthedocs, instead used __version__

  • adjust path in docs/conf.py to fix docs error

  • added bundled/schemas.py with MetadataSchema for bundled transliterator metadata

  • added coverage to from_dict()

  • added allow_none in onmatch_rules in GraphTransliteratorSchema

  • adjusted core.py so that all edges are visited during search, even if no constraints

  • removed _count_of_tokens() in favor of cost

  • added IncompleteGraphCoverageException to exceptions.py

  • added VisitLoggingDirectedGraph to graphs.py

  • added tests/test_transliterator.py

  • partially updated transliterators/README.rst

  • removed transliterators/sample/*

  • added yaml and json to package_data in setup.py

  • Added to core.py class CoverageTransliterator, which tracks visits to edges, nodes, and onmatch rules, and allows clearing of visits and checking of coverage, used to make sure tests are comprehensive

  • created test/test_coverage.py to test CoverageTransliterator

  • created transliterators/bundled.py with class Bundled for bundled transliterators

  • added load_from_YAML() and load_from_JSON() initializers to Bundled to load from bundled YAML (for development) and JSON (for speed)

  • added load_yaml_tests(), run_yaml_tests(), and run_tests() to Bundled

  • created transliterators/__init__.py that finds bundled transliterators in subdirectory and adds them to graphtransliterators.transliterators namespace

  • added iter_names() and iter_transliterators() to transliterators/__init__.py

  • created test/test_transliterator.py to check bundled transliterator loading and functions

  • created in transliterators/example/ __init__.py, example.json, example.yaml

  • created in transliterators/example/tests test_example.py and example_tests.yaml

0.3.8 (2019-09-18)
  • fixed load() docstring example

  • updated check_ambiguity() to use cost

0.3.7 (2019-09-17)
  • Adjusted docs to show readme as first page

  • Added sample graph and code to README.rst

  • moved images in docs to _static

0.3.6 (2019-09-17)
  • adjusted installation.rst renaming libraries to modules

  • updated paper and bibliography.

0.3.5 (2019-09-15)
  • flake8 fix for core.py

  • fixed bug in schemas.py whereby, during load(), DirectedGraphSchema() was modifying input settings

  • added tests for modifications to settings by load()

  • adjusted DirectedGraphSchema to allow for compacted transliteration rule settings

  • adjusted GraphTransliteratorSchema to allow for compacted settings

  • added tests to confirm all optional fields passed to load() are really optional

  • added ValidationError if onmatch_rules_lookup present without onmatch_rules

  • adjusted DirectedGraphSchema edge definition to remove str if loading from JSON

  • added more rigorous schema definitions for edge_list and node in DirectedGraphSchema

  • fixed flake8 warning in graphs.py

  • adjusted docstrings in core.py for dump(), dumps(), load(), and loads()

0.3.4 (2019-09-15)
  • added sphinx-issues and settings to requirements_dev.txt, docs/conf.py

  • added .readthedocs.yml configuration file to accommodate sphinx-issues

  • removed history from setup.py due to sphinx-issues

  • fixed GraphTransliteratorException import in __init__.py

  • added docs/_static directory

  • fixed emphasis error and duplicate object description in docs/usages.rst

  • fixed docstring in core.py

  • added python versions badge to README.rst (openjournals/joss-reviews#1717). Thanks @vc1492a.

  • added NOTICE listing licenses of open-source text and code

  • added Dependencies information to docs/install.rst (openjournals/joss-reviews#1717). Thanks @vc1492a.

  • updated AUTHORS.rst

  • minor updates to README.rst

0.3.3 (2019-09-14)
  • fixed missing marshmallow dependency (#47). Thanks @vc1492a.

  • removed unused code from test (#47). Thanks @vc1492a.

  • removed cerberus dependency

0.3.2 (2019-08-30)
  • fixed error in README.rst

0.3.1 (2019-08-29)
  • adjustments to README.rst

  • cleanup in initialize.py and core.py

  • fix to docs/api.rst

  • adjusted setup.cfg for bumpversion of core.py

  • adjusted requirements.txt

  • removed note about namedtuple in dump docs

  • adjusted docs (api.rst, etc.)

0.3.0 (2019-08-23)
  • Removed _tokens_of() from init

  • Removed serialize()

  • Added load() to GraphTransliterator, without ambiguity checking

  • Added dump() and dumps() to GraphTransliterator to export configuration

  • renamed _tokenizer_from() to _tokenizer_pattern_from(), and so that regex is compiled on load and passed as pattern string (tokenizer_pattern)

  • added settings parameters to DirectedGraph

  • added OnMatchRule as namedtuple for consistency

  • added new GraphTransliterator.from_dict(), which validates from_yaml()

  • renamed GraphTransliterator.from_dict() to GraphTransliterator.from_easyreading_dict()

  • added schemas.py

  • removed validate.py

  • removed cerberus and added marshmallow to validate.py

  • adjusted tests

  • Removed check_settings parameter

0.2.14 (2019-08-15)
  • minor code cleanup

  • removed yaml from validate.py

0.2.13 (2019-08-03)
  • changed setup.cfg for double quotes in bumpversion due to Black formatting of setup.py

  • added version test

0.2.12 (2019-08-03)
  • fixed version error in setup.py

0.2.11 (2019-08-03)
  • travis issue

0.2.10 (2019-08-03)
  • fixed test for version not working on travis

0.2.9 (2019-08-03)
  • Used Black code formatter

  • Adjusted tox.ini, contributing.rst

  • Set development status to Beta in setup.py

  • Added black badge to README.rst

  • Fixed comments and minor changes in initialize.py

0.2.8 (2019-07-30)
  • Fixed ambiguity check if no rules present

  • Updates to README.rst

0.2.7 (2019-07-28)
  • Modified docs/conf.py

  • Modified equation in docs/usage.rst and paper/paper.md to fix doc build

0.2.6 (2019-07-28)
  • Fixes to README.rst, usage.rst, paper.md, and tutorial.rst

  • Modifications to core.py documentation

0.2.5 (2019-07-24)
  • Fixes to HISTORY.rst and README.rst

  • 100% test coverage.

  • Added draft of paper.

  • Added graphtransliterator_version to serialize().

0.2.4 (2019-07-23)
  • minor changes to readme

0.2.3 (2019-07-23)
  • added xenial to travis.yml

0.2.2 (2019-07-23)
  • added CI

0.2.1 (2019-07-23)
  • fixed HISTORY.rst for PyPI

0.2.0 (2019-07-23)
  • Fixed module naming in docs using __module__.

  • Converted DirectedGraph nodes to a list.

  • Added Code of Conduct.

  • Added GraphTransliterator class.

  • Updated module dependencies.

  • Added requirements.txt

  • Added check_settings parameter to skip validating settings.

  • Added tests for ambiguity and check_ambiguity parameter.

  • Changed name to Graph Transliterator in docs.

  • Created core.py, validate.py, process.py, rules.py, initialize.py, exceptions.py, graphs.py

  • Added ignore_errors property and setter for transliteration exceptions (UnrecognizableInputToken, NoMatchingTransliterationRule)

  • Added logging to graphtransliterator

  • Added positive cost function based on number of matched tokens in rule

  • added metadata field

  • added documentation

0.1.1 (2019-05-30)
  • Adjusted copyright in docs.

  • Removed Python 2 support.

0.1.0 (2019-05-30)
  • First release on PyPI.

Indices and tables