ajmc.commons package

ajmc.commons contains all the utilities (helpers, functions, objects, hard-coded variables) which are common (i.e. which must be accessible) to task-specific repositories. These include notably:

  • file_management utilities, which allow to handle files systematically in the ajmc’s data organisation (see notebooks/data_organisation.ipnb for more) and to retrieve information from the various project spreadsheets.

  • arithmetic.py contains helper maths function, mainly to deal with intervals, which are a common object in our Canonical format (of which more below).

  • docstrings.py centralizes common function and class docstrings in a single place and provides a decorator to retrieve them easily.

  • geometry.py provides helper functions and an object, Shape, to deal with geometrical objects such as contours and bounding boxes.

  • image.py provides helper functions and an object, AjmcImage, to deal with images.

  • miscellaneous.py receives everything which doesn’t fit anywhere else. It notably contains generic functions and decorator, lazy objects for efficiency etc…

  • variables contains all the hard-coded variables such as PATHS, COLORS, SPREADSHEET_IDS, CHARSETS and many more.

Submodules

ajmc.commons.arithmetic module

This module contains basic arithmetic functions.

ajmc.commons.arithmetic.are_intervals_within_intervals(contained: List[Tuple[int, int]], container: List[Tuple[int, int]]) bool[source]

Applies is_interval_within_interval on a list of intervals, making sure that all the contained intervals are contained in one of the container intervals.

ajmc.commons.arithmetic.compute_interval_overlap(i1: Tuple[int, int], i2: Tuple[int, int])[source]

Computes the overlap between two intervals defined by their start and their stop included.

Parameters:
  • i1 – A Tuple[int, int] defining the included boundaries of an interval, with start <= stop.

  • i2 – A Tuple[int, int] defining the included boundaries of an interval, with start <= stop.

Returns:

The length of the overlap.

Return type:

int

ajmc.commons.arithmetic.is_interval_within_interval(contained: Tuple[int, int], container: Tuple[int, int]) bool[source]

Checks if the contained, interval is included in the container interval.

Parameters:
  • container – A Tuple[int, int] defining the included boundaries of an interval, with start <= stop.

  • contained – A Tuple[int, int] defining the included boundaries of an interval, with start <= stop.

ajmc.commons.arithmetic.safe_divide(dividend, divisor)[source]

Simple division which return np.nan if divisor equals zero.

ajmc.commons.docstrings module

This file contains generic docstring chunks to be formatted using``docstring_formatter``.

ajmc.commons.docstrings.docstring_formatter(**kwargs)[source]

Decorator with arguments used to format the docstring of a functions.

docstring_formatter is a decorator with arguments, which means that it takes any set of kwargs as argument and returns a decorator. It should therefore always be called with parentheses (unlike traditional decorators - see below). It follows the grammar of str.format(), i.e. {my_format_value}. grammar.

Example

This code snippet:

@docstring_formatter(greeting = 'hello')
def my_func():
    "A simple greeter that says {greeting}"
    # Do your stuff

Is actually equivalent with :

def my_func():
    "A simple greeter that says {greeting}"
    # Do your stuff

my_func.__doc__ = my_func.__doc__.format(greeting = 'hello')

Note

Best practice is to name your arguments in compliance with docstrings.docstrings in order to simply call @doctring_formatter(**docstrings.docstrings).

ajmc.commons.file_management module

ajmc.commons.geometry module

ajmc.commons.image module

ajmc.commons.miscellaneous module

ajmc.commons.unicode_utils module

This file contains unicode variables and functions which serve processing unicode characters.

ajmc.commons.unicode_utils.chunk_string_by_charsets(string: str, fallback: str = 'latin')[source]

Chunk a string by character set, returning a list of tuples of the form (chunk, charset).

Example

>>> chunk_string_by_charsets('Hello Γειά σου Κόσμε World')
[('Hello ', 'latin'), ('Γειά σου Κόσμε ', 'greek'), ('World', 'latin')]
Parameters:

string (str) – The string to chunk.

Returns:

A list of tuples of the form (chunk, charset).

Return type:

list

ajmc.commons.unicode_utils.count_chars_by_charset(string: str, charset: str) int[source]

Counts the number of chars by unicode characters set.

Example

count_chars_by_charset('γεια σας, world', 'greek') returns 7 as there are 7 greek chars in string.

Parameters:
  • string – a NFC-normalized string (default). For NFD-normalized strings, use count_chars_by_charset_nfd.

  • charset – should be 'greek', 'latin', 'numeral', 'punctuation'.

Returns:

the number of charset-matching characters in string.

Return type:

int

ajmc.commons.unicode_utils.count_chars_by_charset_nfd(string: str, charset: str) int[source]

Counts the number of chars by unicode characters set.

Example

count_chars_by_charset('γεια σας, world', 'greek') returns 7 as there are 7 greek chars in string.

Parameters:
  • string – a NFD-normalized string. For NFC-normalized strings, use count_chars_by_charset.

  • charset – should be 'greek', 'latin', 'numeral', 'punctuation'.

Returns:

the number of charset-matching characters in string.

Return type:

int

ajmc.commons.unicode_utils.get_all_chars_from_range(start: str, end: str) str[source]

Get all characters from a range of unicode characters.

Parameters:
  • start (str) – The first character in the range.

  • end (str) – The last character in the range.

Returns:

A string containing all characters in the range.

Return type:

str

ajmc.commons.unicode_utils.get_all_chars_from_ranges(ranges: List[Tuple[str, str]]) str[source]

Get all characters from a list of ranges of unicode characters.

Parameters:

ranges (list) – A list of tuples of unicode characters ranges.

Returns:

A string containing all characters in the ranges.

Return type:

str

ajmc.commons.unicode_utils.get_char_charset(char: str, fallback: str = 'fallback') str[source]

Returns the charset of a character, if any, fallback otherwise.

ajmc.commons.unicode_utils.get_char_unicode_name(char: str) str[source]

Returns the unicode name of a character.

ajmc.commons.unicode_utils.get_string_charset(string: str, fallback: str = 'latin') str[source]

Returns the charset of a string, if any, fallback otherwise.

ajmc.commons.unicode_utils.harmonise_ligatures(text: str) str[source]
ajmc.commons.unicode_utils.harmonise_miscellaneous_symbols(text: str) str[source]
ajmc.commons.unicode_utils.harmonise_non_printable(text: str) str[source]
ajmc.commons.unicode_utils.harmonise_punctuation(text: str) str[source]
ajmc.commons.unicode_utils.harmonise_spaces(text: str) str[source]
ajmc.commons.unicode_utils.harmonise_unicode(text: str, harmonise_functions: ~typing.Tuple[~typing.Callable[[str], str]] = (<function harmonise_punctuation>, <function harmonise_miscellaneous_symbols>, <function harmonise_ligatures>)) str[source]

Harmonise unicode characters.

Note

This function takes an NFC string and returns an NFC string.

Parameters:
  • text (str) – The text to harmonise.

  • harmonise_functions (tuple) – A tuple of functions to apply to the text. Each function should take an NFC string as input and return an NFC string as output.

  • harmonise_space_chars (bool) – Whether to harmonise space characters.

Returns:

The harmonised text (an NFC string).

Return type:

str

ajmc.commons.unicode_utils.is_charset_string(string: str, charset: str, threshold: float = 0.5, strict: bool = True) bool[source]

Returns True if more than threshold of chars in string are in charset, False otherwise.

Parameters:
  • string – self explanatory

  • charset – should be 'greek', 'latin', 'numeral', 'punctuation' or a valid re-pattern, for instance r'([ô-ÿ])'

  • threshold – the threshold above which the function returns True

  • strict – if True, only chars in charset are considered, if False, chars in charset, 'numeral' and 'punctuation' are considered.

ajmc.commons.unicode_utils.is_charset_string_nfd(string: str, charset: str, threshold: float = 0.5, strict: bool = True) bool[source]

Returns True if more than threshold of chars in string are in charset, False otherwise.

Parameters:
  • string – a NFD-normalized string. For NFC-normalized strings, use is_charset_string.

  • charset – should be 'greek', 'latin', 'numeral', 'punctuation'.

  • threshold – the threshold above which the function returns True

  • strict – if True, only chars in charset are considered, if False, chars in charset, 'numeral' and 'punctuation' are considered.

ajmc.commons.unicode_utils.remove_diacritics(s: str) str[source]

Removes diacritical marks via NFKD normalization and recombination. Useful for building search indexes (and searching against them).

Example:

>>> remove_diacritics("μῆνιν ἄειδε, θεά")
'μηνιν αειδε θεα'

ajmc.commons.variables module