ajmc.commons package¶

ajmc.commons contains all the utilities (helpers, functions, objects, hard-coded variables) which are common (i.e. which must be accessible) to task-specific repositories. These include notably:

file_management utilities, which allow to handle files systematically in the ajmc’s data organisation (see notebooks/data_organisation.ipnb for more) and to retrieve information from the various project spreadsheets.
arithmetic.py contains helper maths function, mainly to deal with intervals, which are a common object in our Canonical format (of which more below).
docstrings.py centralizes common function and class docstrings in a single place and provides a decorator to retrieve them easily.
geometry.py provides helper functions and an object, Shape, to deal with geometrical objects such as contours and bounding boxes.
image.py provides helper functions and an object, AjmcImage, to deal with images.
miscellaneous.py receives everything which doesn’t fit anywhere else. It notably contains generic functions and decorator, lazy objects for efficiency etc…
variables contains all the hard-coded variables such as PATHS, COLORS, SPREADSHEET_IDS, CHARSETS and many more.

Submodules¶

ajmc.commons.arithmetic module¶

This module contains basic arithmetic functions.

ajmc.commons.arithmetic.are_intervals_within_intervals(contained: List[Tuple[int, int]], container: List[Tuple[int, int]]) → bool[source]¶: Applies is_interval_within_interval on a list of intervals, making sure that all the contained intervals are contained in one of the container intervals.

ajmc.commons.arithmetic.compute_interval_overlap(i1: Tuple[int, int], i2: Tuple[int, int])[source]¶

Computes the overlap between two intervals defined by their start and their stop included.

Parameters:

i1 – A Tuple[int, int] defining the included boundaries of an interval, with start <= stop.
i2 – A Tuple[int, int] defining the included boundaries of an interval, with start <= stop.

Returns:

The length of the overlap.

Return type:

int

ajmc.commons.arithmetic.is_interval_within_interval(contained: Tuple[int, int], container: Tuple[int, int]) → bool[source]¶

Checks if the contained, interval is included in the container interval.

Parameters:

container – A Tuple[int, int] defining the included boundaries of an interval, with start <= stop.
contained – A Tuple[int, int] defining the included boundaries of an interval, with start <= stop.

ajmc.commons.arithmetic.safe_divide(dividend, divisor)[source]¶: Simple division which return np.nan if divisor equals zero.

ajmc.commons.docstrings module¶

This file contains generic docstring chunks to be formatted using``docstring_formatter``.

ajmc.commons.docstrings.docstring_formatter(**kwargs)[source]¶

Decorator with arguments used to format the docstring of a functions.

docstring_formatter is a decorator with arguments, which means that it takes any set of kwargs as argument and returns a decorator. It should therefore always be called with parentheses (unlike traditional decorators - see below). It follows the grammar of str.format(), i.e. {my_format_value}. grammar.

Example

This code snippet:

@docstring_formatter(greeting = 'hello')
def my_func():
    "A simple greeter that says {greeting}"
    # Do your stuff

Is actually equivalent with :

def my_func():
    "A simple greeter that says {greeting}"
    # Do your stuff

my_func.__doc__ = my_func.__doc__.format(greeting = 'hello')

Note

Best practice is to name your arguments in compliance with docstrings.docstrings in order to simply call @doctring_formatter(**docstrings.docstrings).

ajmc.commons.file_management module¶

ajmc.commons.geometry module¶

ajmc.commons.image module¶

ajmc.commons.miscellaneous module¶

ajmc.commons.unicode_utils module¶

This file contains unicode variables and functions which serve processing unicode characters.

ajmc.commons.unicode_utils.chunk_string_by_charsets(string: str, fallback: str = 'latin')[source]¶

Chunk a string by character set, returning a list of tuples of the form (chunk, charset).

Example

>>> chunk_string_by_charsets('Hello Γειά σου Κόσμε World')
[('Hello ', 'latin'), ('Γειά σου Κόσμε ', 'greek'), ('World', 'latin')]

Parameters:: string (str) – The string to chunk.
Returns:: A list of tuples of the form (chunk, charset).
Return type:: list

ajmc.commons.unicode_utils.count_chars_by_charset(string: str, charset: str) → int[source]¶

Counts the number of chars by unicode characters set.

Example

count_chars_by_charset('γεια σας, world', 'greek') returns 7 as there are 7 greek chars in string.

Parameters:

string – a NFC-normalized string (default). For NFD-normalized strings, use count_chars_by_charset_nfd.
charset – should be 'greek', 'latin', 'numeral', 'punctuation'.

Returns:

the number of charset-matching characters in string.

Return type:

int

ajmc.commons.unicode_utils.count_chars_by_charset_nfd(string: str, charset: str) → int[source]¶

Counts the number of chars by unicode characters set.

Example

count_chars_by_charset('γεια σας, world', 'greek') returns 7 as there are 7 greek chars in string.

Parameters:

string – a NFD-normalized string. For NFC-normalized strings, use count_chars_by_charset.
charset – should be 'greek', 'latin', 'numeral', 'punctuation'.

Returns:

the number of charset-matching characters in string.

Return type:

int

ajmc.commons.unicode_utils.get_all_chars_from_range(start: str, end: str) → str[source]¶

Get all characters from a range of unicode characters.

Parameters:

start (str) – The first character in the range.
end (str) – The last character in the range.

Returns:

A string containing all characters in the range.

Return type:

str

ajmc.commons.unicode_utils.get_all_chars_from_ranges(ranges: List[Tuple[str, str]]) → str[source]¶

Get all characters from a list of ranges of unicode characters.

Parameters:: ranges (list) – A list of tuples of unicode characters ranges.
Returns:: A string containing all characters in the ranges.
Return type:: str

ajmc.commons.unicode_utils.get_char_charset(char: str, fallback: str = 'fallback') → str[source]¶: Returns the charset of a character, if any, fallback otherwise.

ajmc.commons.unicode_utils.get_char_unicode_name(char: str) → str[source]¶: Returns the unicode name of a character.

ajmc.commons.unicode_utils.get_string_charset(string: str, fallback: str = 'latin') → str[source]¶: Returns the charset of a string, if any, fallback otherwise.

ajmc.commons.unicode_utils.harmonise_ligatures(text: str) → str[source]¶

ajmc.commons.unicode_utils.harmonise_miscellaneous_symbols(text: str) → str[source]¶

ajmc.commons.unicode_utils.harmonise_non_printable(text: str) → str[source]¶

ajmc.commons.unicode_utils.harmonise_punctuation(text: str) → str[source]¶

ajmc.commons.unicode_utils.harmonise_spaces(text: str) → str[source]¶

ajmc.commons.unicode_utils.harmonise_unicode(text: str, harmonise_functions: ~typing.Tuple[~typing.Callable[[str], str]] = (<function harmonise_punctuation>, <function harmonise_miscellaneous_symbols>, <function harmonise_ligatures>)) → str[source]¶

Harmonise unicode characters.

Note

This function takes an NFC string and returns an NFC string.

Parameters:

text (str) – The text to harmonise.
harmonise_functions (tuple) – A tuple of functions to apply to the text. Each function should take an NFC string as input and return an NFC string as output.
harmonise_space_chars (bool) – Whether to harmonise space characters.

Returns:

The harmonised text (an NFC string).

Return type:

str

ajmc.commons.unicode_utils.is_charset_string(string: str, charset: str, threshold: float = 0.5, strict: bool = True) → bool[source]¶

Returns True if more than threshold of chars in string are in charset, False otherwise.

Parameters:

string – self explanatory
charset – should be 'greek', 'latin', 'numeral', 'punctuation' or a valid re-pattern, for instance r'([ô-ÿ])'
threshold – the threshold above which the function returns True
strict – if True, only chars in charset are considered, if False, chars in charset, 'numeral' and 'punctuation' are considered.

ajmc.commons.unicode_utils.is_charset_string_nfd(string: str, charset: str, threshold: float = 0.5, strict: bool = True) → bool[source]¶

Returns True if more than threshold of chars in string are in charset, False otherwise.

Parameters:

string – a NFD-normalized string. For NFC-normalized strings, use is_charset_string.
charset – should be 'greek', 'latin', 'numeral', 'punctuation'.
threshold – the threshold above which the function returns True
strict – if True, only chars in charset are considered, if False, chars in charset, 'numeral' and 'punctuation' are considered.

ajmc.commons.unicode_utils.remove_diacritics(s: str) → str[source]¶

Removes diacritical marks via NFKD normalization and recombination. Useful for building search indexes (and searching against them).

Example:

>>> remove_diacritics("μῆνιν ἄειδε, θεά")
'μηνιν αειδε θεα'

ajmc.commons package¶

Submodules¶

ajmc.commons.arithmetic module¶

ajmc.commons.docstrings module¶

ajmc.commons.file_management module¶

ajmc.commons.geometry module¶

ajmc.commons.image module¶

ajmc.commons.miscellaneous module¶

ajmc.commons.unicode_utils module¶

ajmc.commons.variables module¶