Understanding the goal of the library

The functioning of a tool is usually easier to grasp when its utility and purpose are made plain. I’ll therefore start with a brief wrap-up of our project goals. As explained in the README.md and on our webpage, our main goal is to create a dynamic interface to query, display and compare classical commentaries. To this purpose, we rely on a single source of data: scanned images of commentaries. To go all the way from images to information extraction, we will have to complete several steps :

  1. AjmcImage processing aims at preparing and enhancing our images.

  2. OCR then converts the images to machine-readable text.

  3. OLR performs layout analysis to extract the regions we are interested in.

  4. Alignment maps single comments to the snippet of text they are actually commenting.

  5. NLP extract meaningful information from these comments.

From goal to code

How do these steps translate into code ? You may have seen that each of the aforementioned steps has a directory in ajmc/. These step-oriented directory are called task-specific directories. This being said, you may also have seen two exceptions: 1. First, as we generally could get good quality scans, we take image processing for granted. There are only a few helpers for binarization which can be found in commons/image.py. 2. Secondly, there are two directories ajmc/ which do not correspond directly to any steps mentioned above : ajmc/commons and ajmc/text_processing.

It would be too long to go through all the task-specific directories. For a general understanding of the code, it is better to start with commons and text_processing, as these two are massively used by each task.

The commons directory

TODO include commons.__init__ in the docs

The text_processing directory

The meaning of “text processing” in ajmc/text_processing is not to be confused with (natural) language processing in ajmc/nlp. As a matter of fact text_processing doesn’t deal with NLP at all. It allows for the manipulation source texts (generally OCR outputs) and exports them to different formats for later use. For a more detailed view of our text pipeline, please check /notebooks/commentary_importation_pipeline.ipynb. Its most important function is to unify the diversity of OCR output formats in a single canonical json.

There are therefore three types of objects in text_processing: 1. Ocr objects, which can be found in text_processing/ocr_classes/ and which deal with OCR output files. 2. Canonical objects, which can unsurprisingly be found in text_processing/canonical_classes and deal with the canonical format jsons. 3. Generic objects, can be found in text_processing/generic_classes. They are abstract object which determine methods and properties used by both canonical and ocr objects.

All these objects have some common attributes, but also particularities due to their nature and also to the level of text containers they represent. Indeed, both Ocr- and Canonical- objects can represent various levels of text containers commentaries, pages, page regions, lines and words. We hence end up with the following classes :

All these object inherit from commons ancestor, TextContainer. The detailed scheme of inheritance can be seen below :

Level

Generic

Canonical

Ocr

Text container (abstract)

TextContainer

CanonicalTextContainer

RawTextContainer

Commentary

Commentary

CanonicalCommentary

RawCommentary

Page

Page

CanonicalPage

RawPage

Region

CanonicalRegion

OcrRegion

Line

CanonicalLine

RawLine

Word

CanonicalWord

RawWord

Main classes inheritance

Note. The choice of going for multiple objects is motivated by the fact that each object behaves in a slightly different way depending on its source and level. Though this structure may seem complicated, it is the results of several explorations (single classes with type- or level-conditioned functions, no inheritance…) and appeared to be the best balance between conflicting design principles such as systematicity, simplicity, efficiency and maintanibility.

Both canonical and ocr objects have specific attributes in the nitty-gritty of which we will not enter here. Let us all review the most common and important attributes and methods.

Each object must implement a method to call it’s direct parents and children. Note that “parent” and “children” do not refer to the inheritance of python objects here. Rather, they refer to the inclusion of textual objects within each other. Hence, when I say that a Line counts among a Page’s “children”, I do not mean that -Line objects are child-classes of -Page objects. Similarly, a Region-level object must be able to call its parent Page and its children lines. The grammar for this is TextContainer.parents.level or TextCoontainer.children.level[+s]. Notice the singular for parents (eg. self.parents.page) and the plural for children (e.g. self.children.pages, reflecting the fact that RawLine.parent.page returns a single RawPage, whereas RawCommentary.children.pages returns a list of RawPages. Under the hood, children and parent call self._get_children(self, children_type) and self._get_parents(self, parent_type) respectively. Every TextContainer must implement these methods.

This system allows for a very fluid navigation between the elements of commentaries, and turns complex queries (e.g. getting all the potentially abbreviated words in all the footnotes of a commentary) into relatively simple one liner (in our case [w for r in commentary.children.regions if r.region_type == "footnote" for w in r.children.words if w.text.endswith('.')]).

Page level object have a direct link to their image, stored as an ajmc.commons.image.AjmcImage and part of the Commentary.images list of all page images. As for regions, lines and words, there image corresponds to the page image cropped to their bounding box. Commentary is a special, as it contains multiple images. It therefore comes a self.images attribute, which is equivalent to [page.image for page in self.children.pages].

Each text TextContainer below Commentary-level has a single self.bbox attribute representing its bounding box, i.e. the minimal rectangle around the object’s words. From a python perspective, self.bbox as a ajmc.commons.geometry.Shape.

Finally, every TextContainer has a self.text attribute, which results from the concatenation of the text contained by its word.

All these attributes can add up to relatively heavy computations. To keep object instantiation light and swift, lazy_objects.lazy_property is the default way to create an object’s attribute. lazy_property is a decorator which allows for computing and storing properties only when they are called. When first created, objects should require no computation at all. This explains the inheritance based __init__ system: each text container should expect only required arguments and (almost) nothing happens at __init__ time. Notice that lazy properties can be fixed at __init__ using the **kwargs system. For instance, if you want a lazy property self.foo to be set at __init__ time, you can do so by adding foo = "bar" in the __init__ method. This will prevent the lazy property from being computed when called for the first time.