Commentary importation pipeline

This notebook goes through all the steps involved in the creation of CanonicalCommentarys from OCR outputs.

We will therefore:

  1. See how to import an RawCommentary from OCR outputs.

  2. See how to optimise this commentary and transform it to a CanonicalCommentary

  3. See how to export it to a canonical json format for later use.

Creating an OcrCommetary.

RawCommentarys need access to (at least) three kind of information: - OCR outputs files, which represent single pages and which will serve as a basis to create RawPage objects - The corresponding images (after which the former are named) - A via-project.json containing information about the layout.

Using the data provided in ajmc/data/sample_commentaries, we can create our first RawCommentary.

[1]:
from ajmc.text_processing.raw_classes import RawCommentary
from ajmc.commons import variables as vs

comm_id = 'sophoclesplaysa05campgoog'
ocr_run_id = '3467O2_tess_retrained'

ocr_commentary = RawCommentary(id=comm_id,
                               ocr_dir=vs.get_comm_ocr_outputs_dir(comm_id, ocr_run_id),
                               via_path=vs.get_comm_via_path(comm_id),
                               image_dir=vs.get_comm_img_dir(comm_id))

Providing all these paths can be cumbersome. ajmc therefore has a systematic directory structure (see ajmc/notebooks/data_organisation.ipynb) which allows us to create a commentary directly from its OCR outputs directory if it is compliant with the project’s folder structure. As ../data/sample_commentaries are ajmc-compliant, we can simply:

[3]:
# Note that our path holds to the structure pattern : '/abspath/to/root_dir/[comm_id]/ocr/runs/[ocr_run_id]/outputs'
ocr_commentary = RawCommentary.from_ajmc_data(id=comm_id)

The creation of an RawCommentary will take care of the creation of its pages, lines, regions and words. However, it is also possible to instantiate any of these directly:

[5]:
from ajmc.text_processing.ocr_classes import RawPage


page = RawPage(ocr_path=(vs.get_comm_ocr_outputs_dir(comm_id, ocr_run_id) / 'sophoclesplaysa05campgoog_0148.hocr'),
               image_path=vs.get_comm_img_dir(comm_id) / 'sophoclesplaysa05campgoog_0148.png',
               commentary=ocr_commentary)

Note: It is not necessary to provide all the arguments provided here. For instance, if you leave commentary=... blank, the object will still be functionnal, but you won’t be able to retrieve commentary-level information, such as the via_project.

Why should one bother creating CanonicalCommentarys ?

  • TL;DR : Skip to the next section if you don’t care about the details.

The vagaries of OCR

You may ask yourself: what’s actually the problem with RawCommentarys ? Why should we care about enhancing an RawCommentary in the first place ? Well, the problem is not really about the object itself but on the many inconsistencies and noise of the OCR outputs it relies on. To cite a few: 1. Empty or non words 2. Crummy, elongated, stretched or shrinked word bounding boxes or even inverted bounding boxes with negative width and height. 3. Labyrinthine reading order (very often due to marginal line numbers) 4. Single lines spanning over multiple columns, multiple lines or side numbers 5. Diacritics recognized as single lines 6. Crummy, elongated, stretched or shrinked line bounding boxes 7. …

The weakness of xml formats

To add to this already long though not exhaustive list of pitfalls, one should add two other caveats: - OCR outputs come in different formats (Kraken or Tesseract style hocr, alto, xml…) - Though very different because of their individualistic wills to create harmonized, overarching standards, these formats all share the same weakness: the nested architecture of xml-like documents. This property alone makes xml like formats inadequate to our purposes. Let me provided with a simple example. Say we have the following page :

<xml_page attribute_1="..." attribute_2="...">
    <xml_line attribute_1="..." attribute_2="...">
        <xml_word attribute_1="..." attribute_2="...">Foo</xml_word>
        <xml_word attribute_1="..." attribute_2="...">Bar</xml_word>
    </xml_line>
    <xml_line attribute_1="..." attribute_2="...">
        <xml_word attribute_1="..." attribute_2="...">Zig</xml_word>
        <xml_word attribute_1="..." attribute_2="...">Zag</xml_word>
    </xml_line>
</xml_page>

In xml_page we have two xml_line elements, which themselves contain two xml_word elements. This may already be complicated to navigate through, but the most vicious issue is still to come. It appears when you try to overlap different layers of text containers. Say you have a region spanning only the n first word of a line. Should your region be a child of the line ? This makes no sense from a global perspective: regions (such as paragraphs) are higher in the hierarchy and should be parent to lines. One could be tempted to create a line for the region, but then an other problem arises: when calling all the lines from a page, should one call the lines from the regions or from the lines directly, as they are now different ? The same problem appears with entities (e.g. named entities) that span over multiple pages. Say we have an entity starting with the two last words of the last line of page n and ends with the first word of the main text of page n+1. Retrieve the words in such an entity demands extrem precision and absurdly complex chunks of code. In pseudo-python, you would end up with something like my_entity.words = pages[n].children.lines[-1].children.words[-2:]+page[n+1].children.lines[0].children.words[0]. And this is even yet a simple case. What if you have a footnote on page n that you don’t want to include ? What if the first line of page n+1 is actually the page number and not the main text ? I let you imagine the kind of recondite code you end up with (my_entity.words = pages[n].children.find_all("regions", type="main_text")[-1].children.lines[-1].words[-2:]+pages[n+1]...). Not even mentionning the fact that this code is not yet dynamic and that a simple change in page numbering, word alignment or region reading order completely ruins the pipeline.

The advantages of the canonical format

To tackle these issues, we propose with a fairly simple canonical format to store our data. The philosophy of its implementation is to go from a hierarchised and nested structure to a fully horizontal structure. Instead of having nested and re-nested text containers we collect a global list of words and map every other text container to a word range. Here’s an example

{
  "words" : [{"text":"Foo", "attribute_1":"...", "attribute_2":"..."},
             {"text":"Bar", "attribute_1": "...", "attribute_2":"..."},
             {"text":"Zig", "attribute_1": "...", "attribute_2":"..."},
             {"text":"Zag", "attribute_1": "...", "attribute_2":"..."}],
  "pages": [{"word_range": [0,3]}],
  "lines": [{"word_range": [0,1]},
            {"word_range": [2,3]}]
}

This format comes with a lot of advantages :

  1. It’s a json, not an xml. It’s easily readable both by humans and machines. You can import it as a python dict in 2 lines of code. No need for more complex bs4 or etree objects that would clearly overkill for our purposes and offer nothing that jsons or dicts can’t do.

  2. It solves the nesting and overlapping problem at once. You can have overlapping, nested, renested textcontainers. Important thing is that they can be accessed horizontally, simply by finding the other textcontainers with included or overlapping word ranges. Same to get a textcontainer’s words : simply call my_tc.words = words[*my_tc.word_range].

  3. It makes redundant information of xmls useless: To get a line’s bounding box, you simply concatenate it’s words bounding box. This allows to store an entire 400 pages commentary in a ~35MB file, as opposed to ~85MB other OCR outputs (with no information loss and no optimisation), which transitions well to the next point.

  4. It is computationnaly efficient. See a simple example here :

[7]:
import time
import re
from ajmc.text_processing.ocr_classes import RawCommentary
from ajmc.text_processing.canonical_classes import CanonicalCommentary



def time_commentary_operations(commentary_type):
    print(f'Measuring {commentary_type} importation time and manipulation time')

    json_path = vs.COMMS_DATA_DIR / 'sophoclesplaysa05campgoog/canonical/3467O2_tess_retrained.json'

    start_time = time.time()
    if commentary_type == "RawCommentary":
        commentary = RawCommentary.from_ajmc_data(comm_id)
    else:
        commentary = CanonicalCommentary.from_json(json_path)

    commentary.children.words
    print("    Time required by importation and word retrieval: {:.2f}s".format(time.time() - start_time))

    start_time = time.time()
    [l.text for l in commentary.children.lines if re.findall(r'[0-9]', l.text)]
    print("    Time required to retrieve the text lines containing decimals: {:.2f}s\n".format(time.time() - start_time))


time_commentary_operations('RawCommentary')
time_commentary_operations('CanonicalCommentary')
Measuring RawCommentary importation time and manipulation time
    Time required by importation and word retrieval: 9.01s
    Time required to retrieve the text lines containing decimals: 2.03s

Measuring CanonicalCommentary importation time and manipulation time
    Time required by importation and word retrieval: 3.94s
    Time required to retrieve the text lines containing decimals: 0.23s
  1. It allows to deal with multiple versions of the text easily, simply by creating new lists of words and mapping text container customly to any list for any word range (Recall how complicated such an implementation would be if it was to be performed in a nested architecture at line or word level).

Post-processing OCR outputs

Now, how does this solves the OCR related issues mentionned above ? These are dealt with in post-processing. RawCommentary.to_canonical() therefore launches two operations under the hood: 1. Post-processing OCR. 2. Converting to CanonicalCommentary.

Since we already covered the second step, let us briefly go through the first one. Post-processing the OCR aims at harmonizing text, bounding boxes and relations between text containers. It therefore brings a solution to each of the problems listed above: - It deletes empty words and non words. - It adjusts word bounding boxes to their content using contours detection (cv2.findContours) - It adjusts line and regions boxes to the minimal box containing the words (for regions, a _inclusion_threshold is used, which, set to 0.8 proves to be quiet robust. - It cuts lines according to regions, so that overlapping lines are now chunked. - It removes empty lines - It resets reading order from the region level downwards (i.e. order regions, then line in each region, then words in each line). The algorithm is also robust. Use AjmcImage.draw_reading_order to test.

All these operations are performed at page level, using RawPage.optimise(), which is itself called internally by RawCommentary.to_canonical():

[ ]:
can_commentary = RawCommentary.from_ajmc_data(ocr_path=vs.get_comm_ocr_outputs_dir(comm_id, ocr_run_id).to_canonical()

Exporting canonical commentaries to json

This last step is pretty straightforward:

[ ]:
can_commentary.to_json(output_path=None)

If output_path is not provided, the canonical json will be automatically exported to the location determined by ajmc’s directory structure guidelines (i.e. /root_dir/comm_id/canonical/v2/ocr_run_id.json). Under the hood, this calls on each CanonicalTextContainers’ specific self.to_json() method.