Understanding the ajmc directory structure.

ajmc has a systematic directory structure, which allows to deal efficiently with path-related problems.

The root_dir

The root_dir is be the root directory in which every commentary has its data stored in a directory named after its id. All the data for Campbell should therefore be in root_dir/cu31924087948174, as Campbell’s id is cu31924087948174.

The default root_dir is usually called from variables.COMMS_DATA_DIR, but this can be easilly customized.

Commentary directories structure

Each commentary directory observes the following structure:

  • images/png contains the png images. It is the default place to look for images.

  • ocr contains all the outputs and evaluation from ocrs

  • ocr/evaluation contains only the corrected pages selected for evaluation of the model and their corresponding:

  • olr contains the via_project.json, which stores information about olr.

All these can be accessed in variables

[2]:
from ajmc.commons import variables as vs

print(f"""The path to ocr runs from commentary root should be `{vs.COMM_OCR_RUNS_REL_DIR}`""")
print(f"""The path to the via_project from commentary root should be `{vs.COMM_VIA_REL_PATH}`""")
The path to ocr runs from commentary root should be `ocr/runs`
The path to the via_project from commentary root should be `olr/via_project.json`

Note that you can also get most of the paths for a specific commentary id using the various helpers in ajmc.commons.variables:

[4]:
commentary_id = 'sophoclesplaysa05campgoog'

print(f"""The absolute path to the images-dir of {commentary_id} is {vs.get_comm_img_dir(commentary_id)}""")
print(f"""The absolute path to the sections of {commentary_id} is {vs.get_comm_sections_path(commentary_id)}""")

The absolute path to the images-dir of sophoclesplaysa05campgoog is /Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/sophoclesplaysa05campgoog/images/png
The absolute path to the sections of sophoclesplaysa05campgoog is /Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/sophoclesplaysa05campgoog/sections.json

In the end, the global structure looks like :

cu31924087948174
     ├── canonical  # Canonical jsons
     │    ├── 1bm0b3_tess_final  # V1 canonical (for annotation/INCEpTION only)
     │    └── v2  # V2 canonical (for `CanonicalCommentary` importation)
     ├── images
     │    ├── pdf
     │    └── png
     ├── ner  # Named entity recognition data
     │    └── annotation
     │        └── xmi
     │            └── 1bm0b3_tess_final
     ├── ocr  # OCR data
     │    ├── annotation  # For Lace annotation
     │    │         └── lace_corrected
     │    ├── general_evaluation  # Evaluation of all `ocr/runs`
     │    ├── groundtruth  # OCR Groundtruth data
     │    │         ├── evaluation
     │    │         ├── images
     │    │         └── retraining
     │    └── runs  # OCR runs
     │        ├── 13p082_lace_base
     │        │         ├── evaluation
     │        │         ├── evaluation_fuzzy
     │        │         └── outputs
     │        ├── 13s0gR_ocrd_vanilla
     │        │         ├── evaluation
     │        │         ├── evaluation_fuzzy
     │        │         └── outputs
     │        ├── 1560he_ocrd_min
     │        │         ├── evaluation
     │        │         ├── evaluation_fuzzy
     │        │         └── outputs
     │        ├── 15o0a0_lace_base_cu31924087948174-...
     │        │         ├── evaluation
     │        │         └── outputs
     │        ├── 1bm0b3_tess_final
     │        │         └── outputs
     │        ├── 2480ei_greek-english_porson_sophoclesplaysa05campgoog
     │        │         ├── evaluation
     │        │         └── outputs
     │        ├── 28qmab_tess_base
     │        │         └── outputs
     │        ├── tess_eng_finetune-grc-pogretra
     │        │         ├── evaluation
     │        │         └── outputs
     │        └── tess_eng_grc
     │                 ├── evaluation
     │                 └── outputs
     └── olr

OCR run ids

As you may have notice, the id of a single OCR run (i.e. a single OCRing of a commentary with a given system and parameters) is used throughout the entire repository to identify the source of a commentary. If ner/annotation/xmi contains a directory named 1bm0b3_tess_final it means that this specific annotation is perfomed using ocr/runs/1bm0b3_tess_final/outputs as its base.