Understanding the ajmc directory structure.¶
ajmc
has a systematic directory structure, which allows to deal efficiently with path-related problems.
The root_dir
¶
The root_dir
is be the root directory in which every commentary has its data stored in a directory named after its id. All the data for Campbell should therefore be in root_dir/cu31924087948174
, as Campbell’s id is cu31924087948174
.
The default root_dir
is usually called from variables.COMMS_DATA_DIR
, but this can be easilly customized.
Commentary directories structure¶
Each commentary directory observes the following structure:
images/png
contains the png images. It is the default place to look for images.ocr
contains all the outputs and evaluation from ocrsocr/evaluation
contains only the corrected pages selected for evaluation of the model and their corresponding:olr
contains the via_project.json, which stores information about olr.
All these can be accessed in variables
[2]:
from ajmc.commons import variables as vs
print(f"""The path to ocr runs from commentary root should be `{vs.COMM_OCR_RUNS_REL_DIR}`""")
print(f"""The path to the via_project from commentary root should be `{vs.COMM_VIA_REL_PATH}`""")
The path to ocr runs from commentary root should be `ocr/runs`
The path to the via_project from commentary root should be `olr/via_project.json`
Note that you can also get most of the paths for a specific commentary id using the various helpers in ajmc.commons.variables
:
[4]:
commentary_id = 'sophoclesplaysa05campgoog'
print(f"""The absolute path to the images-dir of {commentary_id} is {vs.get_comm_img_dir(commentary_id)}""")
print(f"""The absolute path to the sections of {commentary_id} is {vs.get_comm_sections_path(commentary_id)}""")
The absolute path to the images-dir of sophoclesplaysa05campgoog is /Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/sophoclesplaysa05campgoog/images/png
The absolute path to the sections of sophoclesplaysa05campgoog is /Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/sophoclesplaysa05campgoog/sections.json
In the end, the global structure looks like :
cu31924087948174
├── canonical # Canonical jsons
│ ├── 1bm0b3_tess_final # V1 canonical (for annotation/INCEpTION only)
│ └── v2 # V2 canonical (for `CanonicalCommentary` importation)
├── images
│ ├── pdf
│ └── png
├── ner # Named entity recognition data
│ └── annotation
│ └── xmi
│ └── 1bm0b3_tess_final
├── ocr # OCR data
│ ├── annotation # For Lace annotation
│ │ └── lace_corrected
│ ├── general_evaluation # Evaluation of all `ocr/runs`
│ ├── groundtruth # OCR Groundtruth data
│ │ ├── evaluation
│ │ ├── images
│ │ └── retraining
│ └── runs # OCR runs
│ ├── 13p082_lace_base
│ │ ├── evaluation
│ │ ├── evaluation_fuzzy
│ │ └── outputs
│ ├── 13s0gR_ocrd_vanilla
│ │ ├── evaluation
│ │ ├── evaluation_fuzzy
│ │ └── outputs
│ ├── 1560he_ocrd_min
│ │ ├── evaluation
│ │ ├── evaluation_fuzzy
│ │ └── outputs
│ ├── 15o0a0_lace_base_cu31924087948174-...
│ │ ├── evaluation
│ │ └── outputs
│ ├── 1bm0b3_tess_final
│ │ └── outputs
│ ├── 2480ei_greek-english_porson_sophoclesplaysa05campgoog
│ │ ├── evaluation
│ │ └── outputs
│ ├── 28qmab_tess_base
│ │ └── outputs
│ ├── tess_eng_finetune-grc-pogretra
│ │ ├── evaluation
│ │ └── outputs
│ └── tess_eng_grc
│ ├── evaluation
│ └── outputs
└── olr
OCR run ids¶
As you may have notice, the id of a single OCR run (i.e. a single OCRing of a commentary with a given system and parameters) is used throughout the entire repository to identify the source of a commentary. If ner/annotation/xmi
contains a directory named 1bm0b3_tess_final
it means that this specific annotation is perfomed using ocr/runs/1bm0b3_tess_final/outputs
as its base.