Basic functionalities¶
ajmc
provides a simple framework to deal with the projects data. It can be used to compute stats, to evaluate an ocr output or simply to navigate through the data.
Navigate through the data¶
Creating a canonical commentary¶
The simplest to navigate through a commentary is to instantiate a text_processing.canonical_classes.CanonicalCommentary
. This object relies on an canonical json, which is an standardised and optimised storing format ocr outputs.
[5]:
from ajmc.text_processing.canonical_classes import CanonicalCommentary
from ajmc.commons.variables import COMMS_DATA_DIR
comm_id = 'sophoclesplaysa05campgoog'
can_json_path = COMMS_DATA_DIR / comm_id / 'canonical/3467O2_tess_retrained.json'
comm = CanonicalCommentary.from_json(json_path=can_json_path)
Note. This assumes you are creating a commentary a commentary compliant with ajmc’s file structure. If you want to use custom paths to images, ocr-outputs or via files, please consider creating a ajmc.text_processing.ocr_classes.RawCommentary
(see examples/import_from_ocr.ipynb
)
The main functionalities of a CanonicalCommentary
¶
CanonicalCommentary
are a particular case of the more generic CanonicalTextContainer
. They have children (like page
s, region
s, line
s, and word
s) which are also CanonicalTextContainer
s. They also have images and text. Let us have a look !
[6]:
# Get a commentary's pages:
comm.children.pages
# Get a commentary's regions
comm.children.regions
# Get only the commentary's primary text regions
[r for r in comm.children.regions if r.region_type=='primary_text']
# Select app_crits of 140th to 160th page
[r for p in comm.children.pages[139:159] for r in p.children.regions if r.region_type=='app_crit']
[6]:
[<ajmc.text_processing.canonical_classes.CanonicalRegion at 0x152decc10>,
<ajmc.text_processing.canonical_classes.CanonicalRegion at 0x152defd90>,
<ajmc.text_processing.canonical_classes.CanonicalRegion at 0x152e030d0>]
Other text containers¶
Note that page-level containers like the one mentionned above also have other attributes like bounding boxes self.bbox
and parent
.
Pages, regions and lines¶
[7]:
# Get a single page by it's id
page = [p for p in comm.children.pages if p.id == 'sophoclesplaysa05campgoog_0146'][0]
# Get the page's image
page.image
# Get of the page's first commentary-region
region = [r for r in page.children.regions if r.region_type=='commentary'][0]
# Get its text
region.text
# Count the number of lines in a region
len(region.children.lines)
# Get the coordinates of the region
region.bbox
# Get the average numbers of chars in line number regions
page_numbers = [r for r in comm.children.regions if r.region_type=='page_number']
sum([len(r.text) for r in page_numbers])/len(page_numbers)
[7]:
2.411764705882353
For more information, the code documentation of canonical_classes.py
is quiet furnished.
Sections¶
[ ]:
# For sections a special method has been added to retrieve a specific section of the commentary rapidly.
section = comm.get_section('commentary') # Your desired section type here
# This is actually equivalent to doing:
section = [s for s in comm.children.sections if 'commentary' in s.section_types][0]
# sections then behave like any other text container. To get, for instance, the pages in a section, simply call
section_pages = section.children.pages