Basic functionalities¶

ajmc provides a simple framework to deal with the projects data. It can be used to compute stats, to evaluate an ocr output or simply to navigate through the data.

Navigate through the data¶

Creating a canonical commentary¶

The simplest to navigate through a commentary is to instantiate a text_processing.canonical_classes.CanonicalCommentary. This object relies on an canonical json, which is an standardised and optimised storing format ocr outputs.

[5]:

from ajmc.text_processing.canonical_classes import CanonicalCommentary
from ajmc.commons.variables import COMMS_DATA_DIR

comm_id = 'sophoclesplaysa05campgoog'
can_json_path = COMMS_DATA_DIR / comm_id / 'canonical/3467O2_tess_retrained.json'
comm = CanonicalCommentary.from_json(json_path=can_json_path)

Note. This assumes you are creating a commentary a commentary compliant with ajmc’s file structure. If you want to use custom paths to images, ocr-outputs or via files, please consider creating a ajmc.text_processing.ocr_classes.RawCommentary (see examples/import_from_ocr.ipynb)

The main functionalities of a `CanonicalCommentary`¶

CanonicalCommentary are a particular case of the more generic CanonicalTextContainer. They have children (like pages, regions, lines, and words) which are also CanonicalTextContainers. They also have images and text. Let us have a look !

[6]:

# Get a commentary's pages:
comm.children.pages

# Get a commentary's regions
comm.children.regions

# Get only the commentary's primary text regions
[r for r in comm.children.regions if r.region_type=='primary_text']


# Select app_crits of 140th to 160th page
[r for p in comm.children.pages[139:159] for r in p.children.regions if r.region_type=='app_crit']

[6]:

[<ajmc.text_processing.canonical_classes.CanonicalRegion at 0x152decc10>,
 <ajmc.text_processing.canonical_classes.CanonicalRegion at 0x152defd90>,
 <ajmc.text_processing.canonical_classes.CanonicalRegion at 0x152e030d0>]

Other text containers¶

Note that page-level containers like the one mentionned above also have other attributes like bounding boxes self.bbox and parent.

Pages, regions and lines¶

[7]:

# Get a single page by it's id
page = [p for p in comm.children.pages if p.id == 'sophoclesplaysa05campgoog_0146'][0]

# Get the page's image
page.image

# Get of the page's first commentary-region
region = [r for r in page.children.regions if r.region_type=='commentary'][0]

# Get its text
region.text

# Count the number of lines in a region
len(region.children.lines)

# Get the coordinates of the region
region.bbox

# Get the average numbers of chars in line number regions
page_numbers = [r for r in comm.children.regions if r.region_type=='page_number']
sum([len(r.text) for r in page_numbers])/len(page_numbers)

[7]:

2.411764705882353

For more information, the code documentation of canonical_classes.py is quiet furnished.

Sections¶

[ ]:

# For sections a special method has been added to retrieve a specific section of the commentary rapidly.
section = comm.get_section('commentary')  # Your desired section type here

# This is actually equivalent to doing:
section = [s for s in comm.children.sections if 'commentary' in s.section_types][0]

# sections then behave like any other text container. To get, for instance, the pages in a section, simply call
section_pages = section.children.pages