ajmc.corpora package¶

ajmc.corpora contains utils for scraping and handling corpora for the AjMC project.

Architecture¶

The functionalities of this package can be divided in three main categories:

Scraping corpora from the web
Cleaning and preparing downloaded corpora for further processing
Processing and manipulating corpora

Scraping and cleaning¶

As each corpus has its own peculiarities, it also has its own scraping and cleaning script (see corpora._scripts).

Processing¶

corpora provides a set of functions and objects for processing and manipulating corpora. Basically, each corpus has a type specified in its metadata.json file. This type is used to determine which functions and objects to use for processing the corpus.

The main object is the Corpus object, which is a wrapper around the different corpus styles. See corpora_classes.py for more.

Basic usage¶

Submodules¶

ajmc.corpora.bibliographic_records module¶

⚙️ WIP code process bibliographic records

class ajmc.corpora.bibliographic_records.DublinCoreRecord(soup: BeautifulSoup)[source]¶

Bases: object

creator() → str[source]¶

description() → str[source]¶

get_property_tag_text(tag_name: str) → str[source]¶

keywords() → List[str][source]¶

keywords_string() → str[source]¶

language() → str[source]¶

publisher() → str[source]¶

title() → str[source]¶

whole_text() → str[source]¶

ajmc.corpora.bibliographic_records.get_records_list(xmls_dir: Path | str) → List[BeautifulSoup][source]¶

ajmc.corpora.cleaning_utils module¶

ajmc.corpora.cleaning_utils.basic_clean(text: str) → str[source]¶

ajmc.corpora.cleaning_utils.find_recurrent_lines(path: str, n_first_elements: int | None = None, recurrence_threshold: int | None = None)[source]¶

ajmc.corpora.cleaning_utils.harmonise_linebreaks(text: str) → str[source]¶