ajmc.corpora package

ajmc.corpora contains utils for scraping and handling corpora for the AjMC project.

Architecture

The functionalities of this package can be divided in three main categories:

  • Scraping corpora from the web

  • Cleaning and preparing downloaded corpora for further processing

  • Processing and manipulating corpora

Scraping and cleaning

As each corpus has its own peculiarities, it also has its own scraping and cleaning script (see corpora._scripts).

Processing

corpora provides a set of functions and objects for processing and manipulating corpora. Basically, each corpus has a type specified in its metadata.json file. This type is used to determine which functions and objects to use for processing the corpus.

The main object is the Corpus object, which is a wrapper around the different corpus styles. See corpora_classes.py for more.

Basic usage

Submodules

ajmc.corpora.bibliographic_records module

⚙️ WIP code process bibliographic records

class ajmc.corpora.bibliographic_records.DublinCoreRecord(soup: BeautifulSoup)[source]

Bases: object

creator() str[source]
description() str[source]
get_property_tag_text(tag_name: str) str[source]
keywords() List[str][source]
keywords_string() str[source]
language() str[source]
publisher() str[source]
title() str[source]
whole_text() str[source]
ajmc.corpora.bibliographic_records.get_records_list(xmls_dir: Path | str) List[BeautifulSoup][source]

ajmc.corpora.cleaning_utils module

ajmc.corpora.cleaning_utils.basic_clean(text: str) str[source]
ajmc.corpora.cleaning_utils.find_recurrent_lines(path: str, n_first_elements: int | None = None, recurrence_threshold: int | None = None)[source]
ajmc.corpora.cleaning_utils.harmonise_linebreaks(text: str) str[source]

ajmc.corpora.corpora_classes module

ajmc.corpora.scraping_utils module

ajmc.corpora.variables module