ajmc.corpora package¶
ajmc.corpora
contains utils for scraping and handling corpora for the AjMC project.
Architecture¶
The functionalities of this package can be divided in three main categories:
Scraping corpora from the web
Cleaning and preparing downloaded corpora for further processing
Processing and manipulating corpora
Scraping and cleaning¶
As each corpus has its own peculiarities, it also has its own scraping and cleaning script (see corpora._scripts
).
Processing¶
corpora
provides a set of functions and objects for processing and manipulating corpora. Basically, each corpus has a type specified in its metadata.json
file. This type is used to determine which functions and objects to use for processing the corpus.
The main object is the Corpus
object, which is a wrapper around the different corpus styles. See corpora_classes.py
for more.
Basic usage¶
Submodules¶
ajmc.corpora.bibliographic_records module¶
⚙️ WIP code process bibliographic records