SourceCollection ¶

Bases: BaseModel

Represent a collection of sources.

Attributes:

sources (List[Source]) –

List of Source objects.

Methods:

__iter__ –

Return an iterator over the list of Source objects.
__len__ –

Return the number of sources in this collection.
__str__ –

Return a string representation of the SourceCollection.
from_json –

Import the SourceCollection from a JSON file.
get –

Filter sources based on regex patterns for non-optional attributes.
retrieve_all_from_wayback –

Download archived files for all sources in the collection using retrieve_from_wayback.
to_csv –

Export the SourceCollection to a CSV file.
to_dataframe –

Convert the SourceCollection to a pandas DataFrame.
to_json –

Export the SourceCollection to a JSON file, together with a data schema.

sources `instance-attribute` ¶

sources: Annotated[list[Source], Field(description='List of Source objects.')]

iter ¶

__iter__() -> Iterator[Source]

Return an iterator over the list of Source objects.

Returns:

Iterator[Source] –

An iterator over the Source objects contained in the collection.

len ¶

__len__() -> int

Return the number of sources in this collection.

Returns:

int –

The number of Source objects in the sources list.

str ¶

__str__() -> str

Return a string representation of the SourceCollection.

Returns:

str –

A string representation of the SourceCollection, showing the number of sources.

from_json `classmethod` ¶

from_json(file_path: Path | str) -> Self

Import the SourceCollection from a JSON file.

Parameters:

file_path (Path | str) –

The path to the JSON file to be imported.

get ¶

get(title: str, authors: str) -> Self

Filter sources based on regex patterns for non-optional attributes.

Parameters:

title (str) –

Regex pattern to filter titles.
authors (str) –

Regex pattern to filter authors.

Returns:

SourceCollection –

A new SourceCollection with filtered sources.

retrieve_all_from_wayback ¶

retrieve_all_from_wayback(download_directory: Path) -> list[Path | None]

Download archived files for all sources in the collection using retrieve_from_wayback.

Parameters:

download_directory (Path) –

The base directory where all files will be saved.

Returns:

list[Path | None] –

List of paths where each file was stored, or None if download failed for a source.

to_csv ¶

to_csv(**kwargs: Path | str | bool) -> None

Export the SourceCollection to a CSV file.

Parameters:

**kwargs (dict, default: {} ) –

Additional keyword arguments passed to pandas.DataFrame.to_csv(). Common options include: - path_or_buf : str or pathlib.Path or file-like object, optional File path or object, if None, the result is returned as a string. Default is None. - sep : str String of length 1. Field delimiter for the output file. Default is ','. - index : bool Write row names (index). Default is True. - encoding : str String representing the encoding to use in the output file. Default is 'utf-8'.

Notes

The method converts the collection to a pandas DataFrame using self.to_dataframe() and then writes it to a CSV file using the provided kwargs.

to_dataframe ¶

to_dataframe() -> DataFrame

Convert the SourceCollection to a pandas DataFrame.

Returns:

DataFrame –

A DataFrame containing the source data.

to_json ¶

to_json(file_path: Path, schema_path: Path | None = None, output_schema: bool = False) -> None

Export the SourceCollection to a JSON file, together with a data schema.

Parameters:

file_path (Path) –

The path to the JSON file to be created.
schema_path (Path, default: None ) –

The path to the JSON schema file to be created. By default, created with a schema suffix next to file_path.
output_schema (bool, default: False ) –

If True, generates a JSON schema file describing the data structure. The schema will include field descriptions and type information.

SourceCollection ¶

sources instance-attribute ¶

__iter__ ¶

__len__ ¶

__str__ ¶

from_json classmethod ¶

get ¶

retrieve_all_from_wayback ¶

to_csv ¶

to_dataframe ¶

to_json ¶

sources `instance-attribute` ¶

iter ¶

len ¶

str ¶

from_json `classmethod` ¶