Skip to content

SourceCollection

Bases: BaseModel

Represent a collection of sources.

Attributes:

Methods:

  • __iter__

    Return an iterator over the list of Source objects.

  • __len__

    Return the number of sources in this collection.

  • __str__

    Return a string representation of the SourceCollection.

  • from_json

    Import the SourceCollection from a JSON file.

  • get

    Filter sources based on regex patterns for non-optional attributes.

  • retrieve_all_from_wayback

    Download archived files for all sources in the collection using retrieve_from_wayback.

  • to_csv

    Export the SourceCollection to a CSV file.

  • to_dataframe

    Convert the SourceCollection to a pandas DataFrame.

  • to_json

    Export the SourceCollection to a JSON file, together with a data schema.

sources instance-attribute

sources: Annotated[list[Source], Field(description='List of Source objects.')]

__iter__

__iter__() -> Iterator[Source]

Return an iterator over the list of Source objects.

Returns:

  • Iterator[Source]

    An iterator over the Source objects contained in the collection.

__len__

__len__() -> int

Return the number of sources in this collection.

Returns:

  • int

    The number of Source objects in the sources list.

__str__

__str__() -> str

Return a string representation of the SourceCollection.

Returns:

  • str

    A string representation of the SourceCollection, showing the number of sources.

from_json classmethod

from_json(file_path: Path | str) -> Self

Import the SourceCollection from a JSON file.

Parameters:

  • file_path (Path | str) –

    The path to the JSON file to be imported.

get

get(title: str, authors: str) -> Self

Filter sources based on regex patterns for non-optional attributes.

Parameters:

  • title (str) –

    Regex pattern to filter titles.

  • authors (str) –

    Regex pattern to filter authors.

Returns:

retrieve_all_from_wayback

retrieve_all_from_wayback(download_directory: Path) -> list[Path | None]

Download archived files for all sources in the collection using retrieve_from_wayback.

Parameters:

  • download_directory (Path) –

    The base directory where all files will be saved.

Returns:

  • list[Path | None]

    List of paths where each file was stored, or None if download failed for a source.

to_csv

to_csv(**kwargs: Path | str | bool) -> None

Export the SourceCollection to a CSV file.

Parameters:

  • **kwargs (dict, default: {} ) –

    Additional keyword arguments passed to pandas.DataFrame.to_csv(). Common options include: - path_or_buf : str or pathlib.Path or file-like object, optional File path or object, if None, the result is returned as a string. Default is None. - sep : str String of length 1. Field delimiter for the output file. Default is ','. - index : bool Write row names (index). Default is True. - encoding : str String representing the encoding to use in the output file. Default is 'utf-8'.

Notes

The method converts the collection to a pandas DataFrame using self.to_dataframe() and then writes it to a CSV file using the provided kwargs.

to_dataframe

to_dataframe() -> DataFrame

Convert the SourceCollection to a pandas DataFrame.

Returns:

  • DataFrame

    A DataFrame containing the source data.

to_json

to_json(file_path: Path, schema_path: Path | None = None, output_schema: bool = False) -> None

Export the SourceCollection to a JSON file, together with a data schema.

Parameters:

  • file_path (Path) –

    The path to the JSON file to be created.

  • schema_path (Path, default: None ) –

    The path to the JSON schema file to be created. By default, created with a schema suffix next to file_path.

  • output_schema (bool, default: False ) –

    If True, generates a JSON schema file describing the data structure. The schema will include field descriptions and type information.