SourceCollection
¶
Bases: BaseModel
Represent a collection of sources.
Attributes:
Methods:
-
__iter__–Return an iterator over the list of Source objects.
-
__len__–Return the number of sources in this collection.
-
__str__–Return a string representation of the SourceCollection.
-
from_json–Import the SourceCollection from a JSON file.
-
get–Filter sources based on regex patterns for non-optional attributes.
-
retrieve_all_from_wayback–Download archived files for all sources in the collection using retrieve_from_wayback.
-
to_csv–Export the SourceCollection to a CSV file.
-
to_dataframe–Convert the SourceCollection to a pandas DataFrame.
-
to_json–Export the SourceCollection to a JSON file, together with a data schema.
sources
instance-attribute
¶
sources: Annotated[list[Source], Field(description='List of Source objects.')]
__iter__
¶
__iter__() -> Iterator[Source]
Return an iterator over the list of Source objects.
Returns:
-
Iterator[Source]–An iterator over the Source objects contained in the collection.
__len__
¶
__len__() -> int
Return the number of sources in this collection.
Returns:
-
int–The number of Source objects in the sources list.
__str__
¶
__str__() -> str
Return a string representation of the SourceCollection.
Returns:
-
str–A string representation of the SourceCollection, showing the number of sources.
from_json
classmethod
¶
from_json(file_path: Path | str) -> Self
Import the SourceCollection from a JSON file.
Parameters:
-
file_path(Path | str) –The path to the JSON file to be imported.
get
¶
get(title: str, authors: str) -> Self
Filter sources based on regex patterns for non-optional attributes.
Parameters:
-
title(str) –Regex pattern to filter titles.
-
authors(str) –Regex pattern to filter authors.
Returns:
-
SourceCollection–A new SourceCollection with filtered sources.
retrieve_all_from_wayback
¶
retrieve_all_from_wayback(download_directory: Path) -> list[Path | None]
Download archived files for all sources in the collection using retrieve_from_wayback.
Parameters:
-
download_directory(Path) –The base directory where all files will be saved.
Returns:
-
list[Path | None]–List of paths where each file was stored, or None if download failed for a source.
to_csv
¶
to_csv(**kwargs: Path | str | bool) -> None
Export the SourceCollection to a CSV file.
Parameters:
-
**kwargs(dict, default:{}) –Additional keyword arguments passed to pandas.DataFrame.to_csv(). Common options include: - path_or_buf : str or pathlib.Path or file-like object, optional File path or object, if None, the result is returned as a string. Default is None. - sep : str String of length 1. Field delimiter for the output file. Default is ','. - index : bool Write row names (index). Default is True. - encoding : str String representing the encoding to use in the output file. Default is 'utf-8'.
Notes
The method converts the collection to a pandas DataFrame using
self.to_dataframe() and then writes it to a CSV file using the provided
kwargs.
to_dataframe
¶
to_dataframe() -> DataFrame
Convert the SourceCollection to a pandas DataFrame.
Returns:
-
DataFrame–A DataFrame containing the source data.
to_json
¶
to_json(file_path: Path, schema_path: Path | None = None, output_schema: bool = False) -> None
Export the SourceCollection to a JSON file, together with a data schema.
Parameters:
-
file_path(Path) –The path to the JSON file to be created.
-
schema_path(Path, default:None) –The path to the JSON schema file to be created. By default, created with a
schemasuffix next tofile_path. -
output_schema(bool, default:False) –If True, generates a JSON schema file describing the data structure. The schema will include field descriptions and type information.