SourceCollection Class Documentation¶
Overview¶
The SourceCollection class in technologydata represents a collection of Source objects, providing tools for filtering, exporting, archiving, and data conversion. It is designed to manage multiple bibliographic or web sources, supporting reproducibility and provenance tracking for technology datasets.
Features¶
- Collection Management: Stores and iterates over multiple
Sourceobjects. - Filtering: Supports regex-based filtering by title and authors.
- Archiving: Downloads archived files for all sources using the Wayback Machine.
- Data Export: Converts the collection to pandas DataFrame, CSV, and JSON formats.
- Schema Export: Exports a JSON schema describing the collection structure.
- Integration: Designed for use with technology parameters and datasets.
Usage Examples¶
Creating a SourceCollection¶
from technologydata.source import Source
from technologydata.source_collection import SourceCollection
src1 = Source(title="Source One", authors="Author A", url="http://example.com/1")
src2 = Source(title="Source Two", authors="Author B", url="http://example.com/2")
collection = SourceCollection(sources=[src1, src2])
Filtering Sources¶
from technologydata.source import Source
from technologydata.source_collection import SourceCollection
src1 = Source(title="Source One", authors="Author A", url="http://example.com/1")
src2 = Source(title="Source Two", authors="Author B", url="http://example.com/2")
collection = SourceCollection(sources=[src1, src2])
filtered = collection.get(title="Source One", authors="Author A")
print(filtered) # SourceCollection with sources matching the patterns
Exporting to CSV¶
from technologydata.source import Source
from technologydata.source_collection import SourceCollection
src1 = Source(title="Source One", authors="Author A", url="http://example.com/1")
src2 = Source(title="Source Two", authors="Author B", url="http://example.com/2")
collection = SourceCollection(sources=[src1, src2])
collection.to_csv(path_or_buf="sources.csv")
Exporting to JSON and Schema¶
import pathlib
from technologydata.source import Source
from technologydata.source_collection import SourceCollection
src1 = Source(title="Source One", authors="Author A", url="http://example.com/1")
src2 = Source(title="Source Two", authors="Author B", url="http://example.com/2")
collection = SourceCollection(sources=[src1, src2])
collection.to_json(file_path=pathlib.Path("sources.json"))
Downloading Archived Files¶
import pathlib
from technologydata.source import Source
from technologydata.source_collection import SourceCollection
src1 = Source(title="Source One", authors="Author A", url="http://example.com/1")
src2 = Source(title="Source Two", authors="Author B", url="http://example.com/2")
collection = SourceCollection(sources=[src1, src2])
downloaded_paths = collection.retrieve_all_from_wayback(pathlib.Path("downloads/"))
print(downloaded_paths)
API Reference¶
Please refer to the API documentation for detailed information on the SourceCollection class methods and attributes.
Notes¶
- Filtering: Regex patterns are case-insensitive and applied to non-optional attributes.
- Archiving: Download failures return
Nonefor the corresponding source. - Export: Default CSV export uses UTF-8 encoding and quotes all fields.
- Schema: JSON schema is generated automatically and includes field descriptions.
- Type Checking: The class uses Pydantic for validation and type enforcement.