Source Class Documentation¶
Overview¶
The Source class in technologydata represents bibliographic and web sources, supporting metadata, archiving, and retrieval from the Wayback Machine. It is designed to track provenance, ensure reproducibility, and facilitate the management of references for technology parameters and datasets.
Features¶
- Bibliographic Metadata: Stores title, authors, and optional URL, access date, archive URL, and archive date.
- Equality and Hashing: Implements equality and hashing for use in sets and as dictionary keys.
- String Representation: Provides a readable string summary of the source.
- Wayback Machine Archiving: Ensures URLs are archived and retrieves archive URLs and timestamps from the Wayback Machine.
- File Retrieval: Downloads archived files from the Wayback Machine to a specified directory.
- Automatic File Naming: Determines file extension and save path based on content type or URL.
Usage Examples¶
Creating a Source¶
from technologydata.source import Source
src = Source(title="Example Source", authors="The Authors", url="http://example.com")
A Source object can also be used to archive and retrieve PDFs or other files format as Excel.
Archiving a URL¶
from technologydata.source import Source
src = Source(title="Example Source", authors="The Authors", url="http://example.com")
src.ensure_in_wayback()
print(src.url_archive) # Archived URL
print(src.url_date_archive) # Archive timestamp
Downloading an Archived File¶
import pathlib
from technologydata.source import Source
src = Source(title="Example Source", authors="The Authors", url="http://example.com")
output_path = src.retrieve_from_wayback(pathlib.Path("downloads/"))
print(output_path) # Path to downloaded file
Example usage for an Excel file¶
import pathlib
import pandas as pd
from technologydata.source import Source
# Create a Source for an Excel file
excel_src = Source(title="Example Spreadsheet", authors="The Authors", url="http://example.com/data.xlsx")
# Ensure the URL is archived (creates an archive if missing)
excel_src.ensure_in_wayback()
print("Archive URL:", excel_src.url_archive)
print("Archive timestamp:", excel_src.url_date_archive)
# Retrieve the archived file to `downloads/`
out_path = excel_src.retrieve_from_wayback(pathlib.Path("downloads/"))
print("Downloaded to:", out_path)
# Read the downloaded Excel file with pandas
df = pd.read_excel(out_path)
print(df.head())
API Reference¶
Please refer to the API documentation for detailed information on the Source class methods and attributes.
Notes¶
- Archiving: If the URL is not set, archiving will raise a
ValueError. - File Extensions: File extension is inferred from content type or URL; unsupported types raise a
ValueError. - HTTP Errors: Download and content type retrieval may raise
requests.exceptions.RequestException. - Duplicates: Equality and hashing are based on all attributes; sources with identical metadata are considered equal.