Manual Input USA Parser Documentation¶

Overview¶

The Manual Input USA data parser manual_input_usa.py demonstrates a data-cleaning and transformation pipeline for converting manually curated, USA-specific tabular data into the technologydata schema files technologies.json and sources.json. The parser is implemented in src/technologydata/parsers/manual_input_usa/manual_input_usa.py.

Dataset Description¶

The original dataset is a manually curated CSV file containing USA-specific technology parameters available at this link. The raw source file is included in the repository at src/technologydata/parsers/raw/manual_input_usa.csv.

The dataset is in CSV format and includes a flat table of technology parameters for various energy technologies relevant to the USA context. Columns include technology, parameter, year, value, unit, currency_year, source, further_description, financial_case, and scenario. Rows are individual parameter records (parameter value + unit + context) for technologies with different scenarios and financial cases.

Parser description¶

The parser is articulated in the following steps.

Command line argument parsing¶

Function CommonsParser.parse_input_arguments() defines and parses the command-line arguments:

--num_digits (int, default 4) — number of decimals used when rounding numeric values. The default value is 4.
--store_source (boolean flag) — whether to store the source on the Wayback Machine. The default value is false.

Read the raw data¶

The script reads the raw data available at src/technologydata/parsers/raw/manual_input_usa.csv in a pandas dataframe. It uses pandas.read_csv(..., dtype=str, na_values="None"). All entries are handled as strings initially except for the value column which is converted to float.

Data cleaning, validation and dealing with missing/null values¶

The data cleaning and validation happens with the following steps.

Function extract_units_carriers_heating_value() extracts standardized units, carriers, and heating values from input unit strings. This function maps complex unit representations to simplified unit, carrier, and heating value combinations using a predefined dictionary of special patterns. Examples include:

USD_2022/MW_FT → unit: USD_2022/MW, carrier: 1/FT, heating_value: 1/LHV
MWh_H2/MWh_FT → unit: MWh/MWh, carrier: H2/FT, heating_value: LHV
MWh_el/MWh_FT → unit: MWh/MWh, carrier: el/FT, heating_value: LHV
t_CO2/MWh_FT → unit: t/MWh, carrier: CO2/FT, heating_value: LHV
USD_2022/kWh_H2 → unit: USD_2022/kWh, carrier: 1/H2, heating_value: LHV
USD_2023/t_CO2/h → unit: USD_2023/t/h, carrier: 1/CO2, heating_value: None
MWh_el/t_CO2 → unit: MWh/t, carrier: el/CO2, heating_value: LHV
MWh_th/t_CO2 → unit: MWh/t, carrier: thermal/CO2, heating_value: LHV

The parser also fills missing values in the scenario column with "not_available".

The parser applies the following unit conversions:

Convert per unit to % and multiply the corresponding value by 100.0, rounding to num_digits decimals.

Function Commons.update_unit_with_currency_year(unit, currency_year) appends currency_year information to currency units when present. This is because technologydata follows the currency pattern \b(?P<cu_iso3>[A-Z]{3})_(?P<year>\d{4})\b, as for example USD_2022.

Populate and export the source and technology collections¶

Function build_technology_collection():

if store_source is set, constructs a Source object for the manual input USA dataset, calls ensure_in_wayback() and writes sources.json; otherwise reads an existing sources.json.
groups the cleaned DataFrame by scenario, year, technology.
for each group, builds a dictionary of Parameter objects (each with magnitude, sources, and optionally carrier, heating_value, units, note).
captures the financial_case value from rows within each group to combine with scenario.
creates a case value by combining scenario and financial_case in the format "{scenario} - {financial_case}" when financial_case is present; otherwise uses scenario alone.
creates a Technology object for each group, with name = technology, detailed_technology = technology, year = year, region = USA, case = combined case value, and collects them into a TechnologyCollection object.
writes the TechnologyCollection object to a technologies.json.

Running the parser¶

Execution instructions¶

From repository root:

Basic run: python src/technologydata/parsers/manual_input_usa/manual_input_usa.py
Example with options: --num_digits 3 --store_source

Outputs¶

The parser generates the following outputs:

src/technologydata/parsers/manual_input_usa/technologies.json.
src/technologydata/parsers/manual_input_usa/sources.json.
Optional schema files moved to src/technologydata/parsers/schemas when --export_schema is used.