Manual Input USA Parser Documentation¶
Overview¶
The Manual Input USA data parser manual_input_usa.py demonstrates a data-cleaning and transformation pipeline for converting manually curated, USA-specific tabular data into the technologydata schema files technologies.json and sources.json. The parser is implemented in src/technologydata/parsers/manual_input_usa/manual_input_usa.py.
Dataset Description¶
The original dataset is a manually curated CSV file containing USA-specific technology parameters available at this link. The raw source file is included in the repository at src/technologydata/parsers/raw/manual_input_usa.csv.
The dataset is in CSV format and includes a flat table of technology parameters for various energy technologies relevant to the USA context. Columns include technology, parameter, year, value, unit, currency_year, source, further_description, financial_case, and scenario. Rows are individual parameter records (parameter value + unit + context) for technologies with different scenarios and financial cases.
Parser description¶
The parser is articulated in the following steps.
Command line argument parsing¶
Function CommonsParser.parse_input_arguments() defines and parses the command-line arguments:
--num_digits(int, default 4) — number of decimals used when rounding numeric values. The default value is 4.--store_source(boolean flag) — whether to store the source on the Wayback Machine. The default value isfalse.
Read the raw data¶
The script reads the raw data available at src/technologydata/parsers/raw/manual_input_usa.csv in a pandas dataframe. It uses pandas.read_csv(..., dtype=str, na_values="None"). All entries are handled as strings initially except for the value column which is converted to float.
Data cleaning, validation and dealing with missing/null values¶
The data cleaning and validation happens with the following steps.
Function extract_units_carriers_heating_value() extracts standardized units, carriers, and heating values from input unit strings. This function maps complex unit representations to simplified unit, carrier, and heating value combinations using a predefined dictionary of special patterns. Examples include:
USD_2022/MW_FT→ unit:USD_2022/MW, carrier:1/FT, heating_value:1/LHVMWh_H2/MWh_FT→ unit:MWh/MWh, carrier:H2/FT, heating_value:LHVMWh_el/MWh_FT→ unit:MWh/MWh, carrier:el/FT, heating_value:LHVt_CO2/MWh_FT→ unit:t/MWh, carrier:CO2/FT, heating_value:LHVUSD_2022/kWh_H2→ unit:USD_2022/kWh, carrier:1/H2, heating_value:LHVUSD_2023/t_CO2/h→ unit:USD_2023/t/h, carrier:1/CO2, heating_value:NoneMWh_el/t_CO2→ unit:MWh/t, carrier:el/CO2, heating_value:LHVMWh_th/t_CO2→ unit:MWh/t, carrier:thermal/CO2, heating_value:LHV
The parser also fills missing values in the scenario column with "not_available".
The parser applies the following unit conversions:
- Convert
per unitto%and multiply the correspondingvalueby 100.0, rounding tonum_digitsdecimals.
Function Commons.update_unit_with_currency_year(unit, currency_year) appends currency_year information to currency units when present. This is because technologydata follows the currency pattern \b(?P<cu_iso3>[A-Z]{3})_(?P<year>\d{4})\b, as for example USD_2022.
Populate and export the source and technology collections¶
Function build_technology_collection():
- if
store_sourceis set, constructs aSourceobject for the manual input USA dataset, callsensure_in_wayback()and writessources.json; otherwise reads an existingsources.json. - groups the cleaned DataFrame by
scenario,year,technology. - for each group, builds a dictionary of
Parameterobjects (each withmagnitude,sources, and optionallycarrier,heating_value,units,note). - captures the
financial_casevalue from rows within each group to combine withscenario. - creates a
casevalue by combiningscenarioandfinancial_casein the format"{scenario} - {financial_case}"whenfinancial_caseis present; otherwise usesscenarioalone. - creates a
Technologyobject for each group, withname=technology,detailed_technology=technology,year=year,region=USA,case= combined case value, and collects them into aTechnologyCollectionobject. - writes the
TechnologyCollectionobject to atechnologies.json.
Running the parser¶
Execution instructions¶
From repository root:
- Basic run:
python src/technologydata/parsers/manual_input_usa/manual_input_usa.py - Example with options:
--num_digits 3 --store_source
Outputs¶
The parser generates the following outputs:
src/technologydata/parsers/manual_input_usa/technologies.json.src/technologydata/parsers/manual_input_usa/sources.json.- Optional schema files moved to
src/technologydata/parsers/schemaswhen--export_schemais used.