BONSAI classifications
The BONSAI classifications Python package is a part of the Getting The Data Right project.
Here, all the classifications, which are used in the Bonsai database, are created and stored as csv files. The csv files can be found under /src/classifications/data. The structure of organising these files follows the Bonsai ontology and thus has the following folders:
activitytype (includes:
industry_activity
,government_activity
,treatment_activity
,non_profit_institution_serving_household
,household_production
,household_consumption
,market_activity
,natural_activity
,auxiliary_production_activity
,change_in_stock_activity
,other_activity
)flowobject (includes
industry_product
,material_for_treatment
,market_product
,government_product
,household_product
,needs_satisfaction
,direct_physical_change
,environmental_flow
,economic_flow
,social_flow
)location
time
Since the Bonsai ontology does not cover all required topics, additional folders are added:
dataquality
uncertainty
A comprehensive documentation of the classification package is availbale here
Format
The csv files (tables) of each folder (datapackage) are organised in tabular format. Each of the mentioned folders represents a valid dataio.datapackage
created with the Python package dataio. The following types of tables with its prefixes are used:
tree table
tree_
concordance table
conc_
dimension table
dim_
pairwise cocncordance table
concpair_
tree table
Tree tables are used for classifications which have a tree structure, meaning that the classification is structured hierarchically with multiple levels. The classification starts with broad categories at the top level and then branches out into more specific subcategories as you move down the hierarchy.
The following column names are used:
code
: code of the itemparent_code
: code of the items parentname
: name of the itemlevel
: the items level in the tree structure (from 0 to n)
concordance table
A concordance table is used to establish equivalences or relationships between different classification systems. It provides mappings between codes of a classification system and codes from another classification system. A relationship between codes can have four different types:
one-to-one (1:1) correspondence: In a one-to-one correspondence, each category or code in one classification system is mapped to exactly one category or code in another classification system, and vice versa. This type of mapping implies a direct and unambiguous correspondence between the two systems. The skos uri is http://www.w3.org/2004/02/skos/core#exactMatch
one-to-many (1:M) correspondence: In a one-to-many correspondence, each category or code in one classification system is mapped to multiple categories or codes in another classification system. However, each category or code in the second system is only mapped to one category or code in the first system. This type of mapping implies that one category or code in the first system may encompass multiple categories or codes in the second system. The skos uri is http://www.w3.org/2004/02/skos/core#narrowMatch . Indicating
<A> skos:narrowMatch <B>
means “B is narrower than A”many-to-one (M:1) correspondence: In a many-to-one correspondence, multiple categories or codes in one classification system are mapped to a single category or code in another classification system. However, each category or code in the second system is only mapped to one category or code in the first system. This type of mapping implies that multiple categories or codes in the first system are aggregated or collapsed into a single category or code in the second system. The skos uri is http://www.w3.org/2004/02/skos/core#broadMatch . Indicating
<A> skos:broadwMatch <B>
means “B is broader than A”many-to-many (M:M) correspondence: In a many-to-many correspondence, multiple categories or codes in one classification system are mapped to multiple categories or codes in another classification system. This type of mapping indicates complex relationships where neither a straightforward one-to-one correspondence exists, nor a parent-child relationship. The skos uri is http://www.w3.org/2004/02/skos/core#relatedMatch
The following column names are used:
<tree_classification_A>
: code of classification A<tree_classification_A>
: code of classification B which is mapped to the code of classification Acomment
: comment on the type of concordanceskos_uri
: skos uri
The requirements for these table types are specified here.
dimension table
A dimension table is used for classifications which do not have a tree structure.
The following column names are used:
code
: code of the itemname
: name of the itemdescription
: description of the item
pairwise concordance table (for Bonsai)
This type of concordance table is used to map pairwise codes. For instance, some data providers such as UNdata
and IEA
are using combined codes for an activity (e.g. for “production of”, “electricity production by”) and flowobject
(e.g. “coal”) to express a bonsai_activitytype
(“A_COAL”, “A_PowC”). In some cases, when the conc_
tables for activitytype
and flowobject
, which map single relations, are not sufficient to create these pairwise concordances, it is reasonable to make it explicit. The mapping relationships between the pairwise codes can be the same as in the conc_
tables.
The following column names are used: activitytype_from,flowobject_from,activitytype_to,flowobject_to,classification_from,classification_to
activitytype_from
: code for activitytype of<from>
classificationflowobject_from
: code for flowobject of<from>
classificationactivitytype_to
: code for the activitytype of<other>
classificationflowobject_to
: code for the flowobject of<other>
classificationclassification_from
: name of the<from>
classification schemaclassification_to
: name of the<other>
classification schemaskos_uri
: skos uricomment
: comment on the type of concordance
Usage
To use the classification, you can install the package via pip. Replace <version>
by a specific tag or branch name.
pip install git+ssh://git@gitlab.com/bonsamurais/bonsai/util/classifications@<version>
From pypi, do:
pip install bonsai_classifications
All classifications are provided as dataio.datapackage
which include the tables as pandas.DataFrame
. Therefore, you can do the following get the classification tree
for e.g. industry activities of Bonsai:
import classifications
bosai_tree = classifications.activitytype.datapackage.tree_bonsai
Note
The datapackage object includes also the tables of other classifications.
You can also get the concordance tables and external classifications in the similar way, using the datapackage
object.
To access trees without hard-coding their name and path, you can use get_tree()
:
apple_tree = classifications._utils.get_tree("tree_apple_inc.csv", "flowobject")
This method is preferred for classifications that appears both as an activity and as a product classification.
The activities and flowobjects of Bonsai can be also used directly as objects. By doing the following, you would get the name
of the A_Chick
activity.
classifications.activitytype.bonsai.A_Chick.name
Special methods
lookup()
for searching strings in code namesget_children()
to get all codes that have the same parent codecreate_conc()
to create a concordance tabledisaggregate_bonsai()
for adding new codes, which disaggregate an existing codeget_bonsai_schemas_mapping()
returns a dict that maps Bonsai schemas to Bonsai codesprint_tree()
prints the tree structure of a given code
To search for certain key words in a table, you can use the line of code below. This returns a pandas DataFrame with rows that have “coal” in the name
column. Note that this lookup is case sensitive.
bonsai_tree.lookup("coal")
To get all children of a certain code (here for treatment activities in Bonsai), you can do use the following method. By setting the option deep=True
, you get all descandents. With deep=False
you get only the direct children. The option return_parent=True
will include the selected parent code. The option exclude_sut_children=True
will return only the children that are included by another code in the SUT.
classifications.activitytype.datapackage.tree_bonsai.get_children(parent_code="at", deep=True, return_parent=False, exclude_sut_children=False)
The package also helps to create new concordance tables. When having two concordance tables, one for mapping codes of classification a
to b
, and the other for mapping b
to c
, you can use the following:
df_1:
a |
b |
---|---|
01.01 |
x |
… |
… |
df_2:
b |
c |
---|---|
x |
YXDA |
… |
… |
df_3 = classifications.create_conc(df_1, df_2, source="a", target="c", intermediate="c")
df_3:
a |
c |
---|---|
01.01 |
YXDA |
… |
… |
To disaggregate existing codes of the Bonsai classification, you can use the disaggregate_bonsai()
method. Depending on the category, e.g. activitytype
or flowobject
, you can call that method.
To indicate the which code you want to disaggregate, you need to provide a dictionary, with the old code of Bonsai as keys. The value corresponding to that key is a list of tuples. Each tuple represents a new code. The first entry of that tuple is the code, the second entry is the name, and the third is a mapping dictionary. This mapping dictionary includes the name of another classification scheme (other than Bonsai) as key, and a list of strings, which are the codes of the other classification now represented by the new code.
codes = {"disaggregations":
[
{"old_code" : "A_Paper",
"new_codes":
[
{"code": "New_Paper1",
"description": "new paper production 1",
"mappings": {"nace_rev2": ["10.02","01.13"]}
},
{"code": "New_Paper2",
"description": "new paper production 2",
"mappings": {}
}
]
}
]
}
d = classifications.activitytype.datapackage.disaggregate_bonsai(codes)
Get the pandas DataFrames that are modified.
d["tree_bonsai"]
d["conc_bonsai_nace_rev2"]
To use that function via terminal, execute python disaggregate_bonsai.py <bonsai_categorty> <path/to/disaggregaion.yaml> <directory/for/updated/files>
. <bonsai_category>
can be for instance activitytype
or flowobject
.
Note
To disaggregate an existing code, you need to provide at least 2 new codes. It is assumed that all entities covered by the new codes are equal to the entities of the existing code.
The print_tree(toplevelcode)
method helps to inspect the tree structure for a given code.
classifications.flowobject.datapackage.tree_bonsai.print_tree("C_Wine")
The differentiation between bold and italic text is only relevant for the Bonsai-SUT. Italic written codes are “not part” of the inspected toplevelcode
, since these are explicitly in the SUTs. Since these codes are seperatly in the SUT, the definition of the toplevelcode
is thus “code, excluding the the italic children”.
𝐂_𝐖𝐢𝐧𝐞
├── 𝐟𝐢_𝟐𝟒𝟐𝟏𝟏
├── 𝘊_𝘎𝘳𝘢𝘱𝘵
Methods for creating correspondence tables
The following functions take as input a correspondence table (CSV file) and improve it.
add_mapping_comment()
: adds the type of correspondence as ‘comment’update_levels()
: adds the ‘level’ to a tree table (CSV)
The function add_mapping_comment()
works slightly different when it is a mapping to Bonsai codes.
For these tables, the mapping type is decided for each level of the external classification code. This allows to link one Bonsai code to multiple external codes at different levels, which is useful if we don’t know in advance what aggregation level is used in a dataset (that will be loaded in the load_task
).