Writing your own kiara module - the basics¶
Preparation¶
Check out the 'kiara getting started guide'¶
If you haven't already, it would make sense for you to go through the kiara getting started guide. This will give you a good overview of the relevant kiara features, and how the module(s) you are going to write fits in.
Setting up development environment¶
To get going, we need a Python virtual environment in which to develop. We'll be using conda for that here, but this will work for normal virtual environments as well. As a first step, install conda (if you haven't already). Then:
conda create -n my_kiara_module python=3.9
conda activate my_kiara_module
conda install -c conda-forge mamba # this is optional, but makes everything install related much faster, if you don't use it, replace 'mamba' with 'conda' below
mamba install -c conda-forge -c dharpa kiara kiara_plugin.core_types kiara_plugin.tabular
Note
For Linux, if you experience errors, you might or might not have to also execute: mamba update -c conda-forge libstdcxx-ng
.
After this, the kiara
command-line application should be available to you, you can test whether that works, for example via kiara operation list
.
Creating a kiara plugin project¶
For this tutorial, we'll use a project template to create a bare-bones kiara plugin project, which we will augment with our own module(s).
First we need to install the cruft
conda package, which we will use to create our project stub:
mamba install -c conda-forge cruft
Now, we run cruft
against our template git repo, feel free to change any of the answers to the questions you'll be asked:
cruft create https://github.com/DHARPA-Project/kiara_plugin.develop.git
full_name []: Markus Binsteiner
email []: markus@frkl.io
project_name [my-kiara-plugin]: my-kiara-module
project_slug [my_kiara_module]: my_kiara_module
project_short_description [my-kiara-module]: A kiara plugin project for learning to create kiara modules.
github_user [DHARPA-Project]:
anaconda_user [dharpa]:
This should have created a new folder, named kiara_plugin.my_kiara_module
. Next, we initialize and install the new plugin Python project into our conda environment:
cd kiara_plugin.my_kiara_module
git init
git checkout -b develop
pip install -e .
Note
TODO: explain what happened here?
Once this is done, you should see a new operation called my_kiara_module.example
:
kiara operation list example
╭─ Filtered operations ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ Id Type(s) Description │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ my_kiara_module.example A very simple example module; concatenate two strings. │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Note
The example
string token at the end of the above command filters the output to operations that match the token.
This module comes as example code with the project template, and is located in the modules/__init__.py
Python file. It only serves as an example and blueprint for your own modules, and you can delete the module class within the file if you wish.
Pre-loading a table dataset¶
In our tutorial we'll create a module to filter a table. In order to do this we'll need to pre-seed our kiara data store with a tabular dataset. Here is the command to run (with the project root as our working directory):
kiara run --save table=journal_nodes_table import.table.from.local_file_path path=examples/data/journals/JournalNodes1902.csv
╭─ Results ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ field data_type value │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ imported_file file Id,Label,JournalType,City,CountryNetworkTime,PresentDayCountry,Latitude,Longitude,Language │
│ 75,Psychiatrische en neurologische bladen,specialized: psychiatry and neurology,Amsterdam,Netherlands,Netherlands,52.366667,4.9,Dutch │
│ 36,The American Journal of Insanity,specialized: psychiatry and neurology,Baltimore,United States,United States,39.289444,-76.615278,English │
│ 208,The American Journal of Psychology,specialized: psychology,Baltimore,United States,United States,39.289444,-76.615278,English │
│ 295,Die Krankenpflege,specialized: therapy,Berlin,German Empire,Germany,52.52,13.405,German │
│ 296,Die deutsche Klinik am Eingange des zwanzigsten Jahrhunderts,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │
│ 300,Therapeutische Monatshefte,specialized: therapy,Berlin,German Empire,Germany,52.52,13.405,German │
│ 1,Allgemeine Zeitschrift für Psychiatrie,specialized: psychiatry and neurology,Berlin,German Empire,Germany,52.52,13.405,German │
│ 7,Archiv für Psychiatrie und Nervenkrankheiten,specialized: psychiatry and neurology,Berlin,German Empire,Germany,52.52,13.405,German │
│ 10,Berliner klinische Wochenschrift,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │
│ 13,Charité Annalen,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │
│ 21,Monatsschrift für Psychiatrie und Neurologie,specialized: psychiatry and neurology,Berlin,German Empire,Germany,52.52,13.405,German │
│ 29,Virchows Archiv,"specialized: anatomy, physiology and pathology",Berlin,German Empire,Germany,52.52,13.405,German │
│ 31,Zeitschrift für pädagogische Psychologie und Pathologie,specialized: psychology and pedagogy,Berlin,German Empire,Germany,52.52,13.405,German │
│ 42,Vierteljahrsschrift für gerichtliche Medizin und öffentliches Sanitätswesen,"specialized: anthropology, criminology and forensics",Berlin,German Empire,Germany,52.52,13.405,German │
│ 47,Centralblatt für Nervenheilkunde und Psychiatrie,specialized: psychiatry and neurology,Berlin,German Empire,Germany,52.52,13.405,German │
│ 50,Russische medicinische Rundschau,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │
│ 76,Deutsche Aerzte-Zeitung,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │
│ 87,Monatsschrift für Geburtshülfe und Gynäkologie,specialized: gynecology,Berlin,German Empire,Germany,52.52,13.405,German │
│ 108,Archiv für klinische Chirurgie,specialized: surgery,Berlin,German Empire,Germany,52.52,13.405,German │
│ 113,Zeitschrift für klinische Medicin,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │
│ 159,Deutsche militärärztliche Zeitschrift,specialized: military medicine,Berlin,German Empire,Germany,52.52,13.405,German │
│ 162,Jahresbericht über die Leistungen und Fortschritte auf dem Gebiete der Neurologie und Psychiatrie,specialized: psychiatry and neurology,Berlin,German Empire,Germany,52.52,13.405,German │
│ 192,Ärztliche Sachverständigen-Zeitung,general medicine,Berlin,German Empire,Germany,52.52,13.405,German │
│ 198,Zeitschrift für die Behandlung Schwachsinniger und Epileptischer,specialized: psychiatry and neurology,Berlin,German Empire,Germany,52.52,13.405,German │
│ 258,Der Pfarrbote,news media,Berlin,German Empire,Germany,52.52,13.405,German │
│ 71,Correspondenz-Blatt für Schweizer Aerzte,general medicine,Bern,Switzerland,Switzerland,46.948056,7.4475,German │
│ 6,Archiv für mikroskopische Anatomie,"specialized: anatomy, physiology and pathology",Bonn,German Empire,Germany,50.733333,7.1,German │
│ 203,The Journal of Abnormal Psychology,specialized: psychology,Boston,United States,United States,42.358056,-71.063611,English │
│ 273,"Correspondenz-Blatt der Deutschen Gesellschaft für Anthropologie, Ethnologie und Urgeschichte","specialized: anthropology, criminology and forensics",Braunschweig,German │
│ Empire,Germany,52.266667,10.516667,German │
│ 303,Policlinique de Bruxelles,general medicine,Brussels,Belgium,Belgium,50.85,4.35,French │
│ 306,Annales de la Société Belge de Neurologie,specialized: psychiatry and neurology,Brussels,Belgium,Belgium,50.85,4.35,French │
│ 19,Journal de neurologie,specialized: psychiatry and neurology,Brussels,Belgium,Belgium,50.85,4.35,French │
│ 25,"Revue internationale d'électrothérapie, de physiologie, de médecine, de chirurgie, d'obstétrique, de thérapeutique, de chimie et de pharmacie",general │
│ medicine,Brussels,Belgium,Belgium,50.85,4.35,French │
│ 35,Bulletin de la Société de Médecine Mentale de Belgique,specialized: psychiatry and neurology,Brussels,Belgium,Belgium,50.85,4.35,French │
│ ... │
│ │
│ ... │
│ table table │
│ Id Label JournalType City CountryNetworkTime PresentDayCountry Latitude Longitude Language │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ 75 Psychiatrische en neurologische bladen specialized: psychiatry and neurology Amsterdam Netherlands Netherlands 52.366667 4.9 Dutch │
│ 36 The American Journal of Insanity specialized: psychiatry and neurology Baltimore United States United States 39.289444 -76.615278 English │
│ 208 The American Journal of Psychology specialized: psychology Baltimore United States United States 39.289444 -76.615278 English │
│ 295 Die Krankenpflege specialized: therapy Berlin German Empire Germany 52.52 13.405 German │
│ 296 Die deutsche Klinik am Eingange des zwanzigsten general medicine Berlin German Empire Germany 52.52 13.405 German │
│ 300 Therapeutische Monatshefte specialized: therapy Berlin German Empire Germany 52.52 13.405 German │
│ 1 Allgemeine Zeitschrift für Psychiatrie specialized: psychiatry and neurology Berlin German Empire Germany 52.52 13.405 German │
│ 7 Archiv für Psychiatrie und Nervenkrankheiten specialized: psychiatry and neurology Berlin German Empire Germany 52.52 13.405 German │
│ 10 Berliner klinische Wochenschrift general medicine Berlin German Empire Germany 52.52 13.405 German │
│ 13 Charité Annalen general medicine Berlin German Empire Germany 52.52 13.405 German │
│ 21 Monatsschrift für Psychiatrie und Neurologie specialized: psychiatry and neurology Berlin German Empire Germany 52.52 13.405 German │
│ 29 Virchows Archiv specialized: anatomy, physiology and pathology Berlin German Empire Germany 52.52 13.405 German │
│ 31 Zeitschrift für pädagogische Psychologie und Pat specialized: psychology and pedagogy Berlin German Empire Germany 52.52 13.405 German │
│ 42 Vierteljahrsschrift für gerichtliche Medizin und specialized: anthropology, criminology and forens Berlin German Empire Germany 52.52 13.405 German │
│ 47 Centralblatt für Nervenheilkunde und Psychiatrie specialized: psychiatry and neurology Berlin German Empire Germany 52.52 13.405 German │
│ 50 Russische medicinische Rundschau general medicine Berlin German Empire Germany 52.52 13.405 German │
│ ... ... ... ... ... ... ... ... ... │
│ ... ... ... ... ... ... ... ... ... │
│ 277 L'arte medica general medicine Turin Italy Italy 45.079167 7.676111 Italian │
│ 288 Allgemeine österreichische Gerichts-Zeitung specialized: anthropology, criminology and forens Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 18 Jahrbücher für Psychiatrie specialized: psychiatry and neurology Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 30 Wiener klinische Rundschau general medicine Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 44 Wiener klinische Wochenschrift general medicine Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 45 Wiener medizinische Wochenschrift general medicine Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 72 Wiener medizinische Presse general medicine Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 81 Monatsschrift für Gesundheitspflege general medicine Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 93 Klinisch-therapeutische Wochenschrift general medicine Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 151 Medicinisch-chirurgisches Centralblatt specialized: surgery Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 199 Der Militärazt specialized: military medicine Vienna Austro-Hungarian Empire Austria 48.2 16.366667 German │
│ 261 Медицинская беседа general medicine Voronezh Russian Empire Russia 51.671667 39.210556 Russian │
│ 77 Medycyna general medicine Warsaw Russian Empire Poland 52.233333 21.016667 Polish │
│ 150 Kronika Lekarska general medicine Warsaw Russian Empire Poland 52.233333 21.016667 Polish │
│ 86 Grenzfragen des Nerven- und Seelenlebens specialized: psychiatry and neurology Wiesbaden German Empire Germany 50.0825 8.24 German │
│ 206 Ergebnisse der Allgemeinen Pathologie und Pathol specialized: anatomy, physiology and pathology Wiesbaden German Empire Germany 50.0825 8.24 German │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Stored result values ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ field data type stored id alias(es) │
│ ──────────────────────────────────────────────────────────────────────────────────────── │
│ imported_file file 64dbc562-b5ed-4d09-89aa-d8d7d41bd3b3 │
│ table table f4bda52f-5dc1-4441-adfd-109dbdf357d0 journal_nodes_table │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
This should have created an item with alias journal_nodes_table
in the kiara data store, which you can confirm with kiara data list
.
Writing the kiara module¶
Ok, let's get started and create a kiara module that filters a table, using different filter criteria.
Module skeleton¶
In most cases you'd delete the example module mentioned above, and create your module in the Python file where the example module was, or in a new Python file in the "modules" folder. For the purpose of this tutorial, we can just leave the example module in place, because it can serve as a quick reference for our own module. Use the editor of your choice, and paste the following text below the existing code into modules/__init__.py
:
from kiara import KiaraModule
class TutorialModule(KiaraModule):
def create_inputs_schema(self):
return {
"table_input": {
"type": "table"
}
}
def create_outputs_schema(self):
return {
"table_output": {
"type": "table"
}
}
def process(self, inputs, outputs) -> None:
pass
This module skeleton describes a kiara module that takes a dataset of type table
as input (using table_input
as input field name), and produces another table dataset as output (accordingly, using table_output
as output field name). For your own modules, you'd probably use the field name table
for both input and outputs, but in this tutorial we'll use the longer forms, to avoid any confusion.
On the next kiara run, the new module should be picked up by the operation list
command:
kiara operation list tutorial_module
╭─ Filtered operations ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ Id Type(s) Description │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ kiara_plugin.my_kiara_module.my_kiara_module.tutorial_module -- n/a -- │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
The id of the module was autogenerated from the full Python path of its class: kiara_plugin.my_kiara_module.my_kiara_module.tutorial_module
.
Module id and description¶
In most cases, we don't want such a long and unwieldy module name. We can assign our own, custom and meaningful id by setting the _module_type_name
class attribute. In addition, we will want to add some documentation about the module and its functionality that is displayed to the user. For this, we use a normal Python doc string on the Python class body. For the purpose of this tutorial, we'll only add a single sentence, but in most cases you'll want to have a multi-paragraph markdown text here. So, taking all that into account, edit the module code to include:
...
...
class TutorialModule(KiaraModule):
"""Filter a table."""
_module_type_name = "filter.table"
def create_inputs_schema(self):
return {
...
...
The output for our new module in the operation list is much prettier now:
kiara operation list filter
╭─ Filtered operations ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ Id Type(s) Description │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ filter.table Filter a table. │
│ string_filter.tokens filter -- n/a -- │
│ table_filter.drop_columns filter -- n/a -- │
│ table_filter.select_columns filter -- n/a -- │
│ table_filter.select_rows filter -- n/a -- │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
We can also let kiara tell us about what it knows about the operation itself:
kiara operation explain filter.table
╭─ Operation: filter.table ────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ Documentation Filter a table. │
│ │
│ Inputs │
│ field name type description Required Default │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ table_input table -- n/a -- yes -- no default -- │
│ │
│ │
│ Outputs │
│ field name type description │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ table_output table -- n/a -- │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Input/output field documentation¶
As you can see in the explain
output above, the information to the user is still a bit sparse. In most cases, we'll want to have some information about the input(s) the user is supposed to provide. Same for what the outputs actually mean. In both cases, we can add a doc
attribute to each input and output field.
...
...
def create_inputs_schema(self):
return {
"table_input": {
"type": "table",
"doc": "The table to filter."
}
}
def create_outputs_schema(self):
return {
"table_output": {
"type": "table",
"doc": "The filtered table."
}
}
...
...
Run the explain
command again, to check what kiara thinks of our module now:
kiara operation explain filter.table
╭─ Operation: filter.table ──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ Documentation Filter a table. │
│ │
│ Inputs │
│ field name type description Required Default │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ table_input table The table to filter. yes -- no default -- │
│ │
│ │
│ Outputs │
│ field name type description │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ table_output table The filtered table. │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Processing the inputs¶
Specifying the inputs (and outputs) is an important part of designing your module, it's basically the module's 'public API', and you want to avoid changing it (too much; or at all) as your module evolves over time. But of course, the actual processing is where the interesting stuff happens. In kiara, that is the process
method of every module. The arguments to this method are called inputs
and outputs
, which are basically dicts that use the field names specified in the create_inputs_schema
/ create_outputs_schema
as keys, and Python objects of class Value as values.
One thing to understand is that a Value
object is not the same as the actual data. Instead, it's a reference to it (a means to retrieve it), and it also contains metadata about its provenance (pedigree/lineage) and other properties.
This is the signature of the process
method, including type hints (which we will omit after this):
from kiara.models.values.value import ValueMap, ValueMapWritable
def process(inputs: ValueMap, outputs: ValueMapWritable):
...
...
The inputs
and outputs
arguments to the process
method are of type ValueMap; the two main methods to access input data are:
inputs.get_value_obj([field_name])
: retrieve the (wrapper)Value
object for a fieldinputs.get_value_data([field_name])
: retrieve the data object for a field
In addition, you can retrieve the data object via the value wrapper:
value = inputs.get_value_obj("field_name")
data = value.data
The important methods to set an output is:
outputs.set_value(field_name, result_data)
: set a single output fieldoutputs.set_values(field_name_1=result_data_1, field_name_2=result_data_2, ...)
: set multiple result values at once
All that out of the way, let's get started implementing our table filter. We'll do it in stages, so hopefully we can cover all the important aspects in this tutorial in a way that makes intuitive sense.
To that end, let's write some code that does ...nothing. Our first iteration of our module will take the input table, and immediately set it as output:
def process(self, inputs, outputs):
table_obj = inputs.get_value_obj("table_input")
# some debug output is usually useful while developing. Something like:
print(f"Filter module, table input: {table_obj}")
print("Table data:")
print(table_obj.data)
outputs.set_value("table_output", table_obj)
If we run
our module in this state, we should see our debug output, as well as the resulting table (which will be the unmodified input):
kiara run filter.table table_input=alias:journal_nodes_table
Filter module, table input value: Value(id=f4bda52f-5dc1-4441-adfd-109dbdf357d0, type=table, status=set, initialized=True optional=False)
Table data instance: KiaraTable(model_id=-- n/a --, category=kiara_table, fields=[data_path])
╭─ Result ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ field data_type value │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ table_output table │
│ Id Label JournalTyp City CountryNe PresentDay Latitude Longitude Language │
│ ─────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ 75 Psychiatri specialize Amsterdam Netherlan Netherland 52.366667 4.9 Dutch │
│ 36 The Americ specialize Baltimore United St United Sta 39.289444 -76.61527 English │
│ 208 The Americ specialize Baltimore United St United Sta 39.289444 -76.61527 English │
│ 295 Die Kranke specialize Berlin German Em Germany 52.52 13.405 German │
│ 296 Die deutsc general me Berlin German Em Germany 52.52 13.405 German │
│ 300 Therapeuti specialize Berlin German Em Germany 52.52 13.405 German │
│ 1 Allgemeine specialize Berlin German Em Germany 52.52 13.405 German │
│ 7 Archiv für specialize Berlin German Em Germany 52.52 13.405 German │
│ 10 Berliner k general me Berlin German Em Germany 52.52 13.405 German │
│ 13 Charité An general me Berlin German Em Germany 52.52 13.405 German │
│ 21 Monatsschr specialize Berlin German Em Germany 52.52 13.405 German │
│ 29 Virchows A specialize Berlin German Em Germany 52.52 13.405 German │
│ 31 Zeitschrif specialize Berlin German Em Germany 52.52 13.405 German │
│ 42 Vierteljah specialize Berlin German Em Germany 52.52 13.405 German │
│ 47 Centralbla specialize Berlin German Em Germany 52.52 13.405 German │
│ 50 Russische general me Berlin German Em Germany 52.52 13.405 German │
│ ... ... ... ... ... ... ... ... ... │
│ ... ... ... ... ... ... ... ... ... │
│ 277 L'arte med general me Turin Italy Italy 45.079167 7.676111 Italian │
│ 288 Allgemeine specialize Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 18 Jahrbücher specialize Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 30 Wiener kli general me Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 44 Wiener kli general me Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 45 Wiener med general me Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 72 Wiener med general me Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 81 Monatsschr general me Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 93 Klinisch-t general me Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 151 Medicinisc specialize Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 199 Der Militä specialize Vienna Austro-Hu Austria 48.2 16.366667 German │
│ 261 Медицинска general me Voronezh Russian E Russia 51.671667 39.210556 Russian │
│ 77 Medycyna general me Warsaw Russian E Poland 52.233333 21.016667 Polish │
│ 150 Kronika Le general me Warsaw Russian E Poland 52.233333 21.016667 Polish │
│ 86 Grenzfrage specialize Wiesbaden German Em Germany 50.0825 8.24 German │
│ 206 Ergebnisse specialize Wiesbaden German Em Germany 50.0825 8.24 German │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Now it's time to drill a bit deeper into our input table, and figure out how to access the information it contains. kiara wraps data that shares some schema/structure into so-called 'data types'. You can access a list of the data types that are available in your current kiara environment with the data-type list
sub-command:
kiara data-type list
╭─ Available data types ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ type name type lineage (qualifier) profiles description │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ any -- n/a -- -- n/a -- 'Any' type, the parent type for most other types. │
│ │
│ array -- n/a -- -- n/a -- An array, in most cases used as a column within a table. │
│ │
│ boolean -- n/a -- -- n/a -- A boolean. │
│ │
│ bytes -- n/a -- -- n/a -- An array of bytes. │
│ │
│ database -- n/a -- -- n/a -- A database, containing one or several tables. │
│ │
│ date -- n/a -- -- n/a -- A date. │
│ │
│ dict -- n/a -- -- n/a -- A dictionary. │
│ │
│ file -- n/a -- -- n/a -- A file. │
│ │
│ file_bundle -- n/a -- -- n/a -- A bundle of files (like a folder, zip archive, etc.). │
│ │
│ float -- n/a -- -- n/a -- A float. │
│ │
│ integer -- n/a -- -- n/a -- An integer. │
│ │
│ list -- n/a -- -- n/a -- A list. │
│ │
│ network_data -- n/a -- -- n/a -- Data that can be assembled into a graph. │
│ │
│ string -- n/a -- -- n/a -- A string. │
│ │
│ table -- n/a -- -- n/a -- Tabular data (table, spreadsheet, data_frame, what have you). │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
To find out more about a specific data type, you can use data-type explain
:
kiara data-type explain table
╭─ Data type: table ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ type_name table │
│ type_config {} │
│ │
│ ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ │
│ lineage table │
│ any │
│ qualifier profile(s) -- n/a -- │
│ Documentation │
│ Tabular data (table, spreadsheet, data_frame, what have you). │
│ │
│ The table data is organized in sets of columns (arrays of data of the same type), with each column having a │
│ string identifier. │
│ │
│ kiara uses an instance of the [KiaraTable][kiara_plugin.tabular.models.table.KiaraTable] class to manage the │
│ table data, which let's developers access it in different formats (Apache Arrow Table, Pandas dataframe, │
│ Python dict of lists, more to follow...). │
│ │
│ Please consult the API doc of the KiaraTable class for more information about how to access and query the │
│ data: │
│ │
│ • KiaraTable API doc │
│ │
│ Internally, the data is stored in Apache Feather format -- both in memory and on disk when saved, which │
│ enables some advanced usage to preserve memory and compute overhead. │
│ │
│ Author(s) │
│ Markus Binsteiner markus@frkl.io │
│ │
│ Context │
│ Tags tabular │
│ Labels package: kiara_plugin.tabular │
│ References source_repo: https://github.com/DHARPA-Project/kiara_plugin.tabular │
│ documentation: https://DHARPA-Project.github.io/kiara_plugin.tabular/ │
│ │
│ Python class │
│ python_class_name TableType │
│ python_module_name kiara_plugin.tabular.data_types.table │
│ full_name kiara_plugin.tabular.data_types.table.TableType │
│ │
│ Config class │
│ python_class_name DataTypeConfig │
│ python_module_name kiara.data_types │
│ full_name kiara.data_types.DataTypeConfig │
│ │
│ Value class │
│ python_class_name KiaraTable │
│ python_module_name kiara_plugin.tabular.models.table │
│ full_name kiara_plugin.tabular.models.table.KiaraTable │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Reading this, and following some of the links included. shows us that we can retrieve the table data as a Pandas dataframe using the to_pandas()
method. As the documentation states, this loads the whole data into memory, which is something we should try to avoid, but in a lot of cases (esp. if we are dealing with sub-hundreds-of-megabytes-sized data) it's a perfectly acceptable approach. So, let's do this and use our existing knowledge of Pandas, and retrieve a list of column names from the table the user provided, print out that information debug-style, using print:
def process(self, inputs, outputs) -> None:
table_obj = inputs.get_value_obj("table_input")
print(f"Filter module, table input value: {table_obj}")
print(f"Table data instance: {table_obj.data}")
pandas_df = table_obj.data.to_pandas()
print(f"Column names: {pandas_df.columns}")
outputs.set_value("table_output", table_obj)
Again, let's run and see what's what (this time surpressing the result output we don't need right now, using --output silent
):
kiara run --output silent filter.table table_input=alias:journal_nodes_table
Filter module, table input value: Value(id=f4bda52f-5dc1-4441-adfd-109dbdf357d0, type=table, status=set, initialized=True optional=False)
Table data instance: KiaraTable(model_id=-- n/a --, category=kiara_table, fields=[data_path])
Column names: Index(['Id', 'Label', 'JournalType', 'City', 'CountryNetworkTime',
'PresentDayCountry', 'Latitude', 'Longitude', 'Language'],
dtype='object')
Ok, now we filter. Initially, let's say our module accepts only tables that contain a 'City' column, and returns all rows that have 'Berlin' as a value there:
def process(self, inputs, outputs) -> None:
from kiara.exceptions import KiaraProcessingException
table_obj = inputs.get_value_obj("table_input")
pandas_df = table_obj.data.to_pandas()
column_names = pandas_df.columns
if "City" not in column_names:
raise KiaraProcessingException("Invalid table, does not contain a column named 'City'.")
berlin_df = pandas_df.loc[pandas_df['City'] == "Berlin"]
outputs.set_value("table_output", berlin_df)
And again, we run our module using our example dataset, and now we actually get something that is filtered:
kiara run filter.table table_input=alias:journal_nodes_table
╭─ Result ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ field data_type value │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ table_output table │
│ Id Label JournalT City CountryN PresentD Latitude Longitude Language __index_ │
│ ─────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ 295 Die Kran speciali Berlin German E Germany 52.52 13.405 German 3 │
│ 296 Die deut general Berlin German E Germany 52.52 13.405 German 4 │
│ 300 Therapeu speciali Berlin German E Germany 52.52 13.405 German 5 │
│ 1 Allgemei speciali Berlin German E Germany 52.52 13.405 German 6 │
│ 7 Archiv f speciali Berlin German E Germany 52.52 13.405 German 7 │
│ 10 Berliner general Berlin German E Germany 52.52 13.405 German 8 │
│ 13 Charité general Berlin German E Germany 52.52 13.405 German 9 │
│ 21 Monatssc speciali Berlin German E Germany 52.52 13.405 German 10 │
│ 29 Virchows speciali Berlin German E Germany 52.52 13.405 German 11 │
│ 31 Zeitschr speciali Berlin German E Germany 52.52 13.405 German 12 │
│ 42 Viertelj speciali Berlin German E Germany 52.52 13.405 German 13 │
│ 47 Centralb speciali Berlin German E Germany 52.52 13.405 German 14 │
│ 50 Russisch general Berlin German E Germany 52.52 13.405 German 15 │
│ 76 Deutsche general Berlin German E Germany 52.52 13.405 German 16 │
│ 87 Monatssc speciali Berlin German E Germany 52.52 13.405 German 17 │
│ 108 Archiv f speciali Berlin German E Germany 52.52 13.405 German 18 │
│ 113 Zeitschr general Berlin German E Germany 52.52 13.405 German 19 │
│ 159 Deutsche speciali Berlin German E Germany 52.52 13.405 German 20 │
│ 162 Jahresbe speciali Berlin German E Germany 52.52 13.405 German 21 │
│ 192 Ärztlich general Berlin German E Germany 52.52 13.405 German 22 │
│ 198 Zeitschr speciali Berlin German E Germany 52.52 13.405 German 23 │
│ 258 Der Pfar news med Berlin German E Germany 52.52 13.405 German 24 │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Of course, a module like this is only of very limited value, because the tables it accepts as inputs must contain a column named 'City', and it only filters out a hardcoded string. Ideally, we'd want the user to provide both inputs, along with the table to filter. Let's add those module inputs, and adjust the processing method accordingly:
def create_inputs_schema(self):
return {
"table_input": {
"type": "table",
"doc": "The table to filter."
},
"column_name": {
"type": "string",
"doc": "The column containing the element to use as filter.",
"default": "City"
},
"filter_string": {
"type": "string",
"doc": "The string to use as filter."
}
}
def process(self, inputs, outputs) -> None:
from kiara.exceptions import KiaraProcessingException
table_obj = inputs.get_value_obj("table_input")
column_name = inputs.get_value_data("column_name")
filter_string = inputs.get_value_data("filter_string")
pandas_df = table_obj.data.to_pandas()
column_names = pandas_df.columns
if column_name not in column_names:
raise KiaraProcessingException(f"Invalid table, does not contain a column named '{column_name}'. Available column names: {', '.join(column_names)}.")
berlin_df = pandas_df.loc[pandas_df[column_name] == filter_string]
outputs.set_value("table_output", berlin_df)
In this example, I've used a default value for the column_name
input ('City'). This probably doesn't make a whole lot of sense, but it shows how to set defaults for input fields, which in a lot of cases does make sense. We can try to run this command using a missing filter_string
argument, which shows off nicely what the kiara command-line interface has to say about something like this:
kiara run filter.table table_input=alias:journal_nodes_table
╭─ Run info: filter.table ───────────────────────────────────────────────────╮
│ │
│ Can't run operation: invalid or insufficient input(s) │
│ │
│ ──────────────────────────────────────────────────────────────────────────── │
│ │
│ Operation: filter.table │
│ │
│ Filter a table. │
│ │
│ Inputs: │
│ │
│ field name status type description required default │
│ ────────────────────────────────────────────────────────────────────────── │
│ column_name valid string The column no City │
│ containing the │
│ element to use │
│ as filter. │
│ filter_string not set string The string to yes │
│ use as filter. │
│ table_input valid table The table to yes │
│ filter. │
│ │
│ │
│ Outputs: │
│ │
│ field name type description │
│ ────────────────────────────────────────────────────────────────────────── │
│ table_output table The filtered table. │
│ │
╰──────────────────────────────────────────────────────────────────────────────╯
As you can see, kiara complains about the missing input, but has used 'City' as default for the missing column_name
input, and therefor is ok with the user not providing this. Ok, one more time, let's look for 'Amsterdam':
kiara run filter.table table_input=alias:journal_nodes_table filter_string=Amsterdam
╭─ Result ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │
│ field data_type value │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ table_output table │
│ Id Label JournalT City CountryN PresentD Latitude Longitud Language __index_ │
│ ─────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ 75 Psychiat speciali Amsterda Netherla Netherla 52.36666 4.9 Dutch 0 │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
This should give you a good basis to work on your own kiara module(s). Stay tuned for part II of this tutorial!