Workshop
Digital History: The Story So Far¶
As the field of Digital History continues to grow, so too does the number of tools, software, and coding packages built to support and advance digital history in practice. The range of this is at times staggering: from applications suitable for the most novice of digital historians, to coding guides and tools for those working to more nuanced and specific end-goals, researchers have an ability to engage with their materials in digital, quantitative ways on a never before seen level. Often we focus primarily on the new findings that come of out this new way of approaching research - but what about the ways we get to those findings?
Regardless of the type of digital analysis being performed or even the software being used, the process is normally the same: input some data, click some buttons or run some code (perhaps a couple of times over to edit the code and adjust the outcomes), and get your end result.
You've got an outcome - but do you know how you've got from a to b? It's likely that variables have been written over several times along the way, and the data has changed from one type to another, been filtered or added to, and decision after decision has been made without necessarily knowing it. Each little adjustment or re-run of the code has contributed to the research process and is critical to the end output or findings.
But how do we keep track?
Hello kiara.¶
Introducing kiara, a new data orchestration tool.¶
This new tool incorporates a number of different digital research approaches, but most importantly documents and encourages users to critically reflect on the process and use of DH tools. In doing so, the software opens up the black box of digital research, moving away from button-clicking software and making digital research more transparent and open to commentary, replicability, and criticism. It not only makes the research process itself more open, allowing users to visualise and examine the individual steps from start to finish, but also allows them to track changes to the data itself, something that is either imperceptible or, perhaps more importantly, forgotten about in traditional digital history methods and tools. kiara therefore acts as a 'wrapper' to this digital reserach process, tracking and documenting the steps and changes to the data, producing a veritable map of the journey that can be reflected upon and shared.
This tutorial will walk you through installation of kiara in Jupyter Notebooks, and some basic but essential functions that can be built on in further notebooks. At the end, it will showcase the data lineage, having tracked the research process and changes to the data from start to finish.
This tutorial requires you to know python and SQL.
Installation¶
First, we need to check if kiara and its plugins are already installed, and install them if not. There are currently seven plugins:
kiara_plugin.core-types
kiara_plugin.onboarding
kiara_plugin.tabular
kiara_plugin.network_analysis
kiara_plugin.language_processing
kiara_plugin.html
kiara_plugin.streamlit
All of these will be installed automatically alongside kiara, using the code below:
try:
from kiara_plugin.jupyter import ensure_kiara_plugins
except:
import sys
print("Installing 'kiara_plugin.jupyter'...")
!{sys.executable} -m pip install -q kiara_plugin.jupyter
from kiara_plugin.jupyter import ensure_kiara_plugins
ensure_kiara_plugins()
Running kiara¶
In order to use kiara, we need to create a KiaraAPI
instance. An API allows us to control and interact with kiara and its functions. In kiara this also allows us to get more information about what can be done (and what is happening) to our data as we go. For more on what can be done with the API, see the kiara API documentation here.
from kiara import KiaraAPI
kiara = KiaraAPI.instance()
Now we have an API in place, we can get more information about what we can do in kiara. Let's start by asking kiara to list all the operations that are included with the plugins we just installed.
kiara.list_operation_ids()
Downloading Files¶
Great, now we know the different kind of operations we can use with kiara. Let's start by introducing some files to our notebook, using the download.file
function.
First we want to find out what this operation does, and just as importantly, what inputs it needs to work.
kiara.retrieve_operation_info('download.file')
So from this, we know that download.file
will download a single file from a remote location for us to use in kiara.
We need to give the function a url and, if we want, a file name. These are the inputs.
In return, we will get the file and metadata about the file as our outputs.
Let's give this a go using some kiara sample data.
First we define our inputs, then use kiara.run_job
with our chosen operation, download.file
, and save this as our outputs.
inputs = {
"url": "https://raw.githubusercontent.com/DHARPA-Project/kiara.examples/main/examples/data/journals/JournalNodes1902.csv",
"file_name": "JournalNodes1902.csv"
}
outputs = kiara.run_job('download.file', inputs=inputs)
Let's print out our outputs and see what that looks like.
outputs
Great! We've successfully downloaded the file, and we can see there's lots of information here.
At the moment, we're most interested in the file output. This contains the actual contents of the file that we have just downloaded.
Let's separate this out and store it in a separate variable for us to use.
downloaded_file = outputs['file']
New Formats: Creating and Converting¶
What next? We could transform the downloaded file contents into a different format.
Let's use the operation list earlier, and look for something that allows us to create something out of our new file.
kiara.list_operation_ids('create')
Our file was orginally in a CSV format, so let's make a table using create.table.from.file
.
Just like when we used download.file
, we can double check what this does, and what inputs and outputs this involves.
This time, we're also going to use a variable to store the operation in - this is especially handy if the operation has a long name, or if you want to use the same operation more than once without retyping it.
op_id = 'create.table.from.file'
kiara.retrieve_operation_info(op_id)
Great, we have all the information we need now.
Let's go again.
First we define our inputs, the downloaded file we saved earlier.
Then use kiara.run_job
with our chosen operation, this time stored as op_id
.
Once this is saved as our outputs, we can print it out.
inputs = {
"file": downloaded_file
}
outputs = kiara.run_job(op_id, inputs=inputs)
outputs
This has done exactly what we wanted, and shown the contents from the downloaded file as a table. But we are also interested in some general (mostly internal) information and metadata, this time for the new table we have just created, rather than the original file itself.
Let's have a look.
outputs_table = outputs['table']
outputs_table
Querying our Data¶
So now we have downloaded our file and converted it into a table, we want to actually explore it.
To do this, we can query the table using SQL and some functions already included in kiara.
Let's take another look at that operation list, this time looking for functions that let us 'query'.
kiara.list_operation_ids('query')
Well, we already know our file has been converted into a table, so let's have a look at query.table
.
kiara.retrieve_operation_info('query.table')
So from this information, we only need to provide the table itself, and our query.
Let's work out how many of these journals were published in Berlin.
inputs = {
"table": outputs_table,
"query": "SELECT * from data where City like 'Berlin'"
}
outputs = kiara.run_job('query.table', inputs=inputs)
outputs
The function has returned the table with just the results we were looking for from the SQL query.
Let's narrow this further, and find all the journals that are just about general medicine and published in Berlin.
We can re-use the query.table
function and the table we've just made, stored in outputs['query_result']
inputs = {
"table" : outputs['query_result'],
"query" : "SELECT * from data where JournalType like 'general medicine'"
}
outputs = kiara.run_job('query.table', inputs=inputs)
outputs
Recording and Tracing our Data¶
We've quite a few changes to this table, so let's double check the information about this new table we've created with our queries.
query_output = outputs['query_result']
query_output
Looks good!
We might have changed things around, but we can still get lots of information about all our data.
More importantly, kiara is able to trace all of these changes, tracking the inputs and outputs and giving them all different identifiers, so you know exactly what has happened to your data.
Check it out!
query_output.lineage
Even though we are only actually asking for the data lineage using the last SQL query and the table it made, kiara shows us everything that has happened since we first downloaded the file. This helps us keep an eye on the research process and the changes we are making to the data at the same time!
* Note: this can be updated/made more detailed esp. once we have Mariella's visualisations
What next...?¶
That's great, you've completed the first notebook and successfully installed kiara, downloaded files, tested out some functions, and are able to see what this does to your data.
Now you can check out the other plugin packages to explore how this helps you manage and trace your data while using digital analysis tools!