kiara.operations.extract_metadata¶
ExtractMetadataModule
¶
Base class to use when writing a module to extract metadata from a file.
It's possible to use any arbitrary kiara module for this purpose, but sub-classing this makes it easier.
create_input_schema(self)
¶
Abstract method to implement by child classes, returns a description of the input schema of this module.
If returning a dictionary of dictionaries, the format of the return value is as follows (items with '*' are optional):
{
"[input_field_name]: {
"type": "[value_type]",
"doc*": "[a description of this input]",
"optional*': [boolean whether this input is optional or required (defaults to 'False')]
"[other_input_field_name]: {
"type: ...
...
}
Source code in kiara/operations/extract_metadata.py
def create_input_schema(
self,
) -> typing.Mapping[
str, typing.Union[ValueSchema, typing.Mapping[str, typing.Any]]
]:
input_name = self.value_type
if input_name == "any":
input_name = "value_item"
inputs = {
input_name: {
"type": self.value_type,
"doc": f"A value of type '{self.value_type}'",
"optional": False,
}
}
return inputs
create_output_schema(self)
¶
Abstract method to implement by child classes, returns a description of the output schema of this module.
If returning a dictionary of dictionaries, the format of the return value is as follows (items with '*' are optional):
{
"[output_field_name]: {
"type": "[value_type]",
"doc*": "[a description of this output]"
"[other_input_field_name]: {
"type: ...
...
}
Source code in kiara/operations/extract_metadata.py
def create_output_schema(
self,
) -> typing.Mapping[
str, typing.Union[ValueSchema, typing.Mapping[str, typing.Any]]
]:
outputs = {
"metadata_item": {
"type": "dict",
"doc": "The metadata for the provided value.",
},
"metadata_item_schema": {
"type": "string",
"doc": "The (json) schema for the metadata.",
},
}
return outputs
retrieve_module_profiles(kiara)
classmethod
¶
Retrieve a collection of profiles (pre-set module configs) for this kiara module type.
This is used to automatically create generally useful operations (incl. their ids).
Source code in kiara/operations/extract_metadata.py
@classmethod
def retrieve_module_profiles(
cls, kiara: "Kiara"
) -> typing.Mapping[str, typing.Union[typing.Mapping[str, typing.Any], Operation]]:
all_metadata_profiles: typing.Dict[
str, typing.Dict[str, typing.Dict[str, typing.Any]]
] = {}
value_types: typing.Iterable = cls.get_supported_value_types()
if "*" in value_types:
value_types = kiara.type_mgmt.value_type_names
metadata_key = cls.get_metadata_key()
all_value_types = set()
for value_type in value_types:
if value_type not in kiara.type_mgmt.value_type_names:
log_message(
f"Ignoring metadata-extract operation (metadata key: {metadata_key}) for type '{value_type}': type not available"
)
continue
all_value_types.add(value_type)
sub_types = kiara.type_mgmt.get_sub_types(value_type)
all_value_types.update(sub_types)
for value_type in all_value_types:
op_config = {
"module_type": cls._module_type_id, # type: ignore
"module_config": {"value_type": value_type},
"doc": f"Extract '{metadata_key}' metadata for value of type '{value_type}'.",
}
all_metadata_profiles[
f"extract.{metadata_key}.metadata.from.{value_type}"
] = op_config
return all_metadata_profiles
ExtractMetadataOperationType
¶
Extract metadata from a dataset.
The purpose of this operation type is to be able to extract arbitrary, type-specific metadata from value data. In general, kiara wants to collect (and store along the value) as much metadata related to data as possible, but the extraction process should not take a lot of processing time (since this is done whenever a value is registered into a data registry).
As its hard to predict all the types of metadata of a specific type that could be interesting in specific scenarios, kiara
supports a pluggable mechanism to add new metadata extraction processes by extending the base class ExtractMetadataModule
and adding that implementation somewhere kiara can find it. Once that is done, kiara will automatically add a new
operation with an id that follows this template: <VALUE_TYPE>.extract_metadata.<METADATA_KEY>
, where METADATA_KEY
is a name
under which the metadata will be stored within the value object.
By default, every value type should have at least one metadata extraction module where the METADATA_KEY
is the same
as the value type name, and which contains basic, type-specific metadata (e.g. for a 'table', that would be number of rows,
column names, column types, etc.).