Distance and Surprisal Calculations

Amalgam has a number of opcodes that compute distances, and surprisals as distance, across various data types. The opcode generalized_distance calculates these values based on two containers, whereas opcodes like query_within_generalized_distance and query_nearest_generalized_distance compute the distances on entity labels, and opcodes like query_entity_convictions use distance or surprisal calculations to compute more advanced metrics. For full information on how these distances are calculated, see the paper “A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring)” by Hazard et. al https://arxiv.org/abs/2510.22809v1.

These opcodes all contain a set of common parameters that start with the containers or labels from which to compute the distance. Following these parameters, the distance opcodes have the optional parameters in the order as follows, though not all opcodes have all of these parameters.

list|number selection_bandwidth: The parameter selection_bandwidth specifies either the number of entities to return, or is a list of parameters for more sophisticated bandwidth selection. If selection_bandwidth is a list, the first element of the list specifies the minimum incremental probability or percent of mass that the next largest entity would comprise (e.g., 0.05 would return at most 20 entities if they were all equal in percent of mass), and the other elements of the list are optional. The second element is the minimum number of entities to return, the third element is the maximum number of entities to return, and the fourth indicates the number of additional entities to include after any of the aforementioned thresholds (defaulting to zero). If there is disagreement among the constraints for selection_bandwidth, the constraint yielding the fewest entities will govern the number of entities returned.
list feature_labels: The names of the labels of the features from which to compute the distances.
number p_value: The parameter p_value is the generalized norm parameter, where the value of 1 is probability space and Manhattan distance, the default, 2 being Euclidean distance, etc. For surprisal space, using a value of 1 is generally most appropriate.
list|assoc|assoc of assoc weights: If weights is a list, each value maps to its respective element in the vectors. If weights is null, then it will assume that the weights are 1 and additionally will ignore null values for the vectors instead of treating them as unknown differences. If weights is an assoc, then the parameter value_names will select the weights from the assoc. If weights is an assoc of assocs, additionally the parameter weights_selection_features will select which set of weights to use.
list|assoc of assoc|string attributes: The parameter attributes describes the attributes of each feature which will determine how the differences are calculated. Each entry can either be a string or assoc. If a string, then the valid values are “nominal” or “continuous”. But the entry is an assoc, then there are a wide variety of attributes available depending on type. The key “difference_type” can be either “nominal” or “continuous” to describe whether the difference will only look at equality or whether more distant values will have larger differences. The key “data_type” can be one of “bool”, “number”, “string”, or “code”, and will determine whether all data will be coerced to the corresponding type (null is always allowed), where “code” indicates that no type coercion will occur. The default if omitted is continuous numeric, and the default type if only nominal specified is nominal string. The additional attributes available depend on the combination of “difference_type” and “data_type”. If “difference_type” is “nominal”, then the key “nominal_count” will specify the number of data points in the data set, but if omitted or null, then it will infer the count the values available. If the combination is “continuous” and “number” then the key “cycle_range” specifies the upper bound of the difference of the range between two values. For example, if the “cycle_range” is 360, then the supremum difference between two values will be 360, leading 1 and 359 to have a difference of 2. If the combination of types is “continous” and “code”, then the keys “types_must_match”, “nominal_numbers”, “nominal_strings”, and “recursive_matching” are applicable. If the key “types_must_match” is true (the default), it will only consider nodes common if the types match. If the key “nominal_numbers” is true (the default is false), then it will assume that all numbers will match only if identical; if false, it will compare similarity of values. The key “nominal_strings” defaults to true, but works similar to “nominal_numbers” except on strings using string edit distance. If the key “recursive_matching” is true or null, then it will attempt to recursively match any part of the data structure of node1 to node2. If the key “recursive_matching” is false, then it will only attempt to merge the two at the same level, which yield better results if the data structures are common, and additionally will be much faster. Additionally, for distances computed by contained_entities or compute_on_contained_entities, the value for a given feature may be the result of executing code. If the key “call_entity” is specified and is either a call_entity or call_on_entity opcode, then the opcode will be executed and the result will be compared with regard to distance or surprisal. The entity should be set to null, so the parameter should be formed as (call_entity .null ...)), and in most cases it will be desirable to put constraints on the call to prevent excess compute for dynamic data.
list|assoc deviations: The values in the parameter deviations are used during distance calculation to specify uncertainty per-element, the minimum difference between two values prior to exponentiation. Specifying null as a deviation is equivalent to setting each deviation to 0. Each deviation for each feature can be a single value or a list. If it is a single value, that value is used as the deviation and differences and deviations for null values will automatically computed from the data based on the maximum difference. If a deviation is provided as a list, then the first value is the deviation, the second value is the difference to use when one of the values being compared is null, and the third value is the difference to use when both of the values are null. If the third value is omitted, it will use the second value for both. If both of the null values are omitted, then it will compute the maximum difference and use that for both. For nominal types, the value for each feature can be a numeric deviation, an assoc, or a list. If the value is an assoc it specifies deviation information, where each key of the assoc is the nominal value, and each value of the assoc can be a numeric deviation value, a list, or an assoc, with the list specifying either an assoc followed optionally by the default deviation. This inner assoc, regardless of whether it is in a list, maps the value to each actual value’s deviation.
list|string weights_selection_features: If weights_selection_features is a string and weights is an assoc, then it will select the weights for the given feature and rebalance weights for any unused features.
string|number distance_transform: A transform will be applied to the distances based on distance_transform. If distance_transform is “surprisal” then distances will be calculated as surprisals, and weights will not be applied to the values. If distance_transform is “surprisal_to_prob” then distances will be calculated as surprisals and will be transformed back into probabilities for aggregating, and then transformed back to surprisals. If distance_transform is a number or omitted, which will default to 1.0, then it will be treated as a distance weight exponent, and will be applied to each distance as distance^distance_weight_exponent, only using entity weights for nonpositive values of distance_transform. Note that the corresponding parameter for generalized_distance is bool surprisal_space, and is true then all distance computations will be performed in surprisal space.
number random_seed: If random_seed is specified, it uses a stream from this seed to break ties when selecting entities.
string radius_label: The parameter radius_label parameter represents the label name of the radius of the entity, which effectively operates as a negative distance so that one point can be inside the hypersphere of another.
string numerical_precision: The parameter numerical_precision can be specified as one of three values: “precise”, which computes every distance with high numerical precision, “fast”, which computes every distance with lower but faster numerical precision, and “recompute_precise”, which computes distances quickly with lower precision but then recomputes any distance values that will be returned with higher precision.