Distance and Surprisal Calculations
Amalgam has a number of opcodes that compute distances, and surprisals as distance, across various data types. The opcode generalized_distance calculates these values based on two containers, whereas opcodes like query_within_generalized_distance and query_nearest_generalized_distance compute the distances on entity labels, and opcodes like query_entity_convictions use distance or surprisal calculations to compute more advanced metrics. For full information on how these distances are calculated, see the paper “A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring)” by Hazard et. al https://arxiv.org/abs/2510.22809v1.
These opcodes all contain a set of common parameters that start with the containers or labels from which to compute the distance. Following these parameters, the distance opcodes have the optional parameters in the order as follows, though not all opcodes have all of these parameters.
- list|number
selection_bandwidth: The parameterselection_bandwidthspecifies either the number of entities to return, or is a list of parameters for more sophisticated bandwidth selection. Ifselection_bandwidthis a list, the first element of the list specifies the minimum incremental probability or percent of mass that the next largest entity would comprise (e.g., 0.05 would return at most 20 entities if they were all equal in percent of mass), and the other elements of the list are optional. The second element is the minimum number of entities to return, the third element is the maximum number of entities to return, and the fourth indicates the number of additional entities to include after any of the aforementioned thresholds (defaulting to zero). If there is disagreement among the constraints forselection_bandwidth, the constraint yielding the fewest entities will govern the number of entities returned. - list
feature_labels: The names of the labels of the features from which to compute the distances. - number
p_value: The parameterp_valueis the generalized norm parameter, where the value of 1 is probability space and Manhattan distance, the default, 2 being Euclidean distance, etc. For surprisal space, using a value of 1 is generally most appropriate. - list|assoc|assoc of assoc
weights: Ifweightsis a list, each value maps to its respective element in the vectors. Ifweightsis null, then it will assume that theweightsare 1 and additionally will ignore null values for the vectors instead of treating them as unknown differences. Ifweightsis an assoc, then the parametervalue_nameswill select theweightsfrom the assoc. Ifweightsis an assoc of assocs, additionally the parameterweights_selection_featureswill select which set ofweightsto use. - list|assoc of assoc|string
attributes: The parameterattributesdescribes the attributes of each feature which will determine how the differences are calculated. Each entry can either be a string or assoc. If a string, then the valid values are “nominal” or “continuous”. But the entry is an assoc, then there are a wide variety of attributes available depending on type. The key “difference_type” can be either “nominal” or “continuous” to describe whether the difference will only look at equality or whether more distant values will have larger differences. The key “data_type” can be one of “bool”, “number”, “string”, or “code”, and will determine whether all data will be coerced to the corresponding type (null is always allowed), where “code” indicates that no type coercion will occur. The default if omitted is continuous numeric, and the default type if only nominal specified is nominal string. The additional attributes available depend on the combination of “difference_type” and “data_type”. If “difference_type” is “nominal”, then the key “nominal_count” will specify the number of data points in the data set, but if omitted or null, then it will infer the count the values available. If the combination is “continuous” and “number” then the key “cycle_range” specifies the upper bound of the difference of the range between two values. For example, if the “cycle_range” is 360, then the supremum difference between two values will be 360, leading 1 and 359 to have a difference of 2. If the combination of types is “continous” and “code”, then the keys “types_must_match”, “nominal_numbers”, “nominal_strings”, and “recursive_matching” are applicable. If the key “types_must_match” is true (the default), it will only consider nodes common if the types match. If the key “nominal_numbers” is true (the default is false), then it will assume that all numbers will match only if identical; if false, it will compare similarity of values. The key “nominal_strings” defaults to true, but works similar to “nominal_numbers” except on strings using string edit distance. If the key “recursive_matching” is true or null, then it will attempt to recursively match any part of the data structure of node1 to node2. If the key “recursive_matching” is false, then it will only attempt to merge the two at the same level, which yield better results if the data structures are common, and additionally will be much faster. - list|assoc
deviations: The values in the parameterdeviationsare used during distance calculation to specify uncertainty per-element, the minimum difference between two values prior to exponentiation. Specifying null as a deviation is equivalent to setting each deviation to 0. Each deviation for each feature can be a single value or a list. If it is a single value, that value is used as the deviation and differences and deviations for null values will automatically computed from the data based on the maximum difference. If a deviation is provided as a list, then the first value is the deviation, the second value is the difference to use when one of the values being compared is null, and the third value is the difference to use when both of the values are null. If the third value is omitted, it will use the second value for both. If both of the null values are omitted, then it will compute the maximum difference and use that for both. For nominal types, the value for each feature can be a numeric deviation, an assoc, or a list. If the value is an assoc it specifies deviation information, where each key of the assoc is the nominal value, and each value of the assoc can be a numeric deviation value, a list, or an assoc, with the list specifying either an assoc followed optionally by the default deviation. This inner assoc, regardless of whether it is in a list, maps the value to each actual value’s deviation. - list|string
weights_selection_features: Ifweights_selection_featuresis a string andweightsis an assoc, then it will select theweightsfor the given feature and rebalanceweightsfor any unused features. - string|number
distance_transform: A transform will be applied to the distances based ondistance_transform. Ifdistance_transformis “surprisal” then distances will be calculated as surprisals, and weights will not be applied to the values. Ifdistance_transformis “surprisal_to_prob” then distances will be calculated as surprisals and will be transformed back into probabilities for aggregating, and then transformed back to surprisals. Ifdistance_transformis a number or omitted, which will default to 1.0, then it will be treated as a distance weight exponent, and will be applied to each distance as distance^distance_weight_exponent, only using entity weights for nonpositive values ofdistance_transform. Note that the corresponding parameter forgeneralized_distanceis boolsurprisal_space, and is true then all distance computations will be performed in surprisal space. - number
random_seed: Ifrandom_seedis specified, it uses a stream from this seed to break ties when selecting entities. - string
radius_label: The parameterradius_labelparameter represents the label name of the radius of the entity, which effectively operates as a negative distance so that one point can be inside the hypersphere of another. - string
numerical_precision: The parameternumerical_precisioncan be specified as one of three values: “precise”, which computes every distance with high numerical precision, “fast”, which computes every distance with lower but faster numerical precision, and “recompute_precise”, which computes distances quickly with lower precision but then recomputes any distance values that will be returned with higher precision.