Chengwei LEI, Ph.D.    Associate Professor

Department of Computer and Electrical Engineering and Computer Science
California State University, Bakersfield

 

(Dis)Similarity / Corrolation



(Dis)Similarity calculation also quantifies how (different) similar two data points are, but the term "(dis)similarity" often refers to a broader concept that includes both distance measures and other metrics that focus on different kinds of disparities between points. It is used more generally in clustering, classification, or other unsupervised learning algorithms to describe how "(un)like" two points are.

While distance measures are always non-negative, (dis)similarity can sometimes allow for asymmetric calculations
(i.e., (dis)similarity between A and B is not necessarily the same as between B and A).
In some cases, (dis)similarity can refer to any kind of calculation that reflects how (different) similar two objects are.

 

What methods can we use?

 


Cosine similarity

 

 

 




Jaccard Similarity

 

 

 




Pearson Correlation Coefficient

 

 

The advantage of the Pearson Coefficient over the Euclidean Distance is that it is more robust against data that isn't normalized.

The value range is from -1 to 1 .

 




Spearman's Rank Correlation Coefficient

 

di is the difference between the two ranks of each observation,
n is the number of observations.

 

 

In simple words, we first rank all data, then calculate the Pearson Correlation Coefficient based on ranking.

Note, unlike Pearson’s correlation, there is no requirement of normality and hence it
is a nonparametric statistic.

 






 






Distance Correlation 

 

 

Distance correlation measures the dependence between two random variables, regardless of their types. It can be used to detect any kind of dependence between variables (linear or nonlinear) and works well with mixed data types.

 




Gower's distance

 

 

Gower's distance between two mixed-type objects is a similarity measure that can handle different types of data within the same dataset and is particularly useful in cluster analysis or other multivariate statistical techniques.
Data can be binary, ordinal, or continuous variables. It works by normalizing the differences between each pair of variables and then computing a weighted average of these differences.

 

The value is between 0 and 1 with smaller values indicating higher similarity.




Kendall Rank Correlation Coefficient

 

 

The Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities.
It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities.