How the similarity measure is defined in the k-Means Algorithm

measuring data similarity and dissimilarity in data mining and measuring similarity of interests for clustering web-users and measuring semantic similarity between words
Dr.AlbaNathan Profile Pic
Dr.AlbaNathan,United States,Researcher
Published Date:09-08-2017
Your Website URL(Optional)
11 (Dis)similarity measures 11.1 Introduction Whileexploringandexploitingsimilaritypatternsindataisattheheartoftheclusteringtask andthereforeinherentforallclusteringalgorithms,notallofthemadoptanexplicitsimilarity measuretodrivetheiroperation.Suchsimilarity,oractuallymoreoftendissimilaritymeasures (sincetheytypicallytakeminimumvaluesformaximumsimilarity),arefunctionsthatassign real values to instance pairs from the domain and can be used by clustering algorithms both intheclusterformationandclustermodelingprocesses.Suchalgorithmscanbereferredtoas similarity-basedor –perhapsmoreappropriately –dissimilarity-basedclusteringalgorithms. This chapter presents a selection of the most commonly used general-purpose similar- ity and dissimilarity measures for clustering, providing a necessary common background for presenting the most widely used dissimilarity-based clustering algorithms. The latter are described indetail inChapters 12 and 13. Example 11.1.1 Each dissimilarity measure presented in this chapter will be illustrated with a simple R implementation, applied to theweathercl data. Utility functions EX. 1.5.1 fromthedmr.utilpackageaswellasthestandardizationimplementationavail- able in thedmr.trans package will be also used. The R code presented below loads thepackages and the dataset. library(dmr.util) library(dmr.trans) data(weathercl, package="") 11.2 Measuringdissimilarityandsimilarity A number of dissimilarity and similarity measures have been proposed in the literature and are applied in practice. Their common objective is to measure and express numerically the 314 (DIS)SIMILARITYMEASURES degreetowhichtwoinstancesfromthesamedomain,describedbythesamesetofattributes, are dissimilarfrom or similartoeach other. The most popular general purpose dissimilarityand similaritymeasures presented in this chapter fallintotwocategories: Differencebased. Which transform and aggregate in some way attribute value differences forthe twocompared instances. Correlationbased. Which detect the common pattern of low and high attribute values for thetwocompared instances. Thedistinctionbetweenthemaswellastypesofapplicationstowhichtheyarebestsuitedis presented inthesubsequent sections below. Example11.2.1ThefollowingRcodedenfi es afunctionthatautomatesdissimilaritymatrix generationwithrespecttoagivendissimilaritymeasurethatwillbeusedforthesesubsequent examples. For the sake of simple illustration, it is applied below to generate the dissimilarity matrixfortheweathercldatausingatotallymeaninglessdissimilaritymeasure,denfi ed asthe absolute difference of instance numbers. dissmat - function(data, diss) as.dist(outer(1:nrow(data), 1:nrow(data), Vectorize(function(i, j) if (j=i) diss(datai,, dataj,) else NA)), diag=TRUE, upper=TRUE) dummy dissimilarity matrix for the weathercl data dummy.diss - function(x1, x2) abs(as.integer(row.names(x1))-as.integer(row.names(x2))) dissmat(weathercl, dummy.diss) The function returns a dist object, used in R to represent dissimilarity (or distance) matrices. Only the lower triangle of the matrix is actually calculated and stored to save time and space. 11.3 Difference-baseddissimilarity Difference-baseddissimilaritymeasuresareparticularlycommonandoftenadoptedasdefault fordissimilarity-basedclusteringalgorithmsunlessexistingdomainknowledgesuggeststhat they maybe inappropriate. 11.3.1 Euclidean distance The Euclidean distance –by far the most widely and frequently applied dissimilarity measure – iswell known fromgeometry. For twoinstancesx ,x itiscalculated as 1 2 √ √ n √ ∑ euc √ 2 𝛿 (x ,x ) = (a (x )−a(x )) (11.1) 1 2 i 1 i 2 i=1 DIFFERENCE-BASEDDISSIMILARITY 315 2 whichisactuallytheL normofthedifferenceoftheirattributevaluevectors,a(x )−a(x ). 1 2 This formula is directly applicable to domains with continuous attributes only, but it is not uncommontouseitwhensomeattributesarediscreteaswell,byappropriatelyredenfi ing the differencea (x )−a (x )as 𝛿 (a (x ),a (x )),where i 1 i 2 01 i 1 i 2 0 if𝑣 = 𝑣 1 2 𝛿 (𝑣 ,𝑣 ) = (11.2) 01 1 2 1 otherwise SuchabinarydiscretevaluedifferencemakesitpossibletousetheEuclideandissimilarity(as wellasseveralotherrelateddissimilaritymeasuresthatwillbepresentedlater)whenthereare bothdiscreteandcontinuousattributes,butmaynotalwaysbeappropriatewhenallattributes are discrete. This is when it reduces to measuring dissimilarity by the number of different attributevalues,yieldingasmany(orratheraslittle)distinctdissimilaritylevelsasattributes. For realistically sized datasets this may yield large numbers of equally dissimilar instances. Itisnotamajorproblemsincedissimilarity-basedclusteringalgorithmsaretypicallyapplied to datasets with all or most attributes being continuous anyway, but if necessary –it could be alleviated by denfi ing custom nonbinary per-attribute discrete value difference measures basedondomainknowledge,whichwouldnotconsideralldifferentvaluesequallydifferent. Example11.3.1ThefollowingRcodedenfi es afunctiontocalculatetheEuclideandistance. It handles both continuous and discrete attributes using an auxiliary avdiff function that returns attribute value differences appropriately depending on attribute types. The Euclidean distance-based dissimilaritymatrixfor theweathercl data isthen generated. avdiff - function(x1, x2) mapply(function(v1, v2) ifelse(is.numeric(v1), v1-v2, v1=v2), x1, x2) euc.dist - function(x1, x2) sqrt(sum(avdiff(x1,x2)̂2, na.rm=TRUE)) Euclidean distance dissimilarity matrix for the weathercl data dissmat(weathercl, euc.dist) 11.3.2 Minkowskidistance TheEuclideandistancecanbeconsideredthebestknownandmostpracticallyimportantspe- cialcaseofamoregeneralparameterizeddistancefamily,knownastheMinkowskidistance. When applied to measure the dissimilarity of the two instances x and x , it is denfi ed as 1 2 follows: 1 ( ) n p ∑ mink p 𝛿 (x ,x ) = a(x )−a(x ) (11.3) 1 2 i 1 i 2 i=1 where p≥ 1 is a parameter. The Euclidean distance is obtained for p = 2. It can be easily seen that with increasingp large attribute value differences receive relatively more and more weightthansmallerones.Whereasforp = 1thecontributionofeachattributevaluedifference to the calculated distance is proportional to the difference, for larger p the impact of small differences diminishes and the distance mostly depends on large differences. This may be an 316 (DIS)SIMILARITYMEASURES importantchangewhentherearemanyattributesthatdifferentiatethetwocomparedinstances, butonlyafewofthemhavelargevaluedifferences.Forsmallp,theseinstancescouldstillbe considered quitesimilar(assimilarastwoother instances withallattributevalue differences roughly equal and just slightly larger than the small differences mentioned before), but for largerp they would be considered substantiallydissimilar. Example 11.3.2 The following R code demonstrates Minkowski distance calculation, in the very same manner as presented before for the Euclidean distance. Dissimilarity matrices for theweathercl data are generated usingp = 1 andp = 3. mink.dist - function(x1, x2, p) (sum(abs(avdiff(x1,x2))̂p, na.rm=TRUE))̂(1/p) Minkowski distance dissimilarity matrices for the weathercl data dissmat(weathercl, function (x1, x2) mink.dist(x1, x2, 1)) dissmat(weathercl, function (x1, x2) mink.dist(x1, x2, 3)) 11.3.3 Manhattan distance ThespecialcaseoftheMinkowskidistanceobtainedforp = 1isalsoknownastheManhattan distance. This corresponds to a metaphor of going from one street crossing to another in a grid-arrangedcitybymovingalongstreets(ratherthanfollowingthestraightlinejoiningthe two points, as for the Euclidean distance). As discussed above, it makes the contribution of each attribute value difference proportional to the difference. The formula for the Manhattan distance can besimpliefi d to n ∑ man 𝛿 (x ,x ) = a (x )−a (x ) (11.4) 1 2 i 1 i 2 i=1 Example 11.3.3 The R code presented below denfi es a function for Manhattan distance cal- culation by simply wrapping a call to themink.dist function from the previous example. As before, it isapplied toobtain the dissimilaritymatrixfortheweathercl data. man.dist - function(x1, x2) mink.dist(x1, x2, 1) Manhattan distance dissimilarity matrix for the weathercl data dissmat(weathercl, function (x1, x2) man.dist(x1, x2)) 11.3.4 Canberradistance A modiefi d form of the Manhattan distance that relates the attribute value differences to the values themselves isknown as theCanberradistance and calculated as n ∑ a (x )−a (x ) can i 1 i 2 (x ,x ) = 𝛿 (11.5) 1 2 a (x )+a(x ) i 1 i 2 i=1 This adjusts the contribution of each attribute value difference so that the same absolute dif- ference receives more or less weight depending on whether it is observed for small or large DIFFERENCE-BASEDDISSIMILARITY 317 values.Italsomakesthedissimilaritymeasurelesssensitivetodifferencesinattributeranges thanforotherdifference-baseddissimilaritymeasures.Eachterminthesummationobviously fallstothe 0,1 interval, regardless of the ranges ofparticular attributes. A decfi iency of the Canberra distance is its somewhat degenerate behavior when an attribute’s value for one or both compared instances approaches 0. If one ofa(x ),a (x ) is i 1 i 2 close to 0, the contribution of a to the calculated distance is roughly equal to 1 regardless i of the difference a (x )−a (x ). If both a(x ),a (x ) are close to 0 (and, accordingly, also i 1 i 2 i 1 i 2 close to each other), the contribution of a may also approach 1 even if the attribute value i difference isactually very small. Sincetheformulareferstoattributevaluesthemselvesapartfromtheirdifferences,itmight appearthat,unlikethedifference-baseddissimilaritymeasurespresentedabove,theCanberra distancecannotbedirectlyappliedtodomainswithdiscreteattributessimplybyredenfi ing the attribute value difference appropriately. This is easily solved though by replacing the whole a (x )−a (x ) i 1 i 2 ratio for discrete attributes with 1 if the attribute values differ and 0 if they are a (x )+a (x ) i 1 i 2 the same. Example11.3.4ThefollowingRcodeimplementsCanberradistancecalculationandapplies ittotheweathercldata.Itusesanauxiliaryfunctionravdifftohandlebothcontinuousand discrete attributes. ravdiff - function(x1, x2) mapply(function(v1, v2) ifelse(is.numeric(v1), (v1-v2)/(abs(v1)+abs(v2)), v1=v2), x1, x2) can.dist - function(x1, x2) sum(abs(ravdiff(x1,x2)), na.rm=TRUE) Canberra distance dissimilarity matrix for the weathercl data dissmat(weathercl, function (x1, x2) can.dist(x1, x2)) 11.3.5 Chebyshevdistance The Manhattan distance corresponds to one extreme special case of the Minkowski distance, with small and large attribute value differences having proportionally the same impact. The otherextremespecialcase,obtainedforp = ∞,makesthedistanceequaltothesinglelargest attribute value difference. The resulting dissimilarity measure is called the Chebyshev dis- tance, also known as thechessboarddistance (as it provides the minimum number of moves necessary for a chess king to go from one square to another) or the maximum metric. The definition issimplywrittenas follows: cheb 𝛿 (x ,x ) = maxa(x )−a(x ) (11.6) 1 2 i 1 i 2 i WhiletheMinkowskidistanceisnotdirectlycomputableinmachinearithmeticfortoolarge p, itapproaches the Chebyshev distance even formoderatepvalues. It is not particularly common to see the Chebyshev distance being used as a clustering dissimilarity measure since in many practical clustering tasks it might be inappropriate to entirelyfocusonasinglemostdifferentiatingattributeforeachpairofcomparedinstances.It may be reasonable, though, insome specicfi domains. 318 (DIS)SIMILARITYMEASURES Example 11.3.5 The R code presented below denfi es a function for calculating the Cheby- shev distance and demonstrates its application to the weathercl data. For comparison, the Minkowski distance dissimilarity matrix for p = 10 is also calculated. It can be veriefi d to approximate theChebyshev dissimilaritymatrixquite closely. cheb.dist - function(x1, x2) max(abs(avdiff(x1,x2)), na.rm=TRUE) Chebyshev distance dissimilarity matrix for the weathercl dissmat(weathercl, cheb.dist) roughly the same as dissmat(weathercl, function (x1, x2) mink.dist(x1, x2, 10)) 11.3.6 Hammingdistance Whereas all the difference-based dissimilarity measures presented above are primarily intended for continuous attributes and can be “hacked” to be applicable to discrete attributes when necessary, the Hamming distance is targeted at discrete attributes only and – whilst can be calculated –makes usually little sense for continuous attributes. It is denfi ed as the number of attributesthat take different values forthe twocompared instances: n ∑ ham 𝛿 (x ,x ) = I (11.7) 1 2 a (x )≠a (x ) i 1 i 2 i=1 where the I notation is used to denote an indicator function that yields 1 when the condition condition issatisefi d and 0otherwise. The biggest problem with the Hamming distance is that it takes the very limited set of values0,1, … ,n,wherenisthenumberofattributes.Thismaynotprovidesufcfi ient dis- similaritydiversicfi ation toreasonablydrivetheclusterformationprocess,unlessthenumber ofattributesismuchgreaterthanthenumberofclusterstobecreated.Thisiswhyitisusually pointlesstoapplytheHammingdistanceunlesstherearenocontinuousattributesdenfi ed on the domain. If all attributes are discrete, the previously presented dissimilarity measures suf- ferfromtheverysameproblem,butwithatleastsomecontinuousattributestheyyieldmuch more diversified dissimilaritymatrices. Example 11.3.6 The following R code implements the Hamming distance and applies it to theweathercl data. For the sake of illustration, the fact that the presence of two continuous attributes makes itaquestionable choice isignored. ham.dist - function(x1, x2) sum(x1=x2, na.rm=TRUE) Hamming distance dissimilarity matrix for the weathercl dissmat(weathercl, ham.dist) 11.3.7 Gower’s coefficient The most commonly used difference-based dissimilarity measures presented so far favor either continuous (as for the whole Minkowski distance-based family) or discrete (as for DIFFERENCE-BASEDDISSIMILARITY 319 the Hamming distance) attributes.Gower’scoefcfi ient is a dissimilarity measure specifically designedwiththeintentionofhandlingmixedattributetypes.Thecoefcfi ient iscalculatedas the weighted average of individual attribute contributions, with weights usually used only to indicate which attributevalues could actually be compared meaningfully: gow,i 𝛾 𝛿 (a (x ),a (x )) i i 1 i 2 gow 𝛿 (x ,x ) = (11.8) ∑ 1 2 n 𝛾 i=1 i gow,i where𝛾 istheweightassignedtoattributea and𝛿 istheattributevaluesimilaritymeasure i i for attributei,denfi ed as 𝑣 −𝑣 1 2 ⎧ ifa iscontinuous i max a (x)−min a (x) x∈X i x∈X i ⎪ gow,i 𝛿 (𝑣 ,𝑣 ) = (11.9) 0 ifa 1 2 ⎨ isdiscreteand 𝑣 = 𝑣 i 1 2 ⎪ 1 ifa isdiscreteand 𝑣 ≠ 𝑣 ⎩ i 1 2 Attributeweightsarenormallyequalto1unlessanattribute’svalueforoneorbothinstances is missing, when the corresponding weight is set to 0. It is also possible to incorporate some domain-specicfi attributeweights.Itisprobablymorecommontoencounteraslightlydiffer- entversionofGower’scoefcfi ient formeasuringsimilarityratherthandissimilarity,inwhich the contributions of allattributes are1s complements of those denfi ed above. Although often recommended as the best approach to handling mixed attribute types in dissimilaritymeasuring,Gower’scoefcfi ient isnotactuallyverydifferentfromtheManhattan or Canberra distances, modiefi d to accommodate discrete attributes by redenfi ing attribute value differences. It is particularly closely related to the latter, differing only in the divisor applied to continuous attribute value differences. In Gower’s coefcfi ient, they are divided by the attribute’s range, formally denfi ed above as the difference between the maximum and minimumvalueoftheattributeinthedomain,butinpracticedeterminedbasedonthetraining set used for clustering. This is another and arguably the better approach to ensuring that the contributionofeachcontinuousattributeisbetween0and1,sincethereisnoriskofanomalies fornear-zerovaluestheCanberradistancesuffersfrom.ItdoesnotnecessarilymakeGower’s coefcfi ient superiortothemembersoftheMinkowskidistancefamily,though,since –aswill be discussed below –it is a common practice to have continuous attributes standardized or normalized prior totheir application. Example 11.3.7 The following R code implements a simpliefi d version of Gower’s dissim- ilarity coefcfi ient in which –instead of explicit attribute weighting –terms corresponding to missing attribute values are removed when averaging. The implementation is then applied to the weathercl data, with the ranges function used to determine continuous dmr.util attributeranges. gower.coef - function(x1, x2, rngs) mean(mapply(function(v1, v2, r) ifelse(is.numeric(v1), abs(v1-v2)/r, v1=v2), x1, x2, rngs), na.rm=TRUE) Gower’s coefficient dissimilarity matrix for the weathercl data dissmat(weathercl, function (x1, x2) gower.coef(x1, x2, ranges(weathercl))) 320 (DIS)SIMILARITYMEASURES 11.3.8 Attribute weighting Difference-based dissimilarity measures make it possible and straightforward to incorpo- rate any available domain knowledge about the relative importance of particular attributes for instance dissimilarity assessment. It can be achieved by assigning numerical weights to attributes and weighting the corresponding attribute values accordingly. Assuming that 𝛾 i denotes the weight assigned to attribute a , the dissimilarity measure denfi itions presented i above can be rewritteninthe following weighted form: WeightedEuclideandistance. √ √ n √∑ √ euc 2 2 𝛿 (x ,x ) = 𝛾 (a (x )−a (x )) (11.10) 1 2 i 1 i 2 i i=1 WeightedMinkowskidistance. 1 ) ( p n ∑ p mink p 𝛿 (x ,x ) = 𝛾 a (x )−a (x ) (11.11) 1 2 i 1 i 2 i i=1 WeightedManhattandistance. n ∑ man 𝛿 (x ,x ) = 𝛾 a(x )−a(x ) (11.12) 1 2 i i 1 i 2 i=1 WeightedCanberradistance. n ∑ a(x )−a(x ) i 1 i 2 can 𝛿 (x ,x ) = 𝛾 (11.13) 1 2 i a(x )+a(x ) i 1 i 2 i=1 WeightedChebyshevdistance. cheb 𝛿 (x ,x ) = max𝛾 a (x )−a (x ) (11.14) 1 2 i i 1 i 2 i WeightedHammingdistance. n ∑ ham 𝛿 (x ,x ) = 𝛾 I (11.15) 1 2 i a (x )≠a (x ) i 1 i 2 i=1 Gower’s dissimilarity is missing in this list only because it already incorporates weights in its original form. Although typically only used to eliminate the impact of attributes with missing values, they can also be used to express domain-specicfi knowledge on attribute importance. 11.3.9 Attribute transformation One common problem with the most difference-based dissimilarity or similarity measures is theirobvioussensitivitytodifferencesinattributerangesanddistributions.Withtheexception CORRELATION-BASEDSIMILARITY 321 of the Canberra distance and Gower’s coefcfi ient, which compensate for such differences in someways,theothermeasurespresentedabovemaybeeasilyfooledwhenappliedtodomains withcontinuousattributesthatrepresentquantitiesexpressedinsubstantiallydifferentscales. The same absolute value difference always yields the same contribution to the calculated measure, whereas it may actually indicate similarity for attributes with large or relatively diversified values and dissimilarityfor attributes withsmallor relatively uniform values. A typical solution to this problem for dissimilarity or similarity measures that do not address it somehow internally (as in the case of the Canberra distance or Grower’s coef-fi cient) is to preprocess the data before performing any dissimilarity calculations by applying appropriate attribute transformations. Two transformations that may be useful in this context are standardization and normalization, presented in Sections 17.3.1 and 17.3.2, with the for- merbeingparticularlypopular.Itisnotuncommonforimplementationsofdissimilarity-based clustering algorithms to include such a transformation within their functional scope and per- form it either by default or at the user’s request. If this is not the case, it can be easily applied before running a clustering algorithm. It is usually a good idea to do so, unless some domain-specicfi background knowledge suggests otherwise. In any case, the decision needs tobetakenconsciouslyandthedefaultbehavioroftheparticularclusteringalgorithmimple- mentation should be carefullychecked and understood. AswillbediscussedinSection17.2.5,wheneveranattributetransformationisperformed based on some parameters derived from a dataset, which is subsequently used for creating a predictive model, the very same transformation parameters should be used to transform new datapriortoapplyingthemodelforprediction.Thisisthecase,inparticular,foraclustering modelthatcanbeusedtopredictclustermembershipfornewinstances.Ifthetrainingsetused for model creation is standardized or normalized, the underlying transformation parameters (the mean and standard deviation for the former, the range for the latter) should be retained and re-applied when transformingnew data before generating predictions. Example 11.3.8 The R code presented below performs the standardization of continuous attributes in the weathercl data using the std.all and predict.std func- EX. 17.3.1 tions and then generates the Euclidean distance dissimilarity matrix based on the dmr.trans standardized dataset. ∼ weathercl.std - predict.std(std.all(. ., weathercl), weathercl) dissmat(weathercl.std, euc.dist) 11.4 Correlation-basedsimilarity Measuring instance similarity based on attribute value differences appears perfectly reason- able and is indeed the right way to follow in most situations, but for some domains it may yield misleading results. This is the case whenever instances that differ substantially with respect to attribute values should still be considered similar based on domain knowledge as sharingroughlythesamerelativevaluepattern.Considerattributesthatrepresentfrequencies of some events, word occurrence counts in text documents, performance or quality evalua- tions, ratings or preferences expressed by some individuals, etc. It is not necessarily value differences that really matter for them when comparing two instances but rather patterns of 322 (DIS)SIMILARITYMEASURES “highs” and “lows”: Are mostly the same events particularly frequent; do mostly the same wordsdominate;domostlythesamepeople,organizations,devices,orwhateverotherentities achieve top performance or quality; are mostly the same items highly rated or preferred? Thisconsiderablydifferentkindofsimilaritycanbecapturedbycorrelation-basedmeasures, which – unlike difference-based measures –typically indeed measure similarity rather than dissimilarity,assigninghigh values tosimilarinstances and low values todissimilarones. 11.4.1 Discrete attributes Correlation-based similarity measures are even more oriented toward continuous attributes than most difference-based measures. Actually, the whole justicfi ation behind them is stronglybasedontheassumptionthatattributesarecontinuous(oratleastordered,whichcan be treated as continuous when calculating these measures simply by assigning consecutive integer numbers to their ordered values). This does not limit their utility in any serious way since –whenever it makes sense to consider their application –the assumption is almost alwayssatisefi d. Theonlysituationthatmayneedsomeworkaroundisthatofasmallnumber of discrete attributes accompanying a larger number of continuous attributes that –based on the available domain knowledge –askfor correlation-based similaritymeasures. The workaround is fortunately easily available and the same as used for modeling algo- rithms based on a parametric representation, such as linear classicfi ation and linear regres- sion. This is the binary encoding technique described in Section 17.3.5 that replaces a dis- cretek-valuedattributea ∶X→ 𝑣 ,𝑣 , … ,𝑣 withk(or– actually –k−1,toavoidredun- 1 2 k dancy) binary attributes. These new attributes are then treated as continuous for similarity calculation. Example 11.4.1 The following R code demonstrates how the discode function for binary discrete attribute encoding can be applied to transform selected instances from the weathercl data to an all-continuous representation. This transformation will EX. 17.3.5 be used when calculating correlation-based similarity measures in subsequent dmr.trans examples. ∼ discode( ., weathercl1,) ∼ discode( ., weathercl5,) 11.4.2 Pearson’s correlation similarity The most straightforward way to measure similarity based on attribute value correlation is to use Pearson’s linear correlation, denfi ed in Section 2.5.2. Maximally similar instances are assigned values approaching 1 and maximally dissimilar instances are assigned values approaching −1. Example11.4.2ThefollowingRcodedemonstratestheapplicationofPearson’scorrelation, calculated using a simple wrapper around the standard R corr function, with the discrete attribute handling capability added, to theweathercl data. Despite using thedissmat func- tion, the obtained output isactually asimilarityrather than dissimilaritymatrix. CORRELATION-BASEDSIMILARITY 323 pearson.sim - function(x1, x2) ∼ ∼ cor(unlist(discode( ., x1)), unlist(discode( ., x2)), method="pearson", use="pairwise.complete.obs") Pearson similarity matrix for the weathercl data dissmat(weathercl, pearson.sim ) 11.4.3 Spearman’scorrelation similarity According to Pearson’s correlation, two instances, to be considered highly similar, not only needtosharethesamelow-highpatternofattributevalues,butalsotobelinearlyrelated.The latterrequirementisrelaxedwithSpearman’srankcorrelation,alsodenfi ed inSection2.5.2.It maythereforebeabetterchoiceunlessonedeliberatelyseeksforlinearrelationshipsbetween instances as indications of similarity. Example 11.4.3 Following the pattern of the previous example, the following R code imple- ments and demonstrates thesimilaritymeasure based onSpearman’s correlation. spearman.sim - function(x1, x2) ∼ ∼ cor(unlist(discode( ., x1)), unlist(discode( ., x2)), method="spearman", use="pairwise.complete.obs") Spearman similarity matrix for the weathercl data dissmat(weathercl, spearman.sim ) 11.4.4 Cosine similarity Anotherrelatedsimilaritymeasurethathasgainedhighpopularity,particularlyintextcluster- ingapplications,isthecosinesimilarity,whichconsiderstwocomparedinstancesasattribute value vectors and calculates thecosine of theangle between them: ∑ n a(x )a (x ) i 1 i 2 i=1 cos(x ,x ) = (11.16) √ √ 1 2 ∑ ∑ n n 2 2 a (x ) a (x ) i=1 1 i=1 2 i i or more readably, a(x )⚬a(x ) 1 2 cos(x ,x ) = (11.17) 1 2 a(x ) ⋅a(x ) 1 2 where a(x )⚬a(x ) denotes the dot (inner) product of attribute value vectors representing 1 2 instances x and x , and a(x ) and a(x ) are their Euclidean norms. High cosine 1 2 1 2 values correspond to attribute value vectors pointing roughly in the same direction in the n-dimensional attributespace, indicating high similarity. 324 (DIS)SIMILARITYMEASURES Example 11.4.4 The R code presented below implements cosine similarity calculation and demonstrates its application to the weathercl data. The cosine function performs theactualcosinecalculationfortwocontinuousvectors,withthediscreteattribute dmr.util encoding applied. ∼ ∼ cos.sim - function(x1, x2) cosine(discode( ., x1), discode( ., x2)) cosine similarity matrix for the weathercl data dissmat(weathercl, cos.sim) 11.5 Missingattributevalues Missingattributevalues, acommonproblem forreal-worlddatasets,have anobvious impact oninstancesimilarityassessment.Bothdifference-basedandcorrelation-basedmeasurespre- sented above cannot meaningfully deal with attributes that have missing values for one or bothinstancesbeingcomparedandthusareunabletoestimatetheircontributiontotheoverall instance similarityor dissimilarity.Possibleworkarounds forthis problem include: Omit. Skip attributes with missing values for one or both instances in dissimilarity calculation (and possibly scale up the obtained dissimilarity accordingly, if using a difference-based measure), Impute. Fill-in missing attribute values in a data preprocessing phase using some imputa- tiontechniques, Processinternally. Use some internal techniques to estimate the contribution of attributes withmissingvalues tothecalculated dissimilaritymeasure. The omitting approach is probably the most commonly applied in practice for its sim- plicity, at least unless the number of missing attribute values in the data is prevailing. The functions implementing dissimilarity measure calculation presented in the above examples adoptasimpliefi d versionofthisapproachbyincludingthena.rm=TRUEargumentincalls to aggregating functions such as sum or max, but without subsequent scaling. However, to keep the obtained dissimilarity consistent with those calculated without missing values, the aggregated contributions of attributes with nonmissing values should be scaled up propor- tionallytothenumberofattributesactuallyused,i.e.,multipliedbytheratioofthenumberof all attributes to the number of attributes with nonmissing values for both instances. Gower’s dissimilaritycoefcfi ient assumes that attributeweighting isusedtoimplement thisscheme. The last approach may be sometimes worthwhile to consider when there are many miss- ing attribute values. Clearly it only makes sense for attributes that have missing values for justoneofthetwoinstancesbeingcompared.Theavailablevaluefortheotherinstancescan then contribute to the calculated dissimilarity based on how typical or untypical it is for the attributeinquestion.Moreprecisely,whenevertheexactcontributionofanattributecannotbe determined due to one missing value, itsexpected contribution can be used instead. For con- tinuous attributes, this reduces to substituting the mean attribute value for the missing one, which is equivalent to the most common type of imputation. For discrete attributes, which represent a somewhat more interesting case, the expected contribution can be calculated by FURTHERREADINGS 325 considering all possible values in place of the missing one and averaging the corresponding contributions with weights set to their probabilities estimated from the data. With the con- tribution of a discrete attribute to a difference-based dissimilarity measure being 0 for equal values and 1 otherwise, this reduces to the probability of the attribute’s value being different than itsactual value forthe instance forwhich itisnot missing. 11.6 Conclusion Using explicit dissimilarity measures to guide the cluster formation and modeling processes isaverypopularapproach,adoptedbymanywidelyusedclusteringalgorithms.Thisincludes the following major familiesof clustering algorithms: • k-centers clustering, • agglomerative hierarchical clustering, • divisive clustering (since it is usually based on the repeated application of k-centers algorithms). These are discussedinthe next twochapters. The different dissimilarity and similarity measures presented in this chapter can be con- sidered for use with all (dis)similarity-based algorithms that are efl xible enough not to be tied to a particular single measure. It makes a thoughtful choice possible, based on the avail- able domain knowledge and task requirements, as well as experimental vericfi ation of the effectsofseveralcandidatemeasures.Itisalsopossibletoincorporatedomainknowledgeby appropriate attributetransformation, attribute selection, or attributeweighting. While it is hard to choose the most appropriate (dis)similarity measure for a given clustering task without at least some preliminary experiments, the general choice between difference-based and correlation-based measures is usually much easier if at least some domainknowledge(including,inparticular,attributeinterpretation)isavailable.Itshouldbe sufcfi ient to judge what actually makes instances similar: small attribute value differences or high attribute value correlations. With the former considered the obvious default, the latter may be more appropriate for attributes that represent frequencies of some events, performance or quality of some evaluated entities, ratings, or preferences expressed by some individuals, etc. 11.7 Furtherreadings Clustering falls within the scope of most data mining and many machine learning books and algorithmsbasedoninstance(dis)similaritybelongtothemostpopularones.Thismakesbrief descriptions of basic dissimilarity and similarity measures, accompanying the presentation of dissimilarity-based clustering algorithms, readily available (e.g., Ciosetal. 2007; Witten etal.2011).TheodoridisandKoutroumbas(2008),devotingtheirbookalmostentirelytothe classification and clustering tasks, thoroughly discuss the issue of measuring dissimilarity or similarity for both single instances and sets of instances. A similar discussion is provided by Han et al. (2011) and Tan et al. (2013), as well as in an appendix of the book by Webb (2002). Survey articles and books dedicated to clustering cover a greater variety of measures (Everittetal. 2011; Jain and Dubes 1988; Jainetal. 1999; Kaufman and Rousseeuw 1990), 326 (DIS)SIMILARITYMEASURES including both general-purpose ones for instances represented by attribute value vectors and those designed forspecial nonvector datarepresentations. WiththeEuclideandissimilaritymeasure,aswellasothermeasuresfromtheMinkowski family,beingdistancesforrealvectorspaces,theyarepresentedandtheirpropertiesareexten- sively discussed in mathematical topology literature (e.g., Engelking 1989). The Hamming distance, on the other hand, originates from the study of error-detecting and error-correcting codes (Hamming 1950). These and many other types of distances with different roots and scopes of applications –reaching far beyond clustering and data mining –are systematically presented by Deza and Deza (2013). Gower’s similarity, explicitly permitting mixed discrete-continuous attribute sets, was introduced by Gower (1971). The issue of measuring instance similarity of dissimilarity in the presence of discrete attributes has been addressed by several studies since then, in the contextofbothclustering(DidayandSimon1976;IchinoandYaguchi1994;Ngetal.2007; Sanetal.2004)andmemory-basedlearning(Chengetal.2004;Stanlfi l andWaltz1986;Wil- son and Martinez 1997). They represent various approaches to overcoming the limitation of simplevalueequalitytestsasinGower’scoefcfi ient ortheHammingdistancebyincorporating attributevalue distribution. Simple difference-based dissimilarity measures for continuous attributes have also some limitations.Apartfromthesensitivitytodifferencesinrangesanddistributions,theseinclude ignoring possible relationships among attributes. The Mahalanobis distance is a generaliza- tionoftheEuclideandistancethataddressestheseissues(Mahalanobis1936).Otherpossible improvements include incorporating the context of surrounding instances into dissimilarity calculation (Gowda and Krishna1978; Jarvisand Patrick 1973). Thischapter – asthewholebook –assumestheattribute-valuerepresentationofinstances, with a xfi ed number of discrete or continuous attributes. Dissimilarity measures for several other data representations that may be more appropriate for some domains have been pro- posed, including text strings (Baeza-Yates 1992), text documents (Wajeed and Adilakshmi 2011), tree structures (Zhang 1995), composite symbolic objects (Gowda and Diday 1991), and images (Dubuisson and Jain1994; Huttenlocheretal.1993). References Baeza-Yates RA 1992 Introduction to data structures and algorithms related to information retrieval In Information Retrieval: Data Structures and Algorithms (ed. Frakes WB and Baeza-Yates RA) Prentice-Hall. ChengV,LiCH,KwokJTandLiCK2004Dissimilaritylearningfornominaldata.PatternRecognition 37, 1471–1477. Cios KJ, Pedrycz W, Swiniarski RW and Kurgan L 2007 Data Mining: A Knowledge Discovery Approach. Springer. Deza MMand Deza E 2013EncyclopediaofDistances 2ndedn. Springer. Diday E and SimonJC1976 Clusteringanalysis InDigitalPatternRecognition (ed.FuKS) Springer. DubuissonMPandJainAK1994Amodiefi d HausdorffdistanceforobjectmatchingProceedingsofthe TwelfthInternationalConferenceonPatternRecognition(ICPR-94).IEEEComputerSocietyPress. Engelking R1989GeneralTopology. Heldermann. EverittBS,Landau S,LeeseM andStahl D2011ClusterAnalysis5th edn.Wiley. GowdaKCandDidayE1991Symbolicclusteringusinganewdissimilaritymeasure.PatternRecogni- tion24, 567–578.