Machine Learning – Basic 2 | Software Languages & Systems Laboratory

(** 인터넷 익스플로러에서 보면 파이썬 코드 줄이 제대로 적용되어 보이지 않을 수 있습니다.**)

기계 학습 (Machine Learning) 기초 강의 두 번째로 클러스터링(군집화) 방법을 살펴본다. 군집이란 특성 벡터들의 집합이다. 이 집합으로 대상을 정의하려한다. 예를 들어, 하나의 군집에 포함된 특성 벡터들로 파충류를 정의할 수 있다. 집합 내의 특성 벡터에 지도학습 방법으로 붙인 파충류 라벨을 통해 이 집합이 표현하는 대상을 정의할 수도 있고, 비지도 학습 방법으로 서로 가까운 특성 벡터들 집합들을 각각의 군집으로 묶어 각 군집을 서로 다른 대상으로 분류해서 정의할 수도 있다.

두 표본의 특성 벡터들이 주어졌을 때 하나의 군집에 함께 포함시킬지 서로 다른 군집으로 분리할지 결정하려면 두 특성 벡터가 얼마나 가까운지 측정하는 방법이 필요하다. 동일한 군집(소문자 c)안의 특성벡터들이 얼마나 가까운지 측정하는 방법으로 분산(variance)이 있다. 군집 내의 특성 벡터들의 평균 특성 벡터를 구하고, 각 특성 벡터가 이 평균 특성 벡터와의 떨어져 있는 정도가 분산이다. 평균 특성 벡터는 이 군집의 중심으로 생각할 수 있고, 이를 중심점(센터로이드, centroid)라고 부른다.

다음 파이썬 코드는 Example 클래스와 Cluster 클래스를 선언한다.

class Example(object):
    
    def __init__(self, name, features, label = None):
        #Assumes features is an array of numbers
        self.name = name
        self.features = features
        self.label = label
        
    def dimensionality(self):
        return len(self.features)
    
    def getFeatures(self):
        return self.features[:]
    
    def getLabel(self):
        return self.label
    
    def getName(self):
        return self.name
    
    def distance(self, other):
        return minkowskiDist(self.features, other.getFeatures(), 2)
    
    def __str__(self):
        return self.name +':'+ str(self.features) + ':' + str(self.label)

class Cluster(object):
    
    def __init__(self, examples, exampleType):
        """Assumes examples is a list of example of type exampleType"""
        self.examples = examples
        self.exampleType = exampleType
        self.centroid = self.computeCentroid()
        
    def update(self, examples):
        """Replace the examples in the cluster by new examples
           Return how much the centroid has changed"""
        oldCentroid = self.centroid
        self.examples = examples
        if len(examples) > 0:
            self.centroid = self.computeCentroid()
            return oldCentroid.distance(self.centroid)
        else:
            return 0.0
        
    def members(self):
        for e in self.examples:
            yield e
        
    def size(self):
        return len(self.examples)
    
    def getCentroid(self):
        return self.centroid
    
    def computeCentroid(self):
        dim = self.examples[0].dimensionality()
        totVals = pylab.array([0.0]*dim)
        for e in self.examples:
            totVals += e.getFeatures()
        centroid = self.exampleType('centroid',
                              totVals/float(len(self.examples)))
        return centroid
    
    def variance(self):
        totDist = 0.0
        for e in self.examples:
            totDist += (e.distance(self.centroid))**2
        return totDist**0.5
    
    def __str__(self):
        names = []
        for e in self.examples:
            names.append(e.getName())
        names.sort()
        result = 'Cluster with centroid '\
                 + str(self.centroid.getFeatures()) + ' contains:\n  '
        for e in names:
            result = result + e + ', '
        return result[:-2]

Example 클래스 객체로 이름, 특성벡터, 선택 사항으로 라벨을 저장한다. 이 클래스의 두 객체에 포함된 특성벡터들 간의 민코스키 거리를 구하는 함수도 이 클래스의 멤버함수로 포함되어 있다. Cluster 클래스는 Example 클래스 객체들의 중심점을 계산하는 함수를 제공한다.

특성벡터들이 주어졌을 때 서로 유사한 것들끼리 묶는 군집들로 분류하는 최적의 클러스터링을 구해보자. 이 분류를 위해 `최적'을 정의하는 목적 함수가 필요하다. 일반적으로 이러한 용도로 사용할 최적의 목적 함수를 정하기 어렵기 때문에 미리 몇 개의 군집으로 분류할지 정할 수도 있다. 예를 들어, k개의 군집으로 분류할 것을 정하고, k개의 최적의 클러스터링을 구하는 문제로 바꾸어 풀 수도 있다.

k가 결정된 경우라고 하더라도 최적의 k 클러스터링을 구하는 알고리즘은 매우 시간이 많이 걸린다고 알려져 있다. 따라서 보통 그리디 알고리즘(greedy algorithm)으로 최적의 결과는 아니지만 빠르게 적절한 수준의 클러스터링을 구하는 방식을 취한다.

k-means 클러스터링 알고리즘은 바로 이러한 접근 방법으로 설계한 그리디 알고리즘이다. 이 알고리즘의 입출력과 절차는 다음과 같다.

k-means 클러스터링 알고리즘
입력: k, 특성벡터 데이터
출력: k 군집
1. k개의 중심점을 임의로 정한다.
2. 정해진 k 중심점을 가지고 k 군집을 만든다.
3. 이렇게 만든 k 군집의 중심점을 구한다.
4. 이전 k 중심점과 새로 구한 k 중심점이 같으면 종료하고,
  그렇지 않으면 2단계로 이동해서 반복

def kmeans(examples, exampleType, k, verbose):
    """Assumes examples is a list of examples of type exampleType,
         k is a positive int, verbose is a Boolean
       Returns a list containing k clusters. If verbose is
         True it prints result of each iteration of k-means"""
    #Get k randomly chosen initial centroids
    initialCentroids = random.sample(examples, k)
    
    #Create a singleton cluster for each centroid
    clusters = []
    for e in initialCentroids:
        clusters.append(Cluster([e], exampleType))
        
    #Iterate until centroids do not change
    converged = False
    numIterations = 0
    while not converged:
        numIterations += 1
        #Create a list containing k distinct empty lists
        newClusters = []
        for i in range(k):
            newClusters.append([])

        #Associate each example with closest centroid
        for e in examples:
            #Find the centroid closest to e
            smallestDistance = e.distance(clusters[0].getCentroid())
            index = 0
            for i in range(1, k):
                distance = e.distance(clusters[i].getCentroid())
                if distance < smallestDistance: 
                   smallestDistance = distance 
                   index = i #Add e to the list of examples for the appropriate cluster 
                   newClusters[index].append(e) #Upate each cluster; check if a centroid has changed 
                   converged = True 
                   for i in range(len(clusters)): 
                       if clusters[i].update(newClusters[i]) > 0.0:
                          converged = False
        if verbose:
            print ('Iteration #' + str(numIterations))
            for c in clusters:
                print(c)
            print ('\n') #add blank line
    return clusters

포유류 치아 관련 특성벡터 데이터에 k-means 클러스터링 알고리즘을 적용하여 초식, 육식, 잡식 동물로 분류해보자. 먼저 특성 벡터 데이터는 다음과 같다. 이 데이터가 파일 deltalFormulas.txt에 저장되어 있다고 가정하자.

#Name
#top incisors
#top canines
#top premolars
#top molars
#bottom incisors
#bottom canines
#bottom premolars
#bottom molars
#weight
#Label: 0=herbivore, 1=carnivore, 2=omnivore
Badger,3,1,3,1,3,1,3,2,10,1
Bear,3,1,4,2,3,1,4,3,278,2
Beaver,1,0,2,3,1,0,1,3,20,0
Brown bat,2,1,1,3,3,1,2,3,0.5,1
Cat,3,1,3,1,3,1,2,1,4,1
Cougar,3,1,3,1,3,1,2,1,63,1
Cow,0,0,3,3,3,1,2,1,400,0
Deer,0,0,3,3,4,0,3,3,200,0
Dog,3,1,4,2,3,1,4,3,20,1
Fox,3,1,4,2,3,1,4,3,5,1
Fur seal,3,1,4,1,2,1,4,1,200,1
Grey seal,3,1,3,2,2,1,3,2,268,1
Guinea pig,1,0,1,3,1,0,1,3,1,0
Elk,0,1,3,3,3,1,3,3,500,0
Human,2,1,2,3,2,1,2,3,150,2
Jaguar,3,1,3,1,3,1,2,1,81,1
Kangaroo,3,1,2,4,1,0,2,4,55,0
Lion,3,1,3,1,3,1,2,1,175,1
Mink,3,1,3,1,3,1,3,2,1,1
Mole,3,1,4,3,3,1,4,3,0.75,1
Moose,0,0,3,3,4,0,3,3,900,0
Mouse,1,0,0,3,1,0,0,3,0.3,2
Porcupine,1,0,1,3,1,0,1,3,3,0
Pig,3,1,4,3,3,1,4,3,50,2
Rabbit,2,0,3,3,1,0,2,3,1,0
Raccoon,3,1,4,2,3,1,4,2,40,2
Rat,1,0,0,3,1,0,0,3,.75,2
Red bat,1,1,2,3,3,1,2,3,1,1
Sea lion,3,1,4,1,2,1,4,1,415,1
Skunk,3,1,3,1,3,1,3,2,2,2
Squirrel,1,0,2,3,1,0,1,3,2,2
Woodchuck,1,0,2,3,1,0,1,3,4,2
Wolf,3,1,4,2,3,1,4,3,27,1

파일의 헤더 #Name, #top incisors부터 #weight, #Label은 뒤이어 나열된 데이터 요소들의 설명이다. 헤더 다음의 각 줄에 표본 이름, 특성 벡터, 라벨이 나열되어 있다. 예를 들어, 이름 (Badger), 포유류의 특성벡터 (3, 1, 3, 1, 3, 1, 3, 2, 10), 라벨 (1) 데이터가 각 줄에 작성되어 있다. 각 특성 벡터는 9개 속성으로 구성되어 있고, 라벨로 초식, 육식, 잡식 동물을 구분한다. 이 라벨은 나중에 클러스터링이 잘 되었는지 살필때 사용하고, 클러스터링 과정 자체에서는 사용하지 않는다.

아래의 파이썬 프로그램은 readMammalData() 함수를 정의한다. 이 함수는 지정한 파일에서 앞서 설명한 형식의 데이터를 읽어 각 표본의 특성의 수를 세기 위해 헤더를 처리하고, speciesNames, labelList, featureVals 리스트들을 만든다. 이 리스트들은 각각 표본 이름들, 라벨들, 특성벡터들을 포함한다. 특히 featureVals[i][j]는 j번째 표본의 i번째 특성을 가리킨다.

그리고, 특성 벡터의 각 특성의 크기를 조정하는 scaleFeatures() 함수를 포함한다. 각 특성의 평균이 0, 표준편차가 1이 되도록 크기 비율을 변환한다. 이 크기 조정으로 어느 특성의 절대값이 커서 클러스터링에 영향을 더 미치는 것을 방지할 수 있다.

def scaleFeatures(vals):
    """Assumes vals is a sequence of numbers"""
    result = pylab.array(vals)  # 리스트 vals를 pylab.array (벡터 형식)의 result로 변환
    mean = sum(result)/float(len(result))  
    result = result - mean  # vector 뺄셈 연산
    sd = stdDev(result)
    result = result/sd  # vector 나눗셈 연산
    return result

def readMammalData(fName, scale):
   
    dataFile = open(fName, 'r')
    numFeatures = 0
    #Process lines at top of file
    for line in dataFile: #Find number of features
        if line[0:6] == '#Label': #indicates end of features
            break
        if line[0:5] != '#Name':
            numFeatures += 1
    featureVals = []
    
    #Produce featureVals, speciesNames, and labelList
    featureVals, speciesNames, labelList = [], [], []
    for i in range(numFeatures):
        featureVals.append([])
        
    #Continue processing lines in file, starting after comments
    for line in dataFile:
        dataLine = str.split(line[:-1], ',') #remove newline; then split
        speciesNames.append(dataLine[0])
        classLabel = float(dataLine[-1])
        labelList.append(classLabel)
        for i in range(numFeatures):
            featureVals[i].append(float(dataLine[i+1]))
            
    #Use featureVals to build list containing the feature vectors
    #for each mammal scale features, if needed
    if scale:
        for i in range(numFeatures):
            featureVals[i] = scaleFeatures(featureVals[i])
    featureVectorList = []
    for mammal in range(len(speciesNames)):
        featureVector = []
        for feature in range(numFeatures):
            featureVector.append(featureVals[feature][mammal])
        featureVectorList.append(featureVector)
    return featureVectorList, labelList, speciesNames

testTeeth()함수는 클러스터 수(k), 시도 횟수, 특성의 크기 조정 여부를 지정하면 k-means 알고리즘을 적용하여 k 군집을 구한다. readMammalData()함수로 데이터 파일을 읽고, buildMammalExamples() 함수로 Example 클래스 객체들 리스트를 만든 다음, trykmeans() 함수로 k-means 클러스터링을 구한다. trykemans()함수는 kmeans()함수를 반복해서 실행하여 가장 부동성이 낮은 결과를 선택하는 부가 함수이다.

def buildMammalExamples(featureList, labelList, speciesNames):
    examples = []
    for i in range(len(speciesNames)):
        features = pylab.array(featureList[i])
        example = Example(speciesNames[i], features, labelList[i])
        examples.append(example)
    return examples

def testTeeth(numClusters, numTrials, scale):
    features, classes, species = readMammalData('dentalFormulas.txt', scale)
    examples = buildMammalExamples(features, classes, species)
    bestClustering = trykmeans(examples, Example, numClusters, numTrials)
    for c in bestClustering:
        names = ''
        for p in c.members():
            names += p.getName() + ', '
        print ('\n', names[:-2])
        herbivores, carnivores, omnivores = 0, 0, 0
        for p in c.members():
            if p.getLabel() == 0:
                herbivores += 1
            elif p.getLabel() == 1:
                carnivores += 1
            else:
                omnivores += 1
        print (herbivores, 'herbivores,', carnivores, 'carnivores,',\
              omnivores, 'omnivores')


''' trykmeans 함수

def dissimilarity(clusters):
    totDist = 0.0
    for c in clusters:
        totDist += c.variance()
    return totDist

def trykmeans(examples, exampleType, numClusters, numTrials,
              verbose = False):
    """Calls kmeans numTrials times and returns the result with the
          lowest dissimilarity"""
    best = kmeans(examples, exampleType, numClusters, verbose)
    minDissimilarity = dissimilarity(best)
    for trial in range(1, numTrials):
        clusters = kmeans(examples, exampleType, numClusters, verbose)
        currDissimilarity = dissimilarity(clusters)
        if currDissimilarity < minDissimilarity:
            best = clusters
            minDissimilarity = currDissimilarity
    return best

teethTest(3, 20, True)를 실행하여 포유류 치아 특성벡터 데이터에 적용해본다. 아래와 같이 초식, 육식, 잡식 동물 분류 결과를 얻을 수 있었다. 클러스터링 결과를 특성벡터의 라벨과 비교해서 살펴보면 초식 동물과 육식 동물의 군집을 치아 특성벡터로 잘 구분할 수 있었지만, 잡식 동물의 군집은 정확하게 구분한 것은 아니라는 점을 알 수 있다.

>>> teethTest(3, 20, True)
 Badger, Beaver, Brown bat, Cat, Cougar, Dog, Fox, Guinea pig, Human, Jaguar, Kangaroo, Mink, Mole, Mouse, Porcupine, Pig, Rabbit, Raccoon, Rat, Red bat, Skunk, Squirrel, Woodchuck, Wolf
5 herbivores, 11 carnivores, 8 omnivores

 Bear, Cow, Deer, Fur seal, Grey seal, Elk, Lion, Sea lion
3 herbivores, 4 carnivores, 1 omnivores

 Moose
1 herbivores, 0 carnivores, 0 omnivores
>>>