K-Means Clustering algorithm implementation in Python

This project is a demonstration of the K-means algorithm, one of the simplest unsupervised clustering methods. After creating a generic model, I will test it on a dataset of US Senator voting records to segment Senators by political party affiliation.

First import necessary libraries

In [32]:
import math, random, time, plotly
import pandas as pd
import numpy as np
from numpy import linalg as LA
import plotly.graph_objects as go
import matplotlib.pyplot as plt

The implementation function is as follows:

In [10]:
def k_means(M,k,tol = .01, max_iters = 100):
    
    #------Initializations--------------------------
    steps = 0
    prev = 10000
    current = 1
    m = np.size(M,0) 
    n = np.size(M,1)
    initialize = random.sample(range(n),k) 
    centroids = M[:,initialize]
    partition = [random.randint(0,10) for i in range(n)]
    # note: partition is centroid cluster, k of n datapoints
    
    #-----------------------------------------------
    
    while (abs(prev-current) > tol) & (steps < max_iters):
        steps = steps + 1
        
        #------PARTITION data points (assign to centroids)---
        for i in range(n):
            partition[i] = np.argmin([LA.norm(centroids[:,h]-M[:,i]) for h in range(k)])
            
        #------calculate CENTROIDS-----------------------
        for j in range(k):
            matches = []
            for s in range(n): 
                if partition[s]==j: #need indexes s where partition[s]==j 
                    matches.append(M[:,s])
            if len(matches) != 0:
                centroids[:,j] = (1/(len(matches)))*sum(matches)
            else:
                centroids[:,j] = np.zeros(m)
                
        #----COHERENCE-----------------------------------  
        prev = current
        current = sum([LA.norm(M[:,l]-centroids[:,partition[l]]) for l in range(n)])
        print(f'after {steps} steps the residue is {current}')
    return [centroids, partition]

Next, I shall test the function on a pseudo dataset of only two features to visualize its performance.

2 dimensional data points are created in a numpy array.

In [45]:
data = np.array([
    [1,1],[2,1],[3,2],[1,2],[2,2],[3,1],
    [-1,-2],[-3,-3],[-2,-2],[-3,-1],[-1,-3],[-4,-4], 
    [4,-2],[5,-3],[4,-3],[6,-3],[3,-3],
    [-6,4],[-7,5],[-5,6],[-6,6],[-6,5]         
             ]).T
plt.scatter(cluster[0,:],cluster[1,:])
plt.show()

Since we can visually see there are 4 clusters in the data set above, let k = 4.

In [47]:
k = 4
centroids,partition = k_means(data,k,.0001)
plt.scatter(cluster[0,:],cluster[1,:])
circle_size = 12000
plt.scatter(centroids[0,:],centroids[1,:], circle_size, alpha=0.2)
plt.show()
after 1 steps the residue is 24.378203601230744
after 2 steps the residue is 24.378203601230744

Now that we have a working model, we may test it on a real dataset. I will load a csv file of Senators' voting records for 15 different bills in the 114th United States Congress.

A "1.0" represents a YES vote, and a "0.0" represents NO, while "0.5" means ABSTAINED.

The goal is to predict which party each Senator belongs to by using K-means to segment them into three groups representing Democrats, Republicans, and Independents.

In [100]:
# load data as dataframe
df = pd.read_csv(r'C:\Users\Luna\Documents\SCHOOL\PROJECTS\K_means\114_USA_congress.csv')
print(df.head())
        name party state  00001  00004  00005  00006  00007  00008  00009  \
0  Alexander     R    TN    0.0    1.0    1.0    1.0    1.0    0.0    0.0   
1     Ayotte     R    NH    0.0    1.0    1.0    1.0    1.0    0.0    0.0   
2    Baldwin     D    WI    1.0    0.0    0.0    1.0    0.0    1.0    0.0   
3   Barrasso     R    WY    0.0    1.0    1.0    1.0    1.0    0.0    1.0   
4     Bennet     D    CO    0.0    0.0    0.0    1.0    0.0    1.0    0.0   

   00010  00020  00026  00032  00038  00039  00044  00047  
0    1.0    1.0    1.0    0.0    0.0    0.0    0.0    0.0  
1    1.0    0.0    1.0    0.0    1.0    0.0    1.0    0.0  
2    1.0    0.0    0.0    1.0    1.0    0.0    1.0    1.0  
3    1.0    1.0    1.0    0.0    0.0    1.0    0.0    0.0  
4    1.0    0.0    0.0    0.0    1.0    0.0    1.0    0.0  

Create a matrix with only voting data to feed into the K-means function.

In [101]:
new_df = df.drop(['name','party','state'],axis=1)
matrix = new_df.to_numpy()
In [106]:
k = 3
centroids,partition = k_means(matrix.T,k,tol = .0001, max_iters = 100)
after 1 steps the residue is 59.7574197187986
after 2 steps the residue is 56.05854294950117
after 3 steps the residue is 55.240840776804305
after 4 steps the residue is 55.240840776804305

Result for 100 Senators:

In [107]:
print(partition)
[0, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 2, 0, 0, 0, 2, 1, 0, 2, 0, 0, 0, 0, 2, 2, 0, 2, 0, 2, 0, 1, 2, 0, 2, 0, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 2, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 2, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 2, 0, 2]
In [115]:
print(f"Number of elements labeled 0: {partition.count(0)}")
print(f"Number of elements labeled 1: {partition.count(1)}")
print(f"Number of elements labeled 2: {partition.count(2)}")
Number of elements labeled 0: 54
Number of elements labeled 1: 2
Number of elements labeled 2: 44

Before even comparing labeled data set with the partition resulting from K-means clustering for a thorough analysis, we can see that we've accurately segmented the Senators into three groups. In the 114th US Congress there were 54 Republicans, 44 Democrats and 2 Independent.