K-Means Clustering algorithm implementation in Python

This project is a demonstration of the K-means algorithm, one of the simplest unsupervised clustering methods. After creating a generic model, I will test it on a dataset of US Senator voting records to segment Senators by political party affiliation.

First import necessary libraries

import math, random, time, plotly
import pandas as pd
import numpy as np
from numpy import linalg as LA
import plotly.graph_objects as go
import matplotlib.pyplot as plt

The implementation function is as follows:

def k_means(M,k,tol = .01, max_iters = 100):
    
    #------Initializations--------------------------
    steps = 0
    prev = 10000
    current = 1
    m = np.size(M,0) 
    n = np.size(M,1)
    initialize = random.sample(range(n),k) 
    centroids = M[:,initialize]
    partition = [random.randint(0,10) for i in range(n)]
    # note: partition is centroid cluster, k of n datapoints
    
    #-----------------------------------------------
    
    while (abs(prev-current) > tol) & (steps < max_iters):
        steps = steps + 1
        
        #------PARTITION data points (assign to centroids)---
        for i in range(n):
            partition[i] = np.argmin([LA.norm(centroids[:,h]-M[:,i]) for h in range(k)])
            
        #------calculate CENTROIDS-----------------------
        for j in range(k):
            matches = []
            for s in range(n): 
                if partition[s]==j: #need indexes s where partition[s]==j 
                    matches.append(M[:,s])
            if len(matches) != 0:
                centroids[:,j] = (1/(len(matches)))*sum(matches)
            else:
                centroids[:,j] = np.zeros(m)
                
        #----COHERENCE-----------------------------------  
        prev = current
        current = sum([LA.norm(M[:,l]-centroids[:,partition[l]]) for l in range(n)])
        print(f'after {steps} steps the residue is {current}')
    return [centroids, partition]

Next, I shall test the function on a pseudo dataset of only two features to visualize its performance.

2 dimensional data points are created in a numpy array.

data = np.array([
    [1,1],[2,1],[3,2],[1,2],[2,2],[3,1],
    [-1,-2],[-3,-3],[-2,-2],[-3,-1],[-1,-3],[-4,-4], 
    [4,-2],[5,-3],[4,-3],[6,-3],[3,-3],
    [-6,4],[-7,5],[-5,6],[-6,6],[-6,5]         
             ]).T
plt.scatter(cluster[0,:],cluster[1,:])
plt.show()

Since we can visually see there are 4 clusters in the data set above, let k = 4.

k = 4
centroids,partition = k_means(data,k,.0001)
plt.scatter(cluster[0,:],cluster[1,:])
circle_size = 12000
plt.scatter(centroids[0,:],centroids[1,:], circle_size, alpha=0.2)
plt.show()

after 1 steps the residue is 24.378203601230744
after 2 steps the residue is 24.378203601230744

Now that we have a working model, we may test it on a real dataset. I will load a csv file of Senators' voting records for 15 different bills in the 114th United States Congress.

A "1.0" represents a YES vote, and a "0.0" represents NO, while "0.5" means ABSTAINED.

The goal is to predict which party each Senator belongs to by using K-means to segment them into three groups representing Democrats, Republicans, and Independents.

# load data as dataframe
df = pd.read_csv(r'C:\Users\Luna\Documents\SCHOOL\PROJECTS\K_means\114_USA_congress.csv')
print(df.head())

        name party state  00001  00004  00005  00006  00007  00008  00009  \
0  Alexander     R    TN    0.0    1.0    1.0    1.0    1.0    0.0    0.0   
1     Ayotte     R    NH    0.0    1.0    1.0    1.0    1.0    0.0    0.0   
2    Baldwin     D    WI    1.0    0.0    0.0    1.0    0.0    1.0    0.0   
3   Barrasso     R    WY    0.0    1.0    1.0    1.0    1.0    0.0    1.0   
4     Bennet     D    CO    0.0    0.0    0.0    1.0    0.0    1.0    0.0   

   00010  00020  00026  00032  00038  00039  00044  00047  
0    1.0    1.0    1.0    0.0    0.0    0.0    0.0    0.0  
1    1.0    0.0    1.0    0.0    1.0    0.0    1.0    0.0  
2    1.0    0.0    0.0    1.0    1.0    0.0    1.0    1.0  
3    1.0    1.0    1.0    0.0    0.0    1.0    0.0    0.0  
4    1.0    0.0    0.0    0.0    1.0    0.0    1.0    0.0

Create a matrix with only voting data to feed into the K-means function.

new_df = df.drop(['name','party','state'],axis=1)
matrix = new_df.to_numpy()

k = 3
centroids,partition = k_means(matrix.T,k,tol = .0001, max_iters = 100)

after 1 steps the residue is 59.7574197187986
after 2 steps the residue is 56.05854294950117
after 3 steps the residue is 55.240840776804305
after 4 steps the residue is 55.240840776804305

Result for 100 Senators:

print(partition)

[0, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 0, 0, 2, 0, 2, 0, 0, 0, 2, 1, 0, 2, 0, 0, 0, 0, 2, 2, 0, 2, 0, 2, 0, 1, 2, 0, 2, 0, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 2, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 2, 0, 0, 2, 0, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 2, 0, 2]

print(f"Number of elements labeled 0: {partition.count(0)}")
print(f"Number of elements labeled 1: {partition.count(1)}")
print(f"Number of elements labeled 2: {partition.count(2)}")

Number of elements labeled 0: 54
Number of elements labeled 1: 2
Number of elements labeled 2: 44

Before even comparing labeled data set with the partition resulting from K-means clustering for a thorough analysis, we can see that we've accurately segmented the Senators into three groups. In the 114th US Congress there were 54 Republicans, 44 Democrats and 2 Independent.