This project is a demonstration of the K-means algorithm, one of the simplest unsupervised clustering methods. After creating a generic model, I will test it on a dataset of US Senator voting records to segment Senators by political party affiliation.
First import necessary libraries
import math, random, time, plotly
import pandas as pd
import numpy as np
from numpy import linalg as LA
import plotly.graph_objects as go
import matplotlib.pyplot as plt
The implementation function is as follows:
def k_means(M,k,tol = .01, max_iters = 100):
#------Initializations--------------------------
steps = 0
prev = 10000
current = 1
m = np.size(M,0)
n = np.size(M,1)
initialize = random.sample(range(n),k)
centroids = M[:,initialize]
partition = [random.randint(0,10) for i in range(n)]
# note: partition is centroid cluster, k of n datapoints
#-----------------------------------------------
while (abs(prev-current) > tol) & (steps < max_iters):
steps = steps + 1
#------PARTITION data points (assign to centroids)---
for i in range(n):
partition[i] = np.argmin([LA.norm(centroids[:,h]-M[:,i]) for h in range(k)])
#------calculate CENTROIDS-----------------------
for j in range(k):
matches = []
for s in range(n):
if partition[s]==j: #need indexes s where partition[s]==j
matches.append(M[:,s])
if len(matches) != 0:
centroids[:,j] = (1/(len(matches)))*sum(matches)
else:
centroids[:,j] = np.zeros(m)
#----COHERENCE-----------------------------------
prev = current
current = sum([LA.norm(M[:,l]-centroids[:,partition[l]]) for l in range(n)])
print(f'after {steps} steps the residue is {current}')
return [centroids, partition]
Next, I shall test the function on a pseudo dataset of only two features to visualize its performance.
2 dimensional data points are created in a numpy array.
data = np.array([
[1,1],[2,1],[3,2],[1,2],[2,2],[3,1],
[-1,-2],[-3,-3],[-2,-2],[-3,-1],[-1,-3],[-4,-4],
[4,-2],[5,-3],[4,-3],[6,-3],[3,-3],
[-6,4],[-7,5],[-5,6],[-6,6],[-6,5]
]).T
plt.scatter(cluster[0,:],cluster[1,:])
plt.show()
Since we can visually see there are 4 clusters in the data set above, let k = 4.
k = 4
centroids,partition = k_means(data,k,.0001)
plt.scatter(cluster[0,:],cluster[1,:])
circle_size = 12000
plt.scatter(centroids[0,:],centroids[1,:], circle_size, alpha=0.2)
plt.show()
Now that we have a working model, we may test it on a real dataset. I will load a csv file of Senators' voting records for 15 different bills in the 114th United States Congress.
A "1.0" represents a YES vote, and a "0.0" represents NO, while "0.5" means ABSTAINED.
The goal is to predict which party each Senator belongs to by using K-means to segment them into three groups representing Democrats, Republicans, and Independents.
# load data as dataframe
df = pd.read_csv(r'C:\Users\Luna\Documents\SCHOOL\PROJECTS\K_means\114_USA_congress.csv')
print(df.head())
Create a matrix with only voting data to feed into the K-means function.
new_df = df.drop(['name','party','state'],axis=1)
matrix = new_df.to_numpy()
k = 3
centroids,partition = k_means(matrix.T,k,tol = .0001, max_iters = 100)
Result for 100 Senators:
print(partition)
print(f"Number of elements labeled 0: {partition.count(0)}")
print(f"Number of elements labeled 1: {partition.count(1)}")
print(f"Number of elements labeled 2: {partition.count(2)}")
Before even comparing labeled data set with the partition resulting from K-means clustering for a thorough analysis, we can see that we've accurately segmented the Senators into three groups. In the 114th US Congress there were 54 Republicans, 44 Democrats and 2 Independent.