The following python script can be used to create clusters. The input is trading date, close price and volume obtained from a comma separated file. The number of clusters can be set at the time of execution of the script. Furthermore, in this specific example, we will be clustering the data into 2/3/4/5 clusters. Also, note that, if there are less than a specified percentage of points within a cluster, we believe these points maybe a result of some extraordinary events related to that particular stock (outlier). This percentage largely depends on the number of data points and the number of clusters.
The following figure shows the output of the above Python script with 2 clusters:
The following figure shows the output of the above Python script with 3 clusters:
The following figure shows the output of the above Python script with 4 clusters:
The following figure shows the output of the above Python script with 5 clusters:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import sys | |
import Pycluster | |
import matplotlib.pyplot as plt | |
from scipy import stats | |
fh = open(sys.argv[1], 'r') | |
lines = fh.readlines() | |
fh.close() | |
clusters = int(sys.argv[2]) | |
max_points_pc = 5 | |
points = [] | |
points_r = [] | |
dates = [] | |
volumes = [] | |
close_prices = [] | |
for i in range(len(lines)): | |
if i <= 1: | |
continue | |
line_c = lines[i-1].strip().split(',') | |
close_price = float(line_c[4]) | |
volume = float(line_c[5]) | |
points_r.append((close_price, volume)) | |
volumes.append(volume) | |
close_prices.append(close_price) | |
dates.append(line_c[0]) | |
a = np.array(volumes) | |
volume_z = stats.zscore(a) | |
a = np.array(close_prices) | |
close_price_z = stats.zscore(a) | |
points = zip(close_price_z, volume_z) | |
labels, error, nfound = Pycluster.kcluster(points, clusters) | |
x = [] | |
y = [] | |
d = [] | |
for i in range(clusters): | |
x.append([]) | |
y.append([]) | |
d.append([]) | |
for i in range(len(points_r)): | |
index = labels[i] | |
x[index].append(points_r[i][0]) | |
y[index].append(points_r[i][1]) | |
d[index].append(dates[i]) | |
for i in range(clusters): | |
plt.plot(x[i], y[i], 'o') | |
for j in range(len(x[i])): | |
if len(x[i]) <= max_points_pc * len(points) / 100: | |
plt.annotate(d[i][j], (x[i][j], y[i][j])) | |
plt.xlabel('Close Price') | |
plt.ylabel('Volume') | |
plt.grid() | |
plt.show() |
The following figure shows the output of the above Python script with 2 clusters:
The following figure shows the output of the above Python script with 3 clusters:
The following figure shows the output of the above Python script with 4 clusters:
The following figure shows the output of the above Python script with 5 clusters: