# XNode2Vec - An Alternative Data Clustering Procedure
Description
-----------
This repository proposes an alternative method for data classification and clustering, based on the Node2Vec algorithm that is applied to a properly transformed N-dimensional dataset.
The original [Node2Vec](https://github.com/aditya-grover/node2vec) algorithm was replaced with an extremely faster version, called [FastNode2Vec](https://github.com/louisabraham/fastnode2vec). The application of the algorithm is provided by a function that works with **networkx** objects, that are quite user-friendly. At the moment there are few easy data transformations, but they will be expanded in more complex and effective ones.
Installation
------------
In order to install the Xnode2vec package simply use pip:
- ``` pip install Xnode2vec ```
*If there are some problems with the installation, please read the "Note" below.*
How to Use
----------
The idea behind is straightforward:
1. Take a dataset, or generate one.
2. Apply the proper transformation to the dataset.
3. Build a **networkx** object that embeds the dataset with its crucial properties.
4. Perform a node classification analysis with Node2Vec algorithm.
```python
import numpy as np
import Xnode2vec as xn2v
import pandas as pd
x1 = np.random.normal(4, 1, 20)
y1 = np.random.normal(5, 1, 20)
x2 = np.random.normal(17, 2, 20)
y2 = np.random.normal(13, 1, 20)
family1 = np.column_stack((x1, y1)) # REQUIRED ARRAY FORMAT
family2 = np.column_stack((x2, y2)) # REQUIRED ARRAY FORMAT
dataset = np.concatenate((family1,family2),axis=0) # Generic dataset
transf_dataset = xn2v.best_line_projection(dataset) # Points transformation
df = xn2v.complete_edgelist(transf_dataset) # Pandas edge list generation
edgelist = xn2v.generate_edgelist(df)
G = nx.Graph()
G.add_weighted_edges_from(edgelist) # Feed the graph with the edge list
nodes, similarity = xn2v.similar_nodes(G, dim=128, walk_length=20, context=5, picked=10, p=0.1, q=0.9, workers=4)
similar_points = xn2v.recover_points(dataset,G,nodes) # Final cluster
```
Using the same setup as before, let's perform an even more complex analysis:
```python
x1 = np.random.normal(16, 2, 100)
y1 = np.random.normal(9, 2, 100)
x2 = np.random.normal(25, 2, 100)
y2 = np.random.normal(25, 2, 100)
x3 = np.random.normal(2, 2, 100)
y3 = np.random.normal(1, 2, 100)
x4 = np.random.normal(30, 2, 100)
y4 = np.random.normal(70, 2, 100)
family1 = np.column_stack((x1, y1)) # REQUIRED ARRAY FORMAT
family2 = np.column_stack((x2, y2)) # REQUIRED ARRAY FORMAT
family3 = np.column_stack((x3, y3)) # REQUIRED ARRAY FORMAT
family4 = np.column_stack((x4, y4)) # REQUIRED ARRAY FORMAT
dataset = np.concatenate((family1,family2,family3,family4),axis=0) # Generic dataset
df = xn2v.complete_edgelist(dataset) # Pandas edge list generation
df = xn2v.generate_edgelist(df) # Networkx edgelist format
G = nx.Graph()
G.add_weighted_edges_from(df)
graph = xn2v.nx_to_Graph(G) # Load the Graph object to avoid multiple network readings
nodes_families, unlabeled_nodes = xn2v.clusters_detection(G, graph=graph, cluster_rigidity = 0.85,
spacing = 15, dim_fraction = 0.8,
picked=100, dim=100, context=5,
Weight=True, walk_length=20)
points_families = []
points_unlabeled = []
for i in range(0,len(nodes_families)):
points_families.append(xn2v.recover_points(dataset,G,nodes_families[i]))
points_unlabeled = xn2v.recover_points(dataset,G,unlabeled_nodes)
plt.scatter(dataset[:,0], dataset[:,1])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Generic Dataset', fontweight='bold')
plt.show()
```
Now the list ```points_families``` contains the four clusters -- clearly taking in account possible statistical errors. The results are however surprisingly good in many situations.
Results
-------
The analysis prints out on the terminal automatically:
- Number of clusters found.
- Number of nodes analyzed.
- Number of *clustered* nodes.
- Number of *non-clustered* nodes.
- Number of nodes in each cluster.
The output is something of this type:
```properties
--------- Clusters Information ---------
- Number of Clusters: 5
- Total nodes: 400
- Clustered nodes: 251
- Number of unlabeled nodes: 149
- Nodes in cluster 1: 16
- Nodes in cluster 2: 52
- Nodes in cluster 3: 83
- Nodes in cluster 4: 64
- Nodes in cluster 5: 36
```
The clustered objects are stored into a list of numpy vectors that are returned by the function *clusters_detection()*. It's important to get used to the *parameter selection* that determines the criteria with which the nodes are labeled.
Objects Syntax
--------------
Here we report the list of structures required to use the Xnode2vec package:
- Dataset: ``` dataset = np.array([[1,2,3,..], ..., [1,2,3,..]])```; the rows corresponds to each point, while the coulumns to the coordinates.
- Edge List: ``` edgelist = [(node_a,node_b,weight), ... , (node_c,node_d,weight)] ```; this is a list of tuples, structured as [starting_node, arriving_node, weight]
- DataFrame: ``` pandas.DataFrame(np.array([[1, 2, 3.7], ..., [2, 7, 12]]), columns=['node1', 'node2', 'weight']) ```
Functions Description
---------------------
- ```nx_to_Graph()``` : Performs a conversion from the **networkx** graph format to the **fastnode2vec** one, that is necessary to work with fastnode2vec objects.
- ```labels_modifier()```: Changes the labels of the created networkx graph. It can be useful if we want to select rows from a dataframe that we can't recover only with their positions in the vector.
- ```generate_edgelist()```: Read a pandas DataFrame and generates an edge list vector to eventually build a networkx graph. The syntax of the file header is rigidly controlled and can't be changed. The header format must be: (node1, node2, weight).
- ```edgelist_from_csv()```: Read a .csv file using pandas dataframes and generates an edge list vector to eventually build a networkx graph. The syntax of the file header is rigidly controlled and can't be changed.
- ```complete_edgelist()```: This function performs a **data transformation** from the space points to a network. It generates links between specific points and gives them weights according to the specified metric.
- ```stellar_edgelist()```: This function performs a **data transformation** from the space points to a network. It generates links between specific points and gives them weights according to specific conditions.
- ```low_limit_network()```: This function performs a **network transformation**. It sets the link weights of the network to 0 if their initial value was below a given threshold. The threshold is chosen to be a constant times the average links weight.
- ```best_line_projection()```: Performs a linear best fit of the dataset points and projects them on the line itself.
- ```cluster_generation()```: This function takes the nodes that have a similarity higher than the one set by *cluster_rigidity*.
- ```clusters_detection()```: This function detects the **clusters** that compose a generic dataset. The dataset must be given as a **networkx** graph, using the proper data transformation. The clustering procedure uses Node2Vec algorithm to find the most similar nodes in the network.
- ```recover_points()```: Recovers the spatial points from the analyzed network. It uses the fact that the order of the nodes that build the network is the same as the dataset one, therefore there is a one-to-one correspondence between nodes and points.
- ```similar_nodes()```: Performs FastNode2Vec algorithm with full control on the crucial parameters. In particular, this function allows the user to keep working with networkx objects -- that are generally quite user-friendly -- instead of the ones required by the fastnode2vec algorithm.
- ```load_model()```: Load the saved Gensim.Word2Vec model.
- ```draw_community()```: Draws a networkx p