# README
## _**Please Read this document on gitbook:**_ [_**https://wangbch.gitbook.io/soapml-document/**_](https://wangbch.gitbook.io/soapml-document/)_\*\*\*\*_
## **soapml**
_**This is a Chemist-Friendly tool, just very simple python code to test if this tool is useful to your work, no extra machine learning knowledge is needed.**_
soapml is based on SOAPLite: [https://github.com/SINGROUP/SOAPLite](https://github.com/SINGROUP/SOAPLite)
A machine learning tool for doing regression using SOAP \(smooth overlap of atomic position\) encoded structure of molecules, surface, ... Helps to find relationship between position and energy, activity and other physical chemical property.
### How to install?
1. Test environment is Python3. Anaconda3 is highly recommended
2. Install SOAPLite: [https://github.com/SINGROUP/SOAPLite](https://github.com/SINGROUP/SOAPLite) , if you can not install it on Windows, use linux.
3. Install my machine learning tools: [https://github.com/B-C-WANG/MachineLearningTools](https://github.com/B-C-WANG/MachineLearningTools) \(download files and run "python setup.py install"\)
4. Install my tools for extracting information from Vasp dirs: [https://github.com/B-C-WANG/VDE-VaspDataExtract](https://github.com/B-C-WANG/VDE-VaspDataExtract) \(download files and run "python setup.py install"\)
5. Install soapml: download files and run "python setup.py install"
## Demo
### Train & Test
```python
# prepare data
data = ...
# a vasp_file_path, list of string, like ["\public\Pt_OH1","\public\Pt_OH2"]
vasp_file_path, y = data
# make dataset
dataset = Dataset.from_vasp_dir_and_energy_list(vasp_file_path,final_ads_energy=y,
description="""
the carbon nanotube data,
doped with N or B,
adsorbate: OH
""")
# delete first 15% sample in each vasp dir (a vasp dir is a sample group)
dataset.sample_filter(ratio=0.15)
# apply period on VectorC, it is Z direction in this case, repeat +Z, 0 and -Z
dataset.apply_period(direction=2,repeat_count=1)
dataset.soap_encode(center_atom_cases=[8],encode_atom_cases=[5,6,7])
# save for loading later
dataset.save("dataset.smld")
# do machine learning
dataset = Dataset.load("dataset.smld")
model = Model(dataset)
# use gradient boost regression
model.fit_gbr(n_estimators=200,shuffle=True,test_split_ratio=0.3)
model.save("model.smlm")
```
Result: average error 0.027, fig of model predicted\_y and true\_y
![X: model predicted y, Y: true y from DFT](.gitbook/assets/fig3.png)
### Validate
```python
# load trained model
model = Model.load("soapmlGbrModel_test.smlm")
# offset a new sample_group, from vasp dir
validate_vasp_dir_path = ...
final_ads_energy = ...
dataset = Dataset.from_vasp_dir_and_energy_list(
vasp_dirs=[validate_vasp_dir_path],
final_ads_energy=[final_ads_energy],)
# use model.encode, will encode the same way as dataset \
# when create Model (the dataset of Model(dataset))
dataset = model.encode(dataset,center_atom_cases=[8],sample_filter_ratio=0.15)
model.predict_and_validate(dataset)
```
### Use
```python
# load model
model = Model.load("soapmlGbrModel_test.smlm")
# use a existing vasp dir, but not offer y
vasp_dir_index = 5
dataset = Dataset.from_vasp_dir_and_energy_list(
vasp_dirs=[x_file_path[vasp_dir_index]],
only_x=True)
# obtain a slab_structure from exsiting vasp samples
slab_structure = dataset.give_a_sample_from_dataset(sample_group_index=0,
sample_index=-1,
use_repeated=False)
# use the same box_tensor
box_tensor = dataset.box_tensor
# use custom center_position
center_position = np.array([[0,0,0],[-10,-10,-10],[5,5,5]])
# make a new dataset for predict
dataset =Dataset.from_slab_and_center_position(slab_structure=slab_structure,
center_position=center_position,
box_tensor=box_tensor)
# encode the same way as dataset in model
dataset = model.encode(dataset, center_position=center_position)
print(dataset.datasetx)
# predict y and output
pred_y = model.predict(dataset)
print(pred_y)
```
### Other Demo - Coming Soon ...
And we can predict the energy in any position, here we predicted OH\* adsorption energy on a plane, where z is equal to the z-coordinate of O atom:
![Example of predicting energy on any position using soapml](.gitbook/assets/1_.png)
## What can soapml do?
By training the data that contains the relation ship of e.g., **position-energy**, soapml can predict the energy on any position, make it possible to find some position with higher or lower energy.
## Can I use soapml now?
soapml is now supported for _ΔE_ prediction of a background structure \(slab\) with an active small group \(ads\). It should contain only one ads on the slab, and the position of ads has a big influence on _ΔE_, and _ΔE_ should be _E\(slab+ads\) - E\(slab\)._
{% hint style="info" %}
_**Example:** We have a background structure Pt\(111\) surface as slab, and we have one OH\* on the slab, OH\* moves on the slab freely. We need energy \(Ea, a vector\) of every frame of OH\* on Pt\(111\) , and an energy \(Es\) of slab when it is optimized, ΔE will be Ea - Es ****\(known as adsorption energy\). Another feasible method is to use only the OH\* + slab structure and ΔE calculated in other ways._
{% endhint %}
## How to use soapml? -- Make input as dataset
_**You can make your input by soapml.Dataset,**_ soapml needs structure \(atom index and x y z coordinate\) and energy as input. It is now supported for **VASP results dir** \(the dir containing OUTCAR, CONTCAR, OUT.ANI of VASP\), and directly use **structure array list and energy array list**_**.**_ You can write something in param "description" to help you remember what the dataset is. _**You can use one of following method to make your input, they all return an instance of Dataset, and if you want to make only dataset of X \(when do prediction, you need to make dataset contains only X\), you need to give only\_x = True.**_
_**——————————**_
### 1. @static - Dataset.from\_vasp\_dir\_and\_energy\_table
_**\(vasp\_dir\_table, only\_x, description\)**_
First, give a dir containing many vasp dirs, these dirs have slab + ads structure. Use **@ static - Dataset.generate\_vasp\_dir\_energy\_table\(vasp\_dir, to\_csv\),** to generate a excel or csv table, you need to fill the table with **slab energy of every vasp dir,** like:
| Vasp Dirs | slab energy |
| :--- | :--- |
| Pt\_OH1 | 1.0 |
| Pt\_OH2 | 2.0 |
| Pt\_OH3 | 3.5 |
Then use **@static - Dataset.from\_vasp\_dir\_and\_energy\_table\(vasp\_dir\_table,description\)** and set vasp\_dir\_table to the excel or csv filename.
_**——————————**_
### **2. @static - Dataset.from\_vasp\_dir\_and\_energy\_list**
_**\(vasp\_dirs, slab\_energy, final\_ads\_energy, only\_x, description\)**_
vasp\_dirs is a list of string containing your vasp dir path, slab energy or final\_ads\_energy is list that have same length of vasp\_dirs. **You can only give either slab\_energy or final\_ads\_energy:**
{% tabs %}
{% tab title="slab\_energy" %}
Regression target _Et_ will be: _**energy of every step**_ **-** _**slab\_energy**_
{% endtab %}
{% tab title="final\_ads\_energy" %}
Regression target _Et_ will be: _**energy of every step - energy of final step + final\_ads\_energy**_.
{% endtab %}
{% endtabs %}
{% hint style="info" %}
_**Example:**_ _If we have vasp\_dirs: \[Pt\_OH01, Pt\_OH02\], we can offer slab\_energy like \[-100.0, -102.0\], and the energy of Pt\_OH01, like \[-99.8, -98.7, -96.5\], will be used to be subtracted by -100 to get target energy \[0.02, 1.3,