# **ezSASRec**
## Documentation
https://ezsasrec.netlify.app
## References
### repos
1. [kang205 SASRec](https://github.com/kang205/SASRec)
2. [nnkkmto/SASRec-tf2](https://github.com/nnkkmto/SASRec-tf2)
3. [microsoft recommenders](https://github.com/microsoft/recommenders)
### papers
1. [Self-Attentive Sequential Recommendation](https://arxiv.org/pdf/1808.09781.pdf)
2. [A Case Study on Sampling Strategies for Evaluating Neural Sequential Item Recommendation Models](https://www.informatik.uni-wuerzburg.de/datascience/staff/dallmann/?tx_extbibsonomycsl_publicationlist%5Baction%5D=download&tx_extbibsonomycsl_publicationlist%5Bcontroller%5D=Document&tx_extbibsonomycsl_publicationlist%5BfileName%5D=main.pdf&tx_extbibsonomycsl_publicationlist%5BintraHash%5D=23f589b27e22018936753bb64b33971d&tx_extbibsonomycsl_publicationlist%5BuserName%5D=dallmann&cHash=dd7c54126f6c20972a502e9cc223cec2)
---------------
# **QuickStart**
example data source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
```python
import pandas as pd
import pickle
from sasrec.util import filter_k_core, SASRecDataSet, load_model
from sasrec.model import SASREC
from sasrec.sampler import WarpSampler
import multiprocessing
```
## Preprocessing
```python
path = 'your path'
```
```python
df = pd.read_csv('ratings.csv')
df = df.rename({'userId':'userID','movieId':'itemID','timestamp':'time'},axis=1)\
.sort_values(by=['userID','time'])\
.drop(['rating','time'],axis=1)\
.reset_index(drop=True)
```
```python
df.head()
```
<div id="df-f0146c0d-8a79-4924-9daa-e3b1bad88db4">
<div class="colab-df-container">
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>userID</th>
<th>itemID</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>2762</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>54503</td>
</tr>
<tr>
<th>2</th>
<td>1</td>
<td>112552</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>96821</td>
</tr>
<tr>
<th>4</th>
<td>1</td>
<td>5577</td>
</tr>
</tbody>
</table>
</div>
</div>
```python
# filter data
# every user and item will appear more than 6 times in filtered_df
filtered_df = filter_k_core(df, 7)
```
Original: 270896 users and 45115 items
Final: 243377 users and 24068 items
```python
# make maps (encoder)
user_set, item_set = set(filtered_df['userID'].unique()), set(filtered_df['itemID'].unique())
user_map = dict()
item_map = dict()
for u, user in enumerate(user_set):
user_map[user] = u+1
for i, item in enumerate(item_set):
item_map[item] = i+1
maps = (user_map, item_map)
```
```python
# Encode filtered_df
filtered_df["userID"] = filtered_df["userID"].apply(lambda x: user_map[x])
filtered_df["itemID"] = filtered_df["itemID"].apply(lambda x: item_map[x])
```
```python
# save data and maps
# save sasrec data
filtered_df.to_csv('sasrec_data.txt', sep="\t", header=False, index=False)
# save maps
with open('maps.pkl','wb') as f:
pickle.dump(maps, f)
```
## Load data and Train model
```python
# load data
data = SASRecDataSet('sasrec_data.txt')
data.split() # train, val, test split
# the last interactions of each user is used for test
# the last but one will be used for validation
# others will be used for train
```
```python
# make model and warmsampler for batch training
max_len = 100
hidden_units = 128
batch_size = 2048
model = SASREC(
item_num=data.itemnum,
seq_max_len=max_len,
num_blocks=2,
embedding_dim=hidden_units,
attention_dim=hidden_units,
attention_num_heads=2,
dropout_rate=0.2,
conv_dims = [hidden_units, hidden_units],
l2_reg=0.00001
)
sampler = WarpSampler(data.user_train, data.usernum, data.itemnum, batch_size=batch_size, maxlen=max_len, n_workers=multiprocessing.cpu_count())
```
```python
# train model
model.train(
data,
sampler,
num_epochs=3,
batch_size=batch_size,
lr=0.001,
val_epoch=1,
val_target_user_n=1000,
target_item_n=-1,
auto_save=True,
path = path,
exp_name='exp_example',
)
```
epoch 1 / 3 -----------------------------
Evaluating...
epoch: 1, test (NDCG@10: 0.04607630127474612, HR@10: 0.097)
best score model updated and saved
epoch 2 / 3 -----------------------------
Evaluating...
epoch: 2, test (NDCG@10: 0.060855185638025944, HR@10: 0.118)
best score model updated and saved
epoch 3 / 3 -----------------------------
Evaluating...
epoch: 3, test (NDCG@10: 0.07027207563856912, HR@10: 0.139)
best score model updated and saved
## Predict
```python
# load trained model
model = load_model(path,'exp_example')
```
### get score
```python
# get user-item score
# make inv_user_map
inv_user_map = {v: k for k, v in user_map.items()}
# sample target_user
model.sample_val_users(data, 100)
encoded_users = model.val_users
# get scores
score = model.get_user_item_score(data,
[inv_user_map[u] for u in encoded_users], # user_list containing raw(not-encoded) userID
[1,2,3], # item_list containing raw(not-encoded) itemID
user_map,
item_map,
batch_size=10
)
```
100%|██████████| 10/10 [00:00<00:00, 29.67batch/s]
```python
score.head()
```
<div id="df-556484ef-c5ea-4d4f-b3ec-ec343da88e4e">
<div class="colab-df-container">
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>user_id</th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1525</td>
<td>5.596944</td>
<td>4.241653</td>
<td>3.804743</td>
</tr>
<tr>
<th>1</th>
<td>1756</td>
<td>4.535607</td>
<td>2.694459</td>
<td>0.858440</td>
</tr>
<tr>
<th>2</th>
<td>2408</td>
<td>5.883061</td>
<td>4.655960</td>
<td>4.691791</td>
</tr>
<tr>
<th>3</th>
<td>2462</td>
<td>5.084695</td>
<td>2.942075</td>
<td>2.773376</td>
</tr>
<tr>
<th>4</th>
<td>3341</td>
<td>5.532438</td>
<td>4.348150</td>
<td>4.073740</td>
</tr>
</tbody>
</table>
</div>
### get recommendation
```python
# get top N recommendation
reco = model.recommend_item(data,
user_map,
[inv_user_map[u] for u in encoded_users],
is_test=True,
top_n=5)
```
100%|██████████| 100/100 [00:04<00:00, 21.10it/s]
```python
# returned tuple contains topN recommendations for each user
reco
```
{1525: [(456, 6.0680223),
(355, 6.033769),
(379, 5.9833336),
(591, 5.9718275),
(776, 5.8978705)],
1756: [(7088, 5.735977),
(15544, 5.5946136),
(5904, 5.500249),
(355, 5.492655),
(22149, 5.4117346)],
2408: [(456, 5.976555),
(328, 5.8824606),
(588, 5.8614006),
(264, 5.7114534),
(299, 5.649914)],
2462: [(259, 6.3445344),
(591, 6.2664876),
(295, 6.105361),
(355, 6.0698805),
(1201, 5.8477645)],
3341: [(110, 5.510764),
(1, 5.4927354),
(259, 5.4851904),
(161, 5.467624),
(208, 5.2486935)], ...}
<!--
## Introduction
This repository contains tools to train, evaluate and save SASRec model.
- - -
Original codes and architectures are from
- https://github.com/kang205/SASRec
- https://github.com/microsoft/recommenders/tree/main/recommenders/models/sasrec
## References
1. [Self-Attentive Sequential Recommendation](https://arxiv.org/pdf/1808.09781.pdf)
2. [A Case Study on Sampling Strategies for Evaluating Neural Sequential Item Recommendation Models](https://www.informatik.uni-wuerzburg.de/datascience/staff/dallmann/?tx_extbibsonomycsl_publicationlist%5Baction%5D=download&tx_extbibsonomycsl_publicationlist%5Bcontroller%5D=Document&tx_extbibsonomycsl_publicationlist%5BfileName%5D=main.pdf&tx_extbibsonomycsl_publicationlist%5BintraHash%5D=23f589b27e22018936753bb64b33971d&tx_extbibsonomycsl_publicationlist%5BuserName%5D=dallmann&cHash=dd7c54126f6c20972a502e9cc223cec2)
## Quick Start
### a. sas_evaluate
- added parameters
- target_user_n=1000 :
- target_item_n=-1: metric 산출 시 target label(마지막 interaction) 외에 추가할 neg_candidate의 수 | -1일 경우 target label 및 해당 user의 train,valid에 활용된 아이템 제외한 모든 아이템을 neg_candidate에 포함
- rank_threshold=10 : NDCG@k 및 HR@k metrics의 k 값
- is_val : True -> validation score 계산 || False -> test score 계산
</br>
- usage
```python
sas_evaluate(test_model,data, target_user_n=1000, target_item_n=-1,rank_threshold=5)
```
### b. sas_train
- added parameters
- target_user_n=1000 : for sas_evaluate
- target_item_n=-1: for sas_evaluate
- auto_save : 학습 시, best HR@10 score 모델을 자동으로 저장할지 여부 ([save_sasrec_model source code](custom_SASRec/custom_util.py))
- path : 저장 경로
- exp_name : 실험 이름 (저장 시 suffix)
</br>
- usage
```python
sas_train(test_model,data,sampler,num_epochs=num_epochs, batch_size=batch_size, learning_rate=lr, val_epoch=5, target_user_n=10000, target_item_n=-1)
```
### c. sas_get_prediction
- user에 대한 상위 n개 추천 아이템 산출
- parameters
- model_ : SASRec model
- dataset : SASRecDATASET
- user_map_dict : {original : EncodedLabel} 형태의 dict
- user_id_list : 추천을 받고자 하는 user id의 list
- target_item_n : 추천 후보 수
- randomly sampled
- 전체 후보 -> target_item_n = -1
- top_n : 추천 item 수 (상위 n개)
- exclude_purchased : 해당 user가 이미 구매한 item을 추천 후보에서 제외할지 여부
- is_test : 각 user의 sequence에서 마지막 1개 item(test target)을 제외한 sequence를 기반으로 추천할지 여부.
- return
```
{user_id : [(encoded_item_id, pred_score) ...]}
```
</br>
- usage
```python
pred = sas_get_prediction(loaded_model,data,user_map,user_id_list,is_test=True)
```
### d. sas_predict
- added parameters
- neg_cand_n: test target 외에 추가된 neg_candidates의 수.
- 행렬 연산이 끝난 뒤에 output(test_logits)을 reshape할 때 필요함
- 학습 및 평가 시에는 sas_evaluate 함수에서 자동으로 값을 전달
- 최종 배포 시에는 ```neg_cand_n=0```으로 지정
## 2. custom_util
save and load SASRec model
### a. save_sasrec_model
- 학습이 완료된 SASRec 객체의 weight와 args를 각각 파일로 저장함.
</br>
- usage
```python
save_sasrec_model(test_model, path, exp_name='save_test')
```
</br>
- parameters
- model: SASRec 객체
- path: 저장할 경로 (저장 시 path 아래에 exp_name의 폴더 생성)
- exp_name: 실험(모델) 이름을 suffix로 추가
- outputs
- {exp_name}_weights : 학습된 모델의 weights들을 담은 파일
- {exp_name}_train_log.txt : model의 parameter와 update log 확인 가능
- {exp_name}_model_args : SASRec 모델의 parameter dict를 담은 binary 파일
### b. load_sasrec_model
- save_sasrec_model로 저장한 SASRec 객체의 weight와 args 파일을 불러와서 SASRec 객체 생성
</br>
- usage
```python
loaded_model = load_sasrec_model(path, exp_name='save_test')
```
- parameters
- path: 파일**들**이 저장된 경로
- exp_name: 실험(모델) 이름
- save 시 지정한 suffix -->