Table of Contents
=================
- What is LIBFFM
- Overfitting and Early Stopping
- Installation
- Data Format
- Command Line Usage
- Examples
- OpenMP and SSE
- Building Windows Binaries
- FAQ
What is LIBFFM
==============
LIBFFM is a library for field-aware factorization machine (FFM).
Field-aware factorization machine is a effective model for CTR prediction. It has been used to win the top-3 positions
of following competitions:
* Criteo: https://www.kaggle.com/c/criteo-display-ad-challenge
* Avazu: https://www.kaggle.com/c/avazu-ctr-prediction
* Outbrain: https://www.kaggle.com/c/outbrain-click-prediction
* RecSys 2015: http://dl.acm.org/citation.cfm?id=2813511&dl=ACM&coll=DL&CFID=941880276&CFTOKEN=60022934
You can find more information about FFM in the following paper / slides:
* http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf
* http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf
* https://arxiv.org/abs/1701.04099
Overfitting and Early Stopping
==============================
FFM is prone to overfitting, and the solution we have so far is early stopping. See how FFM behaves on a certain data
set:
> ffm-train -p va.ffm -l 0.00002 tr.ffm
iter tr_logloss va_logloss
1 0.49738 0.48776
2 0.47383 0.47995
3 0.46366 0.47480
4 0.45561 0.47231
5 0.44810 0.47034
6 0.44037 0.47003
7 0.43239 0.46952
8 0.42362 0.46999
9 0.41394 0.47088
10 0.40326 0.47228
11 0.39156 0.47435
12 0.37886 0.47683
13 0.36522 0.47975
14 0.35079 0.48321
15 0.33578 0.48703
We see the best validation loss is achieved at 7th iteration. If we keep training, then overfitting begins. It is worth
noting that increasing regularization parameter do not help:
> ffm-train -p va.ffm -l 0.0002 -t 50 -s 12 tr.ffm
iter tr_logloss va_logloss
1 0.50532 0.49905
2 0.48782 0.49242
3 0.48136 0.48748
...
29 0.42183 0.47014
...
48 0.37071 0.47333
49 0.36767 0.47374
50 0.36472 0.47404
To avoid overfitting, we recommend always provide a validation set with option `-p.' You can use option `--auto-stop' to
stop at the iteration that reaches the best validation loss:
> ffm-train -p va.ffm -l 0.00002 --auto-stop tr.ffm
iter tr_logloss va_logloss
1 0.49738 0.48776
2 0.47383 0.47995
3 0.46366 0.47480
4 0.45561 0.47231
5 0.44810 0.47034
6 0.44037 0.47003
7 0.43239 0.46952
8 0.42362 0.46999
Auto-stop. Use model at 7th iteration.
Installation
============
Requirement: It requires a C++11 compatible compiler. We also use OpenMP to provide multi-threading. If OpenMP is not
available on your platform, please refer to section `OpenMP and SSE.'
- Unix-like systems:
Typeype `make' in the command line.
- Windows:
See `Building Windows Binaries' to compile.
Data Format
===========
The data format of LIBFFM is:
<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
.
.
.
`field' and `feature' should be non-negative integers. See an example `bigdata.tr.txt.'
It is important to understand the difference between `field' and `feature'. For example, if we have a raw data like this:
Click Advertiser Publisher
===== ========== =========
0 Nike CNN
1 ESPN BBC
Here, we have
* 2 fields: Advertiser and Publisher
* 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC
Usually you will need to build two dictionares, one for field and one for features, like this:
DictField[Advertiser] -> 0
DictField[Publisher] -> 1
DictFeature[Advertiser-Nike] -> 0
DictFeature[Publisher-CNN] -> 1
DictFeature[Advertiser-ESPN] -> 2
DictFeature[Publisher-BBC] -> 3
Then, you can generate FFM format data:
0 0:0:1 1:1:1
1 0:2:1 1:3:1
Note that because these features are categorical, the values here are all ones.
Command Line Usage
==================
- `ffm-train'
usage: ffm-train [options] training_set_file [model_file]
options:
-l <lambda>: set regularization parameter (default 0.00002)
-k <factor>: set number of latent factors (default 4)
-t <iteration>: set number of iterations (default 15)
-r <eta>: set learning rate (default 0.2)
-s <nr_threads>: set number of threads (default 1)
-p <path>: set path to the validation set
--quiet: quiet model (no output)
--no-norm: disable instance-wise normalization
--auto-stop: stop at the iteration that achieves the best validation loss (must be used with -p)
By default we do instance-wise normalization. That is, we normalize the 2-norm of each instance to 1. You can use
`--no-norm' to disable this function.
A binary file `training_set_file.bin' will be generated to store the data in binary format.
Because FFM usually need early stopping for better test performance, we provide an option `--auto-stop' to stop at
the iteration that achieves the best validation loss. Note that you need to provide a validation set with `-p' when
you use this option.
- `ffm-predict'
usage: ffm-predict test_file model_file output_file
Examples
========
Download a toy data from:
zip: https://drive.google.com/open?id=1HZX7zSQJy26hY4_PxSlOWz4x7O-tbQjt
tar.gz: https://drive.google.com/open?id=12-EczjiYGyJRQLH5ARy1MXRFbCvkgfPx
This dataset is subsampled 1% from Criteo's challenge.
> tar -xzf libffm_toy.tar.gz
or
> unzip libffm_toy.zip
> ./ffm-train -p libffm_toy/criteo.va.r100.gbdt0.ffm libffm_toy/criteo.tr.r100.gbdt0.ffm model
train a model using the default parameters
> ./ffm-predict libffm_toy/criteo.va.r100.gbdt0.ffm model output
do prediction
> ./ffm-train -l 0.0001 -k 15 -t 30 -r 0.05 -s 4 --auto-stop -p libffm_toy/criteo.va.r100.gbdt0.ffm libffm_toy/criteo.tr.r100.gbdt0.ffm model
train a model using the following parameters:
regularization cost = 0.0001
latent factors = 15
iterations = 30
learning rate = 0.3
threads = 4
let it auto-stop
OpenMP and SSE
==============
We use OpenMP to do parallelization. If OpenMP is not available on your
platform, then please comment out the following lines in Makefile.
DFLAG += -DUSEOMP
CXXFLAGS += -fopenmp
Note: Please run `make clean all' if these flags are changed.
We use SSE instructions to perform fast computation. If you do not want to use it, comment out the following line:
DFLAG += -DUSESSE
Then, run `make clean all'
Building Windows Binaries
=========================
The Windows part is maintained by different maintainer, so it may not always support the latest version.
The latest version it supports is: v1.21
To build them via command-line tools of Visual C++, use the following steps:
1. Open a DOS command box (or Developer Command Prompt for Visual Studio) and go to LIBFFM directory. If environment
variables of VC++ have not been set, type
"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat"
You may have to modify the above command according which version of VC++ or
where it is installed.
2. Type
nmake -f Makefile.win clean all
FAQ
===
Q: Why I have the same model size when k = 1 and k = 4?
A: This is because we use SSE instructions. In order to use SSE, the memory need to be aligned. So even you assign k =
1, we still fill some dummy zeros from k = 2 to 4.
Q: Why the logloss is slightly different on the same data when I run the program two or more times when I use multi-threading
A: When there are more then one thread, the program becomes non-deterministic. To make it determinisitc you can only use one thread.
Contributors
============
Yuchin Juan, Wei-Sheng Chin, and Yong Zhuang
For questions, comments, feature requests, or bug report, please send your email to:
Yuchin Juan ([email protected])
For Windows related questions, please send your email to:
Wei-Sheng Chin ([email protected])
A Library for Field-aware Factorization Machines
Overview
Comments
-
Segmentation fault
opened by skirpichenko 8Hello,
Thank you for your excellent method, software and description.
I faced a problem trying to employ the libffm in my ML task. I am getting segmentation fault when using it with cross-validation option. Here are my setup and data: Ubuntu 13.10 ~/libffm$ ./ffm-train -k 5 -t 30 -r 0.03 -v 2 data.txt fold logloss 0 0.1080 Segmentation fault (core dumped)
The data.txt can be downloaded here https://drive.google.com/open?id=0B9HyQ7ZccW4-VFE0VWtxUHF2R3c
The problem arises only when working with big data files like that. If you cut it to 100K lines (it is around 250K lines) everything get OK.
Regards, Sergey
-
Train and val data set both have labels but there is no label in test data set. How to fill up
opened by altmanWang 2Thanks for your amazing libffm.
When using ffm_predict, I have a problem about how to fill up
Thanks again.
-
“-nan” value appeared during training
opened by lxjhk 2When I was training the model, the first few iterations worked fine but subsequent iterations returned "-nan" for the log losses of training and validating data sets.
Any ideas what went wrong?

Sample of the data used for training:
1 0:400492:1 1:977206:1 2:861366:1 3:223345:1 4:4:0.0 5:5:9567.0 6:6:31835.0 7:7:0.300471105528 8:8:0.0 9:9:0.0 10:35822:1 11:486386:1 12:528723:1 13:662860:1 14:990282:1 15:406964:1 16:698517:1 17:585048:1 18:18:0.38219606197 19:19:0.125217833586 20:20:0.438929013305 21:21:0.216453092359 22:923220:1 23:63477:1 24:216531:1 25:461117:1
0 0:400492:1 1:203267:1 2:861366:1 3:223345:1 4:4:0.0 5:5:1642.0 6:6:9441.0 7:7:0.173830192674 8:8:0.0 9:9:0.0644 10:709579:1 11:486386:1 12:528723:1 13:662860:1 14:778015:1 15:581435:1 16:698517:1 17:181797:1 18:18:0.581693006318 19:19:0.097000178732 20:20:0.367630745198 21:21:0.182764132116 22:923220:1 23:63477:1 24:216531:1 25:461117:1
-
k_aligned & memory requirements
opened by mpekalski 2-
It would be useful to mention in the README that memory allocation depends on k_aligned, not just k. So changing k from 4 to 5 actually doubles memory requirements.
-
Is there any particular reason why you align k to the power of 2?
-
-
ffm-train not found
opened by JoshuaC3 2Hi, I am trying to use libffm on ubuntu 16.04. I have C++11 and OpenMP installed via apt-get, downloaded libffm and did make. I am in the libffm dir and ran and got the following.
josh:~/libffm-master$ ffm-train bigdata.tr.txt model ffm-train: command not foundWhen I check the
diryou can see it is there[email protected]:~/libffm-master$ dir bigdata.te.txt ffm.cpp ffm-predict ffm-train.cpp README bigdata.tr.txt ffm.h ffm-predict.cpp Makefile COPYRIGHT ffm.o ffm-train Makefile.winAny help would be great. Thanks.
-
Refactor build scripts
opened by c-bata 1Changes
- [x] Add CMakeLists.txt for CLion users.
- [x] Update Makefile
- [x] Add description to build macOS binaries.
- [x] Update .gitignore
How to build on macOS
Apple clang (use libomp)
$ brew install libomp $ make OMP_CXXFLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix libomp)/include" OMP_LDFLAGS="-L$(brew --prefix libomp)/lib -lomp"or cmake
$ brew install libomp $ mkdir build $ cd build $ cmake \ -DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix libomp)/include" \ -DOpenMP_CXX_LIB_NAMES="omp" \ -DOpenMP_omp_LIBRARY=$(brew --prefix libomp)/lib/libomp.dylib \ .. $ makeSee https://cmake.org/cmake/help/latest/module/FindOpenMP.html
Using gcc (installed by homebrew)
$ brew install gcc $ make CXX="g++-8"or cmake
$ brew install gcc $ export CXX=g++-8 $ mkdir build && cd build $ cmake .. $ makeDisable OpenMP
$ make USEOMP=OFFor cmake
$ mkdir build && cd build $ cmake -DUSE_OPENMP=OFF .. $ make -
viewing the model
opened by shgidi 1I've used this pacakge a few months, ago, and I remember I was able to do $head model, and to see the model weights. It seems that the model is now encoded somehow (binarized?) am I correct? is there a way to see the model as before?
-
Does parallel operation of train function in ffm.cpp ensure thread safety?
opened by heekyungyoon 1Regarding train in ffm.cpp lines 228-375, I have a question on thread safety.
below are lines 288-312 #if defined USEOMP
#pragma omp parallel for schedule(static) reduction(+: tr_loss) #endif for(ffm_int ii = 0; ii < (ffm_int)order.size(); ii++) { ffm_int i = order[ii]; ffm_float y = tr->Y[i]; ffm_node *begin = &tr->X[tr->P[i]]; ffm_node *end = &tr->X[tr->P[i+1]]; ffm_float r = R_tr[i]; ffm_float t = wTx(begin, end, r, *model); ffm_float expnyt = exp(-y*t); tr_loss += log(1+expnyt); ffm_float kappa = -y*expnyt/(1+expnyt); wTx(begin, end, r, *model, kappa, param.eta, param.lambda, true); }I'm new to openmp parallel operations. I'm curious whether it ensures thread safety regarding wTx operation at the very bottom.
wTx(begin, end, r, *model, kappa, param.eta, param.lambda, true);It seems that since wTx with do_update = true updates weights, it could interfere with other threads updating the weights. Waiting for reply. -
fix the numerical problem in the log loss calculation
opened by ianlini 1When some predictions is very near to 0 or 1, it may produce
log(0)=-inf. I useepsilon = 1e-15to limit the range of the prediction (the same as sklearn and all the competitions on Kaggle). The value should be configurable with a command line argument in the future. I also got-nanbefore using this (like in #11), but I'm not very sure why-nanis produced.(BTW, some redundant spaces are auto removed by my editor.)
-
Unknown features
opened by ralovets 1Unknown features (like new app_id or device_id that was not in training data) lead to random probabilities (too small or too high). Could you suggest a workaround for using LIBFFM in that case?
-
libffm-linear prediction
opened by gediminaszylius 1Hello,
I'm trying to use libffm-linear library. Here are my outputs:
libffm-linear>windows\ffm-train -s 2 -l 0 -k 10 -t 50 -r 0.01 --au to-stop -p test_data.txt train_data.txt model iter tr_logloss va_logloss 1 0.25510 0.25017 2 0.25129 0.24927 3 0.25070 0.24882 4 0.25041 0.24843 5 0.25020 0.24821 6 0.25005 0.24808 7 0.24990 0.24801 8 0.24977 0.24800 9 0.24968 0.24820 Auto-stop. Use model at 8th iteration.
libffm-linear>windows\ffm-predict test_data.txt model output_file logloss = 0.34800
Why prediction logloss differs from validation logloss on same file?
-
How to use tags as features with ffm?
opened by sumitsidana 0How to use tags associated with item as a field in FFM? In FFM, only one feature for a given field can be turned on. But, for tags, we have several of features "1" for that given field. So, how to use tags as field for FFM?
-
almost no comments in codes
opened by lmxhappy 0In the implement, there are almost no comments. It is hard to read and learn. It is known that C codes is harder to read than python lang. That there are no comments make learner much harder. All in all, the implement is unfriendly. Please add necessary comments. At least, the members of structs would be commented. Thank you on behalf of everyone
-
Java wrapper
opened by RochanMehrotra 0Hello!
I'm about to finish a generalised wrapper for "predict" and "ffm_load_model" function in Java. It would be great if you will review my code and then add it to your library if you deem it fit.
Thank You
-
make error
opened by einvince 0g++ -Wall -O3 -std=c++0x -march=native -fopenmp -DUSESSE -DUSEOMP -c -o ffm.o ffm.cpp /tmp/cc2xJsit.s: Assembler messages: /tmp/cc2xJsit.s:3277: Error: no such instruction:vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cc2xJsit.s:3286: Error: suffix or operands invalid forvpaddd' /tmp/cc2xJsit.s:3598: Error: no such instruction:vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cc2xJsit.s:3609: Error: suffix or operands invalid forvpaddd' /tmp/cc2xJsit.s:3949: Error: no such instruction:vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cc2xJsit.s:3955: Error: suffix or operands invalid forvpaddd' /tmp/cc2xJsit.s:4273: Error: no such instruction:vinserti128 $0x1,%xmm0,%ymm1,%ymm0' /tmp/cc2xJsit.s:4284: Error: suffix or operands invalid forvpaddd'
Releases(v123)
-
v123(Nov 14, 2017)
-
v122(Jul 16, 2017)
-
v121(Jun 2, 2017)
-
v120(May 28, 2017)
-
Binary model
In old version the model is in text file and it was very slow for saving and loading. To make it faster, we decide to use binary format.
-
Removed C API support
In the old version in order to support pure C API, the code inside LIBFFM is writing in a mixed C++ / C style. This is very buggy and ugly. We decide to stop providing C API in this version. If you need this, let us know and we will consider to write a wrapper.
-
Remove cross-validation
FFM so far has been shown useful for large scale categorical data. Because the dataset are usually large, it will take a very long time to do cross-validation. Indeed, ourselves have never used cross-validation (including when we were attending the Criteo and the Avazu contest). We think this function is a overkill so we decided to remove it.
-
Remove in memory training
We find that on-disk training has very similar performance as in memory training but consuming way smaller memory. So we decide to remove in memory training and use on-disk version only.
-
Support random in on-disk mode
In previous version the selection of data point is not randomized in on-disk mode.
-
Binary data file reuse
Converting text file to binary file is slow. In this version you only need to convert once and we will automatically reuse the binary.
-
Add timer
Now we output the training time
Source code(zip)
-
QRec: A Python Framework for quick implementation of recommender systems (TensorFlow Based)
QRec is a Python framework for recommender systems (Supported by Python 3.7.4 and Tensorflow 1.14+) in which a number of influential and newly state-of-the-art recommendation models are implemented.
Recommender System Papers
Included Conferences: SIGIR 2020, SIGKDD 2020, RecSys 2020, CIKM 2020, AAAI 2021, WSDM 2021, WWW 2021
ToR[e]cSys is a PyTorch Framework to implement recommendation system algorithms
ToR[e]cSys is a PyTorch Framework to implement recommendation system algorithms, including but not limited to click-through-rate (CTR) prediction, learning-to-ranking (LTR), and Matrix/Tensor Embeddi
RecList is an open source library providing behavioral, "black-box" testing for recommender systems.
RecList is an open source library providing behavioral, "black-box" testing for recommender systems.
Collaborative variational bandwidth auto-encoder (VBAE) for recommender systems.
Collaborative Variational Bandwidth Auto-encoder The codes are associated with the following paper: Collaborative Variational Bandwidth Auto-encoder f
Incorporating User Micro-behaviors and Item Knowledge 59 60 3 into Multi-task Learning for Session-based Recommendation
MKM-SR Incorporating User Micro-behaviors and Item Knowledge into Multi-task Learning for Session-based Recommendation Paper data and code This is the
Cross-Domain Recommendation via Preference Propagation GraphNet.
PPGN Codes for CIKM 2019 paper Cross-Domain Recommendation via Preference Propagation GraphNet. Citation Please cite our paper if you find this code u
Books Recommendation With Python
Books-Recommendation Business Problem During the last few decades, with the rise
RetaGNN: Relational Temporal Attentive Graph Neural Networks for Holistic Sequential Recommendation
RetaGNN: Relational Temporal Attentive Graph Neural Networks for Holistic Sequential Recommendation Pytorch based implemention of Relational Temporal
RecSim NG: Toward Principled Uncertainty Modeling for Recommender Ecosystems
RecSim NG, a probabilistic platform for multi-agent recommender systems simulation. RecSimNG is a scalable, modular, differentiable simulator implemented in Edward2 and TensorFlow. It offers: a power
Spotify API Recommnder System
This project will access your last listened songs on Spotify using its API, then it will request the user to select 5 favorite songs in that list, on which the API will proceed to make 50 recommendat
Deep recommender models using PyTorch.
Spotlight uses PyTorch to build both deep and shallow recommender models. By providing both a slew of building blocks for loss functions (various poin
Detecting Beneficial Feature Interactions for Recommender Systems, AAAI 2021
Detecting Beneficial Feature Interactions for Recommender Systems (L0-SIGN) This is our implementation for the paper: Su, Y., Zhang, R., Erfani, S., &
The implementation of the submitted paper "Deep Multi-Behaviors Graph Network for Voucher Redemption Rate Prediction" in SIGKDD 2021 Applied Data Science Track.
DMBGN: Deep Multi-Behaviors Graph Networks for Voucher Redemption Rate Prediction The implementation of the accepted paper "Deep Multi-Behaviors Graph
This is our Tensorflow implementation for "Graph-based Embedding Smoothing for Sequential Recommendation" (GES) (TKDE, 2021).
Graph-based Embedding Smoothing (GES) This is our Tensorflow implementation for the paper: Tianyu Zhu, Leilei Sun, and Guoqing Chen. "Graph-based Embe
The official implementation of "DGCN: Diversified Recommendation with Graph Convolutional Networks" (WWW '21)
DGCN This is the official implementation of our WWW'21 paper: Yu Zheng, Chen Gao, Liang Chen, Depeng Jin, Yong Li, DGCN: Diversified Recommendation wi
Reinforcement Knowledge Graph Reasoning for Explainable Recommendation
Reinforcement Knowledge Graph Reasoning for Explainable Recommendation This repository contains the source code of the SIGIR 2019 paper "Reinforcement
Handling Information Loss of Graph Neural Networks for Session-based Recommendation
LESSR A PyTorch implementation of LESSR (Lossless Edge-order preserving aggregation and Shortcut graph attention for Session-based Recommendation) fro
EXEMPLO DE SISTEMA ESPECIALISTA PARA RECOMENDAR SERIADOS EM PYTHON
exemplo-de-sistema-especialista EXEMPLO DE SISTEMA ESPECIALISTA PARA RECOMENDAR SERIADOS EM PYTHON Resumo O objetivo de auxiliar o usuário na escolha
基于个性化推荐的音乐播放系统
MusicPlayer 基于个性化推荐的音乐播放系统 Hi, 这是我在大四的时候做的毕设,现如今将该项目开源。 本项目是基于Python的tkinter和pygame所著。 该项目总体来说,代码比较烂(因为当时水平很菜)。 运行的话安装几个基本库就能跑,只不过里面的数据还没有上传至Github。 先