Experiences of choosing Graph Embeddings Training Platform on Large Graphs

I've been having a headache lately about how to train word vectors for wikidata. I have an idea of experiment that requires a large number of graph vectors to help with testing, but I keep running into all sorts of bugs. I'm currently facing a few difficulties:

One is the very large amount of data (about 200 million facts and 60 million nodes) is given by the entire Wikidata;
The second is the limited computing resources on hand, the best computing platform that I can use has 250G of memory, and need to share with others, hence, not quite able to put all the large dimensional entities into the memory at once
Finally, time is still relatively limited, because I temporarily selected four embeddings models that need to be tunned, a training probably takes a few days, so the training speed is also the main consideration direction

Therefore, choising the correct training framework becomes a top priority.
After testing/trying a variety of current frameworks, I finally succeeded in selecting one, and the following is my trial and error experience as a personal memo:

Pykeen

Code: https://github.com/pykeen/pykeen
Paper: https://www.jmlr.org/papers/volume22/20-825/20-825.pdf

Recently a very hot framework that supports many kinds of Embeddings models, but unfortunately the last attempt failed. The reason is that the single-core training speed is too slow, and memory consumption is serious.

AmpliGraph

Code: https://github.com/Accenture/AmpliGraph

Failure, for the same reasons as above

GraphVite

Code: https://github.com/DeepGraphLearning/graphvite
Paper: https://www.jmlr.org/papers/volume22/20-825/20-825.pdf

Support for multi-GPU running framework.
Unfortunately, I’ve encountered the following unknown bugs, after several attempts and inquiries still did not find a solution.
The problem might caused by the wrong cuda version, or the memory limitation as always (Aborted (core dumped)), put aside for the time being, after there is no other solution to deadlock.
I also welcome anyone who can help me.

(gra) graphvite baseline quick start
running baseline: demo/quick_start.yaml
loading graph from /root/.graphvite/dataset/blogcatalog/blogcatalog_train.txt
0.00018755%

Graph<uint32>
#vertex: 10308, #edge: 327429
as undirected: yes, normalization: no

[time] GraphApplication.load: 0.0948522 s
Check failed: error == cudaSuccess CUDA error unknown error at /network/home/zhuzhaoc/.local/envs/build/conda-bld/graphvite_1584598935508/work/include/core/solver.h:203
*** Check failure stack trace: ***
    @     0x7f53f019f1c3  google::LogMessage::Fail()
    @     0x7f53f01a425b  google::LogMessage::SendToLog()
    @     0x7f53f019eebf  google::LogMessage::Flush()
    @     0x7f53f019f6ef  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f51cd2bbfdf  graphvite::CudaCheck()
    @     0x7f51cd41ed64  graphvite::SolverMixin<>::SolverMixin()
    @     0x7f51cd464f1d  _ZZN8pybind1112cpp_function10initializeIZNS_6detail8initimpl11constructorIJSt6vectorIiSaIiEEimEE7executeINS_6class_IN9graphvite11GraphSolverILm128EfjEEJEEEJNS_10call_guardIJNS_18gil_scoped_releaseEEEENS_5arg_vESI_SI_ELi0EEEvRT_DpRKT0_EUlRNS2_16value_and_holderES7_imE_vJSQ_S7_imEJNS_4nameENS_9is_methodENS_7siblingENS2_24is_new_style_constructorESH_SI_SI_SI_EEEvOSJ_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4_FUNES17_
    @     0x7f51cd378529  pybind11::cpp_function::dispatcher()
    @     0x55b27acde424  _PyMethodDef_RawFastCallDict
    @     0x55b27acdeffa  method_call
    @     0x55b27acc3101  PyObject_Call
    @     0x55b27ad2e168  slot_tp_init
    @     0x55b27acc29fa  type_call
    @     0x55b27acc3101  PyObject_Call
    @     0x55b27ad6edb3  _PyEval_EvalFrameDefault
    @     0x55b27acb1ea2  _PyEval_EvalCodeWithName
    @     0x55b27acb341f  _PyFunction_FastCallDict
    @     0x55b27acf3fe1  slot_tp_new
    @     0x55b27acfa360  _PyObject_FastCallKeywords
    @     0x55b27acfb269  call_function
    @     0x55b27ad71cba  _PyEval_EvalFrameDefault
    @     0x55b27acb1ea2  _PyEval_EvalCodeWithName
    @     0x55b27acb341f  _PyFunction_FastCallDict
    @     0x55b27acdf093  method_call
    @     0x55b27acc3101  PyObject_Call
    @     0x55b27ad6edb3  _PyEval_EvalFrameDefault
    @     0x55b27acb1ea2  _PyEval_EvalCodeWithName
    @     0x55b27acb341f  _PyFunction_FastCallDict
    @     0x55b27ad6edb3  _PyEval_EvalFrameDefault
    @     0x55b27acb1ea2  _PyEval_EvalCodeWithName
    @     0x55b27acb341f  _PyFunction_FastCallDict
    @     0x55b27acdf093  method_call
Aborted (core dumped)](url)

DGL-KE

Code: https://github.com/awslabs/dgl-ke
Paper: https://dl.acm.org/doi/pdf/10.1145/3397271.3401172

It should be one of the fastest among the existing frameworks.
Unfortunately, in my machine, I can only train vectors with a maximum dimension of 20.
More than that will be killed directly due to memory problem.
After asking the author, I didn’t get a better solution. But thanks for the author’s reply and his patience :).

Pytorch-Biggraph

Code: https://github.com/facebookresearch/PyTorch-BigGraph
Paper: https://mlsys.org/Conferences/2019/doc/2019/71.pdf

I started training and ran into the old problem of memory explosion. After reading the documentation in detail, I found that the framework can regroup the entities into partitions and then train them separately. Only after the increase in grouping the data to be trained is growing exponentially.
Directly setting

def get_torchbiggraph_config():
    config = dict( 
        # I/O data
        entity_path="data/wikidata",
        edge_paths=[
            "data/trainGpu",
            "data/validGpu",
            "data/testGpu",
        ],
        checkpoint_path="....",
        # Graph structure
        entities={"all": {"num_partitions": 8}}, #Only need to add this and it's ok
        relations=[
            {
                "name": "all_edges",
                "lhs": "all",
                "rhs": "all",
                "operator": "translation",
            }
        ],
        dynamic_relations=True,
        # Scoring model
        dimension=300,
        global_emb=False,
        comparator="cos",
        # Training
        num_epochs=500,
        batch_size=1024,
        num_batch_negs=64,
        num_uniform_negs=64,
        loss_fn="ranking",
        lr=0.05,
        regularization_coef=1e-3,
        # Evaluation during training
        eval_fraction=0,
        # GPU
        num_gpus=4,
    )

    return config

But the wonderful(?) thing is that the training speed on the CPU (50 cores) is about one and a half times faster than on the GPU (2*Tesla K80).