Training big random forests with little resources

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Training big random forests with little resources. / Gieseke, Fabian; Igel, Christian.

KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Association for Computing Machinery, 2018. p. 1445-1454.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Gieseke, F & Igel, C 2018, Training big random forests with little resources. in KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Association for Computing Machinery, pp. 1445-1454, 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018, London, United Kingdom, 19/08/2018. https://doi.org/10.1145/3219819.3220124

APA

Gieseke, F., & Igel, C. (2018). Training big random forests with little resources. In KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1445-1454). ACM Association for Computing Machinery. https://doi.org/10.1145/3219819.3220124

Vancouver

Gieseke F, Igel C. Training big random forests with little resources. In KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Association for Computing Machinery. 2018. p. 1445-1454 https://doi.org/10.1145/3219819.3220124

Author

Gieseke, Fabian ; Igel, Christian. / Training big random forests with little resources. KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Association for Computing Machinery, 2018. pp. 1445-1454

Bibtex

@inproceedings{2ae9626251144a8e8868807110c844ad,

title = "Training big random forests with little resources",

abstract = "Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.",

keywords = "Classification, Ensemble methods, Large-scale data analytics, Machine learning, Random forests, Regression trees",

author = "Fabian Gieseke and Christian Igel",

year = "2018",

doi = "10.1145/3219819.3220124",

language = "English",

isbn = "9781450355520",

pages = "1445--1454",

booktitle = "KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

publisher = "ACM Association for Computing Machinery",

note = "24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018 ; Conference date: 19-08-2018 Through 23-08-2018",

}

RIS

TY - GEN

T1 - Training big random forests with little resources

AU - Gieseke, Fabian

AU - Igel, Christian

PY - 2018

Y1 - 2018

N2 - Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

AB - Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

KW - Classification

KW - Ensemble methods

KW - Large-scale data analytics

KW - Machine learning

KW - Random forests

KW - Regression trees

UR - http://www.scopus.com/inward/record.url?scp=85051471641&partnerID=8YFLogxK

U2 - 10.1145/3219819.3220124

DO - 10.1145/3219819.3220124

M3 - Article in proceedings

AN - SCOPUS:85051471641

SN - 9781450355520

SP - 1445

EP - 1454

BT - KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

PB - ACM Association for Computing Machinery

T2 - 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018

Y2 - 19 August 2018 through 23 August 2018

ER -

ID: 202618778

Forskning