Sublinear Algorithms, Online Packing, Hashing and Clustering

Research output: Book/ReportPh.D. thesisResearch

Standard

Sublinear Algorithms, Online Packing, Hashing and Clustering. / Beretta, Lorenzo; Thorup, Mikkel (Supervisor); Abrahamsen, Mikkel (Supervisor).

Department of Computer Science, Faculty of Science, University of Copenhagen, 2024. 240 p.

Research output: Book/ReportPh.D. thesisResearch

Harvard

Beretta, L, Thorup, M & Abrahamsen, M 2024, Sublinear Algorithms, Online Packing, Hashing and Clustering. Department of Computer Science, Faculty of Science, University of Copenhagen.

APA

Beretta, L., Thorup, M., & Abrahamsen, M. (2024). Sublinear Algorithms, Online Packing, Hashing and Clustering. Department of Computer Science, Faculty of Science, University of Copenhagen.

Vancouver

Beretta L, Thorup M, Abrahamsen M. Sublinear Algorithms, Online Packing, Hashing and Clustering. Department of Computer Science, Faculty of Science, University of Copenhagen, 2024. 240 p.

Author

Beretta, Lorenzo ; Thorup, Mikkel ; Abrahamsen, Mikkel. / Sublinear Algorithms, Online Packing, Hashing and Clustering. Department of Computer Science, Faculty of Science, University of Copenhagen, 2024. 240 p.

Bibtex

@phdthesis{f9e93cfb18e84644b31fe5df994d7835,
title = "Sublinear Algorithms, Online Packing, Hashing and Clustering",
abstract = "In this thesis we study the theoretical properties of several algorithmic problems. The first two problems aim at completing a certain task in sublinear time, namely using a number of operation that is sublinear in the input size. A short description of these tasks follows.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Sum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleThen, we study two problems concerned with online packing of polygons. In both cases, we give upper and lower bounds on the competitive ratio of these problems, described below.Online packing of rectangles: We are given a sequence of axis-parallel rectangles online and we have to irrevocably place them on the plane so as to minimize the area or perimeter of their axis-parallel bounding box.Translational packing of convex polygons: We are given a sequence of convex polygons online and we have to irrevocably place them in a given container on the plane without rotating them. The goal is to use as little container space as possible. We prove a surprising superconstant lower bound on the competitive ratio for several well-studied container types.Besides packing, we study hash functions.Locally uniform hashing: We design Tornado tabulation, a new tabulation-based hash function that aims at filling the gap between (unrealistic) fully-random hashing often used in the analysis of algorithms and practical hash functions with weak theoretical guarantees. In theory, Tornado has strong distributional properties, ensuring that its keys enjoy a certain local full randomness. Moreover, Tornado is practical and it is implementable in a few lines of C code.Finally, we include a result on clustering.Local-search seeding for k-means: The most popular heuristic for k-means clustering, Lloyd{\textquoteright}s algorithm, inherits its theoretical guarantees from the seeding strategy employed to initialize cluster centers. We design a local-search algorithm to initialize cluster centers that improves the approximation ratios of previous seeding strategies. We claim that our algorithm is practical and include experiments to support its effectiveness.Locally uniform hashing: We design Tornado tabulation, a new tabulation-based hash function that aims at filling the gap between (unrealistic) fully-random hashing often used in the analysis of algorithms and practical hash functions with weak theoretical guarantees. In theory, Tornado has strong distributional properties, ensuring that its keys enjoy a certain local full randomness. Moreover, Tornado is practical and it is implementable in a few lines of C code.Sum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleEarth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover{\textquoteright}s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.",
author = "Lorenzo Beretta and Mikkel Thorup and Mikkel Abrahamsen",
year = "2024",
language = "English",
publisher = "Department of Computer Science, Faculty of Science, University of Copenhagen",

}

RIS

TY - BOOK

T1 - Sublinear Algorithms, Online Packing, Hashing and Clustering

AU - Beretta, Lorenzo

A2 - Thorup, Mikkel

A2 - Abrahamsen, Mikkel

PY - 2024

Y1 - 2024

N2 - In this thesis we study the theoretical properties of several algorithmic problems. The first two problems aim at completing a certain task in sublinear time, namely using a number of operation that is sublinear in the input size. A short description of these tasks follows.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Sum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleThen, we study two problems concerned with online packing of polygons. In both cases, we give upper and lower bounds on the competitive ratio of these problems, described below.Online packing of rectangles: We are given a sequence of axis-parallel rectangles online and we have to irrevocably place them on the plane so as to minimize the area or perimeter of their axis-parallel bounding box.Translational packing of convex polygons: We are given a sequence of convex polygons online and we have to irrevocably place them in a given container on the plane without rotating them. The goal is to use as little container space as possible. We prove a surprising superconstant lower bound on the competitive ratio for several well-studied container types.Besides packing, we study hash functions.Locally uniform hashing: We design Tornado tabulation, a new tabulation-based hash function that aims at filling the gap between (unrealistic) fully-random hashing often used in the analysis of algorithms and practical hash functions with weak theoretical guarantees. In theory, Tornado has strong distributional properties, ensuring that its keys enjoy a certain local full randomness. Moreover, Tornado is practical and it is implementable in a few lines of C code.Finally, we include a result on clustering.Local-search seeding for k-means: The most popular heuristic for k-means clustering, Lloyd’s algorithm, inherits its theoretical guarantees from the seeding strategy employed to initialize cluster centers. We design a local-search algorithm to initialize cluster centers that improves the approximation ratios of previous seeding strategies. We claim that our algorithm is practical and include experiments to support its effectiveness.Locally uniform hashing: We design Tornado tabulation, a new tabulation-based hash function that aims at filling the gap between (unrealistic) fully-random hashing often used in the analysis of algorithms and practical hash functions with weak theoretical guarantees. In theory, Tornado has strong distributional properties, ensuring that its keys enjoy a certain local full randomness. Moreover, Tornado is practical and it is implementable in a few lines of C code.Sum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleEarth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.

AB - In this thesis we study the theoretical properties of several algorithmic problems. The first two problems aim at completing a certain task in sublinear time, namely using a number of operation that is sublinear in the input size. A short description of these tasks follows.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Sum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleThen, we study two problems concerned with online packing of polygons. In both cases, we give upper and lower bounds on the competitive ratio of these problems, described below.Online packing of rectangles: We are given a sequence of axis-parallel rectangles online and we have to irrevocably place them on the plane so as to minimize the area or perimeter of their axis-parallel bounding box.Translational packing of convex polygons: We are given a sequence of convex polygons online and we have to irrevocably place them in a given container on the plane without rotating them. The goal is to use as little container space as possible. We prove a surprising superconstant lower bound on the competitive ratio for several well-studied container types.Besides packing, we study hash functions.Locally uniform hashing: We design Tornado tabulation, a new tabulation-based hash function that aims at filling the gap between (unrealistic) fully-random hashing often used in the analysis of algorithms and practical hash functions with weak theoretical guarantees. In theory, Tornado has strong distributional properties, ensuring that its keys enjoy a certain local full randomness. Moreover, Tornado is practical and it is implementable in a few lines of C code.Finally, we include a result on clustering.Local-search seeding for k-means: The most popular heuristic for k-means clustering, Lloyd’s algorithm, inherits its theoretical guarantees from the seeding strategy employed to initialize cluster centers. We design a local-search algorithm to initialize cluster centers that improves the approximation ratios of previous seeding strategies. We claim that our algorithm is practical and include experiments to support its effectiveness.Locally uniform hashing: We design Tornado tabulation, a new tabulation-based hash function that aims at filling the gap between (unrealistic) fully-random hashing often used in the analysis of algorithms and practical hash functions with weak theoretical guarantees. In theory, Tornado has strong distributional properties, ensuring that its keys enjoy a certain local full randomness. Moreover, Tornado is practical and it is implementable in a few lines of C code.Sum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleSum estimation via weighted sampling: We are given a set U of items and each item u ∈ U has a weight w(u). We can sample elements from U with probability proportional to their weight. Our task is to estimate the combined weight P u∈U w(u) using as few samples as possibleEarth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.Earth Mover’s Distance: We are given two distribution μ, ν over a generic metric space (M, d). A transport plan between μ and ν is a distribution ξ over M2 that has μ and ν as its marginals. We define the cost of a transport plan ξ as E(x,y)∼ξ [d(x, y)]. Our task is to compute the minimum cost of transport plans between μ and ν.

M3 - Ph.D. thesis

BT - Sublinear Algorithms, Online Packing, Hashing and Clustering

PB - Department of Computer Science, Faculty of Science, University of Copenhagen

ER -

ID: 399069255