Wednesday, February 19, 2025

Undergraduate Upends a 40-Year-Old Data Science Conjecture about hashing and hash tables

Amazing stuff!

"... demonstrated in a January 2025 paper that this new hash table can indeed find elements faster than was considered possible. In so doing, they had disproved a conjecture long held to be true. ...

Hash tables are among the oldest data structures we have. And they’re still one of the most efficient ways to store data.” Yet open questions remain about how they work, he said. “This paper answers a couple of them in surprising ways.” ...

In a 1985 paper(opens a new tab), the computer scientist Andrew Yao, who would go on to win the A.M. Turing Award, asserted that among hash tables with a specific set of properties, the best way to find an individual element or an empty spot is to just go through potential spots randomly — an approach known as uniform probing. He also stated that, in the worst-case scenario, where you’re searching for the last remaining open spot, you can never do better than x. for 40 years, most computer scientists assumed that Yao’s conjecture was true. ...

[Krapivin] explorations with tiny pointers led to a new kind of hash table — one that did not rely on uniform probing. And for this new hash table, the time required for worst-case queries and insertions is proportional to (log x)2 — far faster than x. This result directly contradicted Yao’s conjecture. ... show that (log x)2 is the optimal, unbeatable bound for the popular class of hash tables Yao had written about. ...

In addition to refuting Yao’s conjecture, the new paper also contains what many consider an even more astonishing result. It pertains to a related, though slightly different, situation: In 1985, Yao looked not only at the worst-case times for queries, but also at the average time taken across all possible queries. He proved that hash tables with certain properties — including those that are labeled “greedy,” which means that new elements must be placed in the first available spot — could never achieve an average time better than log x.

... wanted to see if that same limit also applied to non-greedy hash tables. They showed that it did not by providing a counterexample, a non-greedy hash table with an average query time that’s much, much better than log x. In fact, it doesn’t depend on x at all. “You get a number,” ... “something that is just a constant and doesn’t depend on how full the hash table is.” The fact that you can achieve a constant average query time, regardless of the hash table’s fullness, was wholly unexpected — even to the authors themselves. ..."

From the abstract:
"In this paper, we revisit one of the simplest problems in data structures: the task of inserting elements into an open-addressed hash table so that elements can later be retrieved with as few probes as possible. We show that, even without reordering elements over time, it is possible to construct a hash table that achieves far better expected search complexities (both amortized and worst-case) than were previously thought possible. Along the way, we disprove the central conjecture left by Yao in his seminal paper ``Uniform Hashing is Optimal''. All of our results come with matching lower bounds."

Caveat: I did not read the paper.

Undergraduate Upends a 40-Year-Old Data Science Conjecture | Quanta Magazine "A young computer scientist and two colleagues show that searches within data structures called hash tables can be much faster than previously deemed possible."

No comments: