Cache-conscious String Data Structures, Sorting and Algorithms: theory and practice
String Data Structures Implementations in C
Large-scale Database Sorting
Scalable Genome Pattern Recognition
Email:
W elcome to my homepage. My name is Dr. Nikolas Askitis and I am an
ex-postdoc researcher in computer science, specializing in the discipline of high-performance algorithms and string
data structures. My motivation for completing a PhD on data structures was to show programmers
and researchers how to implement basic string data structures in C/C++, using simple techniques that could dramatically boost performance while simultaneously reducing space.
First things first, however, were my publications. In my opinion, there is not much value in developing or applying new concepts if they can not be published into well-established refereed
Journals, such as VLDB. Why is that? Well, ultimately, you would want your techniques to be known and applied, and the best way to
do that would be to publish it to places where people go to acquire knowledge that has been refereed by scientists ... this process alone took me over 5 years.
So now that this is all behind me, I am finally in the position to share with you my software and knowledge of cache-conscious string
data structures and algorithms. On that note, I would like to thank all of you who contacted me and supported me
throughout these years. I really appreciate it and I thank you for your interest (and patience) with my work.
Enjoy! Please tell others about my website and feel free to contact me if you have any questions regarding my work.
Below you will find my publications and implementations of the cache-conscious data structures that I developed for my PhD.
In addition, I also provide my implementation of the copy-based burst sort
algorithm, which is currently one of the fastest solutions available for sorting strings. And for those of you who are interested in RepMaestro , you can download the
software as well as my published Oxford Bioinformatics paper --- check out my publications list below.
PhD Thesis: Efficient Data Structures For Cache Architectures
If you want to learn more about cache-efficient string data structures, you should check out my PhD thesis as it contains a wealth of information that can help you
gain a much better understanding on how these data structures work, what other researchers have published, and what to
expect out of performance. You download my PhD thesis here . Note: you may require the latest Adobe Acrobat Reader.
OzSort 2.0 --- Fast External Sorting
Want to learn how to sort over 252GB of data for a Penny? Easy!
Follow these steps:
Have a read of my refereed Industry paper on external sorting here . This should give you a good understanding on how
the algorithm works and the basic outline of hardware and performance expectations ... not to mention a few tips and tricks on fine-tuning both software and hardware.
Study my 2009 and 2010 PennySort winning submissions to see first-hand how the algorithm is implemented. You can find the non-commerical
version of the source here .
Then try to write your own ... and you will be well on your way to besting the best in external sorting :)
HAT-trie
Interested in learning how to efficiently implement the HAT-trie in C? I now offer my original
implementation which you can download below. Please note, however, that my software is for non-commercial use only. This means that my code (in whole or
any part of it) can not be used in a commercial (profit/non-profit) environment, product or service, without my written
permission. This is important. See the statement
shown at the top of my code for more details. You can download the HAT-trie here .
Please read my papers on the HAT-trie and also study my code; and feel free to ask any questions you may have regarding my work.
Once understood, try beating it! It's possible :)
Some ground rules:
Your data structure must support null-terminated variable-length strings. No fancy string-terminators like '0xff' instead of '\0', which can lead to faster string comparisons.
The length of the string can not be pre-calculated prior to using the data structure / its interfaces or API.
No HugeTLB, custom Kernels, or over-clocked hardware.
An improvement in performance must not come at a high cost in space! ... both speed and space must be accounted for ... and do not forget to take the operating system overheads into account.
There is no point in attaining speed if the solution is not scalable ... a scalable solution is a practical one!
Have fun!
Array hash table
Interested in learning how to efficiently implement the array hash table in C? My original
implementation can be download below. Please note, however, that my software is for non-commercial use only. This means that my code (in whole or
any part of it) can not be used in a commercial (profit/non-profit) environment, product or service, without my written
permission. This is important. See the statement
shown at the top of my code for more details. You can download the array hash here here .
Please read my papers on the array hash table and also study my code; and feel free to ask any questions you may have regarding my work.
Once understood, try beating it! ... the array hash, however, might be a little tougher to beat than the HAT-trie, lol :)
Same ground rules as before:
Your data structure must support null-terminated variable-length strings. No fancy string-terminators like '0xff' instead of '\0', which can lead to faster string comparisons.
The length of the string can not be pre-calculated prior to using the data structure / its interfaces or API.
No HugeTLB, custom Kernels, or over-clocked hardware.
An improvement in performance must not come at a high cost in space! ... both speed and space must be accounted for ... and do not forget to take the operating system overheads into account.
There is no point in attaining speed if the solution is not scalable ... a scalable solution is a practical one!
Have fun!
The B-tree and B-trie
Learn how to efficiently implement the B-tree in C along with the newer alternative, the B-trie. These data structures
are disk-resident, so they are not as fast as the in-memory variants but they can scale to many hundreds of gigabytes in size. My original
implementations can now be download below. Please note, however, that my software is for non-commercial use only. This means that my code (in whole or
any part of it) can not be used in a commercial (profit/non-profit) environment, product or service, without my written
permission. This is important. See the statement
shown at the top of my code for more details.
Please read my paper on the B-trie to better your understanding on disk-resident data structures and also study my code; and feel free to ask any questions you may have regarding my work. You can download the disk-based data structures here .
Enjoy!
Complete set of data structures
Here is everything I have implemented over the years for my research on cache-efficient string data structures.
Please note, however, that my software is for non-commercial use only. This means that my code (in whole or
any part of it) can not be used in a commercial (profit/non-profit) environment, product or service, without my written
permission. This is important. Having said that, enjoy and feel free to ask any questions you may have regarding my work.
You can download the list of data structures here .
Array hash table (see Chapter 3 and 4 of my PhD thesis).
Standard-chain hash table
Array burst trie
Standard-chain burst trie
HAT-trie (see Chapter 5 of my PhD thesis).
Array binary search tree (BST)
Standard-chain binary search tree (BST)
Standard-chain splay tree
Standard-chain red-black tree
Standard-chain Ternary Search Trie (TST)
Buffered B-tree
Buffered B-trie (see Chapter 6 of my PhD thesis).
Array hash table for 32-bit Integer keys
Please show your support by checking out some of the other links shown on this site. Thank you.
Public feedback :)
I'm thrilled that my research in data structures has inspired others to investigate my claims and also implement
my data structures in other languages, such as C++ and Java. Keep up the good work guys! Here are just some of the links,
and if you know of other links/papers let me know and I'll add them here ;)
Some examples of email feedback I've received (sender details have been filtered) .
1.
Subject: Thanks for the 'B-tries for Disk-based String Management' paper it was instrumental in the success of my project
Sender Mike***
Hi Dr Askitis,
I used your ' B-tries for Disk-based String Management' paper as the basis for a Java in memory
Burst-Trie (although it uses the hybrid and pure splitting described in your paper).
It is so much faster and efficient than the compressed patricia trie and ternary search tree's that I
implemented at first. My primary job is to write the software used to conduct a large scale (5% sample)
travel survey conducted over the telephone and online in Southern Ontario, Canada.
We have a database with 50,000+ street names and municipalities and we use the Burst Trie to index them
for fast retrieval. At first I wrote it for use in the web survey to remove any need to hit the db
(and associated 200ms delay) when returning matches but now we use it in our call center
application because of its speed.
We also support any-string matching (although not ordered) by using the speed of the
trie to store all the suffixes of indexed strings.
I was planning on writing a report documenting our use of the Burst-Trie and crediting you and Justin Zobel ...
You will both be credited in the Conduct of the Survey report but that probably won't be written until 2013 or 2014 ...
Thanks for your highly understandable paper, it was enormously useful when writing our Java Burst Trie implementation.
Mike***
2.
Subject: Your work is great
Sender Alex***
Hey there Nikolas,
My name is Alex***, I'm really excited about your work on
the cache-conscious data structures -- this is awesome. Efficient
data structures is my passion too.
It'd be absolutely great if you could send to me the paper and the
sources for the following topics:
- Cache-conscious Collision Resolution in String Hash Tables
- HAT-trie: A Cache-conscious Trie-based Data Structure for Strings
- B-tries for Disk-based String Management
... oh... you know I just have realized that I crave to consume _all_
your papers and sources... If you would kindly agree to pass those
magnificent pieces to me that'd be really appreciated.
I plan to build either a python implementation (using cython) or a python
wrapper for the C implementation (using ctypes) of at least the HAT-tries.
Python lacks a good space/time efficient ordered dictionary.
...
-- Alex***
3.
Subject: Very interested in your work
Sender: Kris***
Hi Nikolas, I am very intrigued by your research.
Would it be possible to get access to the fundamental data structures that you have been discussing in your papers?
--Kris***
4.
Subject: Repmaestro
Sender: Nih***
Hi,
I read and enjoyed your recent work; RepMaestro. Can you send me the link to download the software please?
Best,
Real-world string data sets
These are real-world variable-length string datasets, ideal (if not essential) for evaluating the practical performance of string data structures such as the
HAT-trie and the Judy trie. The approximate download size of each dataset is shown. The datasets were compressed using bzip2 (you can uncompress them in Linux with the
command bunzip2 ). While downloading these datasets, please consider showing your support for my research. Thank you.
Copy-based burst sort algorithm in ANSI C
The copy-based burst sort algorithm is one of the fastest in-memory
sorting algorithms for variable-length strings, as detailed in
the computing literature. Nonetheless, the original implementation
of the copy-based burst sort by its authors is not readily available.
As such, I have written my own implementation of the copy-based burst sort in C,
using my knowledge of cache-conscious string data structures.
I am not sure if my implementation is as fast or as memory-efficient as the
original used by the authors, but given the results below, I am confident that it can serve
as a good baseline for your experiments :)
My implementation of the copy-based burst sort algorithm is a little different from how its
explained in literature. First, the strings that are stored
in containers are length-encoded --- the original algorithm simply stored
null-terminated strings. Second, containers are not sized to fit into the L2 cache of a typical CPU,
as is the intention of the burst sort algorithm. Instead, much
like my implementation of the array burst trie and HAT-trie, containers are
burst once they store more than a given number of strings, which, in my opinion, is a simpler and
more optimal approach. The table below
shows preliminary results of time (in seconds) and memory (in megabytes) required to sort the DISTINCT
dataset provided above (28772169 unique strings),
compared to the standard Linux sort command. The machine used to conduct the experiments was the
AMD server described above. The source is available below.
Sorting algorithm (64-bit)
Container size
Actual process size (MB)
Wall time to sort (sec)
Copy-based burst sort
64
~520.9
19.3
Copy-based burst sort
128
~382.1
20.2
Copy-based burst sort
256
~290.6
23.7
Linux Sort
---
~682.9
129.8
64-bit Copy-based burst sort for strings:
My implementation of the copy based
burst sort algorithm (for non-commercial use) is provided below for your education. Enjoy!
You can also access the binary for Linux x86_64 platforms here . (Usage: ./copy-based-burstsort [container-size] [number-of-file-fragments-to-sort] [file1] [file2] ... )
Some important notes: The code below handles ASCII-7 printable characters only and supports variable-length null-terminated
strings, up to 32768 characters (including the null character). If you wish to compile my source code, the command is:
gcc -O3 -fomit-frame-pointer -w -DPAGING -o nikolas_askitis_copybased_burst_sort_paging nikolas_askitis_copybased_burst_sort.c sort_module.o common_burstsort.c
Utility files: common_burstsort.c , sort_module.h , sort_module.o , sort_module.c and include/common_burstsort.h
How do they stack up on more recent hardware?
Lets consider the performance of my data structures on a more modern computing architecture, namely an AMD II X4 965 3.2Ghz Processor installed on a high-performance
Asus motherboard with 8GB of DDR2-800 dual-channel 5-5-5-15 RAM, running Linux Kubuntu 10.01 64-bit (Kernel 2.6.35-22-generic SMP x86_64). The data structures were
first constructed using 28772169 (304MB) unique variable-length null-terminated strings, provided above. The time and space required to construct
was captured and shown in the Table below. We then measured the time required to search the data structures using
over 177 million (1.1GB) variable-length null-terminated strings (Skew Set 1, provided above). These strings are of a
highly-skew distribution (as is typically observed in practice). Notice, we report the actual process size of these data structures,
which is important, as this captures any space overheads and memory fragmentation caused by the allocation and frequent resizing of
dynamic arrays. I have compared the performance of my data structures against the Judy trie (version 1.0.5). The results
below we captured without HugeTLB.
Data structure (64-bit)
Actual process size (MB)
Estimated memory (MB)
Time to insert (sec)
Time to search (sec)
Judy trie SL (v1.0.5)
1078.9
888.6
19.8
34.1
HAT-trie (param: 512 slots / 16384 bucket lim)
476.3
450.6
14.3
19.3
Array burst trie (param: 256 bucket lim)
256.4
235.1
28.1
25.4
Array hash table (524288 slots)
330.1
317.6
18.1
10.6
Publications
Doctoral Citations:
Dr Askitis has developed dramatic improvements to some of the most fundamental string data structures used in computation by redesigning them for use on current computer architectures.
His new string algorithms are much faster and require much less space than their predecessors, some of which have been the subject of research for over fifty years. These improvements have the potential to greatly
improve the efficiency of a wide range of computing applications including databases, web search and data processing.
N. Askitis and J. Zobel, Redesigning the String Hash Table, Burst Trie and BST to Exploit Cache
Published to the prestigious ACM Journal of Experimental Algorithmics (ACM JEA), Vol. 15, Article 1.7, February 2011.
N. Askitis, OzSort 2.0: Sorting up to 252GB for a Penny
Winner of the 2010 International PennySort contest hosted by Microsoft & HP, and the first to solo it =)
The entry was refereed and published on the International Sort Benchmark
HomePage May 2010 (medal awarded at ACM SIGMOD 2010).
N. Askitis and R. Sinha, Engineering Scalable, Cache and Space Efficient Tries for Strings
Published to the prestigious International VLDB Journal , October 2010, Volume 19, Number 5, 2010, Pg 633-660.
N. Askitis and J. Zobel, B-tries for Disk-based String Management
Published to the prestigious International VLDB Journal , January 2009, Volume 18, Number 1, Pg 157-179.
N. Askitis, Fast and Compact Hash Tables for Integer Keys
In proceedings of the 32nd Australasian Computer Science Conference, Wellington, New Zealand, January 2009. Vol 91. Pg 101-110.
N. Askitis and R. Sinha, OzSort: Sorting over 246GB for a Penny
Winner of the 2009 International PennySort contest hosted by Microsoft & HP. Entry was refereed and published on the International Sort Benchmark HomePage May 2009. We were the first Australian team to win the PennySort competition.
R. Sinha and N. Askitis, OzSort: Sorting 100GB for less than 87kJoules
Winner of the 2009 International JouleSort contest hosted by Microsoft & HP. Entry was refereed and published on the International Sort Benchmark HomePage
May 2009. We were the first Australian team to win the JouleSort competition, and the first to demonstrate how budget desktop components could be used to deliver
energy-efficient sorting.
N. Askitis and R. Sinha, HAT-trie: A Cache-conscious Trie-based Data Structure for Strings
In proceedings of the 30th Australasian Computer Science Conference, Ballarat, Australia, January 2007, Volume 62, Pg 97-105.
N. Askitis and J. Zobel, Cache-conscious Collision Resolution in String Hash Tables
In proceedings of the 12th International String Processing and Information Retrieval (SPIRE) Conference, Buenos Aires, Argentina, November 2005, LNCS Vol. 3772, Pg 91-102.
N. Askitis and R. Sinha, RepMaestro: Scalable Repeat Detection on Disk-based Genome Sequences
[ Download Software ]
Published to the prestigious Bioinformatics Oxford Journal , October 2010, Volume 26, Number 19, Pg 2368-2374.
The software and results of this paper have been confirmed by the Oxford Bioinformatics committee.
Your Support
Thank you for visiting.
Please show your support by telling others of my website and by checking out some of the other links shown. A small donation to help me cover the cost of maintaining this site is also welcomed and appreciated :-)
Yours Sincerely,
Dr. Nikolas Askitis.
<a href="http://www.seoheap.com/whois" _wpro_href="http://www.seoheap.com/whois" target="_blank">Whois Site</a>