Develop high performance string data structures

The best place to learn how to code fast, compact and cache-conscious dynamic string data structures and sorting algorithms in C, C++ and Java

Learn more »   Consultation / development service




Brief intro.

Hello and welcome to my Homepage. My name is Dr. Nikolas Askitis and I have studied and advanced the discipline of practical and dynamic string data structures in Computer Science. A data structure is simply a piece of software that provides a service to store, retrieve and delete English words and numbers from a computer. It is dynamic because you can store and delete any number of words. You can think of a data structure as an empty dictionary where you can write millions of words (including their definitions), and then later search for words to determine whether they exist in the dictionary, to access their definitions, or to remove them from the dictionary.

Image of the HAT-trie data structure populated with strings

A data structure is absolutely critical to the success of many software products and solutions, including the web browser that your currently viewing this website from. As a result, it is of paramount importance that we know how to develop these data structures efficiently to make the best use of our current computers. That is, we need to know how to develop cache-conscious or cache-efficient data structures. Before my research was published, there was virtually no detailed information available to students and developers that provided clear instructions on how to develop these types of data structures on modern computers, and what to expect out of performance. Sounds unbelievable, but it was true!

My motivation for completing a PhD in data structures was thus to provide this information, to effectively teach you how to actually code cache-conscious data structures (in C, C++, Java and other high-level programming languages). Sounds interesting so far? Keep reading to learn more as the results of my research has attracted a lot of interest world-wide, particularly with my cache efficient implementations of burst sort and external genome pattern recognition software.




Have your data structures developed by me
Software consultation and development service



Using an inefficient data structure can be a huge risk for your business, especially when it comes time to scale to meet client demand.

Why not let me develop your data structure for you?

I know how to choose the right data structure for your business, and I know how to develop it to maximize performance while using the least amount of space.

Take the first step and contact me now.


 




Recent clients




Refereed PhD. Efficient Data Structures For Cache Architectures.

Do you want to learn practical skills on how to code fast and space efficient data structures (both in-memory and on disk) in most computing languages such as C, C++, C# and Java?

You've come to right place.

In just a few minutes, you can learn the basics on how to code cache efficient data structures and apply them to your work. My PhD thesis offers you a wealth of information that is easy to read, providing step-by-step instructions on how to code large-scale cache conscious data structures efficiently in real-world applications, and what to expect out of performance.

Since 2007, my thesis has been downloaded and read by thousands of people world-wide and I have assisted hundreds of individuals on implementing high performance data structures and algorithms.

I have done all of the hard work for you, so all you need to do is read my thesis and apply the basic programming techniques shown to your work.

Click the button below to get immediate access to my PhD thesis for just $10 and begin your online education. It's that easy.

Page loader
Note: Please wait until PayPal returns you back to this site after payment to complete the order. Thank you.








Data structure source code in C / C++


The cache-conscious hash table, cache-conscious burst trie, the B-trie and the HAT-trie

After reading my PhD thesis, advance your education further by downloading a copy of my (non-commercial) source code. What is included is my original refereed source codes that were implemented for my PhD namely the array burst trie, array binary search tree, array hash table, the HAT-trie, the B-trie and the B-tree, along with a set of HQ benchmark data structures. The source codes have been fined-tuned over many years and are thus of high quality (performance-wise). You will not find better implementations of these data structures that can match the performance outlined in my PhD thesis!

With my source codes at hand, you can quickly learn how to implement cache-conscious string data structures by example. Importantly, you can run the experiments yourself and witness the gains in performance that are attainable by applying the techniques outlined. Not many authors are willing to back-up their claims with source code, so download your copy now (while its still available) to complete your online education on efficient data structures. It's that easy.



Page loader
Note: Please wait until PayPal returns you back to this site after payment to complete the order. Thank you.









Burst Sort in C / C++


The cache-conscious copy-based burst sort algorithm

The copybased burst sort algorithm is the fastest in-memory sorting algorithm for variable-length strings, as detailed the computing literature. Nonetheless, its original implementation is not available online. This makes it difficult for you to learn how to code the sorting algorithm efficiently. I have therefore written my own implementation of copy-based burst sort in C / C++, using my expertise of cache-conscious string data structures. The table below highlights the performance of my implementation of burst sort relative to the standard but well-known Linux sort command. The task was to sort the distinct dataset which you can download below, and which contains about 28 million unique variable-length strings.


Sorting algorithm (64-bit) Memory usage (MB) Time to sort (sec)
Cache-efficient burst sort 290.6     57% smaller 23.7     82% faster
Linux Sort 682.9 129.8

If you want to learn how to develop cache-efficient sorting algorithms for yourself and reproduce the results shown above in your software projects, then look no further as my (non-commercial) implementation of burst sort is the best avaliable online. Download your copy now (while it is still available) to complete your online education. It's that easy.

Page loader
Note: Please wait until PayPal returns you back to this site after payment to complete the order. Thank you.









Award-winning External Merge Sort ( Ozsort 2.0 )


Ozsort 2.0 written in C / C++ for Linux 64-bit platforms

Here is a rare opportunity for you to learn how to implement a high-performance external merge sorting algorithm that will work efficiently on both high-end and budget computers. Unlike burst sort which only works in-memory (RAM), external merge-sort allows you to sort millions upon billions of strings efficiently on disk (i.e., strings or database records that are stored in a file on your hard drive).

Image of medals awarded for OzSort 2.0

My implementation of the external merge sort algorithm is called Ozsort, and it has won International awards and recognition from the SortBenchmark programming committee.

If you want to learn how to code a fast and energy-efficient external sorting algorithm in C, C++, C# or Java, there is no better place on the web than here. You will have the opportunity to learn first-hand and at your own pace, simply by studying the source code of Ozsort 2.0 -- a refereed and accredited software solution.

There are no other implementations available online that are as well documented, tested and accredited as OzSort. Look no further. Download your non-commercial copy of OzSort now and start your online education. It's that easy.

Page loader
Note: Please wait until PayPal returns you back to this site after payment to complete the order. Thank you.




OzSort 2.0 --- Seriously Fast and Energy-efficient External Sorting

Want to learn how to sort over 252GB of data for a Penny, and to do it in a way that is more eco-friendly?   Easy!

Follow these steps:

  1. Have a read of my refereed Industry paper on external sorting here. This should give you a good understanding on how the algorithm works and the basic outline of hardware and performance expectations ... not to mention a few tips and tricks on fine-tuning both software and hardware.
  2. Study my 2009 and 2010 PennySort and JouleSort winning submissions to see first-hand how the algorithm is implemented. You can request a free non-commercial version of the source (written in C for Linux) by contacting me.
  3. Then try to write your own ... and you will be well on your way to besting the best in external sorting :)

HAT-trie

Interested in learning how to efficiently implement the HAT-trie in C?   Have a read of my papers on the HAT-trie and also study my code; and feel free to ask any questions you may have regarding my work. Once understood, try beating it!   It's possible :)

Some ground rules:
  • Your data structure must support null-terminated variable-length strings. No fancy string-terminators like '0xff' instead of '\0', which can result in faster string comparisons.
  • The length of the string can not be pre-calculated prior to using the data structure.
  • No HugeTLB, custom Kernels, or over-clocked hardware.
  • An improvement in performance must not come at a high cost in space; both speed and space must be accounted for ... and do not forget to take the operating system overheads into account.  
  • There is no point in attaining speed if the solution is not scalable ... a scalable solution is a practical one! 
  • Have fun!

Array hash table

Interested in learning how to efficiently implement the array hash table in C? Cache-conscious-Data-Structures. Have a read my papers on the array hash table and also study my code; and feel free to ask any questions you may have regarding my work. Once understood, try beating it!  ... the array hash, however, might be a little tougher to beat than the HAT-trie, lol :)

Same ground rules as before:
  • Your data structure must support null-terminated variable-length strings. No fancy string-terminators like '0xff' instead of '\0', which can result in faster string comparisons.
  • The length of the string can not be pre-calculated prior to using the data structure.
  • No HugeTLB, custom Kernels, or over-clocked hardware.
  • An improvement in performance must not come at a high cost in space; both speed and space must be accounted for ... and do not forget to take the operating system overheads into account.  
  • There is no point in attaining speed if the solution is not scalable ... a scalable solution is a practical one! 
  • Have fun!

The B-tree and B-trie

Learn how to efficiently implement the B-tree in C along with the newer alternative, the B-trie. These data structures are disk-resident, so they are not as fast as the in-memory variants but they can scale to many hundreds of gigabytes in size. Please note, however, that my B-trie software is currently for non-commercial use only.

Please read my paper on the B-trie to better your understanding on disk-resident data structures and also study my code; and feel free to ask any questions you may have regarding my work. 


Public feedback
Some examples (out of the hundreds) of email feedback that I have received over the years regarding my research (sender details have been filtered).

1.
Subject: Thanks for the 'B-tries for Disk-based String Management' paper it was instrumental in the success of my project 			
Sender Mike*** 

Hi Dr Askitis, 

I used your ' B-tries for Disk-based String Management' paper as the basis for a Java in memory 
Burst-Trie (although it uses the hybrid and pure splitting described in your paper).

It is so much faster and efficient than the compressed patricia trie and ternary search tree's that I 
implemented at first. My primary job is to write the software used to conduct a large scale (5% sample) 
travel survey conducted over the telephone and online in Southern Ontario, Canada. 

We have a database with 50,000+ street names and municipalities and we use the Burst Trie to index them 
for fast retrieval.  At first I wrote it for use in the web survey to remove any need to hit the db 
(and associated 200ms delay) when returning matches but now we use it in our call center 
application because of its speed.

We also support any-string matching (although not ordered) by using the speed of the 
trie to store all the suffixes of indexed strings.

I was planning on writing a report documenting our use of the Burst-Trie and crediting you and Justin Zobel ... 
You will both  be credited in the Conduct of the Survey report but that probably won't be written until 2013 or 2014 ...

Thanks for your highly understandable paper, it was enormously useful when writing our Java Burst Trie implementation.

Mike***

                                    

2.
Subject: Your work is great			
Sender Alex***

Hey there Nikolas,

My name is Alex***, I'm really excited about your work on
the cache-conscious data structures -- this is awesome. Efficient
data structures is my passion too.

It'd be absolutely great if you could send to me the paper and the
sources for the following topics:

- Cache-conscious Collision Resolution in String Hash Tables
- HAT-trie: A Cache-conscious Trie-based Data Structure for Strings
- B-tries for Disk-based String Management
... oh... you know I just have realized that I crave to consume _all_
 your papers and sources... If you would kindly agree to pass those
 magnificent pieces to me that'd be really appreciated.

I plan to build either a python implementation (using cython) or a python
wrapper for the C implementation (using ctypes) of at least the HAT-tries.
Python lacks a good space/time efficient ordered dictionary.
...

-- Alex***
                                    


3.
Subject: Very interested in your work
Sender: Kris***

Hi Nikolas, I am very intrigued by your research.  
Would it be possible to get access to the fundamental data structures that you have been discussing in your papers?

--Kris***
                                    


4.
Subject: Repmaestro
Sender: Nih***

Hi,

I read and enjoyed your recent work; RepMaestro. Can you send me the link to download the software please?

Best,  
                                    



Real-world string data sets

These are real-world variable-length string datasets, ideal (if not essential) for evaluating the practical performance of string data structures such as the HAT-trie and the Judy trie. The approximate download size of each dataset is shown. The datasets were compressed using bzip2 (you can uncompress them in Linux with the command bunzip2).


How do they stack up on more recent hardware?

Lets consider the performance of my data structures on a more modern computing architecture, namely an AMD II X4 965 3.2Ghz Processor installed on a high-performance Asus motherboard with 8GB of DDR2-800 dual-channel 5-5-5-15 RAM, running Linux Kubuntu 10.01 64-bit (Kernel 2.6.35-22-generic SMP x86_64). The data structures were first constructed using 28772169 (304MB) unique variable-length null-terminated strings. The time and space required to construct was captured and shown in the Table below. We then measured the time required to search the data structures using over 177 million (1.1GB) variable-length null-terminated strings (Skew Set 1). These strings are of a highly-skew distribution (as is typically observed in practice). Notice, we report the actual process size of these data structures, which is important, as this captures any space overheads and memory fragmentation caused by the allocation and frequent resizing of dynamic arrays. I have compared the performance of my data structures against the Judy trie (version 1.0.5). The results below we captured without HugeTLB.


Data structure (64-bit) Actual process size (MB) Estimated memory (MB) Time to insert (sec) Time to search (sec)
Judy trie SL (v1.0.5) 1078.9 888.6 19.8 34.1
HAT-trie (param: 512 slots / 16384 bucket lim) 476.3 450.6 14.3 19.3
Array burst trie (param: 256 bucket lim) 256.4 235.1 28.1 25.4
Array hash table (524288 slots) 330.1 317.6 18.1 10.6


Copy-based burst sort algorithm in ANSI C

The copy-based burst sort algorithm is one of the fastest in-memory sorting algorithms for variable-length strings, as detailed in the computing literature. Nonetheless, the original implementation of the copy-based burst sort by its authors is not readily available. As such, I have written my own implementation of the copy-based burst sort in C, using my knowledge of cache-conscious string data structures. This implementation may not be as fast or as memory-efficient as the original used by the authors, but given the results below, I am confident that it can serve as a good baseline for your experiments :)

My implementation of the copy-based burst sort algorithm is a little different from how its explained in literature. First, the strings that are stored in containers are length-encoded --- the original algorithm simply stored null-terminated strings. Second, containers are not sized to fit into the L2 cache of a typical CPU, as is the intention of the burst sort algorithm. Instead, much like my implementation of the array burst trie and HAT-trie, containers are burst once they store more than a given number of strings, which, in my opinion, is a simpler and more optimal approach. The table below shows preliminary results of time (in seconds) and memory (in megabytes) required to sort the DISTINCT dataset (28772169 unique strings), compared to the standard Linux sort command. The machine used to conduct the experiments was the AMD described for the benchmark test cases (shown in the previous tab).


Sorting algorithm (64-bit) Container size Actual process size (MB) Wall time to sort (sec)
Copy-based burst sort 64 ~520.9 19.3
Copy-based burst sort 128 ~382.1 20.2
Copy-based burst sort 256 ~290.6 23.7
Linux Sort --- ~682.9 129.8



PennySort and JouleSort International External Sorting Competition Check these out (: If you have the time, motivation and patience to compete in the PennySort or JouleSort competitions, then I highly recommend. Go get them!

PennySort and JouleSort Medals :D


Publications
VLDB Journal (array hash table, array burst trie, array BST, HAT-trie, Hybrid HAT-trie, HAT+trie & B-trie)     Oxford Bioinformatics (RepMaestro)   SPIRE Conference (preliminary array hash table)    ACSC Conference (preliminary HAT-trie & Hybrid HAT-trie) ACM JEA


Doctoral Citations:

Dr Askitis has developed dramatic improvements to some of the most fundamental string data structures used in computation by redesigning them for use on current computer architectures. His new string algorithms are much faster and require much less space than their predecessors, some of which have been the subject of research for over fifty years. These improvements have the potential to greatly improve the efficiency of a wide range of computing applications including databases, web search and data processing.

Image of the HAT-trie data structure populated with strings

  1. N. Askitis and J. Zobel,   Redesigning the String Hash Table, Burst Trie and BST to Exploit Cache
    Published to the prestigious ACM Journal of Experimental Algorithmics (ACM JEA), Vol. 15, Article 1.7, February 2011.
  2. N. Askitis,   OzSort 2.0: Sorting up to 252GB for a Penny  PDF paper: OzSort 2010 Penny
    Winner of the 2010 International PennySort contest hosted by Microsoft & HP, and the first one to solo it =)
    The entry was refereed and published on the International Sort Benchmark HomePage May 2010 (medal awarded at ACM SIGMOD 2010).
  3. N. Askitis and R. Sinha,   Engineering Scalable, Cache and Space Efficient Tries for Strings
    Published to the prestigious International VLDB Journal, October 2010, Volume 19, Number 5, 2010, Pg 633-660.
  4. N. Askitis and J. Zobel,   B-tries for Disk-based String Management
    Published to the prestigious International VLDB Journal, January 2009, Volume 18, Number 1, Pg 157-179.
  5. N. Askitis,   Fast and Compact Hash Tables for Integer Keys
    In proceedings of the 32nd Australasian Computer Science Conference, Wellington, New Zealand, January 2009. Vol 91. Pg 101-110.  
  6. N. Askitis and R. Sinha,   OzSort: Sorting over 246GB for a Penny  PDF paper: OzSort 09 Penny
    Winner of the 2009 International PennySort contest hosted by Microsoft & HP. Entry was refereed and published on the International Sort Benchmark HomePage May 2009. We were the first Australian team to win the PennySort competition.
  7. R. Sinha and N. Askitis,   OzSort: Sorting 100GB for less than 87kJoules  PDF Paper: OzSort 09 Joule
    Winner of the 2009 International JouleSort contest hosted by Microsoft & HP. Entry was refereed and published on the International Sort Benchmark HomePage cite.pdf May 2009. We were the first Australian team to win the JouleSort competition, and the first to demonstrate how budget desktop components could be used to deliver energy-efficient sorting. 
  8. N. Askitis and R. Sinha,   HAT-trie: A Cache-conscious Trie-based Data Structure for Strings  
    In proceedings of the 30th Australasian Computer Science Conference, Ballarat, Australia, January 2007, Volume 62, Pg 97-105.
  9. N. Askitis and J. Zobel,   Cache-conscious Collision Resolution in String Hash Tables
    In proceedings of the 12th International String Processing and Information Retrieval (SPIRE) Conference, Buenos Aires, Argentina, November 2005, LNCS Vol. 3772, Pg 91-102.
  10. N. Askitis and R. Sinha,   RepMaestro: Scalable Repeat Detection on Disk-based Genome Sequences  PDF Paper: Bioinformatics RepMaestro[Download Software]
    Published to the prestigious Bioinformatics Oxford Journal, October 2010, Volume 26, Number 19, Pg 2368-2374. The software and results of this paper have been confirmed by the Oxford Bioinformatics committee.







 
malware removal and website security