What should you do if you’re storing lots of data and retrieving them? Sure, you can just write a basic program that does so. But for huge databases of millions of items, these retrievals will become expensive and take longer to complete than you want. So how do you increase the speed of these operations? You use other, more advanced data structures. Without further ado, let’s find out.
- Linked lists
- Hash maps
We’ll first start with the simplest data structure, an array. It is an ordered collection of data with a fixed size, stored in a contiguous (i.e., adjacent) range of memory locations.
For example, if you would like to store four 64-bit integers in an array, the program looks for a range of four consecutive 64-bit memory blocks to store the things. If you want to access anything through the array, you must first know its index — the location in memory where the item is stored relative to the first item in the array. For example, if you’re looking for the fourth item in the above array, it will be at index 3 (note that zero-indexing applies here). Then, the program adds three to the position of the first item (which is stored when the array is first created), and retrieves its value. It takes constant time (O(1)) in general, but if the index is not known, the program needs to search through the entire array. That means the complexity of this operation becomes linear time (O(n)).
One disadvantage of the array, however, is that its length cannot be changed. That means the length of the array, at least the upper limit of it, has to be known at compile time. Therefore, if you don’t know the length in advance, you need to use a dynamic-size array — an array that can resize itself. If a new element is added to an array of length n, it searches for available space adjacent to it. If the array is full, it looks for a block of n+1 free memory addresses, copies the entire array over there, and adds the element. This is an expensive operation, but other than that, it performs the same as a standard array.
2. Linked lists
Linked lists are also flexible collections of items, being able to change their size. But instead of the above operation, it establishes a link between adjacent items (nodes). Each node contains the data that it contains, as well as the memory address of the next node in the linked list.
To add an item to the end of the list, you first establish a link between the current last node and the node you’ve just added. Then, update the “last node” variable to be the “value to add” to facilitate future additions. Adding an item to the start of the list is a similar process. Update the “first node” variable to be the “value to add”, and then link it to the previous first node.
Removing an item from the list is also simple. Link the item in the previous node directly to the next node. This excludes the node from the linking system such that it’s no longer accessible (and hopefully gets garbage-collected). This is similar to adding an item to the middle of the list, where the links are also manipulated in a certain way.
But this flexibility comes at a cost — it takes constant time to retrieve an item from the list. You need to start from the beginning, and sift through the things one by one until you find what you need.
A stack mimics a physical stack of items. The last thing that comes in must be the first to come out. This is a last-in-first-out (LIFO) structure. If you would like to add anything to a stack, you need to add it to the top. Likewise, if you delete something from it, you need to remove the first item, not anything in the middle.
But there’s also one crucial feature that stacks also have: its maximum size. That’s because stacks, like arrays, are located on contiguous memory blocks. If we try to add an item to a full stack, that item has nowhere to go — and it results in a stack overflow error, which often results in a program crash.
One important application of the stack is recursion. This means the function calls itself while it is being executed. When a recursive function is executed, the program adds each execution to the call stack. When the base case is met (i.e., the function stops referring to itself), it stops adding items to the call stack. Then, it starts going through the stack, popping each item from the top as each function call is executed, until it stops as the call stack becomes empty. The program then returns a value at this point.
Queues are very similar to stacks, except for using a first-in-first-out (FIFO) structure. This means you can only add values to the end of the queue or remove them from the beginning. This works like a queue you see in real life, hence the name. Like stacks, queues also have fixed maximum sizes and can overflow if too many things are added.
The applications of a queue are more obvious, as they involve waiting in line, just like actual queues. For instance, suppose you’re running a breadth-first search and need to track how much it has covered. This process uses a queue. When you have to search a specific depth, it adds the nodes of that depth to the queue. It then removes each node from the queue once it is finished with that node, and when it becomes empty, it moves on to the next depth and starts the process over again.
5. Hash Maps
Before talking about hash maps, let’s talk about maps in general. Each entry in a map is a pair of data points — the key and the value, in which the key is used to access a value. This structure is known as a key-value pair. Technically, you can use two arrays to achieve this; the location index of the key in the first array is the same as the location of the value in the second array. But if we would like to retrieve the item, what should the program do? It has to scan through the first list to find the index of the desired item, which can take longer than necessary.
The solution is to use a hash function to make the index. Hash functions are deterministic, which means they return the same result when given the same input. They convert a data point of any type (like a string, a boolean value, and so on) into a number, and because the size of the array limits that number (the index), these functions are not reversible as well (which means that there is more than one preimage of each possible value). When accessing a value through the hash map, you need to run the key through the hash function, and access the value using the hash.
But what happens if two keys get mapped to the same index? This is a hash collision, and there are a few ways to deal with it. You can place the element in the next available index, in a technique known as linear probing. However, this could lead to clustering, where many consecutive indices are occupied, leading to performance issues in placing and retrieving values. Other methods to deal with collisions can circumvent this problem, such as nonlinear probing and open addressing. As these methods still increase the computational cost of operating on a hash map, it’s beneficial to use a collision-resistant algorithm as well, where collisions are hard to find, often due to the large size of the output value.
The five data structures we’ve mentioned are linear, meaning that data is ordered one-dimensionally. But trees are different — it is a hierarchical data structure that models connections between nodes, in which a node can link to any number of nodes.
One example of a tree is a file system. The root directory, expressed as “/” in Unix-like systems, is the root node of the file system tree. It links to more directories, which are parent nodes of other files and subdirectories. Finally, the files, that don’t contain any more files (as they’re not directories), are the leaf nodes. Each file and directory has a path (e.g., “/Users/centralgalaxy/data.txt”, which is the path from the root node through the file system tree in order to find the item.
There are two ways that you can traverse a tree; a breadth-first-search (BFS) and a depth-first search (DFS). A breadth-first search searches depths (distances from the node) one by one until the tree is exhausted. It uses a queue, and is actually the example in the queues section. On the other hand, a depth-first search traverses a specific path until it reaches a leaf node, and then backtracks until it finds another path. It searches the tree recursively, so a stack is used.
Among the tree structures, there is one type that is one of the most useful — a heap. Specifically, it is a tree that satisfies the heap property — the value of each node is either smaller or larger than its parent. In a max-heap, the nodes are smaller than their parents, ensuring that the largest item is in the root. In a min-heap, the opposite is true — the nodes are greater than their parents so that the root is the smallest. Note that, in this context, cases where the parent node is equal to the children nodes are also allowed. Also, a heap needs to be a complete binary tree — a tree where every parent node has two children (except on the last layer).
But given the delicate structure of a perfect heap, how would we add anything to it? That might violate the heap property, right? Well, the way to go is to first add the item to the end of the heap to become one of its leaf nodes. Then, we fix the heap so that it follows the heap property again. If the child and the parent are in the wrong arrangement, we swap those items. For example, if a child node happens to be larger than the parent node in a max-heap, we exchange the values of the two nodes. This process is iterated until the heap is back in place again. Likewise, if we remove a value from the heap, we insert the rightmost value in the last layer to take its place. Then, again, we move the values around to fix the heap. But if we add or remove things frequently, a heap might not be the best data structure. It takes O(log n) time there, but in a linked list it just takes O(1).
However, there is one thing that heaps are good at because they add and remove things efficiently. That’s implementing a priority queue. In a priority queue, each item is assigned a numerical value (a priority), and only the item with the highest priority can be dequeued. But how do you keep track of the largest item? If you’re implementing a priority queue using an array, you must search through it every time, which becomes expensive as the queue lengthens. You can store a sorted version of the array to solve this problem, but what if you add something to it? You still need to scan through it!
But if you’re doing that using heaps, both adding and removing items becomes simple. Specifically, you only need to do at most one swap operation for one layer of the tree. And given that it’s a complete binary tree, where every layer except the last is filled, the number of layers is approximately the binary log of the number of items. Therefore, these insertion and deletion operations cost just O(log n), which is very efficient compared to a sorted list. Heaps can also keep track of the maximum values just as easily, as it is just the root node (in a max heap).
The trees and the heaps, mentioned above, are a subset of graphs. A graph is a set of nodes (vertices), connected by a set of pairs of nodes (edges). Sometimes, the link is symmetric, and the order in the pair does not matter. In that case, the graph is undirected. Otherwise, the order of the pair does matter (i.e., the link goes only from one node to another, but not vice versa). In this case, it is a directed graph, which is what trees, heaps, and linked lists are.
Graphs are essential tools in modeling many things in real life. For example, in public transport systems, each station is a node, while the links between one station to another are the edges of the graph. On the Internet, where pages link together, each webpage is a node, while each link is an edge. In most cases, the transport system is an undirected graph. If station A links to station B, station B probably links to station A as well. However, the Internet is a directional graph. If page A links to page B, page B doesn’t necessarily link to page A.
Graphs are so important that there is a whole field of mathematics dedicated to them — graph theory. This study of graphs has led to many results, such as the development of pathfinding algorithms. They are optimization algorithms that find the shortest path between two nodes on a graph. For this purpose, Dijkstra’s algorithm and the A* algorithm are commonly used.
To recap, we’ve briefly introduced 8 common data structures in this article. They include:
- Linked lists
- Hash maps
Each data structure has its advantages and disadvantages, making it useful in some applications but not others. So next time, if you need to process large amounts of data, make sure you use the right data structure. That will save you (and your program) lots of time!