The data management is the backbone of any application, usually in small to medium-sized applications and programmers need not worry much about data structures because of high-end system, but it is critical that you know when you load data that you should choose efficient data structure for it if you are to create highly scalable applications.
This article is not intended to explain various data structures from scratch and, assuming they are familiar with it, programmers who already know them will be able to jog their memory through it and learn some of its applications.
Big O Notation
Before we dive further into it, first of all, I’d like to mention how do we determine if code is good and scalable. There is a term called Big O notation which is used to evaluate the performance of any data structure for different operations.
Depending upon the number of operations needed to be done on a number of inputs, performance gets calculated and normally denoted in following formats like O(n), O(1), O(log n)
If you going to search anything within an array and you had to put a simple loop with a traverse through all elements (for the worst-case scenario), it will be linear search and denote as O(n) as for 100 items, 100 times operation performed and imagine for 1000,000 items for big applications and that is considered less than ideal.
For some operations, we don’t have to go through any list as we know index like in the case of Array or Hash tables and we can pick up the item then it’s the very fast operation and denoted as O(1) as its constant speed.
But in the case of a nested loop, it could be a nightmare especially in case of the worst-case scenario and big data and for such cases, it could be O(n²)
Nevertheless, I got a simple cheat sheet which should help to jog your memory about Big O
#Big O Cheat Sheet:
O(1) Constant- no loops
O(log N) Logarithmic- usually searching algorithms have log n if they are sorted (Binary Search)
O(n) Linear- for loops, while loops through n items
O(n log(n)) Log Linear- usually sorting operations
O(n²) Quadratic- every element in a collection needs to be compared to every another element. Two nested loops
O(2^n) Exponential- recursive algorithms that solve a problem of size N
O(n!) Factorial- you are adding a loop for every element
Iterating through half a collection is still O(n)
Two separate collections: O(a * b)
-What can cause time in a function?-
Operations (+, -, *, /)
Comparisons (<, >, ==)
Looping (for, while)
Outside Function call (function())
-Rule Book
Rule 1: Always worst Case
Rule 2: Remove Constants
Rule 3: Different inputs should have different variables. O(a+b). A and B arrays nested would be
O(a*b)
+ for steps in order
* for nested steps
Rule 4: Drop Non-dominant terms
When we are executing the program, not only time matters that how quickly it performs an action but memory space also matters, how efficient it is memory-wise. Some data structure are efficient in terms of time and some are in terms of space, so it all depends on how much is data, what matter the most and what kind of operations needed to run mostly. sometimes you have to trade-off, whether better performance or better memory consumption, especially when dealing with big data. Following are a few pointers which cause memory consumption mainly in any program
Space complexity depends upon:
Variables
Data Structures
Function Call
Allocations
Big O helps us decide which algorithm and data structure we should choose for any operation, as there is no single best solution for all kind of operations, it all depends on the type of operation and type of data but there are still some good data structures which mostly fit in many situations so its good to know them well.
ref: bigocheatsheet.com
Array
Probably the most common and most used data structure. There are two types of array, static and dynamic, as the name implies, static is a fixed size and dynamic is not. Most of the operations in the array are very fast means O(1) but with the dynamic array, we had to compromise on space complexity along with performance comparable to a static array but as I mentioned earlier, it all depends how much data and how much it matters to you in given circumstances.
Static Array Big O
Dynamic Array Big O
As you can see there is not much difference except with push/append method, as it could be a linear loop and could cause the array to be replicated on new space which would cause O(n) level of complexity sometimes.
For some reason, I would prefer Hash tables instead of Dynamic array most of the time but as I said before, it all depends on the situation. I’ll explain later in the article why I said, as Dory said in Finding Nemo, just keep swimming ;-)
Hash Tables
Being a programmer, I am pretty sure you must have heard of Hash, Hashing, Hash functions, Hash table sounds a bit scary but it’s not, implementation wise, its something between Static and Dynamic array with an ingredient of Hash function. Hash functions are special functions which used to generate hash (gibberish) text, hash functions are not only used for Hash tables but also you can use it in security like generate a hash of a password to store, as it can’t be reversed so even that hash is stolen, it can’t be reversed into original text so if you were comparing with encryption, DONT!
Hash is generated and used as an index for Array to store/lookup any data.
Now you must be thinking, if it is Big O(1) which is considered to be the best performance then why don’t we use it for everything, believe me, I thought the same thing after looking at it but there are some pros and cons of Hash tables.
Hash tables don’t distribute data in order or evenly fashion, as one input always going to have same hash so if the input is repeating, the hash location would always be same so even their data is different but it will point to the same location and caused Hash table collision, it can add multiple data in a single bucket, which obviously solvable through the linked list and other ways. So you can see if data bucket is already existing, add data and add a link which creates linked list, obviously going to explain further next.
Array vs Hash table
Linked List
Arrays are simple but quite limited, especially size-wise, no matter it’s static or dynamic as dynamic has to be recreated again upon hitting the limit. which cost O (n) in most operations.
Hash tables much better choice but not ordered so can’t run many algorithms like binary sort or such which required ordered data so the search could be an issue if we don’t know the address/index.
Linked List is there to save the day as we can have ordered linked list. There are different types of Linked Lists, like Singly, Doubly, Circular, It all depends on what kind of linkage is required but obviously Singly is more space-efficient but it has its own trade-off.
Singly Linked list, only traverse forward. Head is the first node, Tail is the last node linked to null.
Doubly Linked list, can traverse back and forth but more space consuming. Each node is linked to each other.
Circular Linked list, head and tail nodes are linked. so quicker where required.
Stack and Queues
Both are quite similar as both are linear as handle data sequentially, can’t jump around like in Array or Hash tables. You add data in a similar way but removing data is totally opposite. The stack is First In Last Out or can say Last in First Out but with Queue First in First out like in real-world Qs.
Stacks Big O performance is fast but you can’t use Stack everywhere, only where data needs to be handled sequentially and naturally applicable to take full advantage of stacks.
About Q performance it’s also very fast and same as Stack but application is quite different.
Considering its model, one question arises, why can’t we simply use an array instead of Q. Simple reason is when we remove first (shift) item from Array, we have to shift all items, resize the array, which cost a lot of operations. Hence implementing Q with Array would be a mistake. But to implement Stack, you can certainly consider Stack.
Using Linked List to build Stack/Q makes more sense and you can also build Q using Stack if that is built using Linked List, the only difference is in DeQueue method, you will be popping from the head instead of a tail.
Tree
Like in a real tree, trees have different branches, leaf, root, conceptually its quite similar with Tree data structure. The following image will jog your memory.
if you have experience with HTML DOM, you can relate with it, the HTML tag is Root, Head/Body tags are a child of Root and rest of the other mostly tags are their children and it could go to N level just like real DOM.
There are different types of trees just like in real life like Binary Tree, AVL, Red-Black Tree and yes we going to discuss them and can be easily built using the Linked list as they are Uni directly as you can see arrows pointing in one direction in the illustration above. For both directions, we have directed Graphs which I am going to discuss later.
Binary Tree
As the name implies, Binary can have two at max and 0 at least. so each node can link to two other nodes at max. A binary tree is good for storing information which needs to be stored in order but not necessarily. Binary tree, we try to keep the balance for better performance, to keep the balance we use techniques like AVL/Black Red which I’ll describe later.
Performance-wise they are much better than linear search so for 100 items so Log of 100 = 2 but that depends if data is sorted and you apply Binary Search Algorithm. Always remember Big O is not exact and depends also on different factors.
If Tree is not balanced it could be O(n) in worst case scenario performance which is something we want to avoid. So keeping tree balance is very important. There are a couple of Tree structures are AVL and Red-Black Tree which self-balance itself when new node inserted or any node deleted.
There are a few factors you can choose one over the other. Like AVL lookup is very fast and if mostly Lookup operations are required, prefer AVL tree but if Insert operation is quite extensively used then I’d prefer Red-Black tree.
Binary Heap
First of all, don’t confuse it with memory heap. A binary tree is good if you are storing information comparative from left to right as in Binary Tree, left is always going to be bigger than right but if you want to store information in a way that either top root node is either small or biggest then you can choose Binary heap as think about it, if you adding data Age-wise and you want to choose all those nodes which are bigger than the age of 18 then you can simply choose all nodes after certain nodes as we know from that node upward or downwards all nodes are either small or bigger.
In case the root node is smallest then its called Min Heap otherwise Max Heap.
Binary heaps are self-balancing by restructuring itself. It’s always a balanced tree. With Binary Heap we can easily implement Priority Q. It’s not same as Q we discussed earlier but as in real life Priority, Q has elements which have elements with defined priority so they could be processed first so whenever new node created it will be inserted according to its priority. It does not first come first serve, imagine the Boarding gate Q, Captain and its crew should have been allowed to move prior than passengers to start preparing, so the captain could be a top priority, staff after him and then comes we, poor passengers. With Binary Search Tree, it can be achieved if there are only 3 types of priority, imagine if there are more classification of a priority then we can’t achieve through Binary Search tree.
In a nutshell, Binary Search tree is ordered, flexible, faster, better than linear search on average or if kept well balanced. Whereas Binary Heap is not as faster as BST but still better than O(n) and can define priority Q, flexible in size and fast insertion and keep the tree is balanced.
Trie
A specialized tree for searching for text can have more than 2 children. Good for dictionary implementation. auto-completion, IP routing. It also saves space by not needing to repeat nodes. In the image below, for word News and Not, N node is not repeating, It allows prefix so if we going to search a word like NEWS, we go through Start-> A → D -> N -> E -> W ->S… we don’t’ have to go further or go deep in those nodes which are not matching and we can match with each letter at each level. Obviously if word not found, still we won’t be traversing through all nodes so in worst case its not as bad as linear search.
Graph
The graph is very useful to model real-life relations, nodes are inter-connected through edges. Used for networks whether its roads network, a social network like friends, co-workers, computer networks, airports, airline routes.
There are different types of graphs, Directed Graph and Undirected Graph. Facebook friends are undirected graph, as when I add somebody as a friend and they accept we both can see other profile but in case of Twitter, that’s more like Directed Graph, where when somebody follows me, I don’t automatically follow him/her.
Other types are Weighted and Unweighted Graph. Weighted graph kind of assign value to Edges as well like in the image below, very good to find the shortest path for google maps or any router application.
The graph is good in terms of saving relationship-based data, could be quite complicated when scalable.
There are many many other data structures which are somehow subsets of aforementioned ones which you can learn, following are some real-world test cases and applications of these data structures which should help you to give an idea which type of data structure you can use next.
Applications of Data Structure
Array is very primitive and mostly used to build other data structures like dynamic arrays, hash tables, heaps and matrices. Also used for many sorting algorithms like merge, insertion, quick and bubble.
Hash Table is kind of my favourite and can be easily be chosen where performance is paramount. You can easily remove duplication of data, in Chess to backtrack their moves, can store objects quickly and make them accessible as key-value pairs
Linked List is widely used, where data is a different type (which can’t be held in Array) and data needs to be stored in order (Hash table can’t be used). In the real world, I would say compilers can use it for symbols table or operating system can simply use to load the list of open application to allow users to switch between (circular linked list).
Stack can be used when evaluating a mathematical type of expression, backtracking, process most recent firsts, Undo/Redo operations are stack-based. Even clipboard is stack-based. Functions calling for an engine like JVM.
Queue is more like when you put in a list to process in first come first serve. Messaging Queue and Service bus. Order processing Q system.
Tree as I mentioned earlier, HTML DOM, Parsers, Database indexes, Filesystem, 3D computer graphics, IP routing, Priority Q, sorted and comparable data to query quickly, Trie for the dictionary, textual searching, IP subnet masking.
Graph its more like real-world applications like relationships (human to human) like Facebook, Linkedin (co-workers), computer networks/internet, airline routes, maps, relational databases and much more.
Obviously, it’s not fixed and nor only possible applications, it’s just to give an idea. Good luck and share your thoughts.