Implementation of B-Tree based Database using C Programming Language

A database is crucial to any organization to store, add, modify and retrieve details about the company. Additionally, they form the core of the backend of any imaginable web application that we see around us. Thus, a thorough understanding of internal working a database is necessary for creating meaningful software, or for making full use of the DBMS. We attempt to do so by implementing a simple B-Tree based database, following the SQLite architecture.


Introduction
Databases, or more specifically, relational databases, helps in storing and taking care or large amounts of data, and also to 'relate' different kinds of information based on one or more common properties or attributes they share. Every database has a database engine, which offers a multitude of services such as sorting, saving, changing or serving the information on the database. And finally, a database system is a computer program for managing electronic databases. Databases are the backbone of many businesses currently operational around the world.
The motivation of this project is to gain a better insight into the structure of an RDBMS by building one. This would help one answer a lot of questions like:  What format is the data saved in? (in memory and on disk)  When does it move from memory to disk?  Why can there be only one primary key per table?  How does rolling back a transaction work?  How are indexes formatted?  When and how does a full table scan happen?  What format is a prepared statement saved in? We will be using C Programming language for the implementation. The entire database in stored into a single file which must be provided as a command line argument.

The SQLite Architecture
A query goes through a chain of components in order to retrieve of modify data. The front end consists of:  Tokenizer  Parser  Code generator The back end consists of the: The virtual machine takes bytecode generated by the frontend as instructions. It can then perform operations on one or more tables or indexes, each of which is stored in a data structure called a B-Tree. The VM is essentially a big switch statement on the type of bytecode instruction.
Each B-tree consists of many nodes. Each node is one page in length. The B-tree can retrieve a page from disk or save it back to disk by issuing commands to the pager.
The pager receives commands to read or write pages of data. It is a responsible for reading/writing at appropriate offsets in the database file. It also keeps a cache of recently-accessed pages in memory, and determines when those pages need to be written back to disk.
The OS interface is the layer that differs depending on which operating system SQLite was compiled for.

Implementation
We start with a simple read-execute-print-loop (REPL). For that, our main function will have an infinite loop that prints the prompt, gets a line of input, then processes that line of input.
Next, we add functionality to support meta-commands (Non-SQL commands, like '.exit' are called meta-commands). Also, we display an 'unrecognized command' message to handle the exception where one enters an invalid or unsupported metacommand. Now, we take turn to implement executable commands like insertion. For, this we impose some restriction and define a fixed schema for only a single table which it supports. The schema is shown in the table 1.
At this stage, we create an array to store the data. Following which the 'select' and 'insert' operations are defined to be able read/write data from/to the array-based table data structure, using an abstraction known as the 'Pager'. Note that, currently, 145 data is available only as long as the program runs. Hence, now we shift our focus on to be able to write the data in a file so that we be able to open it again for modification using our program. Using file handling and the pager abstraction, we implement the functionality for the program to pass a filename as a command line argument to write the data into. At this point, we have a fully-functional program which we can use to insert rows into a table, store the data and also view it.
Next, we refactor the code to change the array data structure to store entries to a B-Tree data structure. B-Tree has the advantage of providing the functionality to store large amount of data (many more rows as compared to the array-based database).
For the above task, we begin by coding the 'Cursor' abstraction. The cursor will be used to access the row it is pointing to. Also, one needs to create a cursor at the beginning of the table, and we must be able to advance it to the next row.
The basic idea behind the B-Tree is as follows: a particular node in it can only store a certain amount of data, or rather just a specific no. of key-value pairs. It all starts with a single, root node, and once the limit is reached, new leaf nodes are added. This process occurs recursively with the addition of more rows.

Conclusion
We have successfully implemented a simple database which follows the SQLite architecture using C programming language. Data can conveniently be written into a database file by inserting rows, and can be viewed by executing the select