#Relational database management system (RDBMS)
A relational database like SQL is a collection of data items organized in tables.
ACID is a set of properties of relational database transactions.
Atomicity - Each transaction is all or nothing(For example, a transaction will either be committed or will abort.)
The below image shows that both debit and credit operations are done successfully. Thus the transaction is atomic.
Thus, when the amount loses atomicity, then in the bank systems, this becomes a huge issue, and so the atomicity is the main focus in the bank systems.
Consistency - Any transaction will bring the database from one valid state to another(If we perform any operation over the data, its value before and after the operation should be preserved. For example, the account balance before and after the transaction should be correct, i.e., it should remain conserved.)
In the above figure, there are three accounts, A, B, and C, where A is making a transaction T one by one to both B & C. There are two operations that take place, i.e., Debit and Credit. Account A firstly debits $50 to account B, and the amount in account A is read $300 by B before the transaction. After the successful transaction T, the available amount in B becomes $150. Now, A debits $20 to account C, and that time, the value read by C is $250 (that is correct as a debit of $50 has been successfully done to B). The debit and credit operation from account A to C has been done successfully. We can see that the transaction is done successfully, and the value is also read correctly. Thus, the data is consistent. In case the value read by B and C is $300, which means that data is inconsistent because when the debit operation executes, it will not be consistent.
Isolation - Executing transactions concurrently has the same results as if the transactions were executed serially(For example, when multiple transactions occur at the same time, one transaction effects should not be visible to the other transactions in the database.
It means if two operations are being performed on two different databases, they may not affect the value of one another. In the case of transactions, when two or more transactions occur simultaneously, the consistency should remain maintained. Any changes that occur in any particular transaction will not be seen by other transactions until the change is not committed in the memory.
Durability - Once a transaction has been committed, it will remain so(data changes should remain permanent)
Durability ensures the permanency of something. In DBMS, the term durability ensures that the data after the successful execution of the operation becomes permanent in the database. The durability of the data should be so perfect that even if the system fails or leads to a crash, the database still survives. However, if gets lost, it becomes the responsibility of the recovery manager for ensuring the durability of the database. For committing the values, the COMMIT command must be used every time we make changes.
There are many techniques to scale a relational database:
master-slave replication, master-master replication, federation, sharding, denormalization, and SQL tuning.
1)Master-slave replication
The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate to additional slaves in a tree-like fashion. If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned.
Disadvantage(s): master-slave replication
Additional logic is needed to promote a slave to a master.
See Disadvantage(s): replication for points related to both master-slave and master-master.
2)Master-master replication
Both masters serve reads and writes and coordinate with each other on writes. If either master goes down, the system can continue to operate with both reads and writes.
Disadvantage(s): master-master replication
You'll need a load balancer or you'll need to make changes to your application logic to determine where to write.
Most master-master systems are either loosely consistent (violating ACID) or have increased write latency due to synchronization.
Conflict resolution comes more into play as more write nodes are added and as latency increases.
See Disadvantage(s): replication for points related to both master-slave and master-master.
Disadvantage(s): replication
There is a potential for loss of data if the master fails before any newly written data can be replicated to other nodes.
Writes are replayed to the read replicas. If there are a lot of writes, the read replicas can get bogged down with replaying writes and can't do as many reads.
The more read slaves, the more you have to replicate, which leads to greater replication lag.
On some systems, writing to the master can spawn multiple threads to write in parallel, whereas read replicas only support writing sequentially with a single thread.
Replication adds more hardware and additional complexity.
3)Federation
Federation (or functional partitioning) splits up databases by function. For example, instead of a single, monolithic database, you could have three databases: forums, users, and products, resulting in less read and write traffic to each database and therefore less replication lag. Smaller databases result in more data that can fit in memory, which in turn results in more cache hits due to improved cache locality. With no single central master serializing writes you can write in parallel, increasing throughput.
Disadvantage(s): federation
Federation is not effective if your schema requires huge functions or tables.
You'll need to update your application logic to determine which database to read and write.
Joining data from two databases is more complex with a server link.
Federation adds more hardware and additional complexity.
4)Sharding
Sharding distributes data across different databases such that each database can only manage a subset of the data. Taking a users database as an example, as the number of users increases, more shards are added to the cluster.
Similar to the advantages of federation, sharding results in less read and write traffic, less replication, and more cache hits. Index size is also reduced, which generally improves performance with faster queries. If one shard goes down, the other shards are still operational, although you'll want to add some form of replication to avoid data loss. Like federation, there is no single central master serializing writes, allowing you to write in parallel with increased throughput.
Common ways to shard a table of users is either through the user's last name initial or the user's geographic location.
Disadvantage(s): sharding
You'll need to update your application logic to work with shards, which could result in complex SQL queries.
Data distribution can become lopsided in a shard. For example, a set of power users on a shard could result in increased load to that shard compared to others.
- Rebalancing adds additional complexity. A sharding function based on consistent hashing can reduce the amount of transferred data.
Joining data from multiple shards is more complex.
Sharding adds more hardware and additional complexity.
5)Denormalization
Denormalization attempts to improve read performance at the expense of some write performance. Redundant copies of the data are written in multiple tables to avoid expensive joins. Some RDBMS such as PostgreSQL and Oracle support materialized views which handle the work of storing redundant information and keeping redundant copies consistent.
Once data becomes distributed with techniques such as federation and sharding, managing joins across data centers further increases complexity. Denormalization might circumvent the need for such complex joins.
In most systems, reads can heavily outnumber writes 100:1 or even 1000:1. A read resulting in a complex database join can be very expensive, spending a significant amount of time on disk operations.
Disadvantage(s): denormalization
Data is duplicated.
Constraints can help redundant copies of information stay in sync, which increases complexity of the database design.
A denormalized database under heavy write load might perform worse than its normalized counterpart.
SQL tuning
SQL tuning is a broad topic and many books have been written as reference.
It's important to benchmark and profile to simulate and uncover bottlenecks.
Benchmark - Simulate high-load situations with tools such as ab.
Profile - Enable tools such as the slow query log to help track performance issues.
Benchmarking and profiling might point you to the following optimizations.
Tighten up the schema
MySQL dumps to disk in contiguous blocks for fast access.
Use
CHAR
instead ofVARCHAR
for fixed-length fields.CHAR
effectively allows for fast, random access, whereas withVARCHAR
, you must find the end of a string before moving onto the next one.
Use
TEXT
for large blocks of text such as blog posts.TEXT
also allows for boolean searches. Using aTEXT
field results in storing a pointer on disk that is used to locate the text block.Use
INT
for larger numbers up to 2^32 or 4 billion.Use
DECIMAL
for currency to avoid floating point representation errors.Avoid storing large
BLOBS
, store the location of where to get the object instead.VARCHAR(255)
is the largest number of characters that can be counted in an 8 bit number, often maximizing the use of a byte in some RDBMS.Set the
NOT NULL
constraint where applicable to improve search performance.
Use good indices
Columns that you are querying (
SELECT
,GROUP BY
,ORDER BY
,JOIN
) could be faster with indices.Indices are usually represented as self-balancing B-tree that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.
Placing an index can keep the data in memory, requiring more space.
Writes could also be slower since the index also needs to be updated.
When loading large amounts of data, it might be faster to disable indices, load the data, then rebuild the indices.
NoSQL
NoSQL is a collection of data items represented in a key-value store, document store, wide column store, or a graph database. Data is denormalized, and joins are generally done in the application code. Most NoSQL stores lack true ACID transactions and favor eventual consistency.
BASE is often used to describe the properties of NoSQL databases. In comparison with the CAP Theorem, BASE chooses availability over consistency.
Basically available - the system guarantees availability.
Soft state - the state of the system may change over time, even without input.
Eventual consistency - the system will become consistent over a period of time, given that the system doesn't receive input during that period.
In addition to choosing between SQL or NoSQL, it is helpful to understand which type of NoSQL database best fits your use case(s). We'll review key-value stores, document stores, wide column stores, and graph databases in the next section.
Key-value store
Abstraction: hash table
A key-value store generally allows for O(1) reads and writes and is often backed by memory or SSD. Data stores can maintain keys in lexicographic order, allowing efficient retrieval of key ranges. Key-value stores can allow for storing of metadata with a value.
Key-value stores provide high performance and are often used for simple data models or for rapidly-changing data, such as an in-memory cache layer. Since they offer only a limited set of operations, complexity is shifted to the application layer if additional operations are needed.
A key-value store is the basis for more complex systems such as a document store, and in some cases, a graph database.
- (Key-value storage: It is the simplest type of database storage where it stores every single item as a key (or attribute name) holding its value, together.)
Document store
Abstraction: key-value store with documents stored as values
A document store is centered around documents (XML, JSON, binary, etc), where a document stores all information for a given object. Document stores provide APIs or a query language to query based on the internal structure of the document itself. Note, many key-value stores include features for working with a value's metadata, blurring the lines between these two storage types.
Based on the underlying implementation, documents are organized by collections, tags, metadata, or directories. Although documents can be organized or grouped together, documents may have fields that are completely different from each other.
Some document stores like MongoDB and CouchDB also provide a SQL-like language to perform complex queries. DynamoDB supports both key-values and documents.
Document stores provide high flexibility and are often used for working with occasionally changing data.
(Document-oriented Database: A type of database used to store data as JSON-like document. It helps developers in storing data by using the same document-model format as used in the application code)
Wide column store
Abstraction: nested map
ColumnFamily<RowKey, Columns<ColKey, Value, Timestamp>>
A wide column store's basic unit of data is a column (name/value pair). A column can be grouped in column families (analogous to a SQL table). Super column families further group column families. You can access each column independently with a row key, and columns with the same row key form a row. Each value contains a timestamp for versioning and for conflict resolution.
Google introduced Bigtable as the first wide column store, which influenced the open-source HBase often-used in the Hadoop ecosystem, and Cassandra from Facebook. Stores such as BigTable, HBase, and Cassandra maintain keys in lexicographic order, allowing efficient retrieval of selective key ranges.
Wide column stores offer high availability and high scalability. They are often used for very large data sets.
(Wide-column stores: It is similar to the data represented in relational databases. Here, data is stored in large columns together, instead of storing in rows.)
Graph database
Abstraction: graph
In a graph database, each node is a record and each arc is a relationship between two nodes. Graph databases are optimized to represent complex relationships with many foreign keys or many-to-many relationships.
Graphs databases offer high performance for data models with complex relationships, such as a social network. They are relatively new and are not yet widely-used; it might be more difficult to find development tools and resources. Many graphs can only be accessed with REST APIs.
[Graph Databases: It is used for storing vast amounts of data in a graph-like structure. Most commonly, social networking websites use the graph database.]
Cloud Database
A type of database where data is stored in a virtual environment and executes over the cloud computing platform. It provides users with various cloud computing services (SaaS, PaaS, IaaS, etc.) for accessing the database. There are numerous cloud platforms, but the best options are:
Amazon Web Services(AWS)
Microsoft Azure
Kamatera
PhonixNAP
ScienceSoft
Google Cloud SQL, etc.
Object-oriented Databases
The type of database that uses the object-based data model approach for storing data in the database system. The data is represented and stored as objects which are similar to the objects used in the object-oriented programming language.
SQL or NoSQL
Reasons for SQL:
Structured data
Strict schema
Relational data
Need for complex joins
Transactions
Clear patterns for scaling
More established: developers, community, code, tools, etc
Lookups by index are very fast
Reasons for NoSQL:
Semi-structured data
Dynamic or flexible schema
Non-relational data
No need for complex joins
Store many TB (or PB) of data
Very data intensive workload
Very high throughput for IOPS
Sample data well-suited for NoSQL:
Rapid ingest of clickstream and log data
Leaderboard or scoring data
Temporary data, such as a shopping cart
Frequently accessed ('hot') tables
Metadata/lookup tables
Data model Schema and Instance
The data which is stored in the database at a particular moment of time is called an instance of the database.
The overall design of a database is called schema.
A database schema is the skeleton structure of the database. It represents the logical view of the entire database.
A schema contains schema objects like table, foreign key, primary key, views, columns, data types, stored procedure, etc
A database schema can be represented by using the visual diagram. That diagram shows the database objects and relationship with each other.
A database schema is designed by the database designers to help programmers whose software will interact with the database. The process of database creation is called data modeling.
A schema diagram can display only some aspects of a schema like the name of record type, data type, and constraints.
Other aspects can't be specified through the schema diagram. For example, the given figure neither show the data type of each data item nor the relationship among various files.
In the database, actual data changes quite frequently.
example, in the given figure, the database changes whenever we add a new grade or add a student. The data at a particular moment of time is called the instance of the database.
Data Independence
Data independence can be explained using the three-schema architecture.
Data independence refers characteristic of being able to modify the schema at one level of the database system without altering the schema at the next higher level.
There are two types of data independence:
1. Logical Data Independence
Logical data independence refers characteristic of being able to change the conceptual schema without having to change the external schema.
Logical data independence is used to separate the external level from the conceptual view.
If we do any changes in the conceptual view of the data, then the user view of the data would not be affected.
Logical data independence occurs at the user interface level.
2.Physical Data Independence
Physical data independence can be defined as the capacity to change the internal schema without having to change the conceptual schema.
If we do any changes in the storage size of the database system server, then the Conceptual structure of the database will not be affected
Physical data independence is used to separate conceptual levels from the internal levels.
Physical data independence occurs at the logical interface level.
Fig: Data Independence
Database Languages in DBMS
A DBMS has appropriate languages and interfaces to express database queries and updates.
Database languages can be used to read, store and update the data in the database.