A database is a collection of logically interrelated data and description of this data, designed to meet the information needs for organization. The primary objective of creating a database is to store, update and retrieve data efficiently and use it for analysis. The data stored can also include any transient data such as input documents, reports and intermediate results obtained during processing of the data. To achieve this, database must have different characteristics and some of them are:
- It should have a structure which deals with data types and data behavior.
- It should have a proper retrieval method which could be a declarative query language (like SQL, which we'll talk about in this article) or a procedural database programming language (like Java or Python).
- The storing and retrieving process should be quick and correct
- Data must be secure and must be protected from unauthorized access
Database system provides the abstract view of data to the users. It does so with the help of following three levels:
- Physical level: The lowest level of abstraction. It describes how the data are actually stored.
- Logical Level: This level describes the data stored in database, and the relationships that exist among the data. It describes the entire database in terms of small number of relatively simple structure. For example:
type doctor= recordID : string;Name : string;Hospital name: string;Salary : integer;end;
- View Level: This is the highest level of data abstraction. In this level the application programs hide details of data types. Views can also hide information which is considered vulnerable (such as an employee’s salary) for security purposes.
Structured Query Language (SQL) is an ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems. SQL is based on the relational tuple calculus and is a standard language for relational database management system. SQL works with database programs like MS Access, DB2, Informix, MS SQL Server, Oracle, Sybase, etc. Below, we'll see how to create, read, and alter data using SQL.
Data-Definition Language (DDL) is used to create and modify the structure of database objects in database.
DDL compiler generates a set of tables stored in a data dictionary. Data dictionary is a special set of tables which contains metadata (i.e., data about data that describes the object in the database). The DDL used in SQL are:
CREATE can be used to create database, tables, views of table and other object in database.
CREATE DATABASE DB;
This syntax will create a database named DB.
Similarly, we can create table called table as:
CREATE TABLE table_name(Column1 datatype,Column2 datatype,………ColumnN datatype,PRIMARY KEY( one or more columns));
CREATE TABLE Instructor (ID char(5),Name varchar(20),Department varchar(20),Salary numeric(8,2),PRIMARY KEY( one or more columns));
ALTER is used to modify existing database objects like tables.
ALTER TABLE table_name ADD attribute_name datatype;
This query adds new attribute to the table called table_name.
ALTER TABLE Instructor ADD FirstName varchar(20);
DROP is used to delete an entire table, a view of table or other object in the database. The query given below remove a table called table_name from the database.
DROP TABLE table_name
Data-Manipulation Language (DML) is used to retrieve, store, modify, delete, insert and update data in database. It is also called query language. Some of the DML in SQL are discussed below. We will use the table Instructor that we created above to explore these commands.
Select can be used to get particular data from a table. The query given below provide id, name and salary from all columns of instructor table.
SELECT ID, NameFROM instructor
It is use to sort the data in descending or ascending order.
SELECT column-listFROM table_nameORDER BY column1, column2 .. [ASC|DESC];
SELECT NameFROM InstructorORDER BY Salary DESC
GROUP BY command is used with the SELECT statement to arrange identical data into groups. The syntax of group by is:
SELECT column-listFROM table_nameWHERE [conditions]GROUP BY column1, column2
SELECT COUNT(ID) AS Count, DepartmentFROM Instructor
GROUP BY Department
Update can be used to update values of different column;
UPDATE table_nameSET column1 = value1, column2 = value2 .. , columnN = valueNWHERE [Condition];
UPDATE InstructorSET Salary = 130000Where ID = 1;
INSERT INTO is used to add new rows of data in a table. The syntax is given as:
INSERT INTO (column1, column2, ... columnN)VALUES (value1, value2, ... valueN);
INSERT INTO Instructor (ID, Name, Department, Salary)VALUES (6, 'Dinesh', 'Physics', 140000);
DELETE command is used to delete existing data from a table. The syntax for DELETE is:
DELETE FROM table_name,WHERE [condition];
DELETE FROM InstructorWHERE ID = 5;
Join operations take two relations and return another relation as a result. There are different kind of JOIN like inner join, full outer join, left outer join and right inner join.
An example of join is given below:
course NATURAL LEFT OUTER JOIN prereqSELECT course.course_id, course.title, course.dept_name,course.credits, prereq.prereq_idFROM courseINNER JOIN prereqON course.course_id = prereq.course_id;
This operation will join two tables course and prereq and give output as shown in figure below. The joined table has null as prereq_id for course CS-315 as prereq_id for it is not present.
There is a special way to make querying in SQL fast i.e. Index. Index is a kind of lookup table which is used by database search engine to make data retrieval fast. But sometimes index can slow down the process too like when updating a table as it involves updating the index too. So one should create index for only frequently searched columns.
To create an index on a table ,we can use CREATE operation like when we created table.
CREATE [UNIQUE] INDEX index_nameON table_name (column1,column2…….,columnN);
The unique modifier in above syntax indicates that the combination of value in the indexed columns must be unique.
We can also drop the index using DROP command:
DROP INDEX index_name;
The easiest way in which we can store data on a computer is in plain text files. Although that solution works simpler systems and small datasets, different problems can arise for more complex systems. Some of these problems are:
- File system is redundant as multiple file formats are used and duplication of information takes place.
- Concurrent access of file system can cause different anomalies and inconsistencies.
- Failures may leave file system in an inconsistent state with partial updates carried out like while transferring funds from one account to another but such transactions must be either complete or not happen at all.
Database system provides efficient solutions to this problems.
NoSQL stands for Not only SQL. It is a non-relational database system designed to operate in huge datasets. With advent of big data, the traditional database management system fell short of different requirements. Companies like Google, Facebook, Amazon collects terabytes of data in daily basis and this data requires distributed data storage mechanisms. Due to this the need for a fixed schema that was mandatory for relational database like SQL has diminished and NoSQL has been used instead. NoSQL is basically based on CAP theorem (Brewer’s Theorem) which states requirements:
- Consistency: The data should be remain consistent to all user after any kind of updates to it. For example, when a Facebook user edits his comment in a post, everybody should be able to see same edited version of comment.
- Availability: Usually NoSQL database are implemented in cloud service and it is of primary importance to make the database always available. For instance, the twitter user should be able to go through the tweets any time of the day.
- Partition tolerance: The database should continue to function even when the servers which are partitioned into different groups has unreliable communication.
In practice, it is not possible to fulfill all the requirements of the CAP theorem so database systems follow different combination of C, A, P like CA, AP, etc. There are different databases based on NoSQL. For example, Hadoop, Cassandra, Amazon SimpleDB, MongoDB, Elastic Search, CouchDB, RethinkDB, etc. These database systems have benefits like high data velocity, real location independence, continuous data availability, flexible data models and many more. While choosing the database from this pool, one should consider performance, cost, manageability, scalability and other factors which are essential for your project.
- Intro to SQL for Data Science | Datacamp
- SQL Tutorial | Tutorials Point