Various database products have different characteristics, but they share some common ground in the main database concepts, that is, they can achieve various database objects and different levels of security protection measures, and they also emphasize the performance management and daily operation and maintenance management of the database.

This chapter is about the main responsibilities and contents of database management, and introduces some common but important basic concepts of databases to lay a good foundation for the next stage of learning. After completing this chapter, readers will be able to describe the main work of database management, including distinguishing different backup methods, listing measures for security management and describing the work of performance management, as well as describing the important basic concepts of database and the usage of each database object.

2.1 Overview of Database Management

2.1.1 Database Management and Its Scope of Work

Database management refers to the management and maintenance of the DBMS , whose core goal is to ensure the stability of the database, security and data consistency, as well as the high performance of the system.

Stability refers to the high availability of the database. Master-slave, multi-master, distributed and other highly available architectures are used to ensure the availability and stability of the database system.

Security refers to the security of the content stored in the database to avoid illegal access and use of data content.

Data consistency refers to the database itself will provide many functions to ensure data consistency, such as foreign key constraints and non-null constraints on table, etc. Data consistency here is about the use of synchronization technology, replication tools, etc. provided by the database when building master-standby systems, master-slave systems and other multi-master systems to ensure data consistency among multiple databases. This assurance work is part of the database management work.

The high performance of the system mainly involves the optimization, monitoring, troubleshooting and other work inside the database management.

Database Administrator (DBA) is a collective term for the personnel involved in managing and maintaining the DBMS , not a particular person, but a role. Some companies also call the DBA database engineer. The work of the two roles is basically the same, ensuring 7/24 stable and efficient operation of the database.

Database management work includes management of database objects, database security, backup and recovery, performance and environment.

Database object management is actually the management of the data in the database, including physical design and implementation work. Physical design work refers to understanding the features and functions provided by different database objects, and transforming the data model in conceptual design and logical design into physical database objects on the basis of following reasonable relational design principles. Physical implementation work refers to the creation, deletion, modification and optimization of database objects.

Database security management refers to the prevention of unauthorized access to avoid leakage of protected information, as well as the prevention of security breaches and improper data modifications to ensure that data is available only to authorized users. Database security management work includes, but is not limited to, the management of system security, data security and network security. Enterprise database security strategy includes laying a solid foundation with authentication, authorization , access, control, recovery, classification, and batch management; encryption and desensitization through defensive data protection measures to protect critical information and data privacy without affecting program application functions; creating database intrusion detection with audit monitoring and vulnerability assessment, etc., while developing security policies and standards to ensure role separation and availability.

Backup and recovery management is said to develop a reasonable backup strategy to achieve regular backup function, so as to ensure the fastest recovery and minimum loss for the database system in case of disaster.

Database performance management means monitoring and optimizing the factors that affect database performance and optimizing the resources that the database can use to increase system throughput and reduce contention to handle the maximum possible workload. The factors involved in database performance optimization include workload, throughput, resources, and contention. Among them, the workload for the database is the user-submitted usage requirements, expressed in different forms such as online transactions, batch jobs, analytical queries, instant queries. The workload of the database varies from time to time, and the overall workload has a great impact on the database performance. Throughput refers to the overall processing capacity of the database software, i.e., the number of queries and transactions that can be processed per unit of time. Resources include CPU, I/O, network, storage, processes, threads, and all other hardware and software objects that are available and at the disposal of the database. Competition refers to the demand for the use of the same resource by multiple workloads at the same time, and this conflict arises because the number of resources is less than the demand of the workload.

Database environment management covers such database operation and maintenance management as installation, configuration, upgrade, migration and other management work to ensure the normal operation of the IT infrastructure, including database systems.

2.1.2 Object Management

Database object is a general term for the various concepts and structures used to store and point to data in the database. Object management is about a management process that uses objects to define languages, create tools, or modify or delete various database objects. Basic database objects generally include tables, views, index sequences, stored procedures and functions, as shown in Table 2.1.

Table 2.1 Basic database objects

The database product itself does not propose strict naming restrictions, but arbitrary naming of objects will lead to an uncontrollable and unmaintainable system, or even cause the maintenance difficulties to the entire system. The development of naming convention is a basic requirement for database design , because a good naming convention means a good start.

There are several suggestions for naming convention as follows.

  1. (1)

    Unify the case of the names. The case of the names can be standardized on a project basis, such as all capitalization, all lowercase, or initial capitalization.

  2. (2)

    Use prefixes to identify the type of object, such as the table name prefix “t_”, view prefix “v_”, function prefix “f_” and so on.

  3. (3)

    Try to choose meaningful, easy to remember, descriptive, short and unique English words for naming, not recommended to use Chinese Pinyin.

  4. (4)

    Use the name dictionary to develop some common abbreviations on a project basis, such as “amt” for “amount”.

Some commercial databases set length limits for table names and view names in early versions, for example, they cannot exceed 30 characters. Too long names are not easy to remember and communicate, nor easy for SQL code writing. Some public database naming specifications can be used as a blueprint to develop some industry- and project-oriented database naming conventions according to project characteristics, as shown in Table 2.2.

Table 2.2 Database naming convention

2.1.3 Backup and Recovery Management

There are many possible reasons for data loss, mainly storage media failure, user’s operation error, server failure, virus invasion, natural disasters, etc. Backup database is to additionally store the data in the database and the relevant information to ensure the normal operation of the database system, so that it can be used to restore the database upon the system failure.

The objects of database backup include but not limited to data itself and data-related database objects, users, permissions, database environment (profiles, timing tasks), etc. Data recovery is the activity of restoring a database system from a failed or paralyzed state to one that is operational and capable of restoring data to an acceptable state.

For enterprises and other organizations, database systems and other application systems constitute a larger information system platform, so database backup and recovery is not independent, but should be combined with other application systems to consider the overall disaster recovery performance of the whole information system platform. This is the so-called enterprise-level disaster recovery.

Disaster backup refers to the process of backing up data, data processing systems, network systems, infrastructure, specialized technical information and operational management information for the purpose of recovery after a disaster occurs. Disaster backup has two objectives, one is recovery time objective (RTO) and the other is recovery point objective (RPO). RTO is the time limit within which recovery must be completed after a disaster has stopped an information system or business function. RPO is the requirement for the time point to which the system and data are recovered to after a disaster. For example, if the RPO requirement is one day, then the system and data must be recovered to the state 24 h before the failure caused by the disaster, and the possibility of data loss within 24 h is allowed in this case. However, if the data can be restored to the state only two days ago, that is, 48 h ago, the requirement of RPO = 1 day is not satisfied. The RTO emphasizes the availability of the service, and the smaller the RTO, the less the loss of service. The RPO targets data loss, and the smaller the RPO, the less the data loss. A typical disaster recovery goal of an enterprise is RTO <30 min, with zero data loss (RPO = 0).

China’s GB/T 20988-2007: Information security technology—Disaster recovery specifications for information systems divides disaster recovery into six levels, as shown in Fig. 2.1.

  • Level 1: Basic support. The data backup system is required to guarantee data backup at least once a week, and the backup media can be stored off-site, with no specific requirements for the backup data processing system and backup network system. For example, it is required to store the data backup on a tape placed in another location in the same city.

  • Level 2: Alternate site support. On the basis of meeting Level 1, it is required to equip part of the data processing equipment required for disaster recovery, or to deploy the required data processing equipment to the backup site within a predetermined time after a disaster; it is also required to equip part of the communication lines and corresponding network equipment, or to deploy the required communication lines and network equipment to the backup site within a predetermined time after a disaster.

  • Level 3: Electronic transmission and equipment support. It is required to conduct at least one full data backup every day, and the backup media is stored off-site, while using communication equipment to transfer critical data to the backup site in batches at regular intervals several times a day; part of the data processing equipment, communication lines and corresponding network equipment required for disaster recovery should be equipped.

  • Level 4: Electronic transfer and complete device support. On the basis of Level 3, it is required to configure all data processing equipment, communication lines and corresponding network equipment required for disaster recovery, which must be in ready or operational status.

  • Level 5: Real-time data transfer and complete device support. In addition to requiring at least one full data backup per day and backup media stored off-site, it also requires the use of remote data replication technology to replicate critical data to the backup site in real time through the communication network.

  • Level 6: Zero data loss and remote cluster support. It is required to realize remote real-time backup with zero data loss; the backup data processing system should have the same processing capability as the production data processing system, and the application software should be “clustered” and can be switched seamlessly in real time.

Table 2.3 exemplifies the disaster recovery levels defined by the Information security technology—Disaster recovery specifications for information systems.

Fig. 2.1
figure 1

Disaster recovery levels

Table 2.3 Disaster recovery levels

The higher the disaster recovery level, the better the protection of the information system, but this also means a sharp increase in cost. Therefore, a reasonable disaster recovery level needs to be determined for service systems based on the cost-risk balance principle (i.e., balancing the cost of disaster recovery resources against the potential loss due to risk). For example, the disaster recovery capability specified for core financial service systems is Level 6, while non-core services are generally specified as Level 4 or Level 5 depending on the scope of service and industry standards; the disaster recovery level for SMS networks in the telecom industry is Level 3 or Level 4. Each industry should follow the specifications to assess the importance of its own service systems to determine the disaster recovery level of each system.

Different databases provide different backup tools and means, but all involve various “backup strategies”. Backup strategies are divided into full backup, differential backup and incremental backup according to the scope of data collection; or into hot backup, warm backup and cold backup according to whether the database is deactivated; or into physical backup and logical backup according to the backup content.

Full backup, also called complete backup, refers to the complete backup of all data and corresponding structures at a specified point in time. Full backup is characterized by the most complete data and is the basis for differential and incremental backups, as well as the most secure backup type, whose backup and recovery time increases significantly with the increase in data volume. While important, full backup also comes at a cost in time and expenses, and is prone to a performance impact on the entire system.

The amount of data to be backed up each time for full backup is quite large and takes a long time, so it should not be operated frequently, even with the highest data security. Differential backup is a backup of data that has changed since the last full backup. Incremental backup is a backup of the data that has changed after the previous backup, as shown in Fig. 2.2.

Fig. 2.2
figure 2

Differential and incremental backups

Given that incremental backups have the advantage of not backing up data repeatedly, each incremental backup involves a small volume of data and requires very little time, but the reliability of each backup must be guaranteed. For example, when a system failure occurs in the early hours of Thursday morning and the system needs to be restored, the full backup on Sunday, the incremental backup on Monday, the incremental backup on Tuesday, and the incremental backup on Wednesday must all be prepared and restored in chronological order. If Tuesday’s incremental backup file is corrupted, then Wednesday’s incremental backup will also fail, so that only the data state at 12:00 PM on Monday can be restored.

The differential backup shows the same advantage as incremental backup, the volume of data per backup is small and the backup time is short, but the availability of system data should be guaranteed. It only needs the data from the last full backup and the most recent differential backup. For example, if a failure occurs early Thursday morning and the system needs to be restored, simply prepare the full backup on Sunday and the differential backup on Wednesday.

In terms of the volume of data to be backed up, the largest volume of data to be backed up is the full backup, followed by the differential backup and finally the incremental backup. Usually, full backup + differential backup is recommended when the backup time window allows. If the incremental data volume for differential backup is larger and the backup operation cannot be completed within the allowed backup time window, then the full backup + incremental backup can be used.

Hot backup is performed when the database is running normally, where read/write operations can be performed on the database during the backup period.

Warm backup means that only database read operations can be performed during the backup period, and no write operation is allowed, where the database availability is weaker than hot backup.

Cold backup means that read/write operations are not available during the backup period, and the backup data is the most reliable.

In the case that the database application does not allow the service to stop, a hot backup solution must be used, but the absolute accuracy of the data cannot be guaranteed. In the case where the read/write service of the application can be stopped and the accuracy of the backup data is required, the cold backup solution is preferred. For example, the hot backup solution should be used as much as possible for routine daily backups, while a cold backup solution is recommended in the case of system migration, so as to ensure data accuracy.

A physical backup is a direct backup of the data files corresponding to the database or even the entire disk. Logical backup refers to exporting data from the database and archiving the exported data for backup. The difference between the two is shown in Table 2.4.

Table 2.4 Physical and logical backups

Backup portability means that the backup results of the database can be restored to different database versions and database platforms. In terms of recovery efficiency, the physical backup only needs to directly recover data files of data blocks, which is highly efficient; the logical backup is equivalent to re-executing SQL statements when recovering, so the system overhead is high and inefficient when the data volume is large. Compared with the strong dependence of physical backup on the physical format of logs, the logical backup is only based on logical changes of data, which makes the application more flexible and enables cross-version replication, replication to other heterogeneous databases, and customization support when the table structure of source and target databases are inconsistent.

Logical backups only support backup to SQL script files. Logical backups take up less space in comparison with physical backups, because the latter generate data files. Physical backups also allow backing up only metadata, at which time the backup result takes up the least amount of space.

2.1.4 Security Management

In a broad sense, the database security framework is divided into three levels: network security, operating system security, and DBMS security.

  1. (1)

    Network security. The main technologies for maintaining network security are encryption technology, digital signature technology, firewall technology, and intrusion detection technology. The security at the network level focuses on the encryption of transmission contents. Before transmission through the network, the transmission content should be encrypted, and the receiver should decrypt the data after receiving it to ensure the security of the data in the transmission process.

  2. (2)

    Operating system security. Encryption aiming at securing the operating system refers to the encryption of data files stored in the operating system, the core of which is to ensure the security of the server, mainly in terms of the server’s user accounts, passwords, access rights, etc. Data security is mainly reflected in the encryption technology, security of data storage, security of data transmission, such as Kerberos, IPsec, SSL and VPN technologies.

    Kerberos is a computer network authorization protocol used to authenticate personal communications in a non-secure network by secure means. It was originally designed to provide strong authentication between client and server applications through a key system. In a cluster using Kerberos authentication, the client does not authenticate directly with the server, instead, authenticates with each other through the key distribution center (KDC).

    The Internet Protocol Security (IPsec) is a family of network transport protocols (a collection of interconnected protocols) that protect IP addresses by encrypting and authenticating their packets.

    The secure sockets layer (SSL) protocol and its successor, transport layer security (TLS), are security protocols that provide security and data integrity for network communications. TLS and SSL encrypt network connections at the transport layer.

  3. (3)

    DBMS security. The encryption aimed at DBMS security refers to the encryption and decryption of data in the process of reading and writing data by means of custom functions or built-in system functions, involving database encryption, data access control , security auditing, and data backup.

To summarize, all the three levels of security involve encryption. The security at the network level focuses on encryption of the transmission content, where the sender encrypts the transmission content before the network transmission and the receiver decrypts the information after receiving it, thus securing the transmission. The encryption aiming at securing the operating system refers to the encryption of data files stored in the operating system. The encryption aimed at DBMS security refers to the encryption and decryption of data in the process of reading and writing data by means of custom functions or built-in system functions.

Security control is to provide security against intentional and unintentional damage at different levels of the database application system, for example:

  1. (1)

    Encryption of access data → intentional illegal activities.

  2. (2)

    User authentication and restriction of operation rights → intentional illegal operations.

  3. (3)

    Improvement of system reliability and data backup → unintentional damage behavior.

The security control model shown in Fig. 2.3 is only a schematic diagram, while all database products nowadays have their own security control models. When a user needs to access a database, he or she first has to enter the database system. The user provides his identity to the application, and the application submits the user’s identity to the DBMS for authentication, after that, only legitimate users can proceed to the next step. When a legitimate user is performing a database operation, the DBMS further verifies that the user has such operation rights. The user can only operate if he or she has operation rights, otherwise the operation will be denied. The operating system also has its own protection measures, such as setting access rights to files and encrypting storage for files stored on disk, so that the data is unreadable even if it is stolen. In addition, it is possible to save multiple copies of data files, thus avoiding data loss when accidents occur.

Fig. 2.3
figure 3

Security control model

The authentication of database users is the outermost security protection provided by the DBMS to prevent unauthorized users from accessing.

As database applications now commonly use the “user name + password” authentication mode, it is necessary to enhance the password strength, the main measures for which are as follows.

  1. (1)

    Longer strings, such as 8–20 characters, should be used.

  2. (2)

    A mixture of numbers, letters and symbols should be used.

  3. (3)

    Passwords should be changed regularly.

  4. (4)

    Passwords should not be used repeatedly.

The security policy mainly involves password complexity, password reuse, password validity, password modification, password verification, and prohibits the explicit display of passwords. In general, it is recommended to use an interactive method and real-time password input method for login; some fixed-running scripts or codes should be deployed on a specific and trusted server side, where the user sets a specific password-free login method to allow the code and scripts executed by a specific server to log in to the database through the password-free method.

GaussDB (for MySQL) sets a password security policy for new database users created on the client side.

  The password length is at least eight characters.

  The password should contain at least one uppercase letter, one lowercase letter, one digit and one special character.

  The password should be changed periodically.

Access control is the most effective method of database security management , and also is the most problematic link. Its basic principle is to assign different rights to different users according to the classification requirements of sensitive data.

  1. (1)

    The principle of least right. To meet the needs of the minimum right range, the arbitrary expansion of the scope of authority grant is not allowed. For example, a user who needs to query data only needs to be granted SELECT right, and cannot be granted DELETE and UPDATE rights.

  2. (2)

    Check the key rights. Rights such as DROP, TRUNCATE, UPDATE, and DELETE, which will cause data to disappear or change, should be granted cautiously, and it is also necessary to check whether the users who have obtained the rights continue to use the rights.

  3. (3)

    Check the rights of key database objects. The access rights of system tables, data dictionaries and sensitive database tables should be strictly checked.

Role based access control (RBAC) is mainly used in right management for large database systems or systems with a large volume of user data.

A database “role” is a collection of operations that one or a group of users can perform in a database. Roles can be created based on different job responsibilities, and then the user is assigned to the corresponding role. Users can easily switch roles or bears multiple roles.

The starting point of RBAC is to exclude direct contact between users and database objects, so that rights are assigned to roles, and users can only obtain the appropriate rights to access the corresponding database objects if they have the corresponding roles.

For example, if User A wants to query the data of Table T, then we can grant User A the right to query Table T directly, or we can create Role R, then grant the right to view Table T to Role R, and finally grant Role R to User A.

Audit can help database administrators to find the vulnerabilities in the existing architecture and its usage. Audit of users and database administrators is to analyze and report on various operations, such as creating, deleting, and modifying instances, resetting passwords, backing up and restoring, creating, modifying, and deleting parameter templates, and other operations.

The levels of database audit are as follows.

  1. (1)

    Access and authentication audit: analysis of database user’s login (log in) and logout (log out) information, such as login and logout time, connection method and parameter information, login path, etc.

  2. (2)

    User and database administrator audit: analysis and reporting on the activities performed by users and database administrators.

  3. (3)

    Security activity monitoring: recording of any unauthorized or suspicious activities in the database and generation of audit reports.

  4. (4)

    Vulnerability and threat audit: identification of possible vulnerabilities in the database and the “users” who intend to exploit them.

The encryption of database is divided into two layers—the encryption of kernel layer and the encryption of outer layer. Kernel-layer encryption means that the data is encrypted or decrypted before physical access, which is transparent to the database users. If encrypted store is used, the encryption operation runs on the server side, which will increase the load on the server to some extent. Outer-layer encryption means developing special encryption and decryption tools, or defining encryption and decryption methods, which can control the encryption object granularity, and encrypt and decrypt at table or field level, and users only need to focus on sensitive information range.

When enabling kernel-layer encryption for high-load systems, it is important to carefully consider its impact on performance as the functionality is enabled at the entire database management system level.

The outer-layer encryption requires extra development time, the algorithms of encryption and decryption for different data objects and different data types are complicated, and there are also certain business rules that need to be followed after encrypting some business-critical data, for example, the names of tables can be associated after being encrypted, etc. So it is a very large project to implement a good encryption engine.

2.1.5 Performance Management

There are upper limits on the processing capacity of resources. For example, the disk space is limited, and there are also upper limits on CPU frequency, memory size and network bandwidth. Resources are divided into supply resources and concurrency control resources. The supply resources, also called basic resources, are the resources corresponding to computer hardware, including the resources managed by the operating system, whose processing capacity is ordered as “CPU > memory >> disk ≈ network”. Concurrency control resources include but are not limited to locks, queues, caches, mutually exclusive signals, etc., which are also resources managed by the database system. The basic principle of performance management is to make full use of resources and not to waste them.

Unlike the even supply of resources, the use of resources is uneven. For example, if a distributed system fails to choose a reasonable data slicing method, the nodes with more data will be heavily loaded and their resources will be strained, but the nodes with less data will be lightly loaded and their resources will be relatively “idle”, as shown in Table 2.5.

1 ns = 10−9 s

Table 2.5 Performance indicators

Resource bottlenecks can be exchanged. For example, a system with low I/O performance and sufficient memory can be exchanged through high memory and high CPU consumption. A system with limited network bandwidth can also improve the efficiency of data transfer by compressing the transfer, i.e., using the CPU to handle compression and decompression. This is the optimization idea of exchanging space for time and time for space.

For the use of database, the ideal situation is to have infinite resources, CPU with infinite processing speed, infinite amount of memory, infinite disk space, and infinite network bandwidth. But the database is actually always running in a limited environment. The effective management of resources can ensure that the database system can meet the user’s performance requirements for the system during the peak period, and the meaning of performance management lies in the efficient use of resources. Real- time system performance monitoring through the logs or tools provided by the database can respond to system problems in a timely manner, analyze existing problems based on historical performance data, identify potential problems, and propose better preventive measures based on development trends. The data collected by performance management is the basis for system capacity planning and other forward planning, because data speaks with facts, not with feelings.

Regarding performance management, the basic indicators of database systems include throughput and response time. OLTP and OLAP performance management goals should actually be treated differently, but the two indicators should be analyzed together. In performance management, one of the indicators cannot be pursued unilaterally. OLTP is to provide the highest possible throughput on top of acceptable response times to reduce consumption per unit of resource, to move quickly through the shared area for concurrency, and to reduce bottleneck constraints. OLAP is to reduce response time as much as possible within limited resources, and a transaction should make full use of resources to accelerate processing time. In the case of SQL, for example, SQL optimization for OLTP should minimize the use of resources by SQL. OLAP systems, on the other hand, require SQL to make the best possible use of resources within a limited range. For OLAP systems, when processing batch jobs, the higher the resource utilization rate, the better (it needs to be within a certain time window).

Some of the main scenarios to which performance optimization applies are as follows.

  1. (1)

    Performance optimization for go-online scenario or below-expected performance. It may be obvious for OLTP systems that the performance fails to reach the expectation after going online, because the development environment and test environment tend to pay more attention to functional development, and even stress tests are SQL stress tests of some form query types. For batch operations on the OLAP , in the full data or historical data environment, its performance will be very different from the scenario of a small volume of sample data.

  2. (2)

    Performance optimization for the situation where the response gradually becomes slower after going online for a while. Due to the development of data volume and business, the model and specification of system data have deviated from the original design, and the performance has also changed. In this case, it is basically necessary to analyze and find out which factors are related to this based on the long-term accumulation of performance data.

  3. (3)

    System optimization for sudden slowdown during system operation (emergency processing). In the emergency processing scenario, performance problems do not happen for any reason. Sudden performance changes are often caused by code changes, such as put-into-production of newly developed business, new requirement changes, DDL changes, unexpected configuration changes, database upgrades, etc. Generally, this kind of problem has a high degree of urgency, which often requires the intervention of experienced personnel and quick response.

  4. (4)

    Performance optimization for the situation where the system suddenly becomes slow and then returns to normal after a period of time. This is generally due to bottlenecks that limit throughput during peak periods, and capacity expansion is the simplest way to solve it. However, due to the extra investment and time period involved, this method needs to be supported by sufficient resources. A more natural solution is to reduce the number of operations per unit (concurrency control) or to reduce the resource consumption per unit of operation.

  5. (5)

    System optimization based on the reduction of resource consumption. In this scenario, the whole system generally does not suffer from obvious performance problems, but rather emphasizes the effectiveness of resource usage, which is relatively well-timed and less stressful. For example, to analyze and optimize the top ten jobs that consume the most resources and have the longest response time in system application.

  6. (6)

    Preventive daily inspection. Inspection work is generally applicable to scenarios where the whole system does not have obvious performance problems.

The data to be collected for performance management include CPU usage data, space utilization, users and roles using the database system, response time of heartbeat queries, performance data submitted to the database with SQL as the basic unit, and job-related performance data submitted by database tools (such as load, unload, backup, restore, etc.). As far as the timing of data collection is concerned, some daily data collection can be arranged, or data collection can be carried out during the time period when users use the system intensively in one day, or during the time period when the system pressure is relatively high.

After data collection is completed, corresponding performance reports need to be generated. For example, periodic performance reports or performance trend analysis reports. There are many monitoring reports that can be extracted in the database system, for example, regular performance reports (daily, weekly and monthly reports) can be established by using performance-related data; the performance trend analysis report can be established by using common indicators to obtain an intuitive display of the current system performance; you can also generate reports of specific trend types, such as reports based on abnormal events, reports of SQL or jobs that consume a lot of resources, reports of resource consumption of specific users and user groups, and reports of resource consumption of specific applications.

Built-in resource views or monitoring reports are some advanced features provided by the database that are not available in some databases.

2.1.6 O&M Management

  1. 1.

    Database installation

    The basic principles adopted by different database products are similar, but each product has its own characteristics and precautions, which users need to understand and learn before installation.

    The first is the installation of the database, the process of which is shown in Fig. 2.4.

    The premise of database installation is some basic preparations, mainly as follows.

    1. (1)

      To understand the theory of relational database.

    2. (2)

      To understand the knowledge of operating system.

    3. (3)

      To understand the characteristics of database products and server architecture.

      The software architecture is the composition of the components within the database product. What needs to be understood is which components are basic, major, and must be installed, and which are optional.

      Generally speaking, the network architecture should ensure that the database server can run and manage the network and the planning of the database network.

      The database network refers to the intranet used by the database, hosts and spares, and cascading brackets synchronously.

      Management network generally refers to the communication network used between management module and agent module.

      To understand the server architecture is to understand the considerations of building the product in single-host mode, master-standby mode, cluster mode or distributed mode.

    4. (4)

      To know and understand the proper nouns and specific terms of the target database. The meaning of the same word in different database products may vary greatly, cannot be generalized.

    5. (5)

      To read the installation manual, especially the installation notes.

  2. 2.

    Database uninstallation

    Before the database is upgraded, it is necessary to uninstall and clean up the old version of the database. The basic steps of traditional database uninstallation are as follows.

    1. (1)

      (Optional) Make a full backup of the database.

    2. (2)

      Stop the database service.

    3. (3)

      Uninstall the database.

      The basic steps of cloud database uninstallation are as follows.

    4. (1)

      (Optional) Make a full backup of the database.

    5. (2)

      Delete the data instance from the cloud platform.

      The uninstallation approaches are similar for single-host, master-standby, and one-master-multi-standby architectures, and the uninstallation operations to be performed on each node are the same. Uninstallation of distributed clusters generally uses proprietary uninstallation tools. Some users need to destroy the data on the store media after uninstalling the database in order to prevent data leakage.

  3. 3.

    Database migration

    Database migration needs to design different migration schemes according to different migration scenarios, and the factors to be considered are as follows:

    1. (1)

      the available time window for migration;

    2. (2)

      the available tools for migration;

    3. (3)

      whether the data source system stops writing operations during the migration process;

    4. (4)

      the network conditions between the data source system and the target system during the migration;

    5. (5)

      estimated backup recovery time based on the volume of data migrated;

    6. (6)

      audit the data consistency between the source database system and the target database system after migration.

      Judging the network situation during database migration is mainly to provide reference for deciding whether data direct connection can be used. If the network condition is good at both ends, then the direct data migration without landing can be more efficient because it can avoid the large disk I/O overhead generated by the data landing. In the audit of data consistency, a quick comparison method is usually to compare the number of the same table on both sides and confirm that the numbers of records are the same. The method of comparing the results of aggregation operations on specific columns can also be used, such as finding the sum of amount fields and comparing the results, grouping statistics based on date fields, and counting whether the numbers of records for each day are the same.

      Data migration often faces the challenge of completing a huge amount of work within a limited time, and designing multiple scenarios and contingency plans is a prerequisite for successful data migration.

  4. 4.

    Database expansion

    The capacity of any database system is determined after estimating the volume of data in the future at a certain time point. When determining the capacity, not only the volume of data store should be considered, but also the following shortcomings should be avoided:

    1. (1)

      Inadequacy of computing power (average daily busy level of CPU of the whole system > 90%).

    2. (2)

      Insufficient response and concurrency capability (qps and tps are significantly reduced, failing to meet the SLA).

      SLA is the abbreviation of Service Level Agreement. When signing a contract with a customer, some performance commitments are generally made to the customer, for example, the database system provided should be able to meet 10,000 queries/s, the response time for a single query should not exceed 30ms, and to meet the database-related service indicators. SLA may also include service commitments such as 7 × 24 response.

    3. (3)

      Insufficient data capacity. The definition of insufficient capacity is different between OLTP and OLAP .

      The differences between vertical and horizontal capacity expansion are as follows.

    4. (1)

      Vertical expansion refers to increasing the hardware of the database server, such as increasing memory, increasing store, increasing network bandwidth, and improving the performance configuration of hardware. This method is relatively simple, but it will encounter the bottleneck of single-node hardware performance.

    5. (2)

      Horizontal expansion refers to increasing the number of servers horizontally, taking advantage of the servers in the cluster to improve the overall system performance.

      The differences between Downtime expansion and smooth expansion are as follows:

    6. (1)

      Downtime expansion is a simple way, but limited by the time window. Once there is a problem, the expansion will fail, and it will take too long to be accepted by customers.

    7. (2)

      Smooth expansion has no impact on database services, but the technical implementation is relatively complex. Especially, the complexity of expansion will rise sharply with the increase of database servers.

  5. 5.

    Routine maintenance

    In order to carry out routine maintenance, a more rigorous work plan should be formulated for each job, and implemented to check the risks and ensure the safe and efficient operation of the database system.

    Database troubleshooting mainly involves the following matters.

    1. (1)

      Configure database monitoring indicators and alarm thresholds.

    2. (2)

      Set the alarming process for the fault events at each level.

    3. (3)

      Receive the alarm and locate the fault according to the logs.

    4. (4)

      Record the original information in detail for the problems encountered.

    5. (5)

      Strictly abide by the operating procedures and industry safety regulations.

    6. (6)

      For major operations, the operation feasibility should be confirmed before operation, and the operation personnel with authority should perform them after the corresponding backup, emergency and safety measures are in place.

      Database health inspection mainly involves the following matters.

    7. (1)

      View health inspection tasks.

    8. (2)

      Manage health inspection reports.

    9. (3)

      Modify the health inspection configuration.

Fig. 2.4
figure 4

Database installation

2.2 Key Concepts of Database

2.2.1 Database and Database Instance

Database system is made for managing data, and database is actually a collection of data, which is expressed as a collection of data files, data blocks, physical operating system files or disk data blocks, such as data files, index files and structure files. But not all database systems are file-based, there are also databases that write data directly into memory.

Database instance refers to a series of processes in the operating system and the memory blocks allocated for these processes, which are the channels to access the database. Generally speaking, a database instance corresponds to a database, as shown in Fig. 2.5.

Fig. 2.5
figure 5

Database instance

A database is a collection of physically stored data, and a database instance is the collection of software processes, threads and memory that access data. Oracle is process-based, so its instance refers to a series of processes; and a MySQL instance is a series of threads and the memory associated with the threads.

Multi-instance is to build and run multiple database instances on a physical server, each using a different port to listen through a different socket, and each having a separate parameter profile. Multi-instance operation can make full use of hardware resources and maximize the service performance of the database.

Distributed database presents unified instances, and generally does not allow users to directly connect to instances on data nodes. A distributed cluster is a set of mutually independent servers that form a computer system through a high-speed network. Each server may have a complete copy or a partial copy of the database, and all servers are connected to each other through the network, together forming a complete global large-scale database that is logically centralized and physically distributed.

Multi-instance and distributed cluster are shown in Fig. 2.6.

Fig. 2.6
figure 6

Multi-instance and distributed cluster

2.2.2 Database Connection and Session

Database connection is a physical communication connection, which refers to the network connection between a client and a dedicated server or a shared server established on the same network. When establishing a connection, it is necessary to specify the connection parameters, such as server host name, IP address, port number, and user name and password for the connection, etc.

Database Session refers to the logical connection between the client and the database, which is a context from the beginning throughout the end of communication. This context is located in the memory of the server, recording the client of this connection, the corresponding application process number, the corresponding user login information and other information.

The session and connection are established simultaneously, which are descriptions of the same thing at different levels, as shown in Fig. 2.7. Simply put, the connection is the physical communication link between the client and the server, while the session refers to the logical communication interaction between the user and the server. In a database connection, a proprietary server is an instance on the database server. The scheduling server generally refers to the server on the distributed cluster where the external interface component resides, which in GaussDB (DWS ) corresponds to the coordinator node (CN).

Fig. 2.7
figure 7

Flow chart of establishing database connection

Frequent creation and closing of database connections is costly, which makes the allocation and release of connection resources a bottleneck for the database, thus eroding the performance of the database system. The connection pool is used to reuse database connections, responsible for allocating, managing and releasing database connections. It allows an application to reuse an existing database connection instead of creating a new one, allowing efficient and secure reuse of database connections.

The basic idea of connection pool is to store database connections as objects in memory during system initialization. In this way, a user who needs to access the database does not need to establish a new connection, but directly takes out an established idle connection object from the connection pool, as shown in Fig. 2.8. Instead of closing the connection directly after use, the user puts the connection back into the connection pool for the next user to request. Connection creation and disconnection are managed by the connection pool itself, and the initial number of connections, the upper and lower limits of number of connections, the maximum times of uses per connection, and the maximum idle time can be controlled by setting the parameters of the connection pool. However, there are alternatives, that is, monitoring the number and usage of database connections through its own management mechanism. Connections also vary by database product. Oracle’s connection overhead is large, while MySQL’s is relatively small. For highly concurrent service scenarios, if there are many connections accumulated, the overall connection cost of the whole database should also be considered by database administrators.

Fig. 2.8
figure 8

Database connection pool

2.2.3 Schema

Schema is a collection of related database objects that allows multiple users to share the same database without interfering with each other. The schema organizes database objects into logical groups for easier management and form namespaces to avoid object name conflicts. A schema contains tables, other database objects, data types, functions, operators , etc.

“table_a” shown in Fig. 2.9 indicates tables with the same name. Since they belong to different schemas, they are allowed to use the same name, but in fact they may store different data and have different structures. When accessing one of the tables with the same name, it is necessary to specify the schema name to explicitly point to the target table.

Conceptually, a schema is a set of interrelated database objects. Different databases may adopt different concepts to reflect the schema, so database users generally use the English word “schema” to express the concept.

Fig. 2.9
figure 9

Schema

2.2.4 Tablespace

Tablespace is composed of one or more data files, with which you can define where database object files are stored. All objects in the database are logically stored in the tablespace, and physically stored in the data files belonging to the tablespace.

The function of table space is to arrange the physical store location of data according to the usage pattern of database objects, so as to improve the performance of database. It places frequently used indexes on the disk with stable performance and fast computing speed to facilitate data archiving, and place tables that are used less frequently and require lower access performance on the disk with slower computing speed.

You can also specify the physical disk space occupied by data through tablespaces and set the upper limit of physical space usage to avoid running out of disk space.

In view of the fact that tablespaces correspond to physical data files, tablespaces can actually associate data with store, and then the tablespaces themselves specify the store locations of database objects such as tables and indexes in the database. After the database administrator creates a tablespace, he or she can refer to it when creating database objects.

Figure 2.10 shows the tablespaces of GaussDB (for MySQL), which are created as the system predefines six tablespaces, including SYSTEM tablespace, TEMP tablespace, TEMP2 tablespace, TEMP2_UNDO tablespace, UNDO tablespace, and USERS tablespace, as shown in Table 2.6.

Fig. 2.10
figure 10

Tablespaces

Table 2.6 Tablespaces within GaussDB (for MySQL)

TEMP tablespaces, as the intermediate result set of SQL statements, are be used by common temporary tables of users. When executing DML (insert, update and delete, etc.) operations, the old data generated before the execution of the operation will be written to the UNDO tablespace, which is mainly used to implement transaction rollback, database instance recovery, read consistency and flashback queries.

2.2.5 Table

In a relational database, a database table is a collection of two- dimensional arrays that represent the relationships between the data objects being stored. Each row in the table is called a record and consists of several fields; each column in the table can be called a field, which has two attributes—column name and data type, as shown in Table 2.7.

Table 2.7 Database table

GaussDB (for MySQL) supports the creation of temporary tables, which are used to hold the data needed for a session or a transaction. When a session exits or a user commits and rolls back a transaction, the data in the temporary table is automatically cleared, but the table structure remains.

The data in a temporary table is temporary and procedural, with no need to be retained permanently like a normal data table.

Temporary tables cannot be displayed using the [SHOW TABLES] command.

To avoid deleting a permanent table with the same table name, you can use the [DROP TEMPORARY TABLE staff_history_session;] command when performing a delete of the table structure.

The data in the temporary table exists only for the life of the session and is automatically cleared when the user exits the session and the session ends, as shown below.

CREATE TEMPORARY TABLE staff_history_session ( startdate DATE, enddate DATE );

Temporary tables with the same name can be created for different sessions. The name of the temporary table can be the same as the name of the permanent table.

Execute the following command to create a temporary table in GaussDB (for MySQL):

CREATE TEMPORARY TABLE test_tpcds (a INT, b VARCHAR (10));

2.2.6 How the Table Is Stored

According to the way data is stored, tables are divided into row store and column store, as shown in Fig. 2.11. GaussDB (for MySQL) currently supports only row store, while GaussDB (DWS ) supports both row store and column store. The default store mode is row store, which differs from column store only in store mode. From the presentation form of tables, the tables in the two store modes still hold two-dimensional data, which accord with the relational theory of relational database.

Fig. 2.11
figure 11

How the table is stored

If the table in the form of row store (row store table) stores the same row of data in different columns, records can be written once when performing INSERT and UPDATE operations; But when you choose to query, even if you only query a few columns, all the data will be read.

The table in the form of column store (column store table) first splits the rows when writing data, at which time a row is split into multiple columns, and then the data of the same column is stored in the adjacent physical area. Therefore, in the column store mode, the times of write of a row record is obviously more than that in the row store mode. This increase in the write times leads to higher overhead and poorer performance of the column store table compared with the row store table when performing INSERT and UPDATE operations. However, when querying, column store tables just scan the columns involved and then read them, so the I/O scanning and reading range are much smaller than row store tables. Column-store query can eliminate irrelevant columns. If only a few columns need to be queried, it can greatly reduce the amount of data to be queried, and then speed up the query. In addition, for column store tables, each row hold the data of the same data type, and the data of the same type can be compressed by a lightweight compression algorithm to achieve a good compression ratio, so the space occupied by the column store table is relatively small.

Row store tables, on the other hand, are difficult to compress because the field types of the tables are not uniform and cannot be compressed dynamically unless they are confirmed in advance.

Regarding the choice of store mode, row store is the default store mode. The scenarios for which column store is suitable are mainly queries of statistical analysis type (scenarios with a lot of GROUP and JOIN operations), OLAP , data mining and other application query scenarios that make a lot of query requests. One of the main advantages of column storage is that it can greatly reduce the I/O occupation of the system in the reading process, especially when querying massive data, I/O has always been one of the main bottlenecks of the system. Row store is suitable for scenarios such as point queries (simple queries with fewer returned records and based on indexes), lightweight transactions like OLTP , and scenarios that involves a lot of write operations and more data additions, deletions and changes. Row store is more suitable for OLTP , such as the traditional applications based on addition, deletion, change and check operations. Column store is more suitable for OLAP , and is well suited to play a role in the field of data warehousing, such as data analysis, mass store and business intelligence, which mainly involves infrequently updated data.

2.2.7 Partition

A partitioned table is obtained by dividing the data of a large table into many small subsets of data. The main types of partitioned tables are as follows.

  1. (1)

    Range-partitioned table: The data is mapped to each partition based on a range determined by the partition key specified when the partition table is created. This is the most commonly used partition method, and the date is often used as the partitioning key, for example, the sales data is partitioned by month.

  2. (2)

    List-partitioned table: A huge table is partitioned into small manageable blocks.

  3. (3)

    Hash-partitioned tables: In many cases, users cannot predict the range of data changes on a particular column, and therefore cannot create a fixed number of range partitions or list partitions. In this case, hash-partitioned tables provide a way to divide the data equally among a specified number of partitions, so that the data written to the table is evenly distributed among the partitions; however, the user cannot predict which partition the data will be written to. For example, if the sales cities are spread all over the country, it is difficult to partition the table in a list, and then the table can be hash-partitioned.

  4. (4)

    Interval-partitioned table: It is a special kind of range-partitioned table. For ordinary range partition, users will pre-create partitions, and if the inserted data is not in the partition, the database will report an error. In this case, the user can add the partition manually or use the interval partition. For example, the user can use the range-partitioned table in the way of one partition per day, and create a batch of partitions (e.g. 3 months) for subsequent use when the service is deployed, but the partitions need to be created again after 3 months, otherwise the subsequent service data entry will report an error. This approach of range partition increases maintenance costs and requires the kernel to support automatic partition creation. But with interval partition, the user does not need to care about creating subsequent partitions, which reduces partition design and maintenance costs.

Example: The code for range-partitioning a date is as follows.

CREATE TABLE tp ( id INT, name VARCHAR(50), purchased DATE ) PARTITION BY RANGE( YEAR(purchased)) ( PARTITION p0 VALUES LESS THAN (2015), PARTITION p1 VALUES LESS THAN (2016), PARTITION p2 VALUES LESS THAN (2017), PARTITION p3 VALUES LESS THAN (2018), PARTITION p4 VALUES LESS THAN (2019), PARTITION p5 VALUES LESS THAN (2020) );

The advantages of partitioned tables are as follows.

  1. (1)

    Improved query performance: When querying partitioned objects, you can search only the partitions you care about (also known as partition pruning), which improves retrieval efficiency.

  2. (2)

    Enhanced availability: If a partition in a partitioned table fails, the data in other partitions of the table is still available.

  3. (3)

    Easy maintenance: If a partition of a partitioned table fails and you need to repair the data, you can repair only that partition.

  4. (4)

    Balanced I/O: Different partitions can be mapped to different disks to balance I/O and improve the performance of the whole system.

The data covered by the query condition is located in a partition, so SQL only needs to scan the data of a partition in the query process, instead of scanning the whole table, as shown in Fig. 2.12. Suppose the table contains 10 years of data, if there is no partitioned table, you have to scan all the data of 10 years to calculate the result, while with a partitioned table, you only need to scan 1 year of data in a partitioned table, thus the amount of scanned data is only 1/10.

Fig. 2.12
figure 12

Partition pruning

Table 2.8 shows the applicable scenarios of partitioning.

  • Scenario 1 (Row 1 in Table 2.8): Usually this situation occurs in WHERE clause, where the filter condition uses a partition field and the partition field is equal to a partition; or when “BETWEEN...AND...” statement is used, the search condition is within several partitions, so when the query statement scans data, it will only search specific partitions instead of scanning the whole table through partition pruning. In general, the I/O overhead of partition scanning is n/m compared to scanning the entire table, where m is the total number of partitions and n is the number of partitions that satisfy the WHERE condition.

  • Scenario 2 (Row 2 in Table 2.8): Inserting data into an empty partition is similar to loading data into an empty table, and the efficiency of inserting data is higher with this internal implementation.

  • Scenario 3 (Row 3 in Table 2.8): If data is to be deleted or truncated, the data in some partitioned tables can be processed directly because the quick positioning and deletion function of partition makes the processing much more efficient than the scenario without partition.

Table 2.8 Applicable scenarios of partitioning

2.2.8 Data Distribution

The data tables of GaussDB (DWS ) distributed database are scattered on all data nodes (DNs), so you need to specify the distribution columns when you create the tables, as shown in Table 2.9.

Table 2.9 Data distribution

The sample code for the Hash distribution is as follows.

CREATE TABLE sales_fact ( region_id INTEGER, depart_id INTEGER, product_id INTEGER, sale_amt NUMERIC (9,2), sale_qty INTEGER ) DISTRIBUTE BY HASH(region_id,depart_id,product_id);

The sample code for the Replication distribution is as follows.

CREATE TABLE depart_dim ( depart_id INTEGER, depart_name VARHCARH2(60) ) DISTRIBUTE BY REPLICATION;

The data policy selection is shown in Fig. 2.13.

Fig. 2.13
figure 13

Data policy selection

It should be noted that GaussDB (DWS ) only supports Hash and Replication distribution methods, while database shared storage of GaussDB (for MySQL) is not involved at present.

2.2.9 Data Types

The data in the database is classified into basic data, compound data, serial number data and geometric data. Basic data includes numerical value, character, binary data, date and time, Boolean data, enumeration data, etc., as shown in Table 2.10.

Table 2.10 Data types

FLOAT and DOUBLE numbers in floating-point numbers are inaccurate, which will sacrifice accuracy. Inaccuracy means that some values cannot be accurately converted into values in an internal format, but are stored in an approximate form, so some missing may occur when the data is stored and then output. Therefore, in applications with strict requirements for precision such as financial calculations, data types with high precision like DECIMAL and NUMERIC should be preferred.

CHAR type is a fixed-length string, which automatically fills the empty digits when the inserted character is less than the set length. For example, under the definition of CHAR(10), when the character “abc” is inserted, it will be supplemented with 7 null digits to ensure that the whole character string is 10 bytes long.

Basic data type is the built-in data type of the database, including INTEGER, CHAR, VARCHAR and other data types.

Regarding the field design, considering the query efficiency, the design suggestions are as follows.

  1. (1)

    Give priority to the use of efficient data types. Ensure that the specified maximum length is greater than the maximum character number to be stored to avoid truncation of characters when the maximum length is exceeded. In the database, when using SQL statements to insert data, if the characters are truncated, the SQL statement will not report an error. It is recommended to use data types with higher execution efficiency as much as possible. Generally speaking, integer data operations (including =, >, <, ≥, ≤, ≠ and other conventional comparison operations, as well as GROUP BY) are more efficient than strings and floating-point numbers. The data type of the short field is also recommended. Data types with shorter lengths not only reduce the size of data files and improve I/O performance, but also reduce memory consumption during related calculations and improve computational performance. For example, for integer data, try not to use INT if you can use SMALLINT, and try not to use BIGINT if you can use INT.

  2. (2)

    Use consistent data types. Try to use the same data type for the associated columns in the table; otherwise, the database must dynamically convert them into the same data type for comparison, which will bring some performance overhead. When there is a logical relationship between multiple tables, the fields representing the same meaning should use the same data type.

For string data, it is recommended to use the data type of variable-length string and specify the maximum length. Ensure that the specified maximum length is greater than the maximum character number to be stored to avoid truncation of characters when the maximum length is exceeded.

2.2.10 View

Unlike the base table, a view is not physically present, but is a dummy table. If the data in the base table changes, then the data queried from the view will also change. In this sense a view is a window through which the data of interest to the user in the database and its changes can be seen, and the view is run once each time it is referenced.

“author_v1” shown in Fig. 2.14 is vertically split data, only two columns in the base table are visible, and other columns are not visible through the view; “author_v2” is horizontally split data, only all data in the table with age values greater than 20 are visible, but all columns are visible. No matter how to split, the data of “author_v1” and “author_v2” views are not really stored in the database. When the user accesses the view through the SELECT statement, the user accesses the data in the underlying base table through the view, so the view is called a “dummy table”. To the user, accessing a view is exactly the same as accessing a table.

Fig. 2.14
figure 14

View

The main functions of a view are as follows.

  1. (1)

    Simplifies operations. When querying, we often have to use aggregate functions and display information about other fields, and we may need to associate other tables, thus there is a long statement to write. If this action happens frequently, we can create views, just by executing the SELECT * FROM view statement.

  2. (2)

    Improves security. Users can only query and modify the data they see, because the view is virtual, not physically present, and it just stores a collection of data. The view is a dynamic collection of data, and the data is updated as the base table is updated. We can present the important field information in the base table to the user through the view, but the user cannot change and delete the view at will to ensure the security of the data.

  3. (3)

    Achieves logical independence and shields the impact from the structure of real tables. Views allow the application and database tables to be somewhat independent of each other. Without a view, the application must be built on top of the table; but with a view, the application can be built on top of the view. The application is separated from the database table by the view.

The following sample code encapsulates more complex logic through views.

CREATE VIEW stu_class(id,name,class) AS SELECT student.s_id,student.name,stu_info.class FROM student, stu_info WHERE student.s_id=stu_info.s_id;

The user uses the same simplified SQL query statement as the normal table, with the code shown below.

SELECT * FROM stu_class WHERE class=Beijing’

However, the view also has its limitations, mainly as follows.

  1. (1)

    Performance issues: The query may be simple, but the statement that encapsulates the view is complex.

  2. (2)

    Modification restrictions: For complex views, users cannot modify the base table data through the view.

However, if the view is a single table queried directly using the SELECT statement as follows:

CREATE v_abc(a,b,c) AS SELECT a,b,c FROM tableA;

This form is called a simple view, which enables the modification of the table through the view, for example, using the “UPDATE v_abc SET a=‘101’ WHERE b=‘xxxx’;” statement.

However, if the view has aggregate functions, summary functions, or GROUP BY grouping calculations, or if the view is a result view with multiple table associations, they are complex views that cannot be used to make changes to the base table data.

2.2.11 Index

An index provides pointers to data values stored in specified columns of a table, like a table of contents of a book. It can speed up table queries, but also increase the processing time of insertion, update, and deletion operations.

If you want to add an index to a table, then which fields the index is built on is a question that must be considered before creating the index. It is also necessary to analyze the service processing of the application, data usage, fields that are often used as query conditions or required to be sorted, so as to determine whether to establish an index.

When creating indexes, the following suggestions are used as a reference.

  1. (1)

    Create indexes on columns that are frequently required to be searched and queried, which can speed up the search and query.

  2. (2)

    Create an index on a column that used as the primary key, which emphasizes the uniqueness of the column and organizes the arrangement structure of the data in the table.

  3. (3)

    Create indexes on columns that often need to be searched based on ranges as the ordering of indexes can ensure the continuity of the specified ranges.

  4. (4)

    Create indexes on columns that need to be ordered frequently as the ordering of indexes can reduce query time.

  5. (5)

    Create indexes on the columns that often use the WHERE clause to speed up the judgment of the condition.

  6. (6)

    Create indexes for fields that often follow the keywords ORDER BY, GROUP BY, and DISTINCT.

The created index may not be used, and when to use the index will be automatically judged by the system after the index is successfully created. Indexes are used when the system thinks it is faster to use them than to scan them sequentially. Successfully created indexes must be synchronized with tables to ensure that new data can be found accurately, which increases the load of data operation. We also need to remove useless indexes periodically, and we can query the execution plan by EXPLAIN statement to determine whether to use an index or not.

The indexing methods are shown in Table 2.11.

Table 2.11 Indexing methods

If a table declares a unique constraint or primary key, a unique index (possibly a multi-field index) is automatically created on the fields that make up the unique constraint or primary key to implement those constraints.

Create a normal index

CREATE INDEX index_name ON table_name(col_name);

Create a unique index

CREATE UNIQUE INDEX index_name ON table_name(col_name);

Create a normal combined index

CREATE INDEX index_name ON table_name(col_name_1,col_name_2);

Create a unique combined index

CREATE UNIQUE INDEX index_name ON table_name(col_name_1,col_name_2);

Create a full-text index

CREATE FULLTEXT INDEX index_contents ON article(contents);

2.2.12 Constraints

Data integrity refers to the correctness and consistency of data. Integrity constraints can be defined at the time of defining a table. Integrity constraint itself is a rule that does not occupy database space. Integrity constraints are stored in the data dictionary together with the table structure definitions.

Figure 2.15 shows the common constraint types, as follows.

  1. (1)

    Unique (UNIQUE) and primary key (PRIMARY KEY) constraints. When all values in the field will not have duplicate records, you can add unique constraints to the corresponding fields, such as ID card field and employee number field. If a table does not have a unique constraint, then duplicate records can appear in the table. If the fields can be guaranteed to satisfy the unique constraint and not-null constraint, then the primary key constraint can be used, and usually a table can only have one primary key constraint.

  2. (2)

    References key constraint is used to establish a relationship between two tables, which is necessary to specify which column of the primary table is referenced.

  3. (3)

    Check constraint is a constraint on the range of legal values in a field. For example, the balance in the savings account table is not allowed to be negative, so a check constraint can be added to the balance field so that the balance field takes a value ≥ 0.

  4. (4)

    Not-null constraint. If the current field should not have null values or unknown data in service sense, you can add not-null constraints to ensure that the inserted data are all not-null data, such as the ID card field of personal information.

  5. (5)

    Default constraint. When inserting data, if no specific value is given, then the default constraint will be used to give a default initial value, for example, if the default value of the initial member’s rank is 0, when a new member record is added, the member’s rank will be 0.

If the field values can be filled in from the service level, then it is recommended that the default constraint not be used to avoid unintended results when the data is loaded. Add a not-null constraint to a field that clearly does not have a null value, and the optimizer will automatically optimize it and explicitly name the constraint that is allowed to be explicitly named. Explicit naming is supported for all types of constraints except not-null and default constraints.

Fig. 2.15
figure 15

Common constraint types

If a default constraint is used, it is actually assigned by default for some unexpected cases. Such default values may hide potential problems. So for OLAP systems, default constraints should be used carefully or sparingly. But in OLTP system, they are relatively more commonly used.

A summary of database objects is shown in Fig. 2.16.

Fig. 2.16
figure 16

Relationship between database objects

Schema: A database can contain one or more named schemata. A schema is a logical concept, including tables, indexes and other database objects.

Tablespace: A tablespace is used to specify where to store database objects such as tables and indexes in the database, which is a physical concept. After the database administrator creates a tablespace, he or she can refer to it when creating database objects.

Table: A tablespace can contain multiple tables. All the data in the database exist in the form of tables, and the tables are built in the database. The same tables can be stored in different databases or in different modes of the same database.

  • Schema 1 includes objects Table 0, Table 1 and Table 2.

  • Schema 2 includes objects Table 3, Table 5, Table 6, View3 and Index5.

  • Schema 3 includes objects Table 3 and Table 4.

There are two Table3, but they are in Schema2 and Schema3 respectively, so they can have the same name, and are distinguished by Schema2.Table3 and Schema3.Table3.

View3 corresponds to Table3, which is a dummy table that does not occupy actual physical space.

Index5 corresponds to Table5, the table and index can be not in the same tablespace.

The objects of physical data stored in Tablespace1 are Table1, Table2, Table3 and Index5.

The objects stored in Tablespace2 are Table3, Table4, and Table5.

2.2.13 Transaction

Transaction is a user-defined sequence of data operations, these operations are executed as a complete job unit. The data in the database is shared, allowing multiple users to access the same data at the same time. When multiple users add, delete, or change operations on the same piece of data at the same time, it can cause data exceptions if no action is taken.

All statements within a transaction, as a whole, are either all executed or none executed.

For example, when Account A transfers $1000 to Account B, the first operation is to subtract $1000 from Account A, and the second operation is to add $1000 to Account B. All the operations must be successful or failed through transactions.

The ACID characteristics of the transactions are shown below.

  1. (1)

    Atomicity. The transaction is the logical unit of the database jobs; the operations in the transaction are either all done or none done.

  2. (2)

    Consistency. The result of transaction execution must be a transition from one consistent state to another.

  3. (3)

    Isolation. The execution of a transaction in the database cannot be interfered with by other transactions. That is, the internal operation of a transaction and the use of data are isolated from other transactions; multiple transactions subject to concurrent execution cannot interfere with each other. For example, in the process of transferring money from Account A to Account B, if Account C also transfers money to Account A, the transaction of transferring money from Account C to Account A should be isolated from the transaction of transferring from Account A to Account B, without interfering with each other. If the isolation level is not enough, there will be multiple data inconsistencies.

  4. (4)

    Durability. Once a transaction is committed, the changes to the data in the database are permanent. Post-commit operations or failures will not have any effect on the outcome of the transaction. For example, at the beginning of a transaction, read the value of A as 100, and after calculation, A becomes 200, and then continue to perform subsequent operations after submitting the operation, at this time, the database fails. When the failure is recovered, the value of A should be 200 when it is fetched from the database, not the initial value of 100 or some other value.

There are two markers for the end of a transaction: normal end—COMMIT (commit the transaction); and abnormal end—ROLLBACK (roll back the transaction).

After committing a transaction, all operations of the transaction are physically stored in the database as permanent operations. After rolling back a transaction, all operations in the transaction are undone and the database returns to the state it was in before the transaction started.

There are two types of transaction processing models.

  1. (1)

    Explicit commit: Transactions have explicit start and end marks.

  2. (2)

    Implicit commit: Each data operation statement automatically becomes a transaction. GaussDB (for MySQL) adopts implicit COMMIT by default, without adding COMMIT statement, and each statement is regarded as an automatic commit of transaction.

Implicit commit can be turned off with the SET autocommit = 0 statement.

The code to set explicit commit is as follows.

CREATE TABLE customer (a INT, b CHAR (20), INDEX (a)); START TRANSACTION; INSERT INTO customer VALUES (10, 'Heikki'); COMMIT; SET autocommit=0; INSERT INTO customer VALUES (15, 'John'); INSERT INTO customer VALUES (20, 'Paul'); DELETE FROM customer WHERE b = 'Heikki'; ROLLBACK; SELECT * FROM customer;

Figure 2.17 shows the specific operations of transaction commit and rollback.

Fig. 2.17
figure 17

Transaction commit and rollback

GaussDB (for MySQL) is an OLTP database that adopts an explicit transaction processing model, but it does not provide a statement that explicitly defines the transaction start, instead, it takes the first executable SQL as the transaction start.

You may face a data inconsistency in implicit commit—dirty read, which means that one transaction reads data that has not been committed (uncommitted) from another transaction. The uncommitted data is called “dirty” data because of the possibility of rollback.

The transaction T1 shown in Fig. 2.18 transfers $200 from Account A to Account B, where the initial balance of Account A is $1000 and the initial balance of Account B is $500.

Fig. 2.18
figure 18

Dirty read. (a) Transaction T1 changes the value of A from 1000 to 800 and changes the value of B from 500 to 700, but has not yet committed the transaction. (b) At this time, Transaction T2 starts to read the data, and gets A of value 800 modified by the transaction T1. (c) Transaction T1 is rolled back, but because it is not committed, A recovers to the initial value 1000, while the value of B is 500; at this time, for Transaction T2, the value of A is still 800. This case is dirty read, that is, Transaction T2 reads data that has not been committed by Transaction T1.

There is another kind of data inconsistency—non-repeatable read (NRR), which refers to that a transaction read the data that can be modified by other data. The reason why it is called NRR is that a transaction reads the same data many times during processing (repeated reads), but this data may change, as shown in Fig. 2.19.

Fig. 2.19
figure 19

Non-repeatable read (NRR)

Phantom read is a more special scenario of non-repeatable read—after Transaction T1 reads the data based on certain conditions (using the WHERE filter condition), Transaction T2 deletes some records or inserts some new records, after which these changed data are satisfied with the WHERE filter condition. Then when Transaction T1 reads the data based on the same conditions again, it will find that some data is inexplicably missing or increased.

Such missing or increased data is called phantom data.

  1. A.

    Transaction T1 calculates that the sum of A and B is 300.

  2. B.

    Transaction T2 reads the value of B, and records calculated result 400.

  3. C.

    Transaction T1 reads the values of A and B again and sums them up, but this time the calculation result becomes 500. At this time, for transaction T1, the same data source is read many times in one transaction with the result changing. This case is the so-called NRR.

The ANSI SQL standard defines 4 transaction isolation levels to avoid 3 kinds of data inconsistency. The transaction levels, from high to low, are shown below.

  1. (1)

    Serializable. All transactions in the system are executed one by one in a serial manner, so all data inconsistencies can be avoided. However, this serializable execution method of controlling concurrent transactions in an exclusive manner will lead to queuing of transactions that significantly reduces the concurrency of the system, so should be used with great caution.

    Here serialization means that all operations are serially queued, for example:

    Q1 indicates the insert operation “INSERT INTO TA valules(1,2,3)”; Q2 indicates the query operation “SELECT * FROM TA”;

    Under the serializable transaction isolation mechanism, Q2 must wait for Q1 to complete before getting the returned result. If Q1 is not completed, Q2 is always in the queuing state.

  2. (2)

    Repeatable read. Once a transaction is started, all data read during the transaction is not allowed to be modified by other transactions. This isolation level has no way to solve the problem of phantom reads. It only “protects” the data it reads from being modified, but other data still can be modified. If other data is modified to meet the current transaction’s filter conditions (WHERE statement), then a phantom read will occur.

    For the Q1 transaction, the “SELECT * FROM TA WHERE order_date=‘2019-01-01’” statement queries 100 rows for the first time, and then goes to perform query operations on other tables. At this time, the Q2 transaction performs an insert operation “INSERT INTO TA values(1,2,3,‘2019-01-01’)”, adding a record of orderr_date=2019-01-01. Then if the Q1 transaction executes the “SELECT * FROM TA WHERE order_date=‘2019-01-01’” statement again, the query results become 101 rows. For Q1 transaction, it queries the same transaction in the same range but gets different results. This is the so-called phantom read.

  3. (3)

    Read committed. A transaction can read data that has been committed by other transactions. If a certain data is read repeatedly in processing, and the read data happens to be modified and committed by other transactions, then the current transaction that reads data repeatedly will encounter data inconsistency.

  4. (4)

    Read uncommitted. A transaction can read data that has been modified by other transactions but has not yet been committed. Data modified by other transactions but not yet committed may be rolled back. The read of this “uncommitted” data is a dirty read, which may occur at this isolation level.

    GaussDB (for MySQL) implements 2 levels of transaction isolation: read committed and repeatable read. The table about the correspondence between the four transaction isolation levels and the problems is shown in Table 2.12.

Table 2.12 Correspondence between 4 transaction isolation levels and problems

2.3 Summary

This chapter describes the core objectives of database management, and introduces the scope of database management work, explaining the basic concepts of database object management, backup recovery, and disaster recovery levels, as well as the important concepts of database. Some concepts that tend to be confused are compared and explained, and the important but rather obscure concepts are introduced and analyzed based on scenarios.

2.4 Exercises

  1. 1.

    [Multiple Choice] To migrate data from a database to other heterogeneous databases, you can use the ( ) approach.

    1. A.

      Physical backup

    2. B.

      Logical backup

  2. 2.

    [Multiple Choice] To improve the speed of table queries, you can create the database object ( ).

    1. A.

      View

    2. B.

      Function

    3. C.

      Index

    4. D.

      Sequence

  3. 3.

    [Single Choice] When an organization sets disaster recovery standards, it wants to have the ability to restore the system to an externally serviceable state within 1 h after a disaster occurs. This indicator refers to ( ).

    1. A.

      RTO

    2. B.

      RPO

  4. 4.

    [Multiple Choice] To add an index to a table, on which fields is it recommended to create the index? ( )

    1. A.

      Creating indexes on columns that are frequently required to be searched and queried can speed up the search.

    2. B.

      Creating an index on a column that used as the primary key emphasizes the uniqueness of the column and organizes the arrangement structure of the data in the table.

    3. C.

      Create indexes on the columns that often use the WHERE clause to speed up the judgment of the condition.

    4. D.

      Create indexes for fields that often follow the keywords ORDER BY, GROUP BY, and DISTINCT.

  5. 5.

    [Single choice] Among the following statements about the selection of data types, ( ) is incorrect.

    1. A.

      It is recommended to use data types with higher execution efficiency as much as possible.

    2. B.

      The data type of the short field is also recommended.

    3. C.

      For string data, try to use the fixed-length string and specify the string length.

    4. D.

      When there is a logical relationship between multiple tables, the fields representing the same meaning should use the same data type.

  6. 6.

    [Multiple Choice] Among the following options, ( ) is a transaction characteristic.

    1. A.

      Atomicity

    2. B.

      Isolation

    3. C.

      Durability

    4. D.

      Consistency

  7. 7.

    [Multiple Choice] Which of the following situations will not occur under the Repeatable Read? ( )

    1. A.

      Dirty read

    2. B.

      Non-repeatable read

    3. C.

      Phantom read