Top 50 Database System Concepts Interview Questions You Must Prepare 19.Mar.2024

Reasons for not keeping several search indices include:

  1. Every index requires additional CPU time and disk I/O overhead during inserts and deletions.
  2. Indices on non-primary keys might have to be changed on updates, although an index on the primary key might not (this is because updates typically do not modify the primary key attributes).
  3. Each extra index requires additional storage space.
  4. For queries which involve conditions on several search keys, efficiency might not be bad even if only some of the keys have indices on them. Therefore database performance is improved less by adding indices when many indices already exist

In a generalization–specialization hierarchy, it must be possible to decide which entities are members of which lower level entity sets. In a condition defined design constraint, membership in the lower level entity-sets is evaluated on the basis of whether or not an entity satisfies an explicit condition or predicate.User-defined lower-level entity sets are not constrained by a membership condition; rather, entities are assigned to a given entity set by the database user.

Condition-defined constraints alone can be automatically handled by the system. Whenever any tuple is inserted into the database, its membership in the various lower level entity-sets can be automatically decided by evaluating the respective membership predicates. Similarly when a tuple is updated, its membership in the various entity sets can be re-evaluated automatically.

A general purpose database manager (DBM) has five responsibilities:

  1. interaction with the file manager.
  2. integrity enforcement.
  3. security enforcement.
  4. backup and recovery.
  5. concurrency control.

  • A distributed database allows a user convenient and trparent access to data which is not stored at the site, while allowing each site control over its own local data. A distributed database can be made more reliable than a centralized system because if one site fails, the database can continue functioning, but if the centralized system fails, the database can no longer continue with its normal operation. Also, a distributed database allows parallel execution of queries and possibly splitting one query into many parts to increase throughput.
  • A centralized system is easier to design and implement. A centralized system is cheaper to operate because messages do not have to be sent.

Increasing contention for shared resources prevents linear scale-up with increasing parallelism. In a shared memory system, contention for memory (which implies bus contention) will result in falling scale-up with increasing parallelism. In a shared disk system, it is contention for disk and bus access which affects scale-up. In a shared-nothing system, inter-process communication overheads will be the main impeding factor. Since there is no shared memory, acquiring locks, and other activities requiring message passing between processes will take more time with increased parallelism.

It is preferable to use a dense index instead of a sparse index when the file is not sorted on the indexed field (such as when the index is a secondary index) or when the index file is small compared to the size of memory.

4NF is more desirable than BCNF because it reduces the repetition of information. If we consider a BCNF schema not in 4NF (see Exercise 7.28), we observe that decomposition into 4NF does not lose information provided that a loss less join decomposition is used, yet redundancy is reduced.

A physical OID needs to have a unique identifier in addition to a pointer to a physical storage location. This is required to prevent dereferences of dangling pointers.

If an object gets forwarded multiple times, the retrieval speed will decrease because accessing it will require accessing the series of locations from which the object has been successively forwarded to the current location. Multiple accesses can be avoided by always keeping in the oldest location the latest address of the object. This can be done by checking while forwarding whether this object has already been forwarded and in that case updating the forwarding address at the oldest location. Thus, atmost two accesses will be required.

Usually, a well-designed view and security mechanism can avoid conflicts between ease of access and security. However, as the following example shows, the two purposes do conflict in case the mechanisms are not designed carefully.

Suppose we have a database of employee data and a user whose view involves employee data for employees earning less than $10,0@If this user inserts employee Jones, whose salary is $9,000, but accidentally enters $90,000, several existing database systems will accept this update as a valid update through a view. However, the user will be denied access to delete this erroneous tuple by the security mechanism.

When a traction explicitly locks a node in shared or exclusive mode, it implicitly locks all the descendents of that node in the same mode. The traction need not explicitly lock the descendent nodes. There is no difference in the functionalities of these locks, the only difference is in the way they are acquired, and their presence tested.

The ACID properties, and the need for each of them are:-

  • Consistency: Execution of a traction in isolation (that is, with no other traction executing concurrently) preserves the consistency of the database. This is typically the responsibility of the application programmer who codes the tractions.
  • Atomicity: Either all operations of the traction are reflected properly in the database, or none are. Clearly lack of atomicity will lead to inconsistency in the database.
  • Isolation: When multiple tractions execute concurrently, it should be the case that, for every pair of tractions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started, or Tj started execution after Ti finished. Thus, each traction is unaware of other tractions executing concurrently with it. The user view of a traction system requires the isolation property, and the property that concurrent schedules take the system from one consistent state to another. These requirements are satisfied by ensuring that only serializable schedules of individually consistency preserving tractions are allowed.
  • Durability: After a traction completes successfully, the changes it has made to the database persist, even if there are system failures.

Let tgrid be a two-dimensional integer array of size n × m.

    • The physical level would simply be m × n (probably consecutive) storage locations of whatever size is specified by the implementation (e.g., 32 bits each).
    • The conceptual level is a grid of boxes, each possibly containing an integer, which is n boxes high by m boxes wide.
    • There are 2m×n possible views. For example, a view might be the entire array, or particular row of the array, or all n rows but only columns 1 through i.
    • Consider the following Pascal declarations: type tgrid = array[1..n, 1..m] of integer; var vgrid1, vgrid2 : tgrid Then tgrid is a schema, whereas the value of variables vgrid1 and vgrid2 are instances.
    • To illustrate further, consider the schema array[1..2, 1..2] of integer. Two instances of this scheme are:

In a total design constraint, each higher-level entity must belong to a lower-level entity set. The same need not be true in a partial design constraint. For instance, some employees may belong to no work-team.

  1. Encrypted data allows authorized users to access data without worrying about other users or the system administrator gaining any information.
  2. Encryption of data may simplify or even strengthen other authorization mechanisms. For example, distribution of the cryptographic key amongst only trusted users is both, a simple way to control read access, and an added layer of security above that offered by views.

Deadlock avoidance is preferable if the consequences of abort are serious (as in interactive tractions), and if there is high contention and a resulting high probability of deadlock.

  1. Security conditions may require that the entire logical database be not visible to all users.
  2. We may wish to create a personalized collection of relations that is better matched to a certain user’s intuition than is the actual logical model.

Open hashing may place keys with the same hash function value in different buckets. Closed hashing always places such keys together in the same bucket. Thus in this case, different buckets can be of different sizes, though the implementation may be by linking together fixed size buckets using overflow chains. Deletion is difficult with open hashing as all the buckets may have to inspected before we can ascertain that a key value has been deleted, whereas in closed hashing only that bucket whose address is obtained by hashing the key value need be inspected. Deletions are more common in databases and hence closed hashing is more appropriate for them. For a small, static set of data lookups may be more efficient using open hashing. The symbol table of a compiler would be a good example.

A schedule in which all the instructions belonging to one single traction appear together is called a serial schedule. A serializable schedule has a weaker restriction that it should be equivalent to some serial schedule. There are two definitions of schedule equivalence – conflict equivalence and view equivalence. Both of these are described in the chapter.

With the central server, each site does not have to remember which site to contact when a particular data item is to be requested. The central server alone needs to remember this, so data items can be moved around easily, depending on which sites access which items most frequently.Other house-keeping tasks are also centralized rather than distributed, making the system easier to develop and maintain. Of course there is the disadvantage of a total shutdown in case the server becomes unavailable. Even if it is running, it may become a bottleneck because every request has to be routed via it.

Your wer will be based on the computers and storage media that you use. Typical examples would be hard disk, floppy disks and CD-ROM drives.

Dangling tuples can arise when one tuple is inserted into a decomposed relation but no corresponding tuple is inserted into the other relations in the decomposition. They can cause incorrect values to be returned by queries which form the join of a decomposed relation since the dangling tuple might not be included. dangling tuples can be avoided by the specification of referential integrity constraints.

In the concurrency control scheme choosing Start(Ti) as the time stamp of Ti gives a subset of the schedules allowed by choosing Validation(Ti) as the time stamp. Using Start(Ti) me that whoever started first must finish first. Clearly tractions could enter the validation phase in the same order in which they began executing, but this is overly restrictive. Since choosing Validation(Ti) causes fewer non conflicting tractions to restart, it gives the better response times.

The phantom phenomenon arises when, due to an insertion or deletion, two tractions logically conflict despite not locking any data items in common. The insertion case is described in the book. Deletion can also lead to this phenomenon. Suppose Ti deletes a tuple from a relation while Tj sc the relation. If Ti deletes the tuple and then Tj reads the relation, Ti should be serialized before Tj . Yet there is no tuple that both Ti and Tj conflict on.

An interpretation of 2PL as just locking the accessed tuples in a relation is incorrect. There is also an index or a relation data that has information about the tuples in the relation. This information is read by any traction that sc the relation, and modified by tractions that update, or insert into, or delete from the relation. Hence locking must also be performed on the index or relation data, and this will avoid the phantom phenomenon.

Consider the balance in an account, replicated at N sites. Let the current balance be $100 – consistent across all sites. Consider two tractions T1 and T2 each depositing $10 in the account. Thus the balance would be $120 after both these tractions are executed. Let the tractions execute in sequence: T1 first and then T@Suppose the copy of the balance at one of the sites, say s, is not consistent – due to lazy replication strategy – with the primary copy after traction T1 is executed and let traction T2 read this copy of the balance. One can see that the balance at the primary site would be $110 at the end.

An entity is simply a collection of variables or data items. An object is an encapsulation of data as well as the methods (code) to operate on the data. The data members of an object are directly visible only to its methods. The outside world can gain access to the object’s data only by passing pre-defined messages to it, and these messages are implemented by the methods.

RAID level 1 (mirroring) is the one which facilitates rebuilding of a failed disk with minimum interference with the on-going disk accesses. This is because rebuilding in this case involves copying data from just the failed disk’s mirror. In the other RAID levels, rebuilding involves reading the entire contents of all the other disks.

There are several steps in the creation of a file. A storage area is assigned to the file in the file system, a unique i-number is given to the file and an i-node entry is inserted into the i-list. Deletion of file involves exactly opposite steps.

For the file system user in UNIX, durability is important for obvious reasons, but atomicity is not relevant generally as the file system doesn’t support tractions. To the file system implementor though, many of the internal file system actions need to have traction semantics. All the steps involved in creation/deletion of the file must be atomic, otherwise there will be unreferenceable files or unusable areas in the file system.

The possible sequences of states are:-

  1. active→partially committed→committed. This is the normal sequence a successful traction will follow. After executing all its statements it enters the partially committed state. After enough recovery information has been written to disk, the traction finally enters the committed state.
  2. active → partially committed → aborted. After executing the last statement of the traction, it enters the partially committed state. But before enough recovery information is written to disk, a hardware failure may occur destroying the memory contents. In this case the changes which it made to the database are undone, and the traction enters the aborted state.
  3. active → failed → aborted. After the traction starts, if it is discovered at some point that normal execution cannot continue (either due to internal program errors or external errors), it enters the failed state. It is then rolled back, after which it enters the aborted state.

If the type of an attribute is x, then in each tuple of the table, corresponding to that attribute, there is an actual object of type x . If its type is ref(x), then in each tuple, corresponding to that attribute, there is a reference to some object of type x. We choose a reference type for an attribute, if that attribute’s intended purpose is to refer to an independent object.

Autonomy is the amount of control a single site has over the local database. It is important because users at that site want quick and correct access to local data items. This is especially true when one considers that local data will be most frequently accessed in a database. Trparency hides the distributed nature of the database. This is important because users should not be required to know about location, replication, fragmentation or other implementation aspects of the database.

If a traction is very long or when it fetches data from a slow disk, it takes a long time to complete. In absence of concurrency, other tractions will have to wait for longer period of time. Average response time will increase. Also when the traction is reading data from disk, CPU is idle. So resources are not properly utilized. Hence concurrent execution becomes important in this case. However, when the tractions are short or the data is available in memory, these problems do not occur.

The assertion-name is arbitrary. We have chosen the name Perry. Note that since the assertion applies only to the Perry ridge branch we must restrict attention to only the Perry ridge tuple of the branch relation rather than writing a constraint on the entire relation.

  create assertion perry check  (not exists (select *  from branch  where branch-name = ’Perryridge’ and  assets = (select sum (amount)  from loan  where branch-name = ’Perryridge’)))

Note that indices must operate on the encrypted data or someone could gain access to the index to interpret the data.Otherwise, the index would have to be restricted so that only certain users could access it. To keep the data in sorted order, the index scheme would have to decrypt the data at each level in a tree. Note that hash systems would not be affected.

The primary key of a weak entity set can be inferred from its relationship with the strong entity set. If we add primary key attributes to the weak entity set, they will be present in both the entity set and the relationship set and they have to be the same. Hence there will be redundancy.

BCNF is not always dependency preserving. Therefore, we may want to choose another normal form (specifically, 3NF) in order to make checking dependencies easier during updates. This would avoid joins to check dependencies and increase system performance.

We have weak entities for several reasons:

  • We want to avoid the data duplication and consequent possible inconsistencies caused by duplicating the key of the strong entity.
  • Weak entities reflect the logical structure of an entity being dependent on another entity.
  • Weak entities can be deleted automatically when their strong entity is deleted.
  • Weak entities can be stored physically with their strong entities.

  1. If a pair of entity sets are connected by a path in an E-R diagram, the entity sets are related, though perhaps indirectly. A disconnected graph implies that there are pairs of entity sets that are unrelated to each other. If we split the graph into connected components, we have, in effect, a separate database corresponding to each connected component.
  2. As indicated in the wer to the previous part, a path in the graph between a pair of entity sets indicates a (possibly indirect) relationship between the two entity sets. If there is a cycle in the graph then every pair of entity sets on the cycle are related to each other in at least two distinct ways. If the E-R diagram is acyclic then there is a unique path between every pair of entity sets and, thus, a unique relationship between every pair of entity sets.

Most of the concurrency control protocols (protocols for ensuring that only serializable schedules are generated) used in practice are based on conflict serializability—they actually permit only a subset of conflict serializable schedules. The general form of view serializability is very expensive to test, and only a very restricted form of it is used for concurrency control.

Let us consider a two-dimensional grid array. When a bucket overflows, we can split the ranges corresponding to that row and column into two, in both the linear scales. Thus the linear scales will get one additional entry each, and the bucket is split into four buckets. The ranges should be split in such a way as to ensure that the four resultant buckets have nearly the same number of values.

There can be several other heuristics for deciding how to reorganize the ranges, and hence the linear scales and grid array.

Some main differences between a database management system and a file-processing system are:

  • Both systems contain a collection of data and a set of programs which access that data. A database management system coordinates both the physical and the logical access to the data, whereas a file-processing system coordinates only the physical access.
  • A database management system reduces the amount of data duplication by ensuring that a physical piece of data is available to all programs authorized to have access to it,whereas data written by one program in a file-processing system may not be readable by another program.
  • A database management system is designed to allow flexible access to data (i.e., queries), whereas a file-processing system is designed to allow predetermined access to data (i.e., compiled programs).
  • A database management system is designed to coordinate multiple users accessing the same data at the same time. A file-processing systemis usually designed to allow one or more programs to access different data files at the same time. In a file-processing system, a file can be accessed by two programs concurrently only if both programs have read-only access to the file.

An edge from class A to class B in the DAG representing inheritance me that an object of class B is also an object of class A. It has all the properties that objects of class A have, plus additional ones of its own. In particular, it inherits all the variables and methods of class A. It can of course provide its own implementations for the inherited methods. And edge from class A to class B in the object containment DAG me that an object of class A contains an object of class B. There need not be any similarities in the properties of A and B. Neither B nor A inherit anything from the other. They function as independent types, to the extent that an object of class A can access the variables of the B object contained in it only via the B object’s methods.

Certain functional dependencies are called trivial functional dependencies because they are satisfied by all relations.

  create view salinfo as  select manager-name, avg(salary)  from manages m, works w  where m.employee-name = w.employee-name  group by manager-name

Updates should not be allowed in this view because there is no way to determine how to change the underlying data. For example, suppose the request is “change the average salary of employees working for Smith to $200”. Should everybody who works for Smith have their salary changed to $200? Or should the first (or more, if necessary) employee found who works for Smith have their salary adjusted so that the average is $200?Neither approach really makes sense.

We define triggers for each relation whose primary-key is referred to by the foreign-key of some other relation. The trigger would be activated whenever a tuple is deleted from the referred-to relation. The action performed by the trigger would be to visit all the referring relations, and delete all the tuples in them whose foreign-key attribute value is the same as the primary-key attribute value of the deleted tuple in the referred-to relation. These set of triggers will take care of the on delete cascade operation.

The primary index is on the field which specifies the sequential order of the file. There can be only one primary index while there can be many secondary indices.

If a traction needs to access a large a set of items, multiple granularity locking requires fewer locks, whereas if only one item needs to be accessed, the single lock granularity system allows this with just one lock. Because all the desired data items are locked and unlocked together in the multiple granularity scheme, the locking overhead is low, but concurrency is also reduced.

Index and resource authorization should be special categories to allow certain users to create relations (and the indices to operate on them) while preventing these time-consuming and schema-changing operations from being available to many users. Separating index and resource authorization allows a user to build an index on existing relations, say, for optimization purposes, but allows us to deny that user the right to create new relations.

Even in this case the recovery manager is needed to perform roll-back of aborted tractions.