TOOLS FOR BUILDING INFORMATION SYSTEMS:DATABASE MANAGEMENT TOOLS
DATABASE MANAGEMENT TOOLS
DBM Concepts
One of the most important functions that an information system must perform is data management (i.e., record keeping). Prior to the advent of true database software, computer data management was characterized by excessive data redundancy (or duplication), large data dependence (or coupling),
and diffused data ownership (McLeod 1998). In early data processing systems, the data files were often created with ‘‘little thought as to how those files affected other systems. Perhaps much, or even all, of the new data in a new file was already contained in an existing file’’ (McLeod 1998). This resulted in significant data redundancy and several associated problems. Obviously, data duplication results in wasted space, but worse than this are synchronization problems. ‘‘Multiple copies of data were often updated on different frequencies. One file would be updated daily, another weekly, and a third monthly. Reports based on one file could conflict with reports based on another, with the user being unaware of the differences’’ (McLeod 1998). These early systems also exhibited a high degree of dependence between the data specifications and the programs written to manipulate the data. Every time the representation of the data or the file storage format changed, all of the programs that depended on that data would also need to be changed. Other problems resulted from a lack of agreement on data ownership and standardization. ‘‘This situation of the 1950s and early 1960s may have been the reason why many of the first management information systems failed in the late 1960s. Information specialists appeared unable to deliver systems that provided consistent, accurate, and reliable information’’ (McLeod 1998).
Database systems sought to address these issues by separating the logical organization of the data from the physical organization. The logical organization is how the user of the system might view the data. The physical organization, however, is how the computer views the data. A database, then, may be defined as ‘‘an integrated collection of computer data, organized and stored in a manner that facilitates easy retrieval’’ by the user (McLeod 1998). It seeks to minimize data redundancy and data dependence. A database is often considered to consist of a hierarchy of data: files, records, and fields. A field is the smallest organized element of data, consisting of a group of characters that has a specific meaning. For example, a person’s last name and telephone number would be common fields. Related fields are collected and stored in a structure referred to as a record. A specific person’s record might consist of his or her name, address, telephone number, and so on (all values associated with that individual). Related records are grouped into a file, such as a file of employees or customers. Keep in mind that the physical representation of the database will be shielded from the user and should not limit the user’s ability retrieve information from the system.
A database management system (DBMS) is a ‘‘collection of programs that manage the database structure and controls access to the data stored in the database. The DBMS makes it possible to share the data in the database among multiple applications or users’’ (Rob and Coronel 1997). In a sense, it sits between the user and the data, translating the users’ requests into the necessary program code needed to accomplish the desired tasks. General Electric’s Integrated Data Store (IDS) system, intro- duced in 1964, is considered by most to be the first generalized DBMS (Elmasri and Navathe 2000).
Database systems are often characterized by their method of data organization. Several popular models have been proposed over the years. Conceptual models focus on the logical nature of the data organization, while implementation models focus on how the data are represented in the database itself. Popular conceptual models include the entity-relationship (E-R) model and the object-oriented model discussed below. These models typically describe relationships or associations among data as one-to-one, one-to-many, or many-to-many. For example, in most organizations an employee is as- sociated with one department only but a department may have several employees. This is a one-to- many relationship between department and employee (one department has many employees). But the relationship between employee and skills would be many-to-many, as a single employee might pos- sess several skills and a single skill might be possessed by several employees (many employees have many skills).
Implementation models include the hierarchical model, network model, and relational model. Each model has advantages and disadvantages that made it popular in its day. The relational model is easily understood and has gained widespread acceptance and use. It is currently the dominant model in use by today’s database management systems.
Relational DBMS
The relational database model, introduced by E. F. Cod in 1970, is considered a major advance for users and database designers alike. At the time, however, many considered the model impractical because its ‘‘conceptual simplicity was bought at the expense of computer overhead’’ (Rob and Coronel 1997). As processing power has increased, the cost of implementing the relational model has dropped rapidly. Now you will find relational DBMS implementations from mainframes down to microcomputers in products like DB2, Oracle, and even Microsoft’s Access.
To the user, the relational database consists of a collection of tables (or relations in the formal language of the model). Each table may be thought of as a matrix of values divided into rows and columns. Each row represents a record and is referred to as a tuple in the model. Each column is a field and is called an attribute in the model. Below are some simple tables for Employee and De- partment.
Pay special attention to how relationships between tables are established by sharing a common field value. In this example, there is a one-to-many relationship between the Employee and Depart- ment tables established by the common DepartmentID field. Here, Robert Adams and Mary Roberts are in the Engineering department (DepartmentID 20). The tables represented here are independent of one another yet easily connected in this manner. It is also important to recognize that the table structure is only a logical structure. How the relational DBMS chooses to represent the data physically is of little concern to the user or database designer. They are shielded from these details by the DBMS.
Processing
One of the main reasons for the relational model’s wide acceptance and use is its powerful and flexible ad hoc query capability. Most relational DBMS products on the market today use the same query language, Structured Query Language (SQL). SQL is considered a fourth-generation language (4GL) that ‘‘allows the user to specify what must be done without specifying how it must be done’’ (Rob and Coronel 1997). The relational DBMS then translates the SQL request into whatever program actions are needed to fulfill the request. This means far less programming for the user than with other database models or earlier file systems.
SQL statements exist for defining the structure of the database as well as manipulating its data. You may create and drop tables from the database, although many people use tools provided by the DBMS software itself for these tasks. SQL may also be used to insert, update, and delete data from existing tables. The most common task performed by users, however, is data retrieval. For this, the SELECT statement is used. The typical syntax for SELECT is as follows:
For example, to produce a list of employee names and telephone numbers from the Engineering department using the tables introduced earlier, you would write:
In this example, the rows returned by the query are established by matching the common values of DepartmentID stored in both the Employee and Department tables. These joins may be performed across several tables at a time, creating meaningful reports without requiring excessive data dupli- cation.
It is also possible to sort the results returned by a query by adding the ORDER BY clause to the SELECT statement. For example, to create an alphabetical listing of department names and IDs, you would write:
Object-Oriented Databases
The evolution of database models (from hierarchical to network to relational) is driven by the need to represent and manipulate increasingly complex real-world data (Rob and Coronel 1997). Today’s systems must deal with more complex applications that interact with multimedia data. The relational approach now faces a challenge from new systems based on an object-oriented data model (OODM). Just as OO concepts have influenced programming languages, so too are they gaining in popularity with database researchers and vendors.
There is not yet a uniformly accepted definition of what an OODM should consist of. Different systems support different aspects of object orientation. This section will explore some of the similar characteristics shared by most of these new systems. Rob and Coronel (1997) suggest that, at the very least, an OODM
1. Must support the representation of complex objects
2. Must be extensible, or capable of defining new data types and operations
3. Must support encapsulation (as described in Section 2)
4. Must support inheritance (as described in Section 2)
5. Must support the concept of object identity, which is similar to the notion of a primary key from the relational model that uniquely identifies an object. This object identity is used to relate objects and thus does not need to use JOINs for this purpose.
Consider an example involving data about a Person (from Rob and Coronel 1997). In a typical relational system, Person might become a table with fields for storing name, address, date of birth, and so on as strings. In an OODM, Person would be a class, describing how all instances of the class (Person objects) will be represented. Here, additional classes might be used to represent name, address, date of birth, and so on. The Address class might store city, state, street, and so on as separate attributes. Age might be an attribute of a Person object but whose value is controlled by a method, or function of the class. This inclusion of methods makes objects an active component in the system. In traditional systems, data are considered passive components waiting to be acted upon. In an OODM, we can use inheritance to create specialized classes based on the common character- istics of an existing class, such as the Person class. For example, we could create a new class called Employee by declaring Person as its superclass. An Employee object might have additional attributes of salary and social security number. Because of inheritance, an Employee is a Person and, hence, would also gain the attributes that all Person objects share—address, date of birth, name, and so on. Employee could be specialized even further to create categories of employees, such as manager, secretary, cashier, and the like, each with its own additional data and methods for manipulating that data.
Rob and Coronel (1997) summarize the advantages and disadvantages of object-oriented database systems well. On the positive side, they state that OO systems allow the inclusion of more semantic information than a traditional database, providing a more natural representation of real-world objects. OO systems are better at representing complex data, such as required in multimedia systems, making them especially popular in CAD / CAM applications. They can provide dramatic performance im- provements over relational systems under certain circumstances. Faster application development time can be achieved through the use of inheritance and reuse of classes. These OO systems are not without problems, however. Most of them are based on immature technology, which can lead to product instability. This is exacerbated by the lack of standards and an agreed-upon theoretical foun- dation. OO database systems especially lack a standard ad hoc query language (like SQL for relational systems). In addition, the initial learning curve is steep, particularly for individuals who are well accustomed to the relational model. On the whole, it seems clear that object-oriented databases are
worth consideration. As standards develop over time and people become more familiar with object- oriented systems in general, OO databases could prove to be the successor to the relational model.
Data Warehouses
One of the major changes in the information systems area within the past decade is a recognition that there are two fundamentally different types of information systems in all enterprises: operational systems and informational systems. Operational systems are just what their name implies: they are the systems that help us run the enterprise day-to-day. These are the backbone systems of any enterprise: the order entry, inventory, manufacturing, payroll, and accounting systems. Because of their importance to the organization, operational systems were almost always the first parts of the enterprise to be computerized. Over the years, these operational systems have been enhanced and maintained to the point that they are completely integrated into the organization. Indeed, most large enterprises around the world today couldn’t operate without their operational systems and the data that these systems maintain. Operational data normally is current (the present) and is focused on an area within the enterprise.
On the other hand, other functions go on within the enterprise that have to do with planning, forecasting, and managing the organization. These functions are also critical to the survival of the organization, especially in our fast-paced world. Functions like marketing planning, engineering plan- ning, and financial analysis also require information systems to support them. But these functions are different from operational ones, and the types of systems and information required are also different. The information needed to aid such major decisions that will affect how the enterprise will operate, now and in the future, typically is obtained by analyzing operational data over a long time period and covering many different areas within an enterprise.
A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data to support management decision making. The data in a data warehouse are stored in a manner that will support the information and analytical processing needs of the organization over a long historical time period. They are typically physically separated from the operational databases that an organi- zation must maintain and update to function effectively on a day-to-day basis (Singh 1998; Barqu´ın and Edelstein 1997).
There are many data flows from the operational databases to the data warehouse and from the data warehouse to users. Data must flow from the enterprise’s legacy systems to a more compact form, then to the data warehouse storage, and finally out to consumers. The flows necessary to manage the warehouse itself (metaflow) must also be considered (Hackathorn 1995). For each flow, data warehouses require tools to make these processes function effectively (Mattison 1996).
The tools for developing data warehouses can be grouped into three categories, based on their activities: acquisition tools (for inflow), storage tools (for upflow and downflow), and access products (for outflow) (Mattison 1996). Acquisition tools are necessary to perform tasks such as modeling, designing, and populating data warehouses. These tools extract data from various sources (operational databases and external sources) and transform it (i.e., condition it, clean it up, and denormalize it) to make the data usable in the data warehouse. They also establish the metadata, where information about the data in the warehouse is stored (Francett 1994).
Storage is usually managed by relational DBMSs and other special tools in ways that enable the use of data for effective decision support (Mattison 1996). Access to warehouse contents is provided by data mining tools using such methods as neural networks and data discovery mechanisms (Mat- tison 1996), plus multidimensional analysis software that supports end users in accessing and ana- lyzing the data in the warehouse in various ways (Francett 1994). Data mining is the process of making a discovery from large amounts of detailed data (Barry 1995) by drilling down through detailed data in the warehouse. Data mining tools sift through large volumes of data to find patterns or similarities in the data. Data mining and online analytical processing tools translate warehouse contents into business intelligence by means of a variety of statistical analyses and data visualization methods (Brown 1995; Fogarty 1994). Table 1 provides a list of some of the common terms used within the data warehouse community.
A number of commercial products attempt to fulfill warehousing needs, including the SAS Insti- tute’s SAS / Warehouse Administrator, IBM’s Visual Warehouse, Cognos’s PowerPlay, Red Brick Systems’ Red Brick Warehouse, and Information Builders’ FOCUS. A discussion of these tools is beyond the scope of this chapter.
Comments
Post a Comment