How Database Design postgresql

Database Design
Dr.MohitBansal Profile Pic
Published Date:26-10-2017
Your Website URL(Optional)
Comment Page 357 Tuesday, March 8, 2005 2:21 PM Database Design So far in this book, we have been working with a database for our simple customer/orders/ products data, but we have taken the design of the tables and columns mostly for granted. Now that we understand more about the capabilities of relational databases, we are in a position to backtrack a little and look at a very important aspect of databases: designing the database structure, more formally known as a database schema. When researching this chapter, we asked a friend with excellent database design skills, honed over several years, what he thought was the most important aspect of database design. His simple answer was, “Practice.” Unfortunately, we can’t provide a substitute for practice, but we can provide a foundation for understanding database design as you gain experience. We will explain the basics in this chapter. Also, we’ll work through how we arrived at the design in our sample database. In this chapter, we’ll be looking at the following aspects of database design: � What constitutes good database design � Stages in database design �Logical design � Physical database model development � Normal forms � Common design patterns What Is a Good Database Design? The very first step in designing a database is to understand the problem. Just as when you are designing applications, it is important to understand the problem area well, before getting immersed in any detailed design. With an understanding of the problem, you can determine what you are trying to achieve with your database design. Understanding the Problem Is your planned system going to replace an existing system? If so, you have a head start, because whatever its failings or shortcomings, an existing system will have captured many important Page 358 Tuesday, March 8, 2005 2:21 PM 358 CHAPTER 12 ■ DATABASE DESIGN features required of the replacement system. Even if there is an existing system, it’s important that you talk to the potential users of the system. If it’s a database for your personal use, you still need to ask questions, but ask them of yourself. When interviewing, particularly if you are interviewing more than one person at a time, there are some steps you can take to make the interview as productive as possible: � Don’t try to interview too many people at the same time. Two or three is about the most you should talk to at any one time. � Inform people in advance abou t what you are trying to discover and send them your main questions a couple of days prior to the interview. � See if you can get a helper to jot down note s for you, so you can concentrate on under- standing what the users are saying. � Keep the interview session short and make su re you cover the major issues, even if you need to leave some minor details undecided during the actual meeting. If some items are left unresolved, request answers by a specified date, say in a week. � Always circulate detailed minutes after the meeting, certainly within two working days, with an explicit request to return comments within a week if any points are disputed. The actual questions you ask will depend on your particular application. At initial interviews, start out by asking users to describe the purpose of the system and its principal functions. Try to avoid the “how” and focus on the “what.” People will often try to tell you how things are done in the current system. However, you need to know why they are done, so you can under- stand the purpose better. Potential users hold the key to a good design, even if they don’t know it. If you are creating a system for your personal use, it is worthwhile to take the time to consider precisely what you need to do, and to try to anticipate how this may change over the time. Taking Design Aspects into Account It’s important to understand what you are trying to achieve with a database design. Different features will be important in different systems. For example, you may be building a database to collect some survey data, where once the results have been extracted, there will be no further use for the database. In this case, designing in flexibility for future expansion is usually not the most effective use of your time and energy. Let’s look at the aspects of design that may need to be taken into account when designing a database. Ability to Hold the Required Data The ability to hold the required data is a fairly crucial requirement of all databases, since storing data is the very reason for having a database. However, even this apparently universal require- ment can have degrees of necessity. If you are designing a reasonably complex database that you expect to evolve over time, you should seriously consider what are the “must-have” require- ments and implement those first, putting to one side the “nice-to-have” requests. Page 359 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 359 Database design usually evolves through a number of design iterations, just as in the spiral model of application design, where the design iterates through a number of design-code- implement loops as the system evolves, or the Rational Unified Process, where a number of iterations occur, always working on the key requirements first. With database design, getting the fundamentals correct the first time tends to be even more important than with application design. Once the first iteration of the database is in use and storing real data, significant design changes to the core structure will generally prove difficult and time-consuming, and may require design changes in applications accessing the database. In most database designs, even very complex ones, only about 25% of the tables (at most) in the final database implementation are fundamental to the design. Identifying and designing these core tables must be the first goal. The remaining tables are important, but they are usually peripheral to the core of the design. Ability to Support the Required Relationships The design of the database should support the relationships among the data entities. It is all too easy to become so focused on the details of the data to be stored that you overlook relation- ships between the data items, yet this is the key breakthrough of relational databases. An application using a database design that captures all the data, but neglects these relationships, will almost always eventually suffer from data-integrity problems and excessive complexity, as other parts of the system attempt to make up for the design failings in the underlying database. Ability to Solve the Problem The best-designed databases are worthless if they don’t solve the problem that they were created to tackle. Throughout the design process, you must stay in touch with the problem area. If possible, communicate with the intended users of the database and explain the design to them as the major design decisions are being made. Simply mailing the users copies of your database schema will almost certainly not do. You need to sit with them and talk through the design, explaining in business terms what the design achieves, and more important, what assumptions you have made and any limitations the design imposes. When you do this, remember to explain carefully how each major data entity can relate to other entities. If your design allows only a local IT support person to support a single department, you must mention such limitations. It’s also important, where practical, to carefully select the users you consult. The most valuable people to talk to are usually those with the broadest experience of the problem. Unfortunately, these also tend to be most senior personnel, and therefore often the busiest and most difficult to get time with. Ability to Impose Data Integrity The data-integrity aspect is closely related to the earlier point about relationships. The whole purpose of a database is to store data, and the quality of that data must be very important to database designers. A lot of real-world data inevitably has deficiencies: uncertainties, hand- written forms that have illegible entries, or missing information. These are never excuses for allowing any further deterioration in data quality in the database. Page 360 Tuesday, March 8, 2005 2:21 PM 360 CHAPTER 12 ■ DATABASE DESIGN Choose data types with care, impose column constraints, and if necessary, write trigger functions to maintain the data in the database with as much rigor as is reasonably practical. Of course, some common sense and pragmatism is called for sometimes, but never invent data if something is missing. If you are entering a survey into the database, for example, and some users were unable to answer some questions, it is better to store the fact that the answer was unknown than to enter a best guess. Ability to Impose Data Efficiency Data efficiency is a difficult aspect of database design, because, as Donald Knuth (Professor Emeritus of The Art of Computer Programming at Stanford University) is widely quoted as saying, “Premature optimization is the root of all evil.” Although he was referring to application design, this is just as true, perhaps even more so, with database design. Unfortunately, in a large, heavily used database, it is sometimes necessary to do things that spoil the purity of the design in order to achieve more practical performance goals. You should always get the design right first, before you even consider any optimizations. Often, there are quite simple things, such as adding an index or rewriting a query, that can provide dramatic performance improvements, without compromising the core design. What you should avoid is the temptation to arbitrarily make many small changes, such as changing a varchar type to a char type, or experimenting with indexes on different attributes. Generally, these are a waste of time and can result in a poor and inconsistent database schema, which will be difficult to maintain. You need to invest time in profiling the application first, to determine where any bottlenecks lie, and only then consider what may need changing. Even then, changing the database design itself (as opposed to less-structural changes such as adding an index or rewriting a query) should be very much a last resort. Ability to Accommodate Future Change People in the software business are often surprised at just how long software remains in use, usually well beyond its design lifetime. With databases, this is even more noticeable, because migrating data from an old design to a new one is often a significant problem in its own right. There will always be pressure to enhance the existing database design, rather than start from scratch and then migrate the data at a later date. Often, you will find that any changes you have made to your design in the supposed interests of efficiency make your design harder to evolve. As Alan Perlis said in one of his programming epigrams, “Optimization hinders evolution” ( quotes.html). Stages in Database Design Once you know what you are trying to achieve with the database design, you’re ready to begin the design process. As we hinted earlier when discussing the need to understand the problem, database design is rarely a purely technical problem. A significant aspect is to understand the needs and expectations of users before converting those requirements into a technical design. After gathering information, you can proceed to logical design, and then determining Page 361 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 361 Gathering Information The first stage in designing a database is to gather information about what it is for. Why are you designing a database in the first place? It is important to have a clear objective before you attempt to collect more detailed requirements. You should be able to define, in a small number of sentences—perhaps just a single sentence—your aim with the database. If you can’t come up with a simple way of describing your objective, then perhaps the objective is not yet well understood or defined. Bear this initial simple definition in mind, and if further down the track, it all seems to be getting overcomplicated and suffering from “feature-bloat,” then go back to basics, focusing again on the key objectives. Once you have a clear idea of what you are trying to achieve, you can start to expand on this initial requirement. If your new database will be replacing an existing database, your first task should be to understand the structure of the original database—whether it’s relational, flat file, or perhaps just a spreadsheet. Even if the existing system is badly flawed, you can still learn from it, both good things and bad. It’s likely that many of the items it currently stores will also be required in the new system, and seeing some existing data can often give you a good feel for what real- world data looks like. Ask what the existing system does well and what it does badly, or not at all. This will give you clues as to how the existing design needs to be amended. You should write down what the system needs to do, because writing things down focuses the mind. If reports will be generated, try creating a mock-up for users to comment on. If it will take data that comes from existing paper-based forms, get hold of a copy, preferably with some real data already filled in. At this stage, you should also be thinking about relationships and business rules, and noting any specific features and requirements that are mentioned. You need to be careful to determine which are simply rather arbitrary “this is the way we do things” type rules and prone to change, and which are factual rules about the nature of things and much less likely to change. The former are rules you will probably choose to enforce only with triggers or at the application level, so they are easy to modify. The latter are rules you should probably build into the design of the database, enforcing data integrity at a low level, since they are fundamental and unlikely to change. Developing a Logical Design The first stage of actual design is to develop a logical design. This has several steps. This stage concentrates on the logical database structure, rather than focusing on implementation detail. Determining Entities Once you have gathered information about the initial objectives and business requirements, you should be in a position to identify the principal entities (the key objects that will need to appear in the database). At this point, you shouldn’t worry too much about minor entities. You just need to stay focused on the big picture and pick out the key objects that define the problem area. In our sample database, we would identify customers, orders, and products as the key objects that we need to work with. Additional details, such as the need to track stock or how entities relate, are not important at this Page 362 Tuesday, March 8, 2005 2:21 PM 362 CHAPTER 12 ■ DATABASE DESIGN Once you believe you have identified the major components of your database, you need to identify the attributes of those components, in an informal way. For example, for our sample database, we would probably draw up a list of our main components, with the attributes written in plain language, like this: Customers and Potential Customers Name Address Phone number Orders Products ordered Date placed Date delivered Shipping information Product Information Description Buy price Sell price Barcodes Stock on hand ■Note Name is currently not a reserved keyword in the SQL standard, but it may become a reserved word in the future. Currently, PostgreSQL will accept this as a column identifier, but at some point in the future, it may become illegal, so it is best avoided. At this stage, we are just working with plain language, so for initial design purposes, we will continue to use Name as an attribute of customers. Notice that we have not yet worried about how we might store an address, nor about minor complexities, such as the possibility that each product might have several different barcodes. We have also kept the attributes names quite general, for example “Address” and “Shipping information.” This helps to keep the list of attributes reasonably short and general, so we avoid focusing on the finer details too early and losing sight of the core of the design. At this stage, some people find it helpful to write a brief description of each entity. In our small database, this is a little superfluous, as the components are so simple, but in larger data- bases, particularly those dealing with more abstract ideas, this can be helpful. If we were writing descriptions for the “Product information” attribute, we might have the Page 363 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 363 Product Information Description Description Up to 75 characters that describe the product Buy price The price paid to the supplier per item of product, excluding any delivery costs or tax Sell price The price to be paid for the item, excluding sales tax and shipping costs Barcodes The EAN13 barcode Stock on hand The quantity in stock, including any corrections applied during an inventory check Once you have finished this stage, take the time to check the information you gathered initially, and make sure nothing important has been overlooked. Converting Entities to Tables Now you are ready to take a more technical step, and start converting components and attribute lists into something that will begin to look like a database. First, pick some sensible names for the tables. We always name our tables in the singular form, and try to stick to a 250 single word, even if that is slightly artificial. In our sample database, it’s easy to convert our names to more succinct versions such as customer or order. So rather than Product Information, we use item. Some designers prefer to use the plural for table names, but the key is to be consistent. Next, convert the attributes into more meaningful names, and also break down some of the more general descriptions into the columns you would like to see in a database. When breaking down descriptions into column names, it’s very important to ensure that each column holds just a single attribute. As we will see later in the chapter, this is essential to ensuring the database is in first normal form, a key design requirement for relational databases. Again consistency is important, so be consistent in the way you pick attribute names. The more consistent you are in your choice of names, column types, sizes, and so on, the easier your database will be to maintain in the long term. Starting with our customer table, we have three main attributes: Name, Address, and Phone number. Name is reasonably easy to break down. People normally have a title of some form, such as Mr., Mrs., or Dr., so we need to have a column for this. Names are quite complex. People are often tempted to use a single column for names, assuming that they can always break down the names later if required. The clue is in the word assume. Never assume—making assumptions is always risky, even more so in the early stages of a database design. Suppose you have a customer with a double-barrelled last name, such as Rose Martin, or a Germanic last name, such as von Neumann. Some people might choose to enter two first names, as well as a last name, such as Jennifer Ann Stones. We may have a table of data like Page 364 Tuesday, March 8, 2005 2:21 PM 364 CHAPTER 12 ■ DATABASE DESIGN Title Name Miss Jennifer Ann Stones Dr John von Neumann Mr Andrew Stones Mr Adrian Alan Matthew Mr Robert Rose Martin With this structure, it would be impossible to reliably extract the first and last names at a later date. If you need to separate the components of the name, it is much better to capture the separation of names at the point of entry, and store them separately in the database, like this: Title Fname Lname Miss Jennifer Stones Dr John von Neumann Mr Andrew Stones Mr Adrian Matthew Mr Robert Rose Martin Notice that we have also decided that we are not interested in middle names, and we decide as a point of principle to store only a single first name. Now it’s possible at some point in the future to handle the components of the name separately, so we can write to Dr. von Neumann, and start the letter “Dear John,” rather than “Dear John von.” That sort of carelessness does not impress customers. Our next item is Address. Addresses are always hard to handle in a database, because the form of address varies widely even in a single country. For example, in the United Kingdom, addresses are written in the form: 20 James Road, Great Barr, Birmingham M11 2BA Another address might have no house number at all: Arden House, Warwick Road, Acocks Green, Birmingham B27 Page 365 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 365 United States addresses are similar, although the ZIP code is slightly different from the British postcode style: 2560 Ninth Street, Suite 219, Berkeley, California 94710 In Germany and Austria, addresses are written very differently: Getreidegasse 9 A-5020 Salzburg (Which just happens to be a very attractive street where Mozart was born.) Designing a standard address structure is not easy, and often there is no perfect solution. Usually, a minimum design would be to separate out a postal town and ZIP code or equivalent, which is what we have done in our sample database. In real use, it is probably better to have at least three lines for an address, a town, a ZIP code, a state (if applicable), and a country if that might be required. ■Note If you live outside the United States, a fault you sometimes see on web forms is assuming that everyone has a state part of the address and providing a handy drop-down box to select the state, or making it a mandatory field, but forgetting to allow the “not relevant” option for the rest of the world. It is very annoying for people outside the United States trying to enter an address and discovering that state entry is mandatory, when it has no meaning for most of the world’s population It is usually best to avoid insisting on a house number, as you will cause problems for people in office buildings with a name, or people who live in apartments in condominiums and have an apartment number as well as a street address number. Another possibility is to accept an undefined number of address lines, by splitting the address lines out into a separate table. If you do this, you must remember to impose an order on the lines, so you get the address details in the correct order. Generally, most designers consider this overkill, and splitting the address into a fixed number of lines is sufficient. In general, we would recommend ensuring that the town and ZIP code are separately identified, just leaving additional lines for the remainder of the address. Occasionally, too much subdivision is a bad thing. Assuming a simplified design for our address columns, we get the following for the customer table Page 366 Tuesday, March 8, 2005 2:21 PM 366 CHAPTER 12 ■ DATABASE DESIGN Customer Title Fname Lname Addressline Town Zipcode Phone Our item (Product Information) table is already very close to having columns described: Item Description Buy price Sell price Barcodes (may be several) Stock quantity Notice that we have postponed the problem of multiple barcodes per item for now. We will return to this later. Our order table is similar: Order Items ordered Quantity of each item Date placed Date delivered Shipping information We have again postponed the details of some issues, such as multiple products being put on the same order. It’s clear we will need to further break down this table before we can imple- ment it in a real database. Determining Relationships and Cardinality At this point, you should have a list of the main entities, and although it might not be a complete list, it should be at least a reasonable first pass at the main attributes for each entity. Now comes an important phase in designing a database: breaking out those attributes that can occur several times for each entity and deciding how the different entities relate to each other. This is often referred to as cardinality. Some people like to consider the relationships even before generating an attribute list. We find that listing the main attributes helps in understanding the entities, so we perform that step first. There is no definitive right and wrong way; use whichever works best for Page 367 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 367 Drawing Relationship Diagrams With databases, a graphical representation of the structure of the data can be extremely helpful in understanding the design. At this stage, you are working on what is termed a conceptual model. You are not yet concerned about the finer implementation detail, but more about the logical structure of our data. In a conceptual data model, tables are shown as boxes, with rela- tionships between the tables shown using lines, with symbols at the end of the line indicating the type of relationship, or the cardinality. Relationships between tables are always in two directions; therefore, there will always be a symbol at each end, and you read the diagram toward the table of interest. The symbols we will be using here are shown in Table 12-1. ■Note There are many different diagramming techniques and styles in use in database circles. We will use a common notation; you will find other notation styles in use. Table 12-1. Cardinality Symbols Relationship Symbol Zero or one Exactly one Zero or many One or many Suppose we had a relationship between two tables, A and B, as shown in Figure 12-1. Figure 12-1. Simple relationship between two tables This means that the tables have the following relationship: � For each row in table A, there mu st be exactly one row in table B. � For each row in table B, there can be zero, one, or many rows in table Page 368 Tuesday, March 8, 2005 2:21 PM 368 CHAPTER 12 ■ DATABASE DESIGN For example, if table A is order and table B is customer, this would say, “For each order, there must be exactly one customer. For each customer there can be zero, one, or many orders.” Now that we have the basics of the diagram elements for drawing table relationships, we can look at our example with customers, orders, and products. Our customer table has no multiple attributes, so we can leave it alone for now. Let’s tackle our item table next, as this is reasonably straightforward. Our only difficulty with the item table is that each item could have more than one barcode. As we discussed earlier in the book, having an unknown number of repeating columns in a database table is not generally possible. (PostgreSQL does have an array data type, but that is quite unusual and should be used with caution; we prefer to stick to standard column types.) Suppose most items have two barcodes, but some have three, so we decide that an easy solu- tion is to add three columns to the item table: barcode1, barcode2, and barcode3. This seems like a nice solution to the problem, but it doesn’t stand up to closer scrutiny. What happens when a product comes along that has four barcodes? Do we redesign our database structure to add a fourth barcode column? How many columns are “enough”? As we saw in Chapter 2, having repeated columns is very inflexible, and is almost always the wrong solution. Another solution we might think of is to have a variable-length string, and “hide” barcodes in that string, perhaps separated by a character we know doesn’t typically appear in barcodes, such as a semicolon. Again, this is a very bad solution, because we have stored many pieces of information in the same location. As with a good spreadsheet, it’s very important to ensure that each entity is stored separately, so entities can be processed independently. We need to separate the repeating information—the barcodes—into a new table. That way, we can arrange to store an arbitrary number of barcodes for each item. While we are breaking out the barcode, we also need to consider the relationship between an item and a barcode. Thinking from the item side first, we know that each item could have no barcodes, one barcode, or many barcodes. Thinking from the barcode end, we know that each barcode must be asso- ciated with exactly one item. A barcode on a product is always the lowest level of identifier, identifying different versions of products, such as promotional packs or overfill packs, while the core product remains the same. We can draw this relationship as shown in Figure 12-2. Figure 12-2. The relationship between item and barcode entities This shows that each item can have zero, one, or many barcodes, but a barcode belongs to exactly one item. Notice that we have not identified any columns to join the two tables. This will come later. The important thing at this point is to determine relationships, not how we will enforce them in the database. Now we can move on to the order table, which is slightly harder to analyze. The first problem is how to represent the products that have been ordered. Often, orders will consist of more than one product, so we know that we have a repeating set of information relating to orders. As before, this means that we must separate the products being ordered into another table. We will call our main order table orderinfo, and call the table we split out to hold the products Page 369 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 369 ordered orderline, since we can imagine each row of this table corresponding to a line on a paper order. Now we need to think about the relationship between the orderinfo and orderline tables. It makes no sense to have an order for nothing, or to prevent a single order from having multiple items, so we know that orderinfo to orderline must have a one-to-many relationship. Thinking about an orderline, we realize that each orderline must relate to exactly one actual order, so the relationship between the two is that for each orderline entry, there must be exactly one orderinfo entry. Figure 12-3 illustrates this relationship. Figure 12-3. The initial design for the orderline to orderinfo relationship If you think about this a little more carefully, you can see a possible snag. When people go into a shop, they do not generally order things one at a time: � I’d like a coffee please. � I’d like a coffee please. � I’d like a donut please. � I’d like a milkshake please. � I’d like a coffee please. � I’d like a donut please. They are much more likely to express their order as follows: I’d like three coffees and two donuts and a milkshake please. Currently, our design copes perfectly with the first situation, but it can cope with the second situation only by converting it to the many single lines situation. Now we might decide this is okay, but if we are going to print out an order for a large round of coffees, milkshakes, and donuts, it’s going to look a bit silly to the customer if each item has a separate line. We are also making life difficult for ourselves if we do a discount on multiple items ordered at the same time. For these reasons, we decide it would be better to store a quantity against each line, as shown in Figure 12-4. This way, we can store each type of product in an order only once, and store the quantity of the product required in a separate column. Figure 12-4. The corrected design for the orderline to orderinfo Page 370 Tuesday, March 8, 2005 2:21 PM 370 CHAPTER 12 ■ DATABASE DESIGN Now we have a basic conceptual design for all our entities, as shown in Figure 12-5. It’s time to relate them to each other. Figure 12-5. First full set of entities We can see that we have three core groups of entities, and look at how the three groups relate to each other. In this simple database, it’s immediately obvious that customer rows must relate to orderinfo rows. Looking at the relationship between items and orders, we can see that the relationship is not between the orderinfo and the item, it is between the orderline and the item. How exactly do customers relate to orders? Clearly, each order must relate to a single customer, and each customer could have many orders, but could a customer have no orders? Although not very likely, it could happen, perhaps while a customer account is being set up, so we will allow the possibility of a customer with no orders. Similarly, we must define the exact relationship between item and orderline. Each orderline is for an item, so this relationship is exactly one. In the opposite direction, item to orderline, any individual item could have never been ordered, or could appear on many different order lines, so the relationship is zero or many. Adding these relationships gives us Figure 12-6. We now have what we believe to be a complete map of all the major entities and their most important attributes, broken down where we think we need to store them in individual columns, and a diagram showing the relationship between them. We have our first conceptual database Page 371 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 371 Figure 12-6. The full conceptual data model Validating the Conceptual Design At this point, it’s vital that you stop and validate your initial conceptual design. A mistake at this stage will be much harder to correct later. It is a well-known tenet of software engineering that the earlier you find an error, the less it costs to fix. Some studies have suggested that the cost of correcting an error increases by a factor of ten for each stage in the development process. Invest in getting the requirements capture correct and the initial design right. This doesn’t mean you can’t take an iterative approach if you prefer, but it is a little harder with database design. This is because after the first iteration, you may have significant volumes of live data in your database. Migrating this data to a later design can be challenging in its own right, never mind the application developers not being impressed with needing changes in their code to handle an “improvement” in the underlying database design If you have access to the future users of the system, this is the point at which you should go back and talk to them. Show them the diagram, and explain to them what it means, step by step, to check that what you have designed conforms to their expectations of the system. If your design is partially based on an existing database, go back and revisit the original, to check that you have not missed anything vital. Most users will understand a basic entity relationship diagram such as this, provided that you sit with them and talk them through it. Not only does it help you validate the design, but it also makes users feel involved and consulted in the development. Converting to a Physical Model Once you have a logical model of the data, which has been checked for logical correctness, you can start to move toward a physical representation of this design. This stage also has several Page 372 Tuesday, March 8, 2005 2:21 PM 372 CHAPTER 12 ■ DATABASE DESIGN Establishing Primary Keys The first step is usually to decide what the primary keys of each table will be. Here, we will work through our tables one at a time, considering them individually, and decide which piece of data (attribute) in each row will make that row unique, if any. We will be generating candidate keys—possible data items that make each row uniquely identifiable—and then picking one of the candidate keys to be the primary key. If we can’t find any good candidate keys, we may resort to a logical primary key, which is an attribute created specifically to act as a primary key. ■Tip If you do find that you need to create a special key to act as a primary key, this may be an indication that your attribute list is not complete. It’s always worth revisiting your attribute list if you find there is no obvious primary key. We will first check for a single column that will be unique, and then look for combinations that will be unique. We must also check that none of the columns in our candidate key could ever be NULL. It would make no sense to have a primary key whose value, or part of whose value, could be unknown. Indeed, SQL databases, including PostgreSQL, will automatically enforce the restriction that you may not store a NULL value in a column being used as a primary key. When looking for columns to use as a primary key, be aware that the shorter the field length, the more efficient searching for particular values will be and the smaller the overhead in the database will be. When you make a column a primary key, an index is constructed for that column, both to enforce the requirement that its values are unique and also to enable the database to find values in the column efficiently. Generally, tables are searched using their primary key columns far more often than any other column, so it is important that this can be done efficiently. You can imagine that searching a column for a description that is 1,000 characters long will be much slower than searching for a particular integer value. Having a primary key column that has many characters also makes the index tree that must be built very large, adding to the overhead. For these reasons, it is important that you try to choose columns with small fields as primary keys; integer values are ideal, short strings, particularly fixed-length strings, are tolerable. Using other data types as primary key columns is usually best avoided. Now let’s identify primary keys for the tables in our sample database: barcode table: This is straightforward. We have only one column, and there is only one candidate key: barcode. Barcodes are unique, and generally short; therefore, this candidate key makes a good primary key. customer table: It’s reasonably easy to see that no single column will give us a unique key for each row, so we move on to look at combinations of columns we might use. Let’s consider some possibilities: � First names and last name combined. This might be unique, but we can’t be certain we will never have two customers with the same Page 373 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 373 � Last name and ZIP code. This is better, bu t still not guaranteed to be unique, since there could be a husband and wife who are both customers. � First name, last name, and ZIP code. This is probably unique, but again not a certainty. It’s also rather messy and inefficient to need to use three columns to get to a unique key. One is much preferable, though we will accept two. There is no clear candidate key for the customer table, so we will need to generate a logical key that is unique for each customer. To be consistent, we will always name logical keys table name_id, which gives us customer_id. orderinfo table: This table has exactly the same problem as the customer table. There is no clear way of uniquely identifying each row, so again, we will create a key: orderinfo_id. item table: We could use the description here, but descriptions could be quite a large text string, and long text strings do not make good keys, since they are slow to search. There is also a small risk that descriptions might not always be unique, even though they probably should be. Again, we will create a key: item_id. orderline table: This table sits between the orderinfo table and the item table. If we decide that any particular item will appear on an order only once, because we handle multiple items on the same order using a quantity column, we could consider the item to be a candidate key. In practice, this won’t work, because if two different customers order the same item, it will appear in two different orderline rows. We know that we will need to find some way of relating each orderline row to its parent order in orderinfo, and since there is no column present yet that can do this, we know we will need to add one. We can postpone briefly the problem of candidate keys in the orderline table, and come back to it in a moment. Establishing Foreign Keys After establishing primary keys, you can work on the mechanism to use to relate the tables together. The conceptual model shows the way the tables relate to each other, and you have also established what uniquely identifies each row in a table. When you establish foreign keys, often all you need to do is ensure that the column you have in one table identified as a primary key also appears in all the other tables that are directly related to that table. After adjusting some column names in our tables to make them a little more meaningful, and changing the relationship lines to a physical model version, where we simply draw an arrow that points at the “must exist” table, we have a diagram that looks like Figure 12-7. Notice how the diagram has changed from the conceptual model as we move to the physical model. Now we are showing information about how tables could be physically related, not about the cardinality of those relationships. We have shown the primary key columns underlined. Don’t worry about the data types or sizes for columns yet; that will be a later step. We have deliberately left all the column types as char(10). We will revisit the type and sizes of all the columns shortly. For now, we need to work out how to relate tables. Usually, this simply entails checking that the primary key in the “must exist” table also exists in the table that is related to it. In this case, we needed to add customer_id to orderinfo, orderinfo_id to orderline, and item_id to Page 374 Tuesday, March 8, 2005 2:21 PM 374 CHAPTER 12 ■ DATABASE DESIGN Figure 12-7. Initial conversion to a physical data model Notice the orderline table in Figure 12-7. We can see that the combination of item_id and orderinfo_id will always be unique. Adding in the extra column we need has solved our missing primary key problem. We have one last optimization to make to our schema. We know that, for our particular business, we have a very large number of items, but wish to keep only a few of them in stock. This means that for our item table, quantity_in_stock will almost always be zero. For just a single column, this is unimportant, but consider the problem if we wanted to store a large amount of information for a stocked item, such as the date it arrived at the warehouse, a warehouse location, expiry dates, and batch numbers. These columns would always be empty for unstocked items. For the purposes of demonstration, we will separate the stock information from the item information, and hold it in its own table. This is sometimes referred to as a subsidiary table. Our physical design, with relationships added, primary keys defined (shown as underlined), and the stock information broken out, looks like Figure 12-8. Notice we have been careful to ensure that all related columns have the same name. We didn’t need to do this. We could have had a customer_ident in the orderinfo table that matched customer_id in the customer table. However, as we stressed earlier, database designs that empha- size consistency are much easier to work with. So, unless there are very good reasons to do otherwise, we strongly urge you to keep column names identical for columns that are related to each Page 375 Tuesday, March 8, 2005 2:21 PM CHAPTER 12 ■ DATABASE DESIGN 375 Figure 12-8. Conversion to physical data model with stock as a subsidiary table It’s also a good idea to be consistent in your naming. If you need an ident column as a primary key for a table, then stick to a naming rule, preferably one that is table name_something. It doesn’t matter if you use id, ident, key, or pk as the suffix. What is important is that the naming is consistent across the database. Establishing Data Types Once you have the tables, columns, and relationships, you can work through each table in turn, adding data types to each column. At this stage, you also need to identify any columns that will need to accept NULL values, and declare the remaining columns as NOT NULL. Notice that we start from the assumption that columns should be declared NOT NULL, and look for exceptions. This is a better approach than assuming NULL is allowed, because, as explained in Chapter 2, NULL values in columns are often hard to handle, so you should minimize their occurrence where you can. Generally, columns to be used as primary keys or foreign keys should be set to a native data type that can be efficiently stored and processed, such as integer. PostgreSQL will auto- matically enforce a constraint to prevent primary keys from storing NULL Page 376 Tuesday, March 8, 2005 2:21 PM 376 CHAPTER 12 ■ DATABASE DESIGN Assigning a data type for currency is often a difficult choice. Some people prefer a money type, if the database supports it. PostgreSQL does have a money type, but the documentation urges people to use numeric instead, which is what we have chosen to do in our sample data- base. You should generally avoid using a type with undefined rounding characteristics, such as a floating-point type like float(P). Fixed-precision types, such as numeric(P,S), are much safer for working with financial information, because the rounding behavior is defined. For text strings, there are a wide choice of options. When you know the length of a field exactly, and it is a fixed length, such as barcode, you will generally choose a char(N) type, where N is the required length. For other short text strings, we also prefer to us fixed-length strings, such as char(4) for a title. This is largely a matter of preference, however, and it would be just as valid to use a variable-length type for these strings. For variable-length text columns, PostgreSQL has the text type, which supports variable- length character strings. Unfortunately, this is not standard and, although similar extensions do appear in other databases, the ISO/ANSI definition defines only a varchar(N) text type, where N specifies a maximum length of the string. We value portability quite highly, so we stick with the more standard varchar(N) type. Again consistency is very important. Make sure all your numeric type fields have exactly the same precision. Check that commonly used columns such as description and name, which might appear in several tables in your database, aren’t defined differently (and thus used in different ways) in each. The fewer unique types and character lengths that you need to use, the easier your database will be to manage. Let’s work through the customer table, seeing how we assign types. The first thing to do is give a type to customer_id. It’s a column we added specially to be a primary key, so we can make it efficient by using an integer type. Titles will be things like Mr, Mrs, or Dr. This is always a short string of characters; therefore, we make it a char(4) type. Some designers prefer to always use varchar to reduce the number of types being used, and that would also be a perfectly valid choice. It’s possible not to know someone’s title, so we will allow this field to store NULL values. We then come to fname and lname, for first and last names. It’s unlikely these ever need to exceed 32 characters, but we know the length will be quite variable, so we make them both varchar(32). We also decide that we could accept fname being a NULL, but not lname. Not knowing a customer’s last name seems unreasonable. In this database, we have chosen to keep all the address parts together, in a single long field. As was discussed earlier, this is probably oversimplified for the real world, but addresses are always a design challenge; there is no fixed right answer. You need to do what is appropriate for each particular design. Notice that we store phone as a character string. It is almost always a mistake to store phone numbers as numbers in a database, because that approach does not allow interna- tional dialing codes to be stored. For example, +44 (0)116 … would be a common way of representing a United Kingdom dialing code, where the country code is 44, but if you are already in the United Kingdom, you need to add a 0 before the area code, rather than dialing the +44. Also, storing a number with leading zeros will not work in a numeric field, and in phone numbers, leading zeros are very important. We continue assigning types to columns in this way. The final type allocation for our physical database design is shown in Figure 12-9.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.