What is AWS VPC | Amazon Virtual Private Cloud
Understanding the concept of a Virtual Private Cloud (VPC) can be daunting. Of course, you don’t need to know the innermost details about how a VPC works to use one.
But it helps to understand what it can do for you in order to understand how you can better use it effectively. The first section of this chapter examines what the Amazon VPC provides in the way of features.
You can try to define where you can use the Amazon VPC in your organization and where you might want to avoid it. In many cases, an experimenter can ignore the VPC completely. In fact, a small business may not need to pay much attention to it, either.
However, after your organization starts to get to a certain size, you can’t really ignore the VPC any longer because you need to configure it to interact with your setup in the correct manner.
This means performing some basic configuration tasks, as a minimum. This blog doesn’t get involved in low-level configuration details, which could actually require a whole blog to talk about.
The default VPC can serve the needs of most individuals and small businesses. In fact, even a medium-sized business can probably use the default VPC without any problem.
However, some situations exist, such as when you need to create custom subnets or obtain access to special VPC features, in which creating a custom VPC becomes important.
This blog gives you an overview of the process for creating a custom VPC. What you end up with is an idea of why, how, and when you use a custom VPC to meet specific business needs without delving into details that will cause your head to spin.
Virtual Private Cloud (VPC) Features
The idea behind a VPC is to create an environment in which a system separates the physical world from an execution environment. Essentially, VPC is a kind of virtual machine combined with a Virtual Private Network (VPN) and some additions that you probably won’t find with similar setups. Even so, the concept of using VPC as a virtual machine is the same as any other virtual machine.
You can read more about the benefits of using a virtual machine at https://www.linux.com/learn/why-when-and-how-use-virtual-machine. The connectivity provided by a VPC is akin to the same connectivity provided by any other VPN.
When working with AWS, you never actually see or interact with the physical device running the code that makes the resources you create active. You don’t know where the physical hardware resides or whether other VPCs are also using the same physical hardware as you are.
In fact, you have no idea of whether the code used to create an EC2 instance even resides on just one physical machine. The virtual environment — this execution environment that doesn’t exist in the physical world — lets you improve overall reliability and make recovering from crashes easier.
In addition, the virtual nature of the environment fully separates the code that your organization executes from code that any other organization executes. This concept of total separation tends to make the environment more secure as well.
The following sections describe what a VPC is in more detail and why you can benefit from one in making your organization Internet-friendly.
Defining the VPC and the reason you need it
The Internet is the public cloud. Anyone can access the Internet at any time given the correct software. You don’t even need a browser. Applications access the Internet all the time without using one — a browser is simply a special kind of Internet access application.
Despite the public nature of the Internet, it actually provides four levels or steps that you follow from being completely public to being nearly private:
1. The public, unrestricted Internet
2. Sites that limit access community data using logins and other means
3. Sites that provide access to individually identifiable data for pay or other considerations through a secure connection
4 A nearly private connection that is accessible only between consenting parties (and any hackers that may be listening) The initial step that everyone takes is the public Internet.
You do something to access the Internet; you may use your smart television or some alternative means, but you take this initial step every time you begin a session.
What the Internet really provides is access to a much larger network in which anyone can find resources and use them to meet specific needs. For example, you might read the news stories on a site while someone else downloads precisely the same information and analyzes it in some manner.
Seeing the data of the Internet is important because it helps you understand that the Internet isn’t about games or information; rather, it’s about connections to resources that mainly revolve around data.
To take the next step, you need to consider all the sites out there that limit your access to the data that the Internet provides. For example, when you want to read news stories on some sites, you must first log in to the site.
The need to log in to the site represents a connectivity hurdle that you must overcome in order to gain access to the resource, which is data. Whether you read the data or analyze it, you must still log in. The site is still public.
Anyone who has an account can access it. A third step is public sites that host private data. For example, when you make a purchase at Amazon, you first log in to your account. The data is visible only to you, not anyone else with an account. All others with an account see only their private data as well. However, the site itself is still public.
A VPC is the fourth step. In this case, you separate everything possible from everything else using a variety of software-oriented techniques, with a little machine-level hardware reinforcement.
Keeping everything separated reduces security issues. After all, you don’t want another organization (or a hacker) to know anything about what you’re doing.
Realize that you are potentially using the same physical hardware and definitely using the same cable as other people. The lack of capability to create a separate physical environment is the reason that hackers continue to create methods of overcoming security and gaining access to your resources anyway.
The reason you need a VPC is to ensure that your cloud computing is secure, or at least as secure as possible when it comes to allowing any connectivity to the outside world. In fact, without a VPC, you couldn’t use the cloud for any sensitive data, even if you had no requirement for keeping the data secure legally or ethically.
Without a VPC, every communication would be akin to creating a post on Facebook: Anyone could see it. VPCs are actually quite common because they’re so incredibly useful. Here are some other vendors that make VPCs available as part of their offerings:
HP Hybrid Cloud – HPE Helion
Microsoft Azure Other offerings exist, especially on the regional level. VPC certainly isn’t unique to Amazon; it’s becoming a common technology, and you need to ensure that the Amazon offering suits your needs.
Of course, if you want to use a VPC with a product such as EC2, you really do need the Amazon offering because both are part of AWS.
Getting an overview of the connectivity options
How you make a connection to a VPC is important because different connection types have different features and characteristics. Choosing the right connection option will yield significant gains in efficiency, reliability, and security. You might also see a small boost in speed. The following list describes the common VPC connectivity options:
AWS Hardware VPN: You generally use a hardware router and gateway to provide the Internet Protocol Security (IPSec) connection to your VPN. The AWS router connects to your customer gateway (and through it to your router) using a Virtual Private Gateway (VPG). You can discover more about this option at http://docs.aws.amazon.com/ AmazonVPC/latest/UserGuide/VPC_VPN.html.
AWS Direct Connect: This high-end option relies on a dedicated connection between your network and AWS. The “Connecting directly to AWS with Direct Connect” sidebar, later in this blog, provides additional information about using this connectivity option.
AWS VPN CloudHub: Sometimes you need to connect more than one customer network to a single AWS VPC. CloudHub works like having multiple AWS Hardware VPNs in many respects.
A software VPN offers a minimalistic approach to creating a connection between a VPC and your network. You rely on software to simulate the actions normally performed by hardware to create the connection.
Of all the options, this one is the slowest because you rely on software to perform a task best done with hardware. However, small and even medium-sized businesses may find that it works without a problem.
The only issue is that AWS doesn’t actually provide software VPN support, so you need to rely on one of the third parties listed in the AWS Marketplace.
Discovering the typical use cases
Connectivity comes in many forms today, so you need to know the connectivity options that a particular solution offers. Choosing a particular kind of connectivity affects how you use your VPC. Here’s an overview of the kinds of connectivity you can achieve using the AWS VPC:
Public subnet: Connect directly to the Internet from your EC2 or other supported service instance. Using a public subnet makes the information you present directly accessible from the Internet.
This doesn’t mean that anyone can access the information — you can still put security in place — but it does mean that anyone can gain access using a standard URL.
Private subnet: Connects EC2 or other supported service instances to the Internet using Network Address Translation (NAT). The IP address is now private, which means that the NAT controls access.
You rely on a technique called port forwarding to assign a port to the virtual machine. Requests to the NAT using a specific port access (gain admission to) the virtual machine. However, the NAT provides an additional level of security.
VPN: Creates a connection to a data center using an encrypted IPsec hardware VPN connection. The resulting connection relies on the VPN to ensure both privacy and security.
VPC: Defines a connection between two VPCs, even VPCs owned by other vendors. When working with AWS, the connection relies on private IP addresses, so no single point of failure exists for the connection.
Direct service: Creates a connection directly to an AWS service, such as S3, which allows you to interact with the content of that service, such as your S3 buckets, without having to rely on an Internet gateway or NAT.
To use this approach, you need to create endpoints in the VPC to allow service access. None of these user options must be a stand-alone option. You can create any combination of connectivity types required by your setup.
For example, creating both private and public subnets as needed is entirely possible. The point is that you use these connectivity options to make working with AWS easier.
Managing the Default VPC
The moment you create an instance of anything in AWS, you also create a default VPC. Actually, Amazon creates the default VPC for you, and you use it to launch instances of any service you want to use.
The same VPC follows you around to any region you work in, so you don’t need to create a new VPC if you decide to create instances of services in other regions as your business grows.
Your default VPC includes these features:
A default subnet in each availability zone to support networking functionality
An Internet gateway enabling you to connect to your VPC
A routeing table that ensures that Internet traffic goes where it’s supposed to go
The default security group used to keep your VPC secure
The default Access Control List (ACL) used to define which users and groups can access the resources controlled by the VPC
Dynamic Host Configuration Protocol support to provide IP addresses and other information associated with your VPC to requestors Most businesses will never need anything more than the default VPC. However, many businesses will need to modify the default VPC so that it better meets their needs.
The direct connect option using an endpoint is quite attractive for less complex needs because it requires less effort and promises significantly reduced costs compared to other options.
Currently, AWS supports endpoints only for the S3 service. However, Amazon promises to make endpoints available for other services in the near future.
The most important feature of endpoints is that they offer a configuration-only option that you don’t have to jump through hoops to use. An endpoint is actually a VPC component that provides the same redundancy and scaling that the VPC provides.
The following steps help you create an S3 endpoint:
1. Sign into AWS using your administrator account.
2. Navigate to the VPC Management Console at https://console.aws.amazon.com/vpc.
You see the VPC Dashboard page of the VPC Management Console. Notice the Navigation pane on the left side of the screen. These entries represent the kinds of connection that you can create using the VPC, along with some configuration options.
The Navigation pane also contains entries for configuring security and for creating some of the more complex connection types.
AWS provides two main options for configuring an endpoint. Using the VPC Wizard creates a custom VPC for defining connectivity to a service.
You use this option when you need to ensure that the communication remains private and doesn’t interfere with your default VPC. However, this solution also adds to your costs because now you’re running two VPCs (or possibly more).
Manually configuring the endpoint is more flexible. This is the option used for the example in this blog because it allows you to create an S3 connection to your existing, default VPC.
Both options will allow you to connect to S3 at some point, but depending on your setup, one option may provide a better solution. Generally, if you’re in doubt as to which approach to use, try the manual configuration first. You can always delete the nonfunctional default VPC endpoint and configure one using the wizard later.
3. Click Endpoints in the Navigation pane.
You see the Endpoints page, which doesn’t have anything in it now except a Create Endpoint button (or possibly two) and an Actions button.
4. Click the Create Endpoint button.
AWS presents the Create Endpoint page. This page lets you choose a VPC (you likely have only the default VPC) and a service (which is limited to S3 for now). You must also choose the level of access to provide.
Normally you create a custom policy to ensure that only the people who want to have access to your S3 setup can do so. For the purposes of this example (to keep things simple), the steps use the Full Access option.
5. Choose the default VPC entry in the VPC field.
6. Choose the S3 service entry that you want to use in the Service field.
7. Choose the Full Access option and then click Next Step.
You see the Step 2: Configure Route Tables page. Notice that you aren’t actually configuring anything. All you really need to do is select the only route table offered in the list. Unless you have created specialized route tables for your default VPC configuration, you don’t need to provide anything more than a selection in this step.
8. Select the route table that you want to use and then click Create Endpoint.
You see a Creation Status page, telling you that AWS has created the endpoint.
9. Click the View Endpoints button.
You see a list of endpoints for the selected VPC similar to the list. At this point, you have an endpoint, something to which you can connect. However, you don’t have the means to connect to it.
To create the required connection from your PC, you need an external connection to your VPC. After you have the connection to the VPC, you also have access to S3 through the endpoint.
Working with subnets
Generally, you begin with a number of subnets for your AWS setup, using one for each of the availability zones in your region.
For example, when working in the us-west-2 region, you have three subnets: us-west-2a, us-west-2b, and us-west-2c. You want to avoid confusing the region with the availability zone.
the region is a grouping of one or more availability zones. Each availability zone is a specific physical location within the region. When you’re working with the command line and other AWS features, AWS might ask you to provide your region, not your availability zone (or vice versa).
Using the wrong value can result in commands that don’t work or that incorrectly configure features.
To access these subnets, choose Subnets in the Navigation pane. You see a listing of subnets. Each subnet lists its status along with other essential information that you need to access features in AWS.
These three subnets are internal. You use them as part of working with AWS. Deleting these subnets will cause you to lose access to AWS functionality, so the best idea is to leave them alone unless you need to perform specific configuration tasks.
The following sections describe some subnet-specific tasks that you can perform in the VPC Management Console.
Creating a new subnet
In some cases, you need to create new subnets to support specific VPC functionality. The best option is to allow the various wizards to create these subnets as needed for you, but sometimes you need to create them manually. In that case, you click Create Subnet to define a new subnet. The Create Subnet dialog box.
Type a descriptive name for the subnet in the Name Tag field. You also need to define a Classless Inter-Domain Routing (CIDR) entry in the CIDR Block field.
The “Creating a New VPC” section, later in this blog, describes how a CIDR works. You can find a calculator for creating one at http://www.ipaddressguide.com/ cidr.
If a CIDR is outside the expected range, AWS displays an error message telling you what is wrong with the entry you typed. To create the subnet, click Yes, Create.
Removing an existing subnet
At some point, you may also need to remove an existing subnet. To perform this task, select the subnet entry on the Subnets page and then choose the Subnet Actions ➪ Delete Subnet.
AWS asks whether you’re sure that you want to delete the subnet. Click Yes, Delete to complete the action.
Modifying the network ACLs
The Network ACL tab of a selected subnet on the Subnets page contains the Access Control List (ACL) associated with that subnet. The ACL controls the inbound and outbound rules for accessing that subnet.
If you click Edit on that tab, you can choose a different Network ACL policy, but you can’t change any of the rules. In general, you use the same policy for all the availability zones for a particular region. Consequently, the default configuration contains only one Network ACL to choose from.
The Network ACLs page (selected by choosing Network ACLs in the Navigation pane) controls the actual rules used to govern the subnet access. The default entry doesn’t include a name, but you can give it one by clicking in the empty field associated with the Name column.
The status information includes a listing of the number of subnets associated with the Network ACL. Only the initial Network ACL will contain Yes in the Default column.
Network ACLs consist of a series of inbound and outbound rules. The inbound rules control access to the associated resources from outside sources, while the outbound rules control access to outside resources by inside sources. Both sets of rules play an important role in keeping your configuration safe.
The inbound rules appear on the Inbound Rules tab of a select Network ACL. The Outbound Rules tab looks the same and works the same as the Inbound Rules tab.
To change any of the rules, add new rules, or delete existing rules, click Edit. The display changes to show each of the existing rules with fields that you can modify. You can change the rule number, traffic type, protocol, port range, and the sources or destinations that are allowed access.
To remove a rule, click the X at the end of its entry in the list of rules. Likewise, to add a rule, click Add Another Rule. The rule changes don’t take effect until you click Save to save them.
Creating a Network ACL
To create a new Network ACL, click Create Network ACL. You see the Create Network ACL dialog box. Type a name for the new Network ACL in the Name Tag field. Choose a VPC to associate it within the VPC field.
Click Yes, Create to complete the process. Even though you see the new Network ACL in the list at this point, you still need to configure it. The default settings don’t allow any access in or out, which is a safety feature to ensure that you don’t have rules that allow unwanted access.
The VPC Management Console won’t let you delete the default Network ACL or a Network ACL that’s currently in use. You can, however, delete Network ACLs that you no longer need.
Select the Network ACL that you no longer want and then click Delete. AWS asks whether you’re certain that you want to delete the Network ACL. Click Yes, Delete to complete the process.
Creating a New VPC
You may decide to create a new VPC for any of a number of reasons. Perhaps you simply want to ensure that your private network remains completely separated from any public-facing applications that you install in AWS.
The point is that you can create a custom VPC when needed to perform specific tasks. In most cases, you want the custom VPC in addition to the default VPC that AWS created for you when you started using AWS.
As with most AWS objects, deleting your default VPC removes it entirely. There is no undo feature, so the default VPC is completely gone. It’s essential to keep your default VPC in place until you no longer need it.
The new VPC you create will definitely perform the tasks you assign to it but keeping the old VPC around until you’re certain that you no longer need it is always the best idea.
Deleting a VPC also deletes all the subnets, security groups, ACLs, VPN attachments, gateways, route tables, network interfaces, and VPC peer connections associated with that VPC, along with any service instances attached to the VPC.
The “Managing the Default VPC” section, earlier in this blog, tells you how to perform some essential VPC tasks. The following steps help you create a custom VPC that you can then configure as needed to perform the tasks you have in mind for it. When you get finished, you have an empty VPC that you can fill with anything you want.
1. Sign into AWS using your administrator account.
2. Navigate to the VPC Management Console at https://console.aws.amazon.com/vpc.
You see the VPC Management Console. The VPC Dashboard provides all the statistics for your default setup. The figure shows that this setup currently has one VPC containing all the usual default elements.
3. Click Start VPC Wizard.
You see the Select a VPC Configuration page. Selecting the right VPC template saves configuration time later and ensures that you get the VPC you want with lower potential for mistakes. The basic templates shown offer the range of access options that most businesses need and can modify to address specific requirements.
4. Choose the VPC with a Single Public Subnet option and then click Select.
The wizard displays the Step 2: VPC with a Single Public Subnet page. However, if you had selected one of the other options, you would see a similar page with configuration entries suited to that VPC template.
A few of the entries might look quite mysterious. The Classless Inter-Domain Routing (CIDR) entry simply defines the number of IP addresses available to your VPC. You can read about it at http://whatismyipaddress.com/cidr.
The handy CIDR calculator at http://www.ipaddressguide.com/cidr is also quite helpful. If you change the settings for the IP address range, the wizard automatically updates the number of available IP addresses for you.
You must also decide whether you plan to use S3. If so, you need to add an endpoint to it so that people can access it. Given that the VPC has only one subnet, you have only one choice in selecting a subnet for S3.
5. Type MyVPC in the VPC name field. Adding a name to your custom VPC makes it easier to identify.
6. Select Public Subnet in the Subnet field.
The display changes to show the policy that will control access to S3. The default option provides full access to S3. However, you can create a custom security policy to provide controlled access to S3 as needed.
7. Click Create VPC.
AWS creates the new VPC for you. You see a success message and additional instructions for launching an EC2 instance into the subnet.
The “Working with the Identity and Access Management (IAM) Console” section discusses many of the issues surrounding VPC security. After you create your custom VPC, you need to create security for it. The VPC Wizard doesn’t perform this task for you, such as creating a key pair for your VPC.
The best way to ensure that your custom VPC is accessible and will do what you need it to do at a basic level is to create an EC2 instance using the same techniques, and then work with it to begin performing various tasks.
Moving Data Using Database Migration Service
Moving data between databases is an essential administration task. You can find all sorts of reasons to move data. Some of the most common reasons are
Changing the database vendor
Creating a common platform for all elements of an organization
Upgrading to obtain an improved feature set
Changing platforms (such as moving from a corporate server to the cloud)
Many other reasons exist for moving data, but the essential goal is to make data available to end users. If you consider all the kinds of data movement for a moment, you find that user needs trump everything else.
Even data analysis boils down to serving a user need in some respect, such as the creation of recommender systems to help improve sales or productivity by predicting other choices that the user might want to have from a complex list of choices.
The first section of this blog helps you understand how the Amazon Web Services (AWS) Database Migration Service (DMS) improves your capability to move data quickly, efficiently, and, most important, without errors.
This last requirement is hard to meet in many cases because different databases have different structures, type support, features, and all sorts of other issues that make movement nearly impossible without some kind of mistake.
Database movement occurs in two scenarios: homogenous moves between installations of the same Database Management System (DBMS) product(the software that performs the actual management of the data you send to it for storage) and heterogeneous moves among different DBMS products.
Homogenous moves are easiest because you don’t need to consider issues such as differences in database features nearly as often (except, possibly, when performing an upgrade move). The blog covers homogenous moves first for this reason. However, the blog does discuss both homogenous and heterogeneous moves.
This service is free, but the compute time, data transfer time, and storage resources above a certain amount aren’t. The charges for these items are quite small, however. According to Amazon’s documentation, you can migrate a 1TB database for as little as $3.
A list of prices appears at https://aws.amazon.com/dms/ pricing/. Pricing varies by EC2 instance type (with the t2.micro instance used in the “Creating an instance” section of blog4 costing the least).
Data transfer charges don’t exist when you transfer information into a database, but you are charged when you transfer data out. Storage prices vary, but you get a certain amount of storage free (50GB in the case of the setup described in this blog). The setups in this blog won't cost you any money to perform.
Actually completing a migration will cost you money, but not much. You need to decide how far you want to go in performing the exercises in this blog. Actually performing the migration will cost you something but also provide experience in completing the tasks described.
Considering the Database Migration Service Features
It’s important to know what to expect from the DMS before you begin using it in an actual project. For example, the main page at https://aws.amazon.com/dms/ advertises zero downtime.
However, when you read the associated text, you discover that some downtime is actually involved in migrating the database, which makes sense because you can’t migrate a database containing open records (even with continuous replication between the source and target).
The fact is that you experience some downtime in migrating any database, so you have to be careful about taking any claims to the contrary at face value. Likewise, the merits of a claim that a service is easy to use depend on the skills of the person performing the migration.
An expert DBA will almost certainly find the DMS easy to use, but a less experienced administrator may encounter difficulties. With these caveats in mind, the following sections provide some clarification in what you can expect from the DMS in terms of features you can use to make your job easier.
Choosing a target database
You already have a source database in place on your local network. If you’re happy with that database and simply want to move it to the cloud, you can perform a homogenous migration. A homogenous migration is the simplest type, in most cases, as long as you follow a few basic rules:
Ensure that the source and target database are the same version, have the same updates installed, and use the same extensions.
Configure the target database to match the source database if at all possible (understanding that the configuration may not provide optimal speed, reliability, and security in a cloud environment).
Define the same characteristics for the target database as are found in the source database, such as ensuring that both databases support the same security.
Perform testing during each phase of the move to ensure that the source and target databases really do perform the same way.
Don’t make the error of thinking that moving Microsoft SQL Server to Amazon Aurora is a homogenous data move.
Anytime that you must marshal the data (make the source database data match the type, format, and context of the destination database data) or rely on a product such as the AWS Data Migration Service to move the data, you are performing a heterogeneous data move (despite what the vendor might say).
Even if the two DBMSs are compatible, that means that they aren’t precisely the same, which means that you can encounter issues related to heterogeneous moves.
Treating a move that involves two different products, even when those products are compatible, as a heterogeneous move is a smart way to view the process. Otherwise, you’re opening yourself to potential unexpected delays.
In some cases, you may decide to move data from a source database that works well in a networked environment to a target database that works well in the cloud environment.
The advantage of performing a heterogeneous move (one in which the source and target aren’t the same) is that you can experience gains in speed, reliability, and security. In addition, the target database may include features that your current source database lacks.
The disadvantage is that you must perform some level of marshaling (modifying the data of the source database to match the target database) to ensure that your move is successful.
Modifying data usually results in some level of content (the actual value of the data) or context (the data’s value when associated with other data) loss. In addition, you may find yourself rewriting scripts that perform well on the source database but may not work at all with the target database.
A decision to move to a new target database may come with some surprises as well (most of the bad sort). For example, you can move data from your Microsoft SQL Server database to the Amazon Aurora, MySQL, PostgreSQL, or MariaDB DBMS. Each of these target databases has advantages and disadvantages that you must consider before making the move.
For example, Amazon provides statistics to show that Amazon Aurora performs faster than most of its competitors, but it also locks you into using AWS with no clear migration strategy to other cloud-vendor products.
In addition, Amazon Aurora contains features that may not allow you to move your scripts with ease, making recoding an issue.
You also need to research the realities of some moves. For example, some people may feel that moving to MySQL has advantages in providing a larger platform support base than Microsoft SQL Server.
However, Microsoft is now working on a Linux version of Microsoft SQL Server that may make platform independence less of an issue. The point is that choosing a target for your cloud-based DBMS will require time and an understanding of your organization's specific needs when making the move.
No matter what a vendor tries to tell you, you will have some downtime when migrating data of any kind from any source to any target. The amount of time varies, but some sort of downtime is guaranteed, so you must plan for it. The following list provides some common sources of downtime during a migration:
Performing the data transfer often means having all records locked, which means that users can’t make changes (although they can still potentially use the data for read-only purposes).
Data marshaling problems usually incur a time penalty as administrators, DBAs, developers, and DevOps all work together to discover solutions that will work.
Changing applications to use a new data source always incurs a time penalty. The changeover could result in major downtime when the change doesn’t work as expected.
Unexpected scripting issues can mean everything from data errors to reports that won’t work as expected. Repairs are usually time-consuming at best.
Modifications that work well in the lab suddenly don’t work in the production environment because the lab setup didn’t account for some real-world difference.
Users who somehow don’t get a required update end up using outdated data sources or applications that don’t work well with the new data source.
Schema conversions can work well enough to transfer the data, but they can change its content or context just enough to cause problems with the way in which applications interact with the data.
Consequently, full application testing when performing a heterogeneous move of any sort is a requirement that some organizations skip (and end up spending more time remediating than if they had done the proper testing in the first place).
Differences in the cloud environment add potential latency or other timing issues not experienced in the local network configuration.
An essential part of keeping downtime to a minimum, despite these many sources of problems, is to be sure to use real-world data for testing in a lab environment that duplicates your production environment as closely as possible.
This means that you need to address even the small issues, such as ensuring that the lab systems rely on the same hardware and use the same configuration as your production environment.
You also need to perform real-world testing that relies on users who will actually use the application when it becomes part of the production environment. If you don’t perform real-world testing under the strictest possible conditions, the amount of downtime you experience will increase exponentially.
Not only does an optimistic lab setup produce unrealistic expectations, but it also creates a domino effect in which changes, procedures, and policies that would work with proper testing don’t work because they aren’t properly tested and verified in the lab.
You must also use as many tools as you can to make the move simpler. The “Understanding the AWS Schema Conversion Tool” section, later in this blog, discusses the use of this tool to make moves between heterogeneous databases easier.
However, a great many other tools are on the market, so you may find one that works better for your particular situation.
Organizations can end up with data in a number of different DBMSs because of mergers and inefficiencies within the organization itself. A workgroup database may eventually see use at the organization level, so some of these DBMS scenarios also occur as a result of growth.
Whatever the source of the multitude of DBMSs, consolidating the data into a single DBMS (and sometimes a single database) can result in significant improvement in organizational efficiency.
However, when planning the consolidation, view it as multiple homogenous or heterogeneous moves rather than a single big move. Each move will require special considerations, so each move is unique. All you’re really doing is moving multiple sources to the same target.
A potential issue with data consolidation occurs when multiple source databases have similar data. When you consolidate the data, not only do you have to consider marshaling the data from the source schema to the destination schema, but you must also consider the effects of combining the data into a coherent whole.
This means considering what to do with missing, errant, outdated, or conflicting data. One database can quite possibly have data that doesn’t match a similar entry in another database. Consequently, test runs that combine the data and then look for potential data issues are an essential part of making a consolidation work.
One of the ways in which you can use the AWS DMS is to replicate data. Data replication to a cloud source has a number of uses, which include:
Providing continuous backup
Acting as a data archive
Performing the role of the main data storage while the local database acts as a cache
Creating an online data source for users who rely on mobile applications
Developing a shareable data source for partners
When used in this way, the AWS DMS sits between the source database and one or more target databases. You can use a local, networked, or cloud database as the source.
Normally, the target resides in the cloud. Theoretically, you can create a heterogeneous replication, but homogenous replications are far more reliable because you don’t need to worry about constantly marshaling the data between different source and target DBMS.
Moving Data between Homogenous Databases
Moving data between homogeneous databases (those of precisely the same type) is the easiest kind of move because you have a lot less to worry about than when performing a heterogeneous move (described in the “Moving Data between Heterogeneous Databases” section, later in this blog).
For example, because both databases are the same, you don’t need to consider the need to marshal (convert from one type to another) data between data types. In addition, the databases will have access to similar features, and you don’t necessarily need to consider issues such as database storage limitations.
The definition for homogenous can differ based on what you expect in the way of functionality. For the purposes of this blog, a homogenous data move refers to moving data between copies of precisely the same DBMS.
A move between copies of SQL Server 2016 is homogenous, but moving between SQL Server and Oracle isn’t, even though both DBMSs support relational functionality. Even a move between SQL Server 2016 and SQL Server 2014 could present problems because the two versions have differing functionality.
Trying a homogenous move before you attempt a heterogeneous move is important because the homogenous move presents an opportunity to separate database issues from movement issues.
The following sections help you focus on the mechanics of a move that doesn’t involve any database issues. You can use these sections to build your knowledge of how moves are supposed to work and to ensure that you fully understand how moves work within AWS.
Obtaining access to a source and target database
It relies on a second database created using the same procedure in the “Accessing the RDS Management Console. The name of the target database is MyTarget.
The target database won’t be immediately available when you first create it. You must wait until the Status field of the Instances page of the RDS Management Console shows an Available indicator before you can access the database. If you try to test the database connection before then, the connection will fail.
Defining the move
Unlike many of the other tasks that you perform with AWS, performing a data migration is a task, rather than an object creation.
As a result, you create one or more tasks that define what you want AWS DMS to do, rather than configure a virtual server (as with EC2) or a new database (as with RDS). The following steps are an overview of the series of steps that you might take in creating a migration task:
1. Sign into AWS using your administrator account.
2. Navigate to the DMS Management Console at https://console.aws.amazon.com/dms.
You see a Welcome page that contains interesting information about DMS and what it can do for you. Notice that the Navigation pane contains options for creating new tasks and configuring endpoints.
This console also lets you define replication instances. In the lower-right corner, you find information about the AWS Schema Conversion Tool, an application that you download to your local system rather than use online.
Moving the data
To move data, you must create a migration task. The following steps describe how to create a task that will migrate data from the source test database to the target test database:
1. Click Create Migration.
A Welcome page appears that tells you about the process for migrating a database. This page also specifies the steps you need to perform in the Navigation pane and provides a link for downloading the AWS Schema Conversion Tool.
2. Click Next.
The wizard displays the Create Replication Instance page.This page helps you define all the requirements for performing the migration task.
3. Type MoveMySQLData in the Name field.
Be sure to name your task something descriptive. You may end up using the replication task more than once, and trying to remember what the task is for is hard if you don’t use a descriptive name
4. (Optional) Type a detailed description of the task’s purpose in the Description field.
5. Choose the dms.t2.micro option from the Instance Class field.
The move relies on your EC2 instance. To get free-tier EC2 support, you need to use the dms.t2.micro option.
However, consider the cost of using the service. All incoming data is free. You can also transfer data between Amazon RDS and Amazon EC2 Instances in the same Availability Zone free. Any other transfers will cost the amount described at https://aws.amazon.com/dms/ pricing/.
6. Click the down arrow next to the Advanced heading.
You see the advanced options for transferring the data.
7. Type 30 (or less) in the Allocated Storage GB field.
Remember that you get only 30GB of free EBS storage per month (see https://aws.amazon.com/free/), so experimenting with a larger storage amount will add to your costs.
8. Choose Default-Launch in the VPC Security Group(s) field.
Using this security group ensures that you have access to the migration as needed.
9. Click Next.
10. Fill out the individual fields for the source and target database.
11. Click Run Test under each of the test databases to ensure that you can connect to them.
The Run Test button doesn’t become available until after AWS completes an initial configuration and you completely fill in the required blanks. You want to test each connection individually to ensure that it actually works before proceeding.
12. Click Next.
You see the Create Task page. The fields on this page configure the task so that you can use it. You use the settings to determine how AWS performs the task, what sorts of data that AWS moves from one database to another, and precisely which tables AWS moves.
13. Type TestDataMove in the Task Name field.
14. Choose Migrate Existing Data in the Migration Type field.
The migration type determines how AWS perform the task. You have the following options when setting this field:
Migrate Existing Data: Copies all the data from the source database to the target database.
Migrate Existing Data and Replicate Ongoing Changes: Performs the initial data copy and then monitors the source database for changes. When AWS detects changes, it copies just the changes to the target database (saving resources and maintaining speed).
Replication Data Changes Only: Assumes that the source and target databases are already in sync. AWS monitors the source database and copies only the changes to the target.
15. Select the Start Task on Create check box.
This option specifies that you want the migration to start immediately. If you deselect this check box, you need to start the task manually.
16. Click Create Task.
After a few moments, you see the Tasks page of the DMS Management Console. The task’s Status field contains Creating until the creation process is complete.
Moving Data between Heterogeneous Databases
DBMSs come in many different forms because people expect them to perform a wide variety of tasks. The relational DBMS serves the interests of businesses because it provides organized data storage that ensures data integrity and moderately fast response times.
However, the relational database often doesn’t work well for data that isn’t easy to organize, such as large quantities of text, which means that you must use a text-based DBMS instead.
The popular NoSQL DBMSs provide a nontabular approach to working with big data and real-time applications that aren’t easy to model using relational strategies.
In short, the need for multiple DBMS types is well established and reasonable because each serves a different role. However, moving data between DBMS of different types can become quite difficult.
The following sections can’t provide you with a detailed description of every move of this type (which would require an entire blog), but they do give you an overview of the AWS perspective of heterogeneous data moves.
Considering the essential database differences
Even if providing such a discussion were possible, considering the wealth of available DBMSs, the resulting text would be immense.
Fortunately, you don’t need to know the particulars of every DBMS; all you really need to think about are the types of differences you might encounter so that you’re better prepared to deal with them.
The following list presents essential database differences by type and in order of increasing complexity. As a difference becomes more complex to handle, the probability of successfully dealing with it becomes lower. In some cases, you must rely on compromises of various sorts to achieve success.
Features: Whenever a vendor introduces a new version of a DBMS product, the new version contains new features. Fortunately, many of these products provide methods for saving data in a previous version format, making the transition between versions easy.
This same concept holds for working with products (the data target) that can import specific versions of another product’s data (the data source). Exporting the data from the source DBMS in the required version makes data transfers to the target easier.
Functionality: One DBMS may offer the capability to store graphics directly in the database, while another can store only links to graphics data. The transition may entail exporting the graphic to another location and then providing that location as input to the new DBMS as a link.
Platform: Some platform differences can prove quite interesting to solve. For example, one platform may store paths and filenames in a manner in which case doesn’t matter, while another store this same information in a case-sensitive way. The data exchange may require the use of some level of automation to ensure consistency of path and filename case.
Data types: Most data type issues are relatively easy to fix because software commonly provides methods to marshal (change) one data type to another.
However, you truly can’t convert a Binary Large Object (BLOB) type text field into a fixed-length text field of the sort used by relational databases, so you must create a custom conversion routine of some sort. In short, data type conversions can become tricky because you can change the context, meaning,
Adding automation, such as code stored in data fields, to a DBMS significantly increases the complexity of moving data from one DBMS to another. In many cases, you must choose to leave the automation behind when making the data move or representing it in some other way.
Data organization: Dealing with DBMSs of different types, such as moving data from a NoSQL database to a relational database, can involve some level of data loss because the organization of the data between the two DBMSs is so different.
Any conversion will result in data loss in this case. In addition, you may have to calculate some values, replace missing values, and perform other sorts of conversions to successfully move the data from one DBMS to another of a completely different organizational type.
Storage methodology: The reason that storage methodology can incur so many issues is that the mechanics of working with the data are now different. Having different storage technologies in play means that you must now consider all sorts of issues that you don’t ordinarily need to consider, such as whether the data requires encryption in the target database to ensure that the storage meets any legal requirements.
Given that cloud storage is inherently different from storage on a local drive, you always encounter this particular difference when moving your data to AWS, and you really need to think about all that a change in storage methodology entails.
Understanding the AWS Schema Conversion Tool
The AWS Schema Conversion Tool makes marshaling data from a source database to a target database relatively easy. To follow this example, you must create a target PostgreSQL example database using the same technique. This odd requirement exists because the AWS Schema Conversion Tool supports specific source and target database combinations.
Fortunately, a free-tier version of PostgreSQL is available for development use, just as one is for MySQL. Name your target database MyTarget2. The following sections help you get started with the AWS Schema Conversion Tool.
Getting, installing, and configuring the AWS Schema Conversion Tool
You download this product to your system and install it. Amazon provides versions of this tool for Windows, Mac OS X, Fedora Linux (Redhat Package Manager, RPM), and Ubuntu Linux (Debian). Even though the following steps show the Windows version of the product, versions for other platforms work in the same way.
1. Obtain and install a copy of the AWS Schema Conversion Tool for your platform.
You can find a list of the platforms at the bottom of the page at https://aws.amazon.com/dms/.
2. Start the application.
You see a Create New Database Migration Project dialog box,
3. Choose MySQL in the Source Database Engine field.
The appearance of the dialog box will change to reflect the needs of the particular database engine you use.
4. Type the endpoint information associated with your copy of MyDatabase without the port.
5. Type the port information for your copy of MyDatabase.
6. Provide the username and password for your copy of MyDatabase.
7. Provide the location of a MySQL driver.
Amazon isn’t very clear about where to obtain the driver. If you already have MySQL installed on your system, theoretically you also have the driver.
However, if you installed only MySQL Workbench to interact with the cloud-based version of your MySQL database, you won’t have the driver installed. You can obtain a copy of the driver needed for this example from http:// www.mysql.com/downloads/connector/j/.
8. Click Test Connection. You see a success message if the connection is successful.
9. Click Next. The wizard shows you a list of tables and asks which one you’d like to analyze
10. Select the first database entry and click Next.
You see a database migration report that tells you about the issues that you might encounter in migrating the database. The example database contains a single table with a couple of fields as a sample. It won’t have any migration issues.
11. Click Next.
You see a target database dialog box. Notice that this dialog box asks for the same information as the source dialog box does, including the location of the PostgreSQL driver. You can download this driver from http://jdbc.postgresql.org/.
12. Fill out the target database information using the same approach as you did for the source database.
Note that PostgreSQL uses 5432 as its default port. Make sure to enter the correct port number when filling out the form. The example assumes that you created a database named the first database as part of creating the database.
13. Click Test Connect to verify the connection to the database. You see a success message.
14. Click Finish.