Bulk loading data into PGD clusters
Bulk loading data into PGD clusters
This guidance is specifically for environments where there's no direct access to the PGD nodes, only PGD Proxy endpoints, such as Cloud Service's distributed high availability deployments of PGD.
Without using care, bulk loading data into a PGD cluster can cause a lot of replication load on a cluster. With that in mind, this content describes a process to mitigate that replication load.
Provision or prepare a PGD cluster
You must provision a PGD cluster, either manually, using TPA, or on Cloud Service. This will be the target database for the migration. Ensure that you provision it with sufficient storage capacity to hold the migrated data.
We recommend that, when provisioning or, if needed, after provisioning, you set the following Postgres GUC variables.
GUC variable | Setting |
---|---|
maintenance_work_mem | 1GB |
wal_sender_timeout | 60min |
wal_receiver_timeout | 60min |
max_wal_size | Set to either: • A multiple (2 or 3) of your largest table or • More than one third of the capacity of your dedicated WAL disk (if configured) |
Make note of the target's proxy hostname (target-proxy) and port (target-port). You also need a user (target-user) and password (target-password) for the target cluster.
The following instructions give examples for a cluster named ab-cluster
with an ab-group
subgroup and three nodes: ab-node-1
, ab-node-2
, and ab-node3
. The cluster is accessed through a host named ab-proxy
(the target-proxy).
On Cloud Service, a cluster is configured, by default, with an edb_admin
user (the target-user) that can be used for the bulk upload.
The target-password for the target-user is available from the Cloud Service dashboard for the cluster.
A database named bdrdb
(the target-dbname) was also created.
Identify your data source
You need the source hostname (source-host), port (source-port), database name (source-dbname), user, and password for your source database.
Also, you currently need a list of tables in the database that you want to migrate to the target database.
Prepare a bastion server
Create a virtual machine with your preferred operating system in the cloud to orchestrate your bulk loading.
- Use your EDB account.
- Obtain your EDB repository token from the EDB Repos 2.0 page.
- Set environment variables.
- Set the
EDB_SUBSCRIPTION_TOKEN
environment variable to the repository token.
- Set the
- Configure the repositories.
- Run the automated installer to install the repositories.
- Install the required software.
- Install and configure:
- psql
- PGD CLI
- Migration Toolkit
- LiveCompare
- Install and configure:
Use your EDB account
Go to the EDB Repos 2.0 page and log in with your EDB account. Make a note of the repository token that you will use to configure the repositories on the bastion server.
Set environment variables
Set the EDB_SUBSCRIPTION_TOKEN
environment variable to the repository token you obtained from the EDB Repos 2.0 page.
Configure repositories
The required software is available from the EDB repositories. You need to install the EDB repositories on your bastion server.
- Red Hat
- Ubuntu/Debian
Install the required software
Once the repositories are configured, you can install the required software.
Installing psql
and pg_dump
/pg_restore
/pg_dumpall
The psql command is the interactive terminal for working with PostgreSQL. It's a client application and can be installed on any operating system. Packaged with psql are pg_dump and pg_restore, command-line utilities for dumping and restoring PostgreSQL databases.
- Ubuntu
- Red Hat
To simplify logging in to the databases, create a .pgpass file for both your source and target servers:
Create the file in your home directory and change its permissions to read/write only for the owner.
Ensure that your passwords are appropriately escaped in the .pgpass
file. If an entry needs to contain : or \, escape this character with \.
Installing PGD CLI
PGD CLI is a command-line interface for managing and monitoring PGD clusters. It's a Go application and can be installed on any operating system.
- Ubuntu
- Red Hat
Create a configuration file for the PGD CLI:
For the example ab-cluster
:
Save it as pgd-cli-config.yml
.
See also Installing PGD CLI.
Installing Migration Toolkit
EDB's Migration Toolkit (MTK) is a command-line tool you can use to migrate data from a source database to a target database. It's a Java application and requires a Java runtime environment to be installed.
- Ubuntu
- Red Hat
See also Installing Migration Toolkit
Installing LiveCompare
EDB LiveCompare is an application you can use to compare two databases and generate a report of the differences. You'll use it later in this process to verify the data migration.
- Ubuntu
- Red Hat
See also LiveCompare requirements.
Set up and tune the target cluster
On the target cluster and within the regional group required, select one node to be the destination for the data.
If you have a group ab-group
with ab-node-1
, ab-node-2
, and ab-node-3
, you can select ab-node-1
as the destination node.
Set up a fence
Fence off all other nodes except for the destination node.
Connect to any node on the destination group using the psql command.
Use bdr.alter_node_option
and turn the route_fence
option to true
for each node in the group apart from the destination node:
The next time you connect with psql, you're directed to the write leader, which should be the destination node. To ensure that it is, you need to send two more commands.
Make the destination node both write and raft leader
To minimize the possibility of disconnections, move the raft and write leader roles to the destination node.
Make the destination node the raft leader using bdr.raft_leadership_transfer
. You need to specify the node and the group name that the node is a member of:
Because you fenced off the other nodes in the group, this command triggers a write leader election in the ab-group
that elects the ab-node-1
as write leader.
Record then clear default commit scopes
You need to make a record of the default commit scopes in the cluster. The next step overwrites the settings. (At the end of this process, you need to restore them.) Run: