aws-glue-test-data-generator

AWS Glue Test Data Generator for S3 Data Lakes and DynamoDB

Test data generation plays a critical role in evaluating system performance, validating accuracy, bug identification, enhancing reliability, assessing scalability, ensuring regulatory compliance, training machine learning models, and supporting CI/CD processes. It enables the discovery of potential issues and ensures that systems operate as intended across diverse scenarios.

The AWS Glue Test Data Generator provides a configurable framework for Test Data Generation using AWS Glue Pyspark serverless Jobs. The required test data description is fully configurable through a YAML configuration file.

Code Repository on Github

The source code and depolyment instruction are accessible through this link: Github Code Repository

Supported data types

The Test Data Generation Framework currently supports the following types:

Solution Architecture

image

The Test Data Generator is based on PySpark library which is invoked through as a PySpark AWS Glue job. All configurations to the generator is configured through a YAML formatted file stored in the S3 artefact bucket. The deployment to AWS account is done by using AWS Cloud Development Kit (CDK)

  1. AWS CDK generates the CloudFromation template and deploy it in the hosting AWS Account
  2. Cloudfromation creates:

    1. The artefacts S3 Bucket and uploads the TDG PySpark library and YAML configuration file into it.

    2. The TDG PySpark glue Job

    3. The Service IAM role required by TDG PySpark glue Job.

  3. The TDG PySpark glue Job is invoked to generate the test data.

Deployment

  1. Clone the GitHub repository in your local development environment

  2. Set the following environment variables:

AWS_ACCOUNT to the AWS account id where you intend to deploy the Test Data Generator

AWS_REGION to the AWS region id where you intend to deploy the Test Data Generator

  1. Use aws configure to configure the AWS CLI with the access key to the AWS account
  2. If the account is not CDK bootstrapped, you need to run the following command:

cdk bootstrap

  1. open a terminal in the workspace path and run the following CDK command to deploy the solution

$<workspace-path>/AWSGluePysparkTDG> cdk deploy

Configuration

Configuration File

The Test Data Generator is configured through the YAML file TDG_configuration_file.yml found in the artefacts bucket at the following path:

s3://tdg-artefacts-<account-id>/tgd_glue_job/Config/TDG_configuration_file.yml

Configuration Parameters

number_of_generated_records

Number of desired generated records

attributes_list

Descriptor of the generated record fields/columns. You can configure the following data types:

ColumnName: Column name

Generator: key_generator

DataDescriptor:

Prefix: (optional) prefix to the key generated values

LeadingZeros: (optional) number of digits formatting the key values. Key values are prefixed by leading zeros to generated a fixed number of digits

ColumnName: Column name

Generator: child_key_generator

DataDescriptor:

Prefix: prefix should match the parent key prefix

LeadingZeros: should match the parent key LeadingZero

ChildCountPerSublevel: a list of number of nodes per hierarchy sub-levels. For example, the following list describes three levels of hierarchy with level 1 has 10 nodes, level 2 has 100 nodes and level 3 has 1000 nodes.

   - 10
   - 100
   - 1000

ColumnName: Column name

Generator: string_generator

DataDescriptor:

Values: a list of string values.

2. Strings from a Pattern

ColumnName: Column name

Generator: string_generator

DataDescriptor:

Pattern: a pattern of expressions separated by #. available expressions:

  1. Constant strings: can be any constant string such as: Contact Details, @, Title:, ..etc
  2. Random Numbers: ^N for example to specify 8 digits: ^N8
  3. Random Alphabetic Strings: ^A for example to specify a random string of length 10 charters: ^A10
  4. Random Alphanumeric Strings: ^x for example to specify a random alphanumeric string of length 5 charters: ^X5

Example, the following pattern

Contact Details: Email: #^X8#__#^N2#@#^A4#.#^A3# Phone: #^N8”

will result in the following sample values:

Contact Details: Email: dTJeG0vO__65@rAeF.Dsh Phone: 9643728

Contact Details: Email: H8bmzlVP__8@KlVQ.Swc Phone: 84716259

Contact Details: Email: FAoNEfDV__6@HAYI.Jkp Phone: 4651938

3. Random Strings

ColumnName: Column name

Generator: string_generator

DataDescriptor:

Random: ‘True’

NumChar: length of generated alphanumeric strings

ColumnName: Column name

Generator: integer_generator

DataDescriptor:

Range: lower value, upper value

ColumnName: Column name

Generator: float_generator

DataDescriptor:

** Expression**: SQL expression such as: rand(42) * 3000

ColumnName: Column name

Generator: date_generator

DataDescriptor:

StartDate: start date of the date range on the format DD/MM/YYYY

EndDate: end date of the date range on the format DD/MM/YYYY

ColumnName: Column name

Generator: close_date_generator

DataDescriptor:

StartDateColumnName: column name of the generated open date

CloseDateRangeInDays: maximum span form the open date in days

ColumnName: Column name

Generator: ip_address_generator

DataDescriptor:

IpRanges: list of ranges for the IP address four numeric parts on the form of lower value, upper value. For example:

 - 9,10
 - 1,254
 - 1,128
 - 2,20

target_list

the list of targets for the generator. The generator will perform automatic data types conversion for every specified target. Currently, the generator supports the following targets:

Invocation

From the AWS Glue Console:

  1. Navigate to Data Integration and ETL>AWS Glue Studio]>Jobs
  2. Select the TestDataGeneratorJob **job and press Run Job**
  3. Once the job completes successfully, check for the generated data in the configured targets.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Contributors