Test data generation plays a critical role in evaluating system performance, validating accuracy, bug identification, enhancing reliability, assessing scalability, ensuring regulatory compliance, training machine learning models, and supporting CI/CD processes. It enables the discovery of potential issues and ensures that systems operate as intended across diverse scenarios.
The AWS Glue Test Data Generator provides a configurable framework for Test Data Generation using AWS Glue Pyspark serverless Jobs. The required test data description is fully configurable through a YAML configuration file.
The source code and depolyment instruction are accessible through this link: Github Code Repository
The Test Data Generation Framework currently supports the following types:
Unique Key Generator
This generator produces formatted unique values that can be used as partition key. you can specify a prefix to and the number of leading zeros if required.
Child Key Generator
This generator produces a child key referencing the primary key. This is useful in generating multi-level hierarchical data. you can specify the number of levels and how many nodes you want to generate per level.
String Data Generator
This generator produces String data type with various mechanisms:
Random Strings: you can specify the number of characters and the type of generated characters: numeric, alphabetic or alphanumeric values. This can be used for generating random serial numbers, ordinal data, codes, identity numbers, .. etc.
Strings from a Dictionary: you can provide a dictionary of words to pick up randomly by the generator. This can be used to generate categorical columns with predefined set of values such as order status, product types, marital status, gender,..etc/
Strings from a Pattern: you can provide generic pattern for your string data. This can be used to generate fake emails, formatted phone numbers, comments, address like data, …etc.
Integer Data Generator
This generator produces random integer data from a specified range.
Float/Double Data Generator
This generator produces random float/double data from an expression. This can be used to generate float values such as salary, temperature, profit, statistical data,.. etc
Internet Address Data Generator
This generator produces random IP addresses. This can be used to generate IP address ranges for testing applications used for internet traffic monitoring or filtering.
Date Data Generator
This generator produces random dates generator from a configurable date range.
Close Date Data Generator
This generator produces random from a configurable start date column and a range. This can be used to generate dates of specific intervals such as a support ticket close date, deceased date, expiration date,… etc
The Test Data Generator is based on PySpark library which is invoked through as a PySpark AWS Glue job. All configurations to the generator is configured through a YAML formatted file stored in the S3 artefact bucket. The deployment to AWS account is done by using AWS Cloud Development Kit (CDK)
Cloudfromation creates:
The artefacts S3 Bucket and uploads the TDG PySpark library and YAML configuration file into it.
The TDG PySpark glue Job
The Service IAM role required by TDG PySpark glue Job.
Clone the GitHub repository in your local development environment
Set the following environment variables:
AWS_ACCOUNT
to the AWS account id where you intend to deploy the Test Data Generator
AWS_REGION
to the AWS region id where you intend to deploy the Test Data Generator
cdk bootstrap
$<workspace-path>/AWSGluePysparkTDG> cdk deploy
The Test Data Generator is configured through the YAML file TDG_configuration_file.yml
found in the artefacts bucket at the following path:
s3://tdg-artefacts-<account-id>/tgd_glue_job/Config/TDG_configuration_file.yml
Number of desired generated records
Descriptor of the generated record fields/columns. You can configure the following data types:
ColumnName: Column name
Generator: key_generator
DataDescriptor:
Prefix: (optional) prefix to the key generated values
LeadingZeros: (optional) number of digits formatting the key values. Key values are prefixed by leading zeros to generated a fixed number of digits
ColumnName: Column name
Generator: child_key_generator
DataDescriptor:
Prefix: prefix should match the parent key prefix
LeadingZeros: should match the parent key LeadingZero
ChildCountPerSublevel: a list of number of nodes per hierarchy sub-levels. For example, the following list describes three levels of hierarchy with level 1 has 10 nodes, level 2 has 100 nodes and level 3 has 1000 nodes.
- 10 - 100 - 1000
1. Strings from a Dictionary
ColumnName: Column name
Generator: string_generator
DataDescriptor:
Values: a list of string values.
2. Strings from a Pattern
ColumnName: Column name
Generator: string_generator
DataDescriptor:
Pattern: a pattern of expressions separated by #. available expressions:
- Constant strings: can be any constant string such as: Contact Details, @, Title:, ..etc
- Random Numbers: ^N
for example to specify 8 digits: ^N8 - Random Alphabetic Strings: ^A
for example to specify a random string of length 10 charters: ^A10 - Random Alphanumeric Strings: ^x
for example to specify a random alphanumeric string of length 5 charters: ^X5
Example, the following pattern
Contact Details: Email: #^X8#__#^N2#@#^A4#.#^A3# Phone: #^N8”
will result in the following sample values:
Contact Details: Email: dTJeG0vO__65@rAeF.Dsh Phone: 9643728
Contact Details: Email: H8bmzlVP__8@KlVQ.Swc Phone: 84716259
Contact Details: Email: FAoNEfDV__6@HAYI.Jkp Phone: 4651938
3. Random Strings
ColumnName: Column name
Generator: string_generator
DataDescriptor:
Random: ‘True’
NumChar: length of generated alphanumeric strings
ColumnName: Column name
Generator: integer_generator
DataDescriptor:
Range: lower value, upper value
ColumnName: Column name
Generator: float_generator
DataDescriptor:
** Expression**: SQL expression such as: rand(42) * 3000
ColumnName: Column name
Generator: date_generator
DataDescriptor:
StartDate: start date of the date range on the format DD/MM/YYYY
EndDate: end date of the date range on the format DD/MM/YYYY
ColumnName: Column name
Generator: close_date_generator
DataDescriptor:
StartDateColumnName: column name of the generated open date
CloseDateRangeInDays: maximum span form the open date in days
ColumnName: Column name
Generator: ip_address_generator
DataDescriptor:
IpRanges: list of ranges for the IP address four numeric parts on the form of lower value, upper value. For example:
- 9,10 - 1,254 - 1,128 - 2,20
the list of targets for the generator. The generator will perform automatic data types conversion for every specified target. Currently, the generator supports the following targets:
target: S3
attributes:
BucketArn: S3 Bucket arn including the prefix
mode: s3 bucket writing mode (overwrite, append)
header: include header in the generated data (True, Flase)
delimiter: CSV file delimeter
target: Dynamodb
attributes:
dynamodb.output.tableName: dynamodb table name
dynamodb.throughput.write.percent: throughput write percent
From the AWS Glue Console:
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.