Test Data Generation

Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development.

Note

Throughout this guide, we use pb.preview() to display generated datasets with nice HTML formatting. This is optional: pb.generate_dataset() returns a standard DataFrame that you can display or manipulate however you prefer.

Quick Start

Generate test data using a schema with field constraints:

import pointblank as pb

# Define a schema with typed field specifications
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=80),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate 100 rows of test data (seed ensures reproducibility)
pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns5
user_id
Int64
name
String
email
String
age
Int64
status
String
1 7188536481533917197 Vivienne Rios vivienne.rios@gmail.com 77 pending
2 2674009078779859984 William Schaefer williamschaefer@aol.com 67 active
3 7652102777077138151 Lily Hansen lilyhansen@hotmail.com 78 active
4 157503859921753049 Shirley Mays shirley.mays27@aol.com 36 inactive
5 2829213282471975080 Sean Dawson sean.dawson29@aol.com 75 pending
96 7027508096731143831 Kathryn Green kathryn.green@hotmail.com 55 active
97 6055996548456656575 Daniel Morris dmorris@yahoo.com 39 inactive
98 3822709996092631588 William Cooper williamcooper@protonmail.com 24 inactive
99 1522653102058131295 Lane Sawyer l_sawyer@zoho.com 41 active
100 5690877051669225499 Paisley Sandoval paisley_sandoval@gmail.com 75 pending

Field Types

Pointblank provides helper functions for defining typed columns with constraints:

Function Description Key Parameters
int_field() Integer columns min_val, max_val, allowed, unique
float_field() Float columns min_val, max_val, allowed
string_field() String columns preset, pattern, allowed, unique
bool_field() Boolean columns p_true (probability of True)
date_field() Date columns min_val, max_val
datetime_field() Datetime columns min_val, max_val
time_field() Time columns min_val, max_val
duration_field() Duration columns min_val, max_val

Integer Fields

Integer fields support range constraints with min_val and max_val, discrete allowed values with allowed, and uniqueness enforcement with unique=True:

schema = pb.Schema(
    id=pb.int_field(min_val=1000, max_val=9999, unique=True),
    quantity=pb.int_field(min_val=1, max_val=100),
    rating=pb.int_field(allowed=[1, 2, 3, 4, 5]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
id
Int64
quantity
Int64
rating
Int64
1 5749 100 3
2 2368 38 1
3 1279 11 1
4 6025 3 5
5 7942 76 3
96 5330 64 2
97 8634 31 1
98 9982 43 2
99 4221 70 1
100 8520 19 5

The unique=True constraint ensures no duplicate values appear in that column, which is useful for generating primary keys or identifiers.

Float Fields

Float fields work similarly to integers, with min_val and max_val defining the range of generated values:

schema = pb.Schema(
    price=pb.float_field(min_val=0.0, max_val=1000.0),
    discount=pb.float_field(min_val=0.0, max_val=0.5),
    temperature=pb.float_field(min_val=-40.0, max_val=50.0),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
price
Float64
discount
Float64
temperature
Float64
1 924.8652516259452 0.4624326258129726 43.23787264633508
2 948.6057779931772 0.47430288899658857 45.37452001938594
3 892.4333440485793 0.44621667202428966 40.31900096437214
4 83.55067683068363 0.04177533841534181 -32.48043908523847
5 592.0272268857353 0.29601361344286764 13.282450419716177
96 444.6925279641446 0.2223462639820723 0.022327516773010814
97 342.7762214585577 0.17138811072927884 -9.150140068729808
98 892.3288689140903 0.4461644344570452 40.309598202268134
99 813.7559456012128 0.4068779728006064 33.238035104109144
100 895.1816604808429 0.44759083024042146 40.56634944327587

Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data.

String Fields with Presets

Presets generate realistic data like names, emails, and addresses. When you include related fields like name and email in the same schema, Pointblank ensures coherence (e.g., the email address will be derived from the person’s name), making the generated data more realistic:

schema = pb.Schema(
    full_name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns4
full_name
String
email
String
company
String
city
String
1 Kingston Miller k_miller@zoho.com Innovative Systems Solutions Hollywood
2 Kaden Mosley kaden.mosley9@protonmail.com Sterling Engineering Santa Ana
3 Brooks Wilkerson brooks703@yahoo.com Goldman Sachs Rochester
4 Juliana Mitchell jmitchell@zoho.com Simmons LLC Bloomington
5 Barbara Walters barbara662@icloud.com Frontier Systems Toledo
96 Cheryl Robinson cheryl.robinson@zoho.com Watts Retail Henderson
97 Elijah Cunningham ecunningham22@hotmail.com National Solutions International Aurora
98 Magnolia Mosley magnolia_mosley@aol.com Silver Consulting International Vancouver
99 Stella Gray stella_gray@mail.com Elite Realty Services Syracuse
100 Harrison Allen harrison.allen25@outlook.com Tate Ltd Plano

This coherence extends to other related fields like user_name, which will also reflect the person’s name when included alongside name and email fields.

String Fields with Patterns

Use regex patterns to generate strings matching specific formats:

schema = pb.Schema(
    product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"),
    phone=pb.string_field(pattern=r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"),
    hex_color=pb.string_field(pattern=r"#[0-9A-F]{6}"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
product_code
String
phone
String
hex_color
String
1 CAS-6685 (109) 668-2347 #209DCB
2 XGI-0397 (397) 117-0865 #68E07E
3 DCW-6086 (309) 293-9594 #32FD0D
4 YBG-9529 (917) 797-2285 #161B56
5 XLS-9459 (911) 609-9495 #B9A2F5
96 THG-2900 (993) 511-5415 #A7A37B
97 CHC-3681 (065) 802-0822 #47E498
98 HKT-3552 (927) 701-4276 #AF75D8
99 OEW-4157 (365) 419-1062 #5CCD95
100 FSX-8948 (897) 459-3038 #0F3220

Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification.

Boolean Fields

Control the probability of True values:

schema = pb.Schema(
    is_active=pb.bool_field(p_true=0.8),      # 80% True
    is_premium=pb.bool_field(p_true=0.2),     # 20% True
    is_verified=pb.bool_field(),              # 50% True (default)
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
is_active
Boolean
is_premium
Boolean
is_verified
Boolean
1 False False False
2 False False False
3 False False False
4 True True True
5 True False False
96 True False True
97 True False True
98 False False False
99 False False False
100 False False False

This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others.

Date and Datetime Fields

Temporal fields accept Python date and datetime objects for their range boundaries, generating values uniformly distributed within the specified period:

from datetime import date, datetime

schema = pb.Schema(
    birth_date=pb.date_field(
        min_date=date(1960, 1, 1),
        max_date=date(2005, 12, 31)
    ),
    created_at=pb.datetime_field(
        min_date=datetime(2024, 1, 1),
        max_date=datetime(2024, 12, 31)
    ),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns2
birth_date
Date
created_at
Datetime
1 1986-01-03 2024-12-25 04:22:08
2 1967-06-30 2024-10-29 16:22:23
3 1961-07-13 2024-04-22 14:13:08
4 1987-07-09 2024-12-12 14:04:53
5 1998-01-06 2024-11-18 04:49:47
96 1969-04-14 2024-07-29 13:15:44
97 1975-03-23 2024-04-28 08:49:29
98 1981-05-29 2024-12-13 09:42:37
99 1982-09-14 2024-10-28 23:35:39
100 1968-12-21 2024-06-25 14:22:27

The same pattern applies to time_field() and duration_field(), allowing you to generate realistic temporal data for any use case.

Available Presets

The preset= parameter in string_field() supports many data types:

Personal Data:

  • name: full name (first + last)
  • name_full: full name with optional prefix/suffix (e.g., “Dr. Ana Sousa”, “Prof. Tanaka Yuki”)
  • first_name: first name only
  • last_name: last name only
  • email: email address
  • phone_number: phone number in country-specific format

Location Data:

  • address: full street address
  • city: city name
  • state: state/province name
  • country: country name
  • postcode: postal/ZIP code
  • latitude: latitude coordinate
  • longitude: longitude coordinate

Business Data:

  • company: company name
  • job: job title
  • catch_phrase: business catch phrase

Internet Data:

  • url: website URL
  • domain_name: domain name
  • ipv4: IPv4 address
  • ipv6: IPv6 address
  • user_name: username
  • password: password

Financial Data:

  • credit_card_number: credit card number
  • iban: International Bank Account Number
  • currency_code: currency code (USD, EUR, etc.)

Identifiers:

  • uuid4: UUID version 4
  • md5: MD5 hash (32 hex characters)
  • sha1: SHA-1 hash (40 hex characters)
  • sha256: SHA-256 hash (64 hex characters)
  • ssn: Social Security Number (country-specific format)
  • license_plate: vehicle license plate (location-aware for CA, US, DE, AU, GB)

Barcodes:

  • ean8: EAN-8 barcode with valid check digit
  • ean13: EAN-13 barcode with valid check digit

Date/Time:

  • date_this_year: a date within the current year
  • date_this_decade: a date within the current decade
  • date_between: a random date between 2000 and 2025
  • date_range: two dates joined with an en-dash (e.g., "2012-05-12 – 2015-11-22")
  • future_date: a date up to 1 year in the future
  • past_date: a date up to 10 years in the past
  • time: a time value

Text:

  • word: single word
  • sentence: full sentence
  • paragraph: paragraph of text
  • text: multiple paragraphs

Miscellaneous:

  • color_name: color name
  • file_name: file name
  • file_extension: file extension
  • mime_type: MIME type
  • user_agent: browser user agent string (country-weighted)

Country-Specific Data

One of the most powerful features is generating locale-aware data. Use the country= parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets.

Let’s create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures consistency across related fields. The city, address, postcode, and coordinates will all correspond to the same location:

# Schema with linked location fields
schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    address=pb.string_field(preset="address"),
    postcode=pb.string_field(preset="postcode"),
    latitude=pb.string_field(preset="latitude"),
    longitude=pb.string_field(preset="longitude"),
)

Here’s German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="DE"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Ines Flohr Sachsenhausen Oppenheimer Landstraße 8446, 60559 Sachsenhausen 60569 50.102657 8.698808
2 Joachim Pohlmann St. Pauli Talstraße 3672, 20302 St. Pauli 20392 53.540407 9.968037
3 Elfriede Sander Kreuzberg Mariannenstraße 990, 10911 Kreuzberg 10927 52.489220 13.409102
4 Wilhelm Opitz Dessau-Roßlau Königstraße 1418, 06784 Dessau-Roßlau 06116 51.810370 12.259340
5 Ursula Westphal Wiesbaden Mozartstraße 8328, Whg. 683, 65008 Wiesbaden 65936 50.087802 8.256100
196 Hildegard Reinhardt Berlin Rüdesheimer Platz 4345, Whg. 595, 10036 Berlin 10642 52.443180 13.612304
197 Arnold Münz Stuttgart Rotebühlplatz 5931, Whg. 911, 70562 Stuttgart 70441 48.759266 9.178001
198 Dominik Bachmann Ulm Bahnhofstraße 9887, 89272 Ulm 89286 48.404127 10.016080
199 Alexander Busch Prenzlauer Berg Fehrbelliner Straße 3918, Whg. 378, 10467 Prenzlauer Berg 10411 52.546146 13.434788
200 Bianca Bollmann Augsburg Karlstraße 9976, 86392 Augsburg 86227 48.413405 10.903475

Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan’s geographic boundaries:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="JP"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Tadayuki Hara Fukuyama 720-8233 Hiroshima Fukuyama Nishi-cho 9502-435 720-4531 34.514784 133.376591
2 Takafumi Kato Kamakura 248-7907 Kanagawa Kamakura Hase 8877-274 248-9166 35.325474 139.545896
3 Gota Kashiwagi Ikebukuro 171-1830 Tokyo Ikebukuro Shiinamachi-dori 6360-350 171-8706 35.726125 139.702343
4 Shinya Fujishima Mihara 723-6084 Hiroshima Mihara Onomichi-dori 8740-969 723-5615 34.393466 133.116512
5 Nodoka Kuwata Fuji 416-2282 Shizuoka Fuji Shin-Fuji Eki-mae 3347-837 416-8803 35.183740 138.673396
196 Manami Inagawa Nagaoka 940-4484 Niigata Nagaoka Ojiya-dori 288-565 940-8762 37.477659 138.841431
197 Takuya Komori Chiba 260-8345 Chiba Chiba Kaihin-makuhari 2538-65 260-9734 35.591983 140.145145
198 Hayato Sakamoto Kure 737-4862 Hiroshima Kure Nishimachi 2690-670 737-2541 34.256791 132.585931
199 Mitsuko Tateno Chigasaki 253-7355 Kanagawa Chigasaki Shonan-dori 8470 253-6378 35.344598 139.419824
200 Toshiko Tominaga Hakodate 040-3017 Hokkaido Hakodate Omoricho 5503 040-7893 41.808678 140.736790

Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil’s CEP format:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="BR"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Iraci Quiroga Porto Velho Rua Brasília, 8740, 76084-235 Porto Velho - RO 76813-377 -8.783359 -63.944499
2 Almir Pires Aracaju Rua Lagarto, 8874, 49572-600 Aracaju - SE 49876-654 -10.910802 -37.069824
3 Ubaldo Leite Porto Alegre Rua Coronel Fernando Machado, 3821, Apto 888, 90352-709 Porto Alegre - RS 90522-720 -30.097326 -51.222829
4 Edilene Rabello Ribeirão Preto Rua Barão do Amazonas, 3087, Apto 470, 14878-711 Ribeirão Preto - SP 14678-905 -21.107871 -47.773963
5 Theo Moura Londrina Avenida Voluntários da Pátria, 9139, Apto 338, 86513-524 Londrina - PR 86338-199 -23.343806 -51.184776
196 Eunice Tavares Curitiba Avenida Cândido de Abreu, 1437, Apto 392, 80883-545 Curitiba - PR 80653-272 -25.443571 -49.223253
197 Saulo Vilanova Duque de Caxias Avenida Washington Luís, 1691, 25794-818 Duque de Caxias - RJ 25192-165 -22.796907 -43.287355
198 Cauã Beltrão Porto Alegre Avenida Borges de Medeiros, 1400, Apto 798, 90327-342 Porto Alegre - RS 90197-296 -30.029122 -51.214312
199 Caetano Duarte Campinas Avenida Aquidabã, 6485, Apto 775, 13037-569 Campinas - SP 13674-387 -22.896391 -47.005293
200 Luiz Serrano Campinas Rua Sacramento, 3353, Apto 354, 13184-367 Campinas - SP 13766-592 -22.901606 -47.083738

This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters.

Data Coherence

Pointblank automatically links related columns to produce realistic rows. There are three coherence systems that activate based on which presets appear together in a schema:

Address coherence activates when any address-related preset is present (address, city, state, postcode, latitude, longitude, phone_number, license_plate). All of these fields will refer to the same location within each row.

Person coherence activates when any person-related preset is present (name, name_full, first_name, last_name, email, user_name). The email and username are derived from the person’s name.

Business coherence activates when both job and company are present. When active:

  • the company and job title are drawn from the same industry (e.g., a nurse will work at a hospital, not a law firm).
  • name_full gains profession-matched titles: a doctor may appear as “Dr. Ana Sousa” and a professor as “Prof. Tanaka Yuki”. For German-speaking countries (DE, AT, CH), the honorific stacks before the professional title (e.g., “Herr Dr. med. Klaus Weber”).
  • integer columns whose name contains age (e.g., age, person_age) are automatically constrained to working-age range (22–65).

Here’s an example showing all three coherence systems working together:

schema = pb.Schema(
    name=pb.string_field(preset="name_full"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    job=pb.string_field(preset="job"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    license_plate=pb.string_field(preset="license_plate"),
    age=pb.int_field(),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23, country="DE"))
PolarsRows100Columns8
name
String
email
String
company
String
job
String
city
String
state
String
license_plate
String
age
Int64
1 Herr Ben Neumann bneumann62@posteo.de Global Immobilien Gruppe Immobilienmakler Sachsenhausen Hessen F-HA 754 40
2 Herr Gerd Fleischer gerd153@gmail.com St. Pauli Gesamtschule Lehrer St. Pauli Hamburg HH-TS 5054 27
3 Frau Annette Graf annette.graf96@t-online.de Technische Fertigung Industrie Maschinenbauingenieur Kreuzberg Berlin B-N 4646 23
4 Herr Dr. rer. nat. Leonhard Wagner leonhard.wagner56@gmx.de Finke-Klinik Apotheker Dessau-Roßlau Sachsen-Anhalt DE-VV 904 59
5 Herr Dr. med. Arnold Bormann a_bormann@gmail.com Bayer Arzt Wiesbaden Hessen WI-YH 726 41
96 Frau Dr. rer. nat. Karin Meier kmeier@web.de Augsburg Klinikum Apotheker Augsburg Bayern A-AB 846 24
97 Herr Wilhelm Wimmer wilhelm.wimmer@posteo.de Innovative Analytik Data Scientist Homburg Saarland HOM-SD 101 25
98 Frau Grete Bormann gretebormann@arcor.de Finke Partner Buchhalter Dessau-Roßlau Sachsen-Anhalt DE-P 031 62
99 Frau Auguste Ziegler auguste_ziegler@mail.de Lufthansa Betriebsleiter Harburg Hamburg HH-GS 2524 48
100 Herr Berthold Schulze berthold_schulze@web.de Universität Heilbronn Lehrer Heilbronn Baden-Württemberg HN-G 215 47

License plate coherence is part of address coherence. For CA, US, DE, AU, and GB, license plates follow real subregion-specific formats when location fields are present. For example, an Ontario row produces plates like "CABC 123" while a British Columbia row produces "AB1 23C". Letters I, O, Q, and U are excluded from plate generation, matching real-world restrictions.

Supported Countries

Pointblank currently supports 55 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., "US") or alpha-3 codes (e.g., "USA").

Europe (32 countries):

  • Austria (AT), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czech Republic (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Germany (DE), Greece (GR), Hungary (HU), Iceland (IS), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Russia (RU), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), Switzerland (CH), United Kingdom (GB)

Americas (7 countries):

  • Argentina (AR), Brazil (BR), Canada (CA), Chile (CL), Colombia (CO), Mexico (MX), United States (US)

Asia-Pacific (12 countries):

  • Australia (AU), China (CN), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), New Zealand (NZ), Philippines (PH), Singapore (SG), South Korea (KR), Taiwan (TW), Thailand (TH)

Middle East & Africa (4 countries):

  • Nigeria (NG), South Africa (ZA), Turkey (TR), United Arab Emirates (AE)

Additional countries and expanded coverage are planned for future releases.

Mixing Multiple Countries

When you need test data that spans multiple locales (e.g., simulating an international customer base), you can pass a list or dict to the country= parameter instead of a single string.

Passing a list of country codes splits rows equally across those countries. Here, 200 rows are divided evenly among the US, Germany, and Japan (~67 each):

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    postcode=pb.string_field(preset="postcode"),
)

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country=["US", "DE", "JP"]))
PolarsRows200Columns3
name
String
city
String
postcode
String
1 Mitzi Klinger Trier 54657
2 Tsubasa Kitazaki Urayasu 279-3480
3 Theodore Garrett Durham 27762
4 Barbara Woodward Port St. Lucie 34982
5 Ilona Schönfeld Ingolstadt 85529
196 Eita Imaoka Fukuoka 810-2690
197 Eric Dixon Lake Charles 70604
198 Carson Bridges Grand Rapids 49505
199 Fiona Bartsch Düsseldorf 40182
200 Haruma Higashino Iwata 438-6071

To control the proportion of rows per country, pass a dict mapping country codes to weights. The following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:

pb.preview(
    pb.generate_dataset(
        schema, n=200, seed=23,
        country={"US": 0.7, "DE": 0.2, "FR": 0.1},
    )
)
PolarsRows200Columns3
name
String
city
String
postcode
String
1 Genesis Donovan Scottsdale 85228
2 Bernd Neuhaus Berlin 10017
3 Albert Murphy Durham 27760
4 Donna Clay Port St. Lucie 34911
5 Cora Meyer Irving 75027
196 Patrick Schreiber Schöneberg 10778
197 Brynlee Schmidt Lake Charles 70673
198 Grayson Adams Grand Rapids 49519
199 Judith Myers Reading 19681
200 Éméline Deschamps Pau 64010

Weights are auto-normalized, so {"US": 7, "DE": 2, "FR": 1} is equivalent to the example above. Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly n.

By default, rows from different countries are interleaved randomly (shuffle=True). Set shuffle=False to keep rows grouped by country in the order the countries are listed:

pb.preview(
    pb.generate_dataset(
        schema, n=120, seed=23,
        country=["US", "DE", "JP"], shuffle=False,
    )
)
PolarsRows120Columns3
name
String
city
String
postcode
String
1 Theodore Harmon Hialeah 33061
2 Claire Bell Bend 97736
3 Simon Villegas Raleigh 27651
4 Autumn Kelly Brooklyn 11230
5 Leo Conner Lake Charles 70697
116 Maiko Endo Fukuoka 810-5676
117 Chiharu Taniguchi Hakodate 040-2391
118 Kaede Namiki Arashiyama 616-7994
119 Eiji Takai Sakae 460-0768
120 Takahiro Matsunaga Fukuoka 810-3742

All coherence systems (address, person, business) work correctly within each country’s batch of rows. A French row will have a French name with a matching French email; a Japanese row will have a Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates) are generated independently for each batch but still respect their field constraints.

Output Formats

The generate_dataset() function supports multiple output formats via the output= parameter, making it easy to integrate with your preferred data processing library.

schema = pb.Schema(
    id=pb.int_field(min_val=1),
    name=pb.string_field(preset="name"),
)

The default output is a Polars DataFrame, which offers excellent performance and a modern API for data manipulation:

polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars")

pb.preview(polars_df)
PolarsRows100Columns2
id
Int64
name
String
1 7188536481533917197 Vivienne Rios
2 2674009078779859984 William Schaefer
3 7652102777077138151 Lily Hansen
4 157503859921753049 Shirley Mays
5 2829213282471975080 Sean Dawson
96 7027508096731143831 Kathryn Green
97 6055996548456656575 Daniel Morris
98 3822709996092631588 William Cooper
99 1522653102058131295 Lane Sawyer
100 5690877051669225499 Paisley Sandoval

If your workflow uses Pandas, simply specify output="pandas" to get a Pandas DataFrame:

pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas")

pb.preview(pandas_df)
PandasRows100Columns2
id
int64
name
str
1 7188536481533917197 Vivienne Rios
2 2674009078779859984 William Schaefer
3 7652102777077138151 Lily Hansen
4 157503859921753049 Shirley Mays
5 2829213282471975080 Sean Dawson
96 7027508096731143831 Kathryn Green
97 6055996548456656575 Daniel Morris
98 3822709996092631588 William Cooper
99 1522653102058131295 Lane Sawyer
100 5690877051669225499 Paisley Sandoval

Both formats work seamlessly with Pointblank’s validation functions, so you can choose whichever fits best with your existing data pipeline.

Using Generated Data for Validation Testing

A common use case is generating test data to validate your validation rules:

# Define a schema with constraints
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=100),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate test data
test_data = pb.generate_dataset(schema, n=100, seed=23)

# Validate the generated data (it should pass all checks)
validation = (
    pb.Validate(test_data)
    .col_vals_gt("user_id", 0)
    .col_vals_regex("email", r".+@.+\..+")
    .col_vals_between("age", 18, 100)
    .col_vals_in_set("status", ["active", "pending", "inactive"])
    .interrogate()
)

validation
Pointblank Validation
2026-02-18|18:55:42
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
user_id 0 100 100
1.00
0
0.00
#4CA64C 2
col_vals_regex
col_vals_regex()
email .+@.+\..+ 100 100
1.00
0
0.00
#4CA64C 3
col_vals_between
col_vals_between()
age [18, 100] 100 100
1.00
0
0.00
#4CA64C 4
col_vals_in_set
col_vals_in_set()
status active, pending, inactive 100 100
1.00
0
0.00

Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline.

Pytest Fixture

When Pointblank is installed, a generate_dataset pytest fixture is automatically available in all your test files. There is no need to import anything or add configuration to conftest.py: the fixture is registered via pytest’s plugin system.

The fixture works identically to pb.generate_dataset(), but with one key difference: when you don’t supply a seed= parameter, a deterministic seed is automatically derived from the test’s fully-qualified name. This means:

  • the same test always produces the same data: no manual seed management required.
  • different tests get different seeds, so they exercise different datasets.
  • you can still pass an explicit seed= to override the automatic seed when needed.

Basic Usage

Use it by adding generate_dataset to your test function’s parameter list:

test_pipeline.py
import pointblank as pb

def test_etl_handles_nulls(generate_dataset):
    schema = pb.Schema(
        user_id=pb.int_field(unique=True),
        email=pb.string_field(preset="email", nullable=True, null_probability=0.3),
        age=pb.int_field(min_val=0, max_val=120),
    )

    df = generate_dataset(schema, n=500)
    result = my_etl_pipeline(df)
    assert result.filter(pl.col("email").is_null()).shape[0] == 0

All parameters from generate_dataset() are supported: n=, seed=, output=, and country=:

def test_german_data(generate_dataset):
    schema = pb.Schema(
        name=pb.string_field(preset="name"),
        city=pb.string_field(preset="city"),
    )

    df = generate_dataset(schema, n=200, country="DE", output="pandas")
    assert len(df) == 200

Multiple Datasets in One Test

Calling the fixture multiple times within the same test produces different (but still deterministic) data on each call:

def test_merge_pipeline(generate_dataset):
    customers = generate_dataset(customer_schema, n=1000, country="US")
    orders = generate_dataset(order_schema, n=5000)

    # Each call gets a unique seed derived from the test name + call index,
    # so both DataFrames are deterministic and different from each other.
    result = merge_pipeline(customers, orders)
    assert result.shape[0] > 0

Testing Across Locales

The fixture makes locale testing particularly concise when combined with pytest.mark.parametrize:

import pytest
import pointblank as pb

@pytest.mark.parametrize("country", ["US", "DE", "JP", "BR"])
def test_name_normalizer(generate_dataset, country):
    schema = pb.Schema(name=pb.string_field(preset="name_full"))
    df = generate_dataset(schema, n=100, country=country)
    result = normalize_names(df)
    assert result["name"].str.len_chars().min() > 0

Sharing Schemas Across Tests

Define schemas as fixtures in conftest.py and compose them with generate_dataset:

conftest.py
import pytest
import pointblank as pb

@pytest.fixture
def customer_schema():
    return pb.Schema(
        id=pb.int_field(unique=True),
        name=pb.string_field(preset="name"),
        email=pb.string_field(preset="email"),
        city=pb.string_field(preset="city"),
    )
test_validation.py
def test_customer_validation(generate_dataset, customer_schema):
    df = generate_dataset(customer_schema, n=200, country="DE")
    validation = pb.Validate(df).col_vals_not_null(columns="email").interrogate()
    assert validation.all_passed()
test_export.py
def test_customer_export(generate_dataset, customer_schema):
    df = generate_dataset(customer_schema, n=50, country="JP")
    exported = export_to_parquet(df)
    assert exported.exists()

Debugging with Seed Introspection

The fixture callable exposes two attributes that make debugging failed tests straightforward:

  • generate_dataset.default_seed: the base seed derived from the test name (available before any call)
  • generate_dataset.last_seed: the seed actually used for the most recent call (accounts for the call counter and explicit overrides)

Include .last_seed in assertion messages so failures are immediately reproducible:

def test_age_range(generate_dataset):
    schema = pb.Schema(age=pb.int_field(min_val=18, max_val=100))
    df = generate_dataset(schema, n=500)
    min_age = df["age"].min()
    assert min_age >= 18, (
        f"Expected min age >= 18, got {min_age} (seed={generate_dataset.last_seed})"
    )

You can also use .default_seed to reproduce the exact dataset outside of pytest:

# In a REPL or notebook, reproduce the data from a failed test:
import pointblank as pb
df = pb.generate_dataset(schema, n=500, seed=<default_seed_from_output>)

Seed Stability

A given seed (whether explicit or auto-derived) is guaranteed to produce identical output within the same Pointblank version. Across versions, changes to country data files or generator logic may alter the output for a given seed.

For CI pipelines that require bit-exact data across library upgrades, we recommend saving generated DataFrames as Parquet or CSV snapshot files rather than relying on cross-version seed stability. This is the same approach used by snapshot-testing tools like pytest-snapshot and syrupy.

Conclusion

Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can:

  • quickly prototype validation rules before working with production data
  • create reproducible test fixtures for automated testing and CI/CD pipelines
  • generate locale-specific data for internationalization testing across 55 countries
  • ensure coherent relationships between related fields like names, emails, addresses, jobs, and license plates
  • produce datasets of any size with consistent, realistic values

Whether you’re building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality.