Test Data Generation

Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development.

Note

Throughout this guide, we use pb.preview() to display generated datasets with nice HTML formatting. This is optional: pb.generate_dataset() returns a standard DataFrame that you can display or manipulate however you prefer.

Quick Start

Generate test data using a schema with field constraints:

import pointblank as pb

# Define a schema with typed field specifications
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=80),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate 100 rows of test data (seed ensures reproducibility)
pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	user_id Int64	name String	email String	age Int64	status String
PolarsRows100Columns5
1	7188536481533917197	Vivienne Rios	vivienne.rios@gmail.com	77	pending
2	2674009078779859984	William Schaefer	williamschaefer@aol.com	67	active
3	7652102777077138151	Lily Hansen	lilyhansen@hotmail.com	78	active
4	157503859921753049	Shirley Mays	shirley.mays27@aol.com	36	inactive
5	2829213282471975080	Sean Dawson	sean.dawson29@aol.com	75	pending
96	7027508096731143831	Kathryn Green	kathryn.green@hotmail.com	55	active
97	6055996548456656575	Daniel Morris	dmorris@yahoo.com	39	inactive
98	3822709996092631588	William Cooper	williamcooper@protonmail.com	24	inactive
99	1522653102058131295	Lane Sawyer	l_sawyer@zoho.com	41	active
100	5690877051669225499	Paisley Sandoval	paisley_sandoval@gmail.com	75	pending

Field Types

Pointblank provides helper functions for defining typed columns with constraints:

Function	Description	Key Parameters
`int_field()`	Integer columns	`min_val`, `max_val`, `allowed`, `unique`
`float_field()`	Float columns	`min_val`, `max_val`, `allowed`
`string_field()`	String columns	`preset`, `pattern`, `allowed`, `unique`
`bool_field()`	Boolean columns	`p_true` (probability of True)
`date_field()`	Date columns	`min_val`, `max_val`
`datetime_field()`	Datetime columns	`min_val`, `max_val`
`time_field()`	Time columns	`min_val`, `max_val`
`duration_field()`	Duration columns	`min_val`, `max_val`

Integer Fields

Integer fields support range constraints with min_val and max_val, discrete allowed values with allowed, and uniqueness enforcement with unique=True:

schema = pb.Schema(
    id=pb.int_field(min_val=1000, max_val=9999, unique=True),
    quantity=pb.int_field(min_val=1, max_val=100),
    rating=pb.int_field(allowed=[1, 2, 3, 4, 5]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	id Int64	quantity Int64	rating Int64
PolarsRows100Columns3
1	5749	100	3
2	2368	38	1
3	1279	11	1
4	6025	3	5
5	7942	76	3
96	5330	64	2
97	8634	31	1
98	9982	43	2
99	4221	70	1
100	8520	19	5

The unique=True constraint ensures no duplicate values appear in that column, which is useful for generating primary keys or identifiers.

Float Fields

Float fields work similarly to integers, with min_val and max_val defining the range of generated values:

schema = pb.Schema(
    price=pb.float_field(min_val=0.0, max_val=1000.0),
    discount=pb.float_field(min_val=0.0, max_val=0.5),
    temperature=pb.float_field(min_val=-40.0, max_val=50.0),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	price Float64	discount Float64	temperature Float64
PolarsRows100Columns3
1	924.8652516259452	0.4624326258129726	43.23787264633508
2	948.6057779931772	0.47430288899658857	45.37452001938594
3	892.4333440485793	0.44621667202428966	40.31900096437214
4	83.55067683068363	0.04177533841534181	-32.48043908523847
5	592.0272268857353	0.29601361344286764	13.282450419716177
96	444.6925279641446	0.2223462639820723	0.022327516773010814
97	342.7762214585577	0.17138811072927884	-9.150140068729808
98	892.3288689140903	0.4461644344570452	40.309598202268134
99	813.7559456012128	0.4068779728006064	33.238035104109144
100	895.1816604808429	0.44759083024042146	40.56634944327587

Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data.

String Fields with Presets

Presets generate realistic data like names, emails, and addresses. When you include related fields like name and email in the same schema, Pointblank ensures coherence (e.g., the email address will be derived from the person’s name), making the generated data more realistic:

schema = pb.Schema(
    full_name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	full_name String	email String	company String	city String
PolarsRows100Columns4
1	Kingston Miller	k_miller@zoho.com	Innovative Systems Solutions	Hollywood
2	Kaden Mosley	kaden.mosley9@protonmail.com	Sterling Engineering	Santa Ana
3	Brooks Wilkerson	brooks703@yahoo.com	Goldman Sachs	Rochester
4	Juliana Mitchell	jmitchell@zoho.com	Simmons LLC	Bloomington
5	Barbara Walters	barbara662@icloud.com	Frontier Systems	Toledo
96	Cheryl Robinson	cheryl.robinson@zoho.com	Watts Retail	Henderson
97	Elijah Cunningham	ecunningham22@hotmail.com	National Solutions International	Aurora
98	Magnolia Mosley	magnolia_mosley@aol.com	Silver Consulting International	Vancouver
99	Stella Gray	stella_gray@mail.com	Elite Realty Services	Syracuse
100	Harrison Allen	harrison.allen25@outlook.com	Tate Ltd	Plano

This coherence extends to other related fields like user_name, which will also reflect the person’s name when included alongside name and email fields.

String Fields with Patterns

Use regex patterns to generate strings matching specific formats:

schema = pb.Schema(
    product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"),
    phone=pb.string_field(pattern=r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"),
    hex_color=pb.string_field(pattern=r"#[0-9A-F]{6}"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	product_code String	phone String	hex_color String
PolarsRows100Columns3
1	CAS-6685	(109) 668-2347	#209DCB
2	XGI-0397	(397) 117-0865	#68E07E
3	DCW-6086	(309) 293-9594	#32FD0D
4	YBG-9529	(917) 797-2285	#161B56
5	XLS-9459	(911) 609-9495	#B9A2F5
96	THG-2900	(993) 511-5415	#A7A37B
97	CHC-3681	(065) 802-0822	#47E498
98	HKT-3552	(927) 701-4276	#AF75D8
99	OEW-4157	(365) 419-1062	#5CCD95
100	FSX-8948	(897) 459-3038	#0F3220

Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification.

Boolean Fields

Control the probability of True values:

schema = pb.Schema(
    is_active=pb.bool_field(p_true=0.8),      # 80% True
    is_premium=pb.bool_field(p_true=0.2),     # 20% True
    is_verified=pb.bool_field(),              # 50% True (default)
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	is_active Boolean	is_premium Boolean	is_verified Boolean
PolarsRows100Columns3
1	False	False	False
2	False	False	False
3	False	False	False
4	True	True	True
5	True	False	False
96	True	False	True
97	True	False	True
98	False	False	False
99	False	False	False
100	False	False	False

This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others.

Date and Datetime Fields

Temporal fields accept Python date and datetime objects for their range boundaries, generating values uniformly distributed within the specified period:

from datetime import date, datetime

schema = pb.Schema(
    birth_date=pb.date_field(
        min_date=date(1960, 1, 1),
        max_date=date(2005, 12, 31)
    ),
    created_at=pb.datetime_field(
        min_date=datetime(2024, 1, 1),
        max_date=datetime(2024, 12, 31)
    ),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	birth_date Date	created_at Datetime
PolarsRows100Columns2
1	1986-01-03	2024-12-25 04:22:08
2	1967-06-30	2024-10-29 16:22:23
3	1961-07-13	2024-04-22 14:13:08
4	1987-07-09	2024-12-12 14:04:53
5	1998-01-06	2024-11-18 04:49:47
96	1969-04-14	2024-07-29 13:15:44
97	1975-03-23	2024-04-28 08:49:29
98	1981-05-29	2024-12-13 09:42:37
99	1982-09-14	2024-10-28 23:35:39
100	1968-12-21	2024-06-25 14:22:27

The same pattern applies to time_field() and duration_field(), allowing you to generate realistic temporal data for any use case.

Available Presets

The preset= parameter in string_field() supports many data types:

Personal Data:

name: full name (first + last)
name_full: full name with optional prefix/suffix (e.g., “Dr. Ana Sousa”, “Prof. Tanaka Yuki”)
first_name: first name only
last_name: last name only
email: email address
phone_number: phone number in country-specific format

Location Data:

address: full street address
city: city name
state: state/province name
country: country name
postcode: postal/ZIP code
latitude: latitude coordinate
longitude: longitude coordinate

Business Data:

company: company name
job: job title
catch_phrase: business catch phrase

Internet Data:

url: website URL
domain_name: domain name
ipv4: IPv4 address
ipv6: IPv6 address
user_name: username
password: password

Financial Data:

credit_card_number: credit card number
iban: International Bank Account Number
currency_code: currency code (USD, EUR, etc.)

Identifiers:

uuid4: UUID version 4
md5: MD5 hash (32 hex characters)
sha1: SHA-1 hash (40 hex characters)
sha256: SHA-256 hash (64 hex characters)
ssn: Social Security Number (country-specific format)
license_plate: vehicle license plate (location-aware for CA, US, DE, AU, GB)

Barcodes:

ean8: EAN-8 barcode with valid check digit
ean13: EAN-13 barcode with valid check digit

Date/Time:

date_this_year: a date within the current year
date_this_decade: a date within the current decade
date_between: a random date between 2000 and 2025
date_range: two dates joined with an en-dash (e.g., "2012-05-12 – 2015-11-22")
future_date: a date up to 1 year in the future
past_date: a date up to 10 years in the past
time: a time value

Text:

word: single word
sentence: full sentence
paragraph: paragraph of text
text: multiple paragraphs

Miscellaneous:

color_name: color name
file_name: file name
file_extension: file extension
mime_type: MIME type
user_agent: browser user agent string (country-weighted)

Country-Specific Data

One of the most powerful features is generating locale-aware data. Use the country= parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets.

Let’s create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures consistency across related fields. The city, address, postcode, and coordinates will all correspond to the same location:

# Schema with linked location fields
schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    address=pb.string_field(preset="address"),
    postcode=pb.string_field(preset="postcode"),
    latitude=pb.string_field(preset="latitude"),
    longitude=pb.string_field(preset="longitude"),
)

Here’s German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="DE"))

	name String	city String	address String	postcode String	latitude String	longitude String
PolarsRows200Columns6
1	Ines Flohr	Sachsenhausen	Oppenheimer Landstraße 8446, 60559 Sachsenhausen	60569	50.102657	8.698808
2	Joachim Pohlmann	St. Pauli	Talstraße 3672, 20302 St. Pauli	20392	53.540407	9.968037
3	Elfriede Sander	Kreuzberg	Mariannenstraße 990, 10911 Kreuzberg	10927	52.489220	13.409102
4	Wilhelm Opitz	Dessau-Roßlau	Königstraße 1418, 06784 Dessau-Roßlau	06116	51.810370	12.259340
5	Ursula Westphal	Wiesbaden	Mozartstraße 8328, Whg. 683, 65008 Wiesbaden	65936	50.087802	8.256100
196	Hildegard Reinhardt	Berlin	Rüdesheimer Platz 4345, Whg. 595, 10036 Berlin	10642	52.443180	13.612304
197	Arnold Münz	Stuttgart	Rotebühlplatz 5931, Whg. 911, 70562 Stuttgart	70441	48.759266	9.178001
198	Dominik Bachmann	Ulm	Bahnhofstraße 9887, 89272 Ulm	89286	48.404127	10.016080
199	Alexander Busch	Prenzlauer Berg	Fehrbelliner Straße 3918, Whg. 378, 10467 Prenzlauer Berg	10411	52.546146	13.434788
200	Bianca Bollmann	Augsburg	Karlstraße 9976, 86392 Augsburg	86227	48.413405	10.903475

Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan’s geographic boundaries:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="JP"))

	name String	city String	address String	postcode String	latitude String	longitude String
PolarsRows200Columns6
1	Tadayuki Hara	Fukuyama	720-8233 Hiroshima Fukuyama Nishi-cho 9502-435	720-4531	34.514784	133.376591
2	Takafumi Kato	Kamakura	248-7907 Kanagawa Kamakura Hase 8877-274	248-9166	35.325474	139.545896
3	Gota Kashiwagi	Ikebukuro	171-1830 Tokyo Ikebukuro Shiinamachi-dori 6360-350	171-8706	35.726125	139.702343
4	Shinya Fujishima	Mihara	723-6084 Hiroshima Mihara Onomichi-dori 8740-969	723-5615	34.393466	133.116512
5	Nodoka Kuwata	Fuji	416-2282 Shizuoka Fuji Shin-Fuji Eki-mae 3347-837	416-8803	35.183740	138.673396
196	Manami Inagawa	Nagaoka	940-4484 Niigata Nagaoka Ojiya-dori 288-565	940-8762	37.477659	138.841431
197	Takuya Komori	Chiba	260-8345 Chiba Chiba Kaihin-makuhari 2538-65	260-9734	35.591983	140.145145
198	Hayato Sakamoto	Kure	737-4862 Hiroshima Kure Nishimachi 2690-670	737-2541	34.256791	132.585931
199	Mitsuko Tateno	Chigasaki	253-7355 Kanagawa Chigasaki Shonan-dori 8470	253-6378	35.344598	139.419824
200	Toshiko Tominaga	Hakodate	040-3017 Hokkaido Hakodate Omoricho 5503	040-7893	41.808678	140.736790

Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil’s CEP format:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="BR"))

	name String	city String	address String	postcode String	latitude String	longitude String
PolarsRows200Columns6
1	Iraci Quiroga	Porto Velho	Rua Brasília, 8740, 76084-235 Porto Velho - RO	76813-377	-8.783359	-63.944499
2	Almir Pires	Aracaju	Rua Lagarto, 8874, 49572-600 Aracaju - SE	49876-654	-10.910802	-37.069824
3	Ubaldo Leite	Porto Alegre	Rua Coronel Fernando Machado, 3821, Apto 888, 90352-709 Porto Alegre - RS	90522-720	-30.097326	-51.222829
4	Edilene Rabello	Ribeirão Preto	Rua Barão do Amazonas, 3087, Apto 470, 14878-711 Ribeirão Preto - SP	14678-905	-21.107871	-47.773963
5	Theo Moura	Londrina	Avenida Voluntários da Pátria, 9139, Apto 338, 86513-524 Londrina - PR	86338-199	-23.343806	-51.184776
196	Eunice Tavares	Curitiba	Avenida Cândido de Abreu, 1437, Apto 392, 80883-545 Curitiba - PR	80653-272	-25.443571	-49.223253
197	Saulo Vilanova	Duque de Caxias	Avenida Washington Luís, 1691, 25794-818 Duque de Caxias - RJ	25192-165	-22.796907	-43.287355
198	Cauã Beltrão	Porto Alegre	Avenida Borges de Medeiros, 1400, Apto 798, 90327-342 Porto Alegre - RS	90197-296	-30.029122	-51.214312
199	Caetano Duarte	Campinas	Avenida Aquidabã, 6485, Apto 775, 13037-569 Campinas - SP	13674-387	-22.896391	-47.005293
200	Luiz Serrano	Campinas	Rua Sacramento, 3353, Apto 354, 13184-367 Campinas - SP	13766-592	-22.901606	-47.083738

This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters.

Data Coherence

Pointblank automatically links related columns to produce realistic rows. There are three coherence systems that activate based on which presets appear together in a schema:

Address coherence activates when any address-related preset is present (address, city, state, postcode, latitude, longitude, phone_number, license_plate). All of these fields will refer to the same location within each row.

Person coherence activates when any person-related preset is present (name, name_full, first_name, last_name, email, user_name). The email and username are derived from the person’s name.

Business coherence activates when both job and company are present. When active:

the company and job title are drawn from the same industry (e.g., a nurse will work at a hospital, not a law firm).
name_full gains profession-matched titles: a doctor may appear as “Dr. Ana Sousa” and a professor as “Prof. Tanaka Yuki”. For German-speaking countries (DE, AT, CH), the honorific stacks before the professional title (e.g., “Herr Dr. med. Klaus Weber”).
integer columns whose name contains age (e.g., age, person_age) are automatically constrained to working-age range (22–65).

Here’s an example showing all three coherence systems working together:

schema = pb.Schema(
    name=pb.string_field(preset="name_full"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    job=pb.string_field(preset="job"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    license_plate=pb.string_field(preset="license_plate"),
    age=pb.int_field(),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23, country="DE"))

	name String	email String	company String	job String	city String	state String	license_plate String	age Int64
PolarsRows100Columns8
1	Herr Ben Neumann	bneumann62@posteo.de	Global Immobilien Gruppe	Immobilienmakler	Sachsenhausen	Hessen	F-HA 754	40
2	Herr Gerd Fleischer	gerd153@gmail.com	St. Pauli Gesamtschule	Lehrer	St. Pauli	Hamburg	HH-TS 5054	27
3	Frau Annette Graf	annette.graf96@t-online.de	Technische Fertigung Industrie	Maschinenbauingenieur	Kreuzberg	Berlin	B-N 4646	23
4	Herr Dr. rer. nat. Leonhard Wagner	leonhard.wagner56@gmx.de	Finke-Klinik	Apotheker	Dessau-Roßlau	Sachsen-Anhalt	DE-VV 904	59
5	Herr Dr. med. Arnold Bormann	a_bormann@gmail.com	Bayer	Arzt	Wiesbaden	Hessen	WI-YH 726	41
96	Frau Dr. rer. nat. Karin Meier	kmeier@web.de	Augsburg Klinikum	Apotheker	Augsburg	Bayern	A-AB 846	24
97	Herr Wilhelm Wimmer	wilhelm.wimmer@posteo.de	Innovative Analytik	Data Scientist	Homburg	Saarland	HOM-SD 101	25
98	Frau Grete Bormann	gretebormann@arcor.de	Finke Partner	Buchhalter	Dessau-Roßlau	Sachsen-Anhalt	DE-P 031	62
99	Frau Auguste Ziegler	auguste_ziegler@mail.de	Lufthansa	Betriebsleiter	Harburg	Hamburg	HH-GS 2524	48
100	Herr Berthold Schulze	berthold_schulze@web.de	Universität Heilbronn	Lehrer	Heilbronn	Baden-Württemberg	HN-G 215	47

License plate coherence is part of address coherence. For CA, US, DE, AU, and GB, license plates follow real subregion-specific formats when location fields are present. For example, an Ontario row produces plates like "CABC 123" while a British Columbia row produces "AB1 23C". Letters I, O, Q, and U are excluded from plate generation, matching real-world restrictions.

Supported Countries

Pointblank currently supports 55 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., "US") or alpha-3 codes (e.g., "USA").

Europe (32 countries):

Austria (AT), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czech Republic (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Germany (DE), Greece (GR), Hungary (HU), Iceland (IS), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Russia (RU), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), Switzerland (CH), United Kingdom (GB)

Americas (7 countries):

Argentina (AR), Brazil (BR), Canada (CA), Chile (CL), Colombia (CO), Mexico (MX), United States (US)

Asia-Pacific (12 countries):

Australia (AU), China (CN), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), New Zealand (NZ), Philippines (PH), Singapore (SG), South Korea (KR), Taiwan (TW), Thailand (TH)

Middle East & Africa (4 countries):

Nigeria (NG), South Africa (ZA), Turkey (TR), United Arab Emirates (AE)

Additional countries and expanded coverage are planned for future releases.

Mixing Multiple Countries

When you need test data that spans multiple locales (e.g., simulating an international customer base), you can pass a list or dict to the country= parameter instead of a single string.

Passing a list of country codes splits rows equally across those countries. Here, 200 rows are divided evenly among the US, Germany, and Japan (~67 each):

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    postcode=pb.string_field(preset="postcode"),
)

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country=["US", "DE", "JP"]))

	name String	city String	postcode String
PolarsRows200Columns3
1	Mitzi Klinger	Trier	54657
2	Tsubasa Kitazaki	Urayasu	279-3480
3	Theodore Garrett	Durham	27762
4	Barbara Woodward	Port St. Lucie	34982
5	Ilona Schönfeld	Ingolstadt	85529
196	Eita Imaoka	Fukuoka	810-2690
197	Eric Dixon	Lake Charles	70604
198	Carson Bridges	Grand Rapids	49505
199	Fiona Bartsch	Düsseldorf	40182
200	Haruma Higashino	Iwata	438-6071

To control the proportion of rows per country, pass a dict mapping country codes to weights. The following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:

pb.preview(
    pb.generate_dataset(
        schema, n=200, seed=23,
        country={"US": 0.7, "DE": 0.2, "FR": 0.1},
    )
)

	name String	city String	postcode String
PolarsRows200Columns3
1	Genesis Donovan	Scottsdale	85228
2	Bernd Neuhaus	Berlin	10017
3	Albert Murphy	Durham	27760
4	Donna Clay	Port St. Lucie	34911
5	Cora Meyer	Irving	75027
196	Patrick Schreiber	Schöneberg	10778
197	Brynlee Schmidt	Lake Charles	70673
198	Grayson Adams	Grand Rapids	49519
199	Judith Myers	Reading	19681
200	Éméline Deschamps	Pau	64010

Weights are auto-normalized, so {"US": 7, "DE": 2, "FR": 1} is equivalent to the example above. Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly n.

By default, rows from different countries are interleaved randomly (shuffle=True). Set shuffle=False to keep rows grouped by country in the order the countries are listed:

pb.preview(
    pb.generate_dataset(
        schema, n=120, seed=23,
        country=["US", "DE", "JP"], shuffle=False,
    )
)

	name String	city String	postcode String
PolarsRows120Columns3
1	Theodore Harmon	Hialeah	33061
2	Claire Bell	Bend	97736
3	Simon Villegas	Raleigh	27651
4	Autumn Kelly	Brooklyn	11230
5	Leo Conner	Lake Charles	70697
116	Maiko Endo	Fukuoka	810-5676
117	Chiharu Taniguchi	Hakodate	040-2391
118	Kaede Namiki	Arashiyama	616-7994
119	Eiji Takai	Sakae	460-0768
120	Takahiro Matsunaga	Fukuoka	810-3742

All coherence systems (address, person, business) work correctly within each country’s batch of rows. A French row will have a French name with a matching French email; a Japanese row will have a Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates) are generated independently for each batch but still respect their field constraints.

Output Formats

The generate_dataset() function supports multiple output formats via the output= parameter, making it easy to integrate with your preferred data processing library.

schema = pb.Schema(
    id=pb.int_field(min_val=1),
    name=pb.string_field(preset="name"),
)

The default output is a Polars DataFrame, which offers excellent performance and a modern API for data manipulation:

polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars")

pb.preview(polars_df)

	id Int64	name String
PolarsRows100Columns2
1	7188536481533917197	Vivienne Rios
2	2674009078779859984	William Schaefer
3	7652102777077138151	Lily Hansen
4	157503859921753049	Shirley Mays
5	2829213282471975080	Sean Dawson
96	7027508096731143831	Kathryn Green
97	6055996548456656575	Daniel Morris
98	3822709996092631588	William Cooper
99	1522653102058131295	Lane Sawyer
100	5690877051669225499	Paisley Sandoval

If your workflow uses Pandas, simply specify output="pandas" to get a Pandas DataFrame:

pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas")

pb.preview(pandas_df)

	id int64	name str
PandasRows100Columns2
1	7188536481533917197	Vivienne Rios
2	2674009078779859984	William Schaefer
3	7652102777077138151	Lily Hansen
4	157503859921753049	Shirley Mays
5	2829213282471975080	Sean Dawson
96	7027508096731143831	Kathryn Green
97	6055996548456656575	Daniel Morris
98	3822709996092631588	William Cooper
99	1522653102058131295	Lane Sawyer
100	5690877051669225499	Paisley Sandoval

Both formats work seamlessly with Pointblank’s validation functions, so you can choose whichever fits best with your existing data pipeline.

Using Generated Data for Validation Testing

A common use case is generating test data to validate your validation rules:

# Define a schema with constraints
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=100),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate test data
test_data = pb.generate_dataset(schema, n=100, seed=23)

# Validate the generated data (it should pass all checks)
validation = (
    pb.Validate(test_data)
    .col_vals_gt("user_id", 0)
    .col_vals_regex("email", r".+@.+\..+")
    .col_vals_between("age", 18, 100)
    .col_vals_in_set("status", ["active", "pending", "inactive"])
    .interrogate()
)

validation

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
2026-02-18\|18:55:42 Polars
#4CA64C	1	col_vals_gt()	user_id	0	✓	100	100 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_regex()	email	.+@.+\..+	✓	100	100 1.00	0 0.00	—	—	—	—
#4CA64C	3	col_vals_between()	age	[18, 100]	✓	100	100 1.00	0 0.00	—	—	—	—
#4CA64C	4	col_vals_in_set()	status	active, pending, inactive	✓	100	100 1.00	0 0.00	—	—	—	—

Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline.

Pytest Fixture

When Pointblank is installed, a generate_dataset pytest fixture is automatically available in all your test files. There is no need to import anything or add configuration to conftest.py: the fixture is registered via pytest’s plugin system.

The fixture works identically to pb.generate_dataset(), but with one key difference: when you don’t supply a seed= parameter, a deterministic seed is automatically derived from the test’s fully-qualified name. This means:

the same test always produces the same data: no manual seed management required.
different tests get different seeds, so they exercise different datasets.
you can still pass an explicit seed= to override the automatic seed when needed.

Basic Usage

Use it by adding generate_dataset to your test function’s parameter list:

test_pipeline.py

import pointblank as pb

def test_etl_handles_nulls(generate_dataset):
    schema = pb.Schema(
        user_id=pb.int_field(unique=True),
        email=pb.string_field(preset="email", nullable=True, null_probability=0.3),
        age=pb.int_field(min_val=0, max_val=120),
    )

    df = generate_dataset(schema, n=500)
    result = my_etl_pipeline(df)
    assert result.filter(pl.col("email").is_null()).shape[0] == 0

All parameters from generate_dataset() are supported: n=, seed=, output=, and country=:

def test_german_data(generate_dataset):
    schema = pb.Schema(
        name=pb.string_field(preset="name"),
        city=pb.string_field(preset="city"),
    )

    df = generate_dataset(schema, n=200, country="DE", output="pandas")
    assert len(df) == 200

Multiple Datasets in One Test

Calling the fixture multiple times within the same test produces different (but still deterministic) data on each call:

def test_merge_pipeline(generate_dataset):
    customers = generate_dataset(customer_schema, n=1000, country="US")
    orders = generate_dataset(order_schema, n=5000)

    # Each call gets a unique seed derived from the test name + call index,
    # so both DataFrames are deterministic and different from each other.
    result = merge_pipeline(customers, orders)
    assert result.shape[0] > 0

Testing Across Locales

The fixture makes locale testing particularly concise when combined with pytest.mark.parametrize:

import pytest
import pointblank as pb

@pytest.mark.parametrize("country", ["US", "DE", "JP", "BR"])
def test_name_normalizer(generate_dataset, country):
    schema = pb.Schema(name=pb.string_field(preset="name_full"))
    df = generate_dataset(schema, n=100, country=country)
    result = normalize_names(df)
    assert result["name"].str.len_chars().min() > 0

Sharing Schemas Across Tests

Define schemas as fixtures in conftest.py and compose them with generate_dataset:

conftest.py

import pytest
import pointblank as pb

@pytest.fixture
def customer_schema():
    return pb.Schema(
        id=pb.int_field(unique=True),
        name=pb.string_field(preset="name"),
        email=pb.string_field(preset="email"),
        city=pb.string_field(preset="city"),
    )

test_validation.py

def test_customer_validation(generate_dataset, customer_schema):
    df = generate_dataset(customer_schema, n=200, country="DE")
    validation = pb.Validate(df).col_vals_not_null(columns="email").interrogate()
    assert validation.all_passed()

test_export.py

def test_customer_export(generate_dataset, customer_schema):
    df = generate_dataset(customer_schema, n=50, country="JP")
    exported = export_to_parquet(df)
    assert exported.exists()

Debugging with Seed Introspection

The fixture callable exposes two attributes that make debugging failed tests straightforward:

generate_dataset.default_seed: the base seed derived from the test name (available before any call)
generate_dataset.last_seed: the seed actually used for the most recent call (accounts for the call counter and explicit overrides)

Include .last_seed in assertion messages so failures are immediately reproducible:

def test_age_range(generate_dataset):
    schema = pb.Schema(age=pb.int_field(min_val=18, max_val=100))
    df = generate_dataset(schema, n=500)
    min_age = df["age"].min()
    assert min_age >= 18, (
        f"Expected min age >= 18, got {min_age} (seed={generate_dataset.last_seed})"
    )

You can also use .default_seed to reproduce the exact dataset outside of pytest:

# In a REPL or notebook, reproduce the data from a failed test:
import pointblank as pb
df = pb.generate_dataset(schema, n=500, seed=<default_seed_from_output>)

Seed Stability

A given seed (whether explicit or auto-derived) is guaranteed to produce identical output within the same Pointblank version. Across versions, changes to country data files or generator logic may alter the output for a given seed.

For CI pipelines that require bit-exact data across library upgrades, we recommend saving generated DataFrames as Parquet or CSV snapshot files rather than relying on cross-version seed stability. This is the same approach used by snapshot-testing tools like pytest-snapshot and syrupy.

Conclusion

Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can:

quickly prototype validation rules before working with production data
create reproducible test fixtures for automated testing and CI/CD pipelines
generate locale-specific data for internationalization testing across 55 countries
ensure coherent relationships between related fields like names, emails, addresses, jobs, and license plates
produce datasets of any size with consistent, realistic values

Whether you’re building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality.