The string_field() function defines the constraints and behavior for a string column when generating synthetic data with generate_dataset(). It provides three main modes of string generation: (1) controlled random strings with min_length=/max_length=, (2) strings matching a regular expression via pattern=, or (3) realistic data using preset= (e.g., "email", "name", "address"). You can also restrict values to a fixed set with allowed=. Only one of preset=, pattern=, or allowed= can be specified at a time.
When no special mode is selected, random alphanumeric strings are generated with lengths between min_length= and max_length= (defaulting to 1–20 characters).
Parameters
min_length:int | None=None
Minimum string length (for random string generation). Default is None (defaults to 1). Only applies when preset=, pattern=, and allowed= are all None.
max_length:int | None=None
Maximum string length (for random string generation). Default is None (defaults to 20). Only applies when preset=, pattern=, and allowed= are all None.
pattern:str | None=None
Regular expression pattern that generated strings must match. Supports character classes (e.g., [A-Z], [0-9]), quantifiers (e.g., {3}, {2,5}), alternation, and groups. Cannot be combined with preset= or allowed=.
preset:str | None=None
Preset name for generating realistic data. When specified, values are produced using locale-aware data generation, and the country= parameter of generate_dataset() controls the locale. Cannot be combined with pattern= or allowed=. See the Available Presets section below for the full list.
allowed:list[str] | None=None
List of allowed string values (categorical constraint). Values are sampled uniformly from this list. Cannot be combined with preset= or pattern=.
nullable:bool=False
Whether the column can contain null values. Default is False.
null_probability:float=0.0
Probability of generating a null value for each row when nullable=True. Must be between 0.0 and 1.0. Default is 0.0.
unique:bool=False
Whether all values must be unique. Default is False. When True, the generator will retry until it produces n distinct values.
generator:Callable[[], Any] | None=None
Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single string value.
Returns
StringField
A string field specification that can be passed to Schema().
Raises
:ValueError
If more than one of preset=, pattern=, or allowed= is specified; if allowed= is an empty list; if min_length or max_length is negative; if min_length exceeds max_length; or if preset is not a recognized preset name.
Available Presets
The preset= parameter accepts one of the following preset names, organized by category. When a preset is used, the country= parameter of generate_dataset() controls the locale for region-specific formatting (e.g., address formats, phone number patterns).
Personal:"name" (first + last name), "name_full" (full name with possible prefix or suffix), "first_name", "last_name", "email" (realistic email address), "phone_number", "address" (full street address), "city", "state", "country", "postcode", "latitude", "longitude"
Barcodes:"ean8" (EAN-8 barcode with valid check digit), "ean13" (EAN-13 barcode with valid check digit)
Date/Time (as strings):"date_this_year", "date_this_decade", "date_between" (random date between 2000–2025), "date_range" (two dates joined with an en-dash, e.g., "2012-05-12 – 2015-11-22"), "future_date" (up to 1 year ahead), "past_date" (up to 10 years back), "time"
Miscellaneous:"color_name", "file_name", "file_extension", "mime_type", "user_agent" (browser user agent string with country-specific browser weighting)
Coherent Data Generation
When multiple columns in the same schema use related presets, the generated data will be coherent across those columns within each row. Specifically:
Person-related presets ("name", "name_full", "first_name", "last_name", "email", "user_name"): the email and username will be derived from the person’s name.
Address-related presets ("address", "city", "state", "postcode", "phone_number", "latitude", "longitude"): the city, state, and postcode will correspond to the same location within the address.
This coherence is automatic and requires no additional configuration.
Examples
The preset= parameter generates realistic personal data, while allowed= restricts values to a categorical set: