Skip to content

GH-49505: [Python] Implement DictionaryArray converter for type pa.large_string and pa.large_binary#49658

Open
fotinosk wants to merge 4 commits intoapache:mainfrom
fotinosk:dictarray-type-converter
Open

GH-49505: [Python] Implement DictionaryArray converter for type pa.large_string and pa.large_binary#49658
fotinosk wants to merge 4 commits intoapache:mainfrom
fotinosk:dictarray-type-converter

Conversation

@fotinosk
Copy link
Copy Markdown

@fotinosk fotinosk commented Apr 3, 2026

Rationale for this change

Resolves #49505 enhancment request.

Previously, attempting to convert a Python sequence into a Dictionary array with large values resulted in the following behaviour:

>>> import pyarrow as pa
>>> pa.array([], pa.dictionary(pa.int32(), pa.large_binary()))
Traceback (most recent call last):
  ...
ArrowNotImplementedError: DictionaryArray converter for type dictionary<values=large_binary, indices=int32, ordered=0> not implemented

What changes are included in this PR?

  • C++ Core (cpp/src/arrow/util/converter.h):
    • Added DICTIONARY_CASE(LargeBinaryType) and DICTIONARY_CASE(LargeStringType) to the MakeConverterImpl::Visit(const DictionaryType&) dispatch table so the C++ core knows how to route the large types.
    • Updated the PyDictionaryConverter template for enable_if_has_string_view to use std::string_view in the Append method. This allows the underlying Arrow builders to handle size-dispatching (32-bit vs 64-bit) internally.
  • Python Tests (python/pyarrow/tests/test_array.py):
    • Added test_dictionary_large_string_and_binary to verify sequence conversion for both large_string and large_binary dictionary types.

Are these changes tested?

Yes. Added test_dictionary_large_string_and_binary to python/pyarrow/tests/test_array.py which validates both the schema resolution and the data integrity of the resulting pylist.

Are there any user-facing changes?

Yes. Users can now pass pa.large_string() and pa.large_binary() as value types into pa.dictionary() when using pa.array() to ingest Python sequences.

@fotinosk fotinosk force-pushed the dictarray-type-converter branch from c248773 to f0f29ef Compare April 3, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant