Polars: rs-0.46.0 Release

Release date:
January 26, 2025
Previous version:
rs-0.45.0.1 (released December 8, 2024)
Magnitude:
34,010 Diff Delta
Contributors:
48 total committers
Data confidence:
Commits:

474 Commits in this Release

Ordered by the degree to which they evolved the repo in this version.

Authored January 24, 2025
Authored December 7, 2024
Authored December 20, 2024
Authored December 11, 2024
Authored December 27, 2024
Authored December 9, 2024
Authored December 15, 2024
Authored January 20, 2025
Authored January 18, 2025
Authored December 10, 2024
Authored December 9, 2024
Authored January 24, 2025
Authored December 9, 2024

Top Contributors in rs-0.46.0

ritchie46
coastalwhite
orlp
nameexhaustion
bschoenmaeckers
mcrumiller
alexander-beedie
lukemanley
itamarst
etiennebacher

Directory Browser for rs-0.46.0

We haven't yet finished calculating and confirming the files and directories changed in this release. Please check back soon.

Release Notes Published

πŸ† Highlights

  • Add new Int128Type (#20232)

πŸ’₯ Breaking changes

  • Support writing partitioned parquet to cloud (#20590)

πŸš€ Performance improvements

  • Use BitmapBuilder in yet more places (#20868)
  • Make an owned version of append (#20800)
  • Use BitmapBuilder in a lot more places (#20776)
  • Extend functionality on BitmapBuilder and use in Growables (#20754)
  • Specialize first/last agg for simple types in new-streaming engine (#20728)
  • Improve state caching and parallelism of window functions (#20689)
  • Broadcast without materialization in concat_arr (#20681)
  • Cache rolling groups (#20675)
  • Use downcast_ref instead of dtype equality in <dyn SeriesTrait as AsRef<ChunkedArray<T>> (#20664)
  • Fix performance regression for DataFrame serialization/pickling (#20641)
  • Make Parquet verify_dict_indices SIMD (#20623)
  • Move to zlib-rs by default and use zstd::with_buffer (#20614)
  • Skip filter expansion in eager (#20586)
  • Use AtomicWaker in async engine task joiner (#20604)
  • Move morsel distribution to the computational async engine (#20600)
  • Improve unique pred-pd (#20569)
  • Collapse expanded filters in eager (#20493)
  • Remove predicate from IR::DataFrame (#20492)
  • Add proper distributor to new-streaming parquet reader (#20372)
  • Use different binview dedup strategy depending on chunks ratio (#20451)
  • Generalize the arg_sort fast path onto Column (#20437)
  • Dedup binviews up front (#20449)
  • Re-enable common subplan elim for new-streaming engine (#20443)
  • Don't collect all LHS arrays in gather (#20441)
  • Remove prepare_series for gather kernels (#20439)
  • Don't always take all data buffers when gathering views (#20435)
  • Order observability optimizations (#20396)
  • Purge ChunkedArray Metadata (#20371)
  • Drop probe tables in parallel in new-streaming equi-join (#20373)
  • Explicit transpose in new-streaming equi-join finalize (#20363)
  • Cache dtype on ExprIR (#20331)

✨ Enhancements

  • Expose descending and nulls last in window order-by (#20919)
  • Add linear_space (#20678)
  • Implement df.unique() on new-streaming engine (#20875)
  • Add unique operations for Decimal dtype (#20855)
  • Add NDJson sink for the new streaming engine (#20805)
  • Support nested keys in window functions (#20837)
  • Add CSV sink for the new streaming engine (#20804)
  • Periodically check python signals ('CTRL-C' handling) (#20826)
  • Experimental unity catalog client (#20798)
  • Support cumulative aggregations for Decimal dtype (#20802)
  • Improve window function caching strategy (#20791)
  • Allow different python versions for pickle (#20740)
  • Add SQL support for the NORMALIZE string function (#20705)
  • Add 'allow_exact_matches' join_asof' (#20723)
  • Add new-streaming first/last aggregations (#20716)
  • Add Parquet Sink to new streaming engine (#20690)
  • Expose IRBuilder (#20710)
  • Make automatic use of Azure storage account keys opt-in (#20652)
  • Improve GroupsProxy/GroupsPosition to be sliceable and cheaply cloneable (#20673)
  • Add str.normalize() (#20483)
  • Allow more group_by agg expressions in the new streaming engine (#20663)
  • Support writing partitioned parquet to cloud (#20590)
  • Add hint to error message for extra struct field in JSON (#20612)
  • Add index_of() function to Series and Expr (#19894)
  • Update sqlparser-rs, enabling "LEFT" keyword to be optional for anti/semi joins in SQL queries (#20576)
  • Add cat.starts_with/cat.ends_with (#20257)
  • Add Int128 IO support for csv & ipc (#20535)
  • Support arbitrary expressions in 'join_where' (#20525)
  • Allow more join lossless casting (#20474)
  • Always resolve dynamic types in schema (#20406)
  • Order observability optimizations (#20396)
  • Add FirstArgLossless supertype (#20394)
  • Add dt.replace (#19708)
  • Polars build for Pyodide (#20383)
  • Add Azure credential provider using DefaultAzureCredential() (#20384)
  • Add env var to ignore file cache allocate error (#20356)
  • Enable joins between compatible differing numeric key columns (#20332)
  • Cache dtype on ExprIR (#20331)
  • Serialize DataFrame/Series using IPC in serde (#20266)
  • Improve error message on SchemaError (#20326)
  • Use better error messages when opening files (#20307)
  • Add 'skip_lines' for CSV (#20301)
  • Allow subtraction of time dtype columns (#20300)
  • Add bin.reinterpret (#20263)
  • Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet (#20248)
  • Add new Int128Type (#20232)
  • IR formatting QoL improvements (#20246)
  • Add cat.len_chars and cat.len_bytes (#20211)
  • Expose AexprArena (#20230)

🐞 Bug fixes

  • Fix from_numpy returning Null dtype for empty 1D numpy array (#20907)
  • Fix map_elements panicking with Decimal type (#20905)
  • Warn if asof keys not sorted (#20887)
  • Avoid name collisions and panicking in object conversion (#20890)
  • Incorrect scale used in log and exp for Decimal type (#20888)
  • Don't deep clone manuallydrop in GroupsPosition (#20886)
  • Fix DuplicateError when selecting columns after join_where or cross join + filter (#20865)
  • Incorrect Decimal value for fill_null(strategy="one") (#20844)
  • Fix one edge case (out of many) of int128 literals not working (#20830)
  • Add height check to frame-level row indexing when key is int (#20778)
  • Remove assert that panics on group_by followed by head(n), where n is larger then the frame height (#20819)
  • Fix panic InvalidHeaderValue scanning from S3 on Windows (#20820)
  • Fix clip for Decimal returning wrong values (#20814)
  • Incorrect height from slicing after projecting only the file path column (#20817)
  • Shift mask when skipping Bitpacked values in Parquet (#20810)
  • Error instead of truncate if length mismatch for several str functions (#20781)
  • Support cumulative aggregations for Decimal dtype (#20802)
  • Do not print sensitive information to output on POLARS_VERBOSE (#20797)
  • Ignore file cache allocation error if fallocate() is not permitted (#20796)
  • Incorrect logic in assert_series_equal for infinities (#20763)
  • Avoid blocking on async runtime when resolving cloud scans (#20750)
  • Fix allow_invalid_certificates being ignored in storage_options (#20744)
  • Incorrect output type for map_groups returning all-NULL column (#20743)
  • Fix unique(maintain_order=True) raising InvalidOperationError for null array (#20737)
  • Don't collapse into a Nested Loop Join if the cross join maintains order (#20729)
  • Don't serialize credentials provider (#20741)
  • Fix Series.n_unique raising for list of struct (#20724)
  • Fix incorrect top-k by sorted column, fix head() returning extra rows (#20722)
  • Add outer validity to AnyValueBufferTrusted for structs (#20713)
  • Don't partition group-by with non-scalar literals in agg (#20704)
  • Incorrect view buffer dedup (#20691)
  • Only verify Parquet ConvertedType if no LogicalType is given (#20682)
  • Validate length of schema_overrides in read_csv (#20672)
  • Fix map_elements ignoring skip_nulls=True for struct dtype (#20668)
  • Check for MAP-GROUPS in cloud-eligible (#20662)
  • Fix empty output of to_arrow() on filtered unit height DataFrame (#20656)
  • Add .default to azure credential provider scope URL (#20651)
  • Fix join_asof panicking for invalid tolerance input (#20643)
  • Incorrect flag check on is_elementwise (#20646)
  • Don't panic but set null type if type is unknown (#20647)
  • Fix performance regression for DataFrame serialization/pickling (#20641)
  • Fix Int128 dtype serialization (#20629)
  • Ensure that SQL LIKE and ILIKE operators support multi-line matches (#20613)
  • Properly broadcast in sort_by (#20434)
  • Properly load nested Parquet Statistics (#20610)
  • AWS environment config was not loaded when credential provider was used (#20611)
  • Fix order observability of group-by-dyn (#20615)
  • Soundness when loading Parquet string statistics (#20585)
  • Fix error filtering after with_columns() on unit height LazyFrame (#20584)
  • Restore symbols on Apple by bumping nightly version (#20563)
  • Fix variable name in error message for "unsupported data type" in rolling and upsampling operations (#20553)
  • Output index type instead of u32 for sum_horizontal with boolean inputs (#20531)
  • Fix more global categorical issues (#20547)
  • Update eager join doctest on multiple columns (#20542)
  • Revert categorical unique code (#20540)
  • Add unique fast path for empty categoricals (#20536)
  • Fix various Int128 operations (#20515)
  • Fix global cat unique (#20524)
  • Fix union (#20523)
  • Fix rolling aggregations for various integer types (#20512)
  • Ensure ignore_nulls is respected in horizontal sum/mean (#20469)
  • Fix incorrectly added sorted flag after append for lexically ordered categorical series (#20414)
  • More Int128 testing and related fixes (#20494)
  • Validate column names in unique() for empty DataFrames (#20411)
  • Implement list.min and list.max for list[i128] (#20488)
  • Decimal from physical in horizontal min/max and shift (#20487)
  • Don't remove sort if first/last strategy is set in unique (#20481)
  • Fix join literal behavior (#20477)
  • Validate asof join by args in IR resolving phase (#20473)
  • Fix align_frames with single row panicking (#20466)
  • Allow multiple column sort for Decimal (#20452)
  • Fix mode panicking for String dtype (#20458)
  • Return correct schema for sum_horizontal with boolean dtype (#20459)
  • Properly handle to_physical_repr of nested types (#20413)
  • Workaround for mmap crash under Emscripten (#20418)
  • Fix using new_columns in scan_csv with compressed file (#20412)
  • Fix decimal arithmetic schema (#20398)
  • Raise on categorical search_sorted (#20395)
  • Don't try to load non-existend List/FSL statistics (#20388)
  • Propagate nulls for float methods on all numeric types (#20386)
  • Add env var to ignore file cache allocate error (#20356)
  • Flip order on right join (#20358)
  • Fix incorrect object store caching for ADLS URI (#20357)
  • Use the same encoding for nullable as non-nullable arrays (#20323)
  • Improve error message on SchemaError (#20326)
  • Boolean optional slice pushdown (#20315)
  • Properly handle from_physical for List/Array (#20311)
  • Ignore quotes in csv comments (#20306)
  • Ensure pl.datetime returns empty column when input columns are empty (#20278)
  • Ensure output height does not change on lazy projection pushdown with aggregations (#20223)
  • Fix error writing on Windows to locations outside of C drive (#20245)
  • Incorrect comparison in some cases with filtered list/array columns (#20243)
  • Ensure height is maintained in SQL SELECT 1 FROM (#20241)
  • Properly account for updated Categorical in .unique() kernel (#20235)
  • Fix incorrect lazy select(len()) with some select orderings (#20222)
  • Fix assertion panic on LazyFrame scratch.is_empty() (#20219)

πŸ“– Documentation

  • Update source URL for legislators-historical.csv (#20858)
  • Fix typo in sql functions (cosinus -> cosine) (#20676)
  • Fix small typo in plugins (polars-dt -> polars-st) (#20657)
  • Add polars-h3 and polars-st to plugin list (#20653)
  • Add docs reference for Field (#20625)
  • Miscellaneous minor updates/fixes (#20573)
  • Update "group_by_rolling" (deprecated) to "rolling" in user guide (#20548)
  • Fix flaky doctests (#20516)
  • Clarify the join pre-condition of join_asof (#20509)
  • Fix Expr.all description of Kleene logic (#20409)
  • Improve docstring clarity (#20416)
  • Fix "forcolumnar" typo in docs (#20401)
  • Remove Plugins overview page without information (#20348)
  • Small fixes/clarifications in user guide (#20335)
  • Improve docs about NaN (#20310)
  • Fix typo in fork warning (#20258)

πŸ› οΈ Other improvements

  • Add tests for already resolved issues (#20921)
  • Fix the verify_dict_indices codegen (#20920)
  • Add ProjectionContext in projection pushdown opt (#20918)
  • Disable 'catalog' in build (#20897)
  • Implement negative slice for new streaming IPC (#20866)
  • Remove last instances of itoa (#20881)
  • Reduce bloat in static_array_collect by using BitmapBuilders (#20891)
  • Use defunctionalization in polars-core scalar.rs in order to reduce code duplication (#20377)
  • Simplify decimal formatting and remove itoap dep (#20880)
  • Remove polars(_core)::export (#20869)
  • Debloat Series bitops (#20873)
  • Move sum kernel to polars-compute (#20867)
  • Remove todo and test restriction for new-streaming (#20861)
  • Dispatch to the in-mem engine for AExpr::Gather (#20862)
  • Dispatch to the in-memory engine for multifile sources (#20860)
  • Add tests for open issues (#20857)
  • Mark 'register_startup' as unsafe (#20841)
  • Reduce mode bloat (#20839)
  • Rename ContainsMany to ContainsAny (#20785)
  • Unpin NumPy in type checking workflow (#20792)
  • Add various tests (#20768)
  • Small drive-by's (#20772)
  • Touch the upload probe for the remote benchmark (#20767)
  • Fix remote benchmark script (#20755)
  • Fix tests (#20745)
  • Simplify hive predicate handling in NEW_MULTIFILE (#20730)
  • Add tests for various open issues (#20720)
  • Add tests for various open issues that have been fixed (#20680)
  • Don't include debug symbols in benchmark run (#20571)
  • Remove implicit reverse from AExpr::replace_inputs() (#20659)
  • Implement CSV, IPC and NDJson in the MultiScanExec node (#20648)
  • Fix Python deps installation in remote-benchmark workflow (#20619)
  • Fix rust-analyzer misinterpretation (#20595)
  • Remove unused file (#20594)
  • Rename is_numeric to is_primitive_numeric (#20574)
  • Reduce size of ArrowDataType by boxing heavy variants (#20588)
  • Bump multiversion from 0.7 to 0.8 (#20543)
  • Groundwork for allowing multi-output nodes in the new streaming engine (#20550)
  • Improve bin size info (#20551)
  • Increase categorical test coverage (#20514)
  • Report wheel sizes (#20541)
  • Add tests for floor/ceil on integers (#20479)
  • Expose and rewrite 'can_pre_agg' (#20450)
  • Skip test on windows; kuzu import segfaults (#20463)
  • Add a TypeCheckRule to the optimizer (#20425)
  • Fix duplicate cols in new-streaming parquet prefilter (#20419)
  • Move gather kernels to polars-compute (#20415)
  • Temporarily disable common subplan elim for new-streaming (#20374)
  • Remove unused IR::Reduce node (#20392)
  • Enable masked out list, struct and array elements in parametric tests (#20365)
  • Dispatch slice/filter lowering properly (#20390)
  • Move hive partitioning/multi-file handling outside of readers (#20203)
  • Purge ChunkedArray Metadata (#20371)
  • Add equi joins to new streaming engine (#19869)
  • Make parametric tests include pl.List and pl.Array by default (#20319)
  • Use Column in Row Encoding (#20312)
  • Don't warn on fork hook (#20309)
  • Don't deconstruct CsvParseOptions (#20302)
  • Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet (#20248)
  • Add FunctionCastOptions and conservative IR-level cast type-checking (#20286)
  • Add more descriptive error message for failure of vstack/extend (#20299)
  • Expose AexprArena (#20230)

Thank you to all our contributors for making this release possible! @Biswas-N, @FBruzzesi, @IndexSeek, @Jesse-Bakker, @MarcoGorelli, @MoizesCBF, @Prathamesh-Ghatole, @SamuelAllain, @Terrigible, @ZemanOndrej, @alexander-beedie, @arnabanimesh, @balbok0, @beckernick, @braaannigan, @brifitz, @bschoenmaeckers, @burakemir, @coastalwhite, @deanm0000, @dependabot[bot], @dimfeld, @eitsupi, @etiennebacher, @georgestagg, @hamdanal, @haocheng6, @ion-elgreco, @itamarst, @jqnatividad, @kszlim, @lukemanley, @mcrumiller, @nameexhaustion, @noexecstack, @orlp, @ptiza, @r-brink, @ritchie46, @rodrigogiraoserrao, @siddharth-vi, @stijnherfst, @stinodego, @tswast, @zero-stroke and dependabot[bot]