In this comparison, which is also available as Jupyter Notebook for you to try out, I demonstrate typical operations in
Polars and Pandas side by side to show how much easier Polars is to use.
# Download a huge csv as a test. Takes a while and only needed once...big_csv=Path("./big.csv")csv_url="http://sdm.lbl.gov/fastbit/data/star2002-full.csv.gz"ifnotbig_csv.exists():withget_handle(csv_url,compression="gzip",mode="r")asfh_in,open(big_csv,"bw")asfh_out:fh_out.write(fh_in.handle.buffer.read())
it’s much harder to read since it’s not operator chaining,
it’s more verbose if you assign actual variable names to your dataframes and not just use df all the time. Check out this filtering example: agg_metric_df[agg_metric_df["metric_1"] < 0.9]. Using col to refer to the column of the current dataframe is much cleaner,
it’s not possible to switch later from eager to lazy execution
Lazy Execution
Just switching read_csv to scan_csv is all it needs to go from eager to lazy in this example. collect or fetch is then used to trigger the execution.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-aaa4e676d17c> in <module>
1 ldf = pl.scan_csv(str(big_csv), has_headers=False)
----> 2 ldf[ldf["column_1"] == 1][["column_9"]].head()
TypeError: 'LazyFrame' object is not subscriptable
Slicing & Indexing
Slicing and indexing in Polars works with the help of the subscript syntax similar to Numpy, i.e. df[1] or df[1, 2]. Some simple rules apply:
indexing by a single dimension
returns one or several rows if indexed by an integer, e.g. df[42], df[42:],
returns one or several columns if index by a string, e.g. , df["my_col"], df[["col1", "col2]],
indexing by two dimensions
returns the row(s) indexed by an integer in the first dimension and the column(s) indexed by integer or string in the second dimension, e.g. df[69, 42] or df[69, "col_42"]
In case of integers also slices, e.g. 1:, are possible.
edf[1,[2,3]]# index by (row, column) but returns data frame
column_3
column_4
i64
f64
808
2.0011015223e7
edf[1:4,"column_4":]# slice by row and column's name
column_4
column_5
column_6
column_7
column_8
column_9
column_10
column_11
column_12
column_13
column_14
column_15
column_16
f64
i64
i64
i64
i64
i64
i64
f64
f64
i64
f64
f64
f64
2.0011015223e7
1613424
886
0
0
61
371
2.0011204115e7
23.326
2288071
-2.47e-1
0.456
57.811
2.0011015223e7
1613424
638
0
0
7
121
2.0011204115e7
2.444
2288071
-3.91e-1
0.59
167.757
2.0011015223e7
1613424
4259
0
0
1024
1302
2.0011204115e7
9.522
2288071
-2.9e-1
0.446
8.644
Since in Pandas there is an explicit index that can be any type, not just integer and columns that can have any immutable datatype, it has to workaround several ambiguities with special accessors like iloc, loc, at, iat, etc.
pdf=edf.to_pandas()
pdf.iloc[1,3]
20011015.222604
# when mixing indexing by integer and by string it gets less comprehensible in Pandaspdf["column_4"].iloc[1]
# for slicing with column names and guaranteed indexing by integer we have to write:pdf.loc[:,"column_4":].iloc[1:4]# `pdf.loc[1:4, "column_4":]` works only as long the index is set correctly.
Note that the last element of the c column is null, not NaN as in Pandas, and the datatype is still int and not automatically converted to float as in Pandas.
Select certain elements from a column by filtering from another
df.select(col("names").filter(col("random")>0.4))
names
str
“ham”
“spam”
“egg”
null
Syntax in Pandas is way less readable
pdf.loc[pdf["random"]>0.4][["names"]]
names
1
ham
2
spam
3
egg
4
None
Another way in Pandas is to use the query style:
pdf.query("random > 0.4")[["names"]]
names
1
ham
2
spam
3
egg
4
None
The problem with Pandas’ query style is that it is basically string hacking with no checks whatsoever until the code is executed. Also reuse of certain expressions is highly limited if you have just strings.
Complex expressions are also possible
All expressions in Polars are embarassingly parallel by design and thus automatically parallelized
Doing the same in Pandas is a bit more complex. Also note that there is an unexpected NaN in the last row. This is due to the fact that when inserting pdf.groupby(['names'], dropna=False)['random'].apply(list) we compare NaN to NaN which is false by definition. This is just another subtle problem caused by the fact that Pandas uses NaN to express NA.
Also note that Polars needs no explicit index like Pandas to do operations like this, just like Spark has no way to set an index explicitely.
# or alternatively using `join` which is also avoiding the NaN problempdf.join(pdf.groupby('groups').random.sum().rename("sum[random]/groups"),on="groups").join(pdf.groupby('names',dropna=False).random.apply(list).rename("random/name"),on="names")
(df.lazy()# allows for working only on a subset using limit.groupby("first_name").agg([col("party").count().alias("n_party"),# renaming an aggregated column is a blisscol("gender").list(),col("last_name").first(),]).sort("n_party",reverse=True).limit(5).collect())
first_name
n_party
gender_agg_list
last_name_first
str
u32
list
str
“John”
19
“[M, M, … M]”
“Barrasso”
“Mike”
12
“[M, M, … M]”
“Kelly”
“Michael”
11
“[M, M, … M]”
“Bennet”
“David”
11
“[M, M, … M]”
“Cicilline”
“James”
9
“[M, M, … M]”
“Inhofe”
Note how easily we can deal with lists of strings by aggregating over gender using list().
In Pandas the same operation feels more like string hacking and renaming happens as a separate step having unnecessary repetitions of the column names. Everything is of course eagerly evaluated.
defcompute_age()->pl.Expr:# Date64 is time in msms_to_year=1e3*3600*24*365return(lit(datetime(2021,1,1))-col("birthday"))/(ms_to_year)defavg_age(gender:str)->pl.Expr:return(compute_age().filter(col("gender")==gender).mean().alias(f"avg {gender} age"))(df.lazy().groupby(["state"]).agg([avg_age("M"),avg_age("F"),]).sort("state").limit(5).collect())
state
avg M age
avg F age
str
f64
f64
“AK”
71.899
63.657
“AL”
65.167
56.038
“AR”
58.325
null
“AS”
null
73.06
“AZ”
60.004
59.168
Translating this to Pandas is really hard since we have no way to refer to a column. Also with Pandas’ agg we have only access to the aggregation column and cannot filter by another, thus we have to use apply.
defp_compute_age(grp:pd.DataFrame):# Date64 is time in mss_to_year=3600*24*365return((datetime(2021,1,1)-grp["birthday"]).dt.total_seconds())/(s_to_year)defp_avg_age(grp:pd.DataFrame,gender:str):age=p_compute_age(grp)mean_age=age[grp["gender"]==gender].mean()returnpd.Series([mean_age],index=[f"avg {gender} age"])(pdf.groupby("state").apply(lambdagrp:pd.concat([p_avg_age(grp,gender="M"),p_avg_age(grp,gender="F")])).reset_index().sort_values(by="state").head(5))
state
avg M age
avg F age
0
AK
71.898630
63.657534
1
AL
65.167466
56.038356
2
AR
58.324658
NaN
3
AS
NaN
73.060274
4
AZ
60.003767
59.168037
The same code in Pandas just feels not as clean as in Polars, thus showing nicely the power that comes with Polars’ composable expressions.
Comments
comments powered by Disqus