| tags: [ development pandas benchmark ] categories: [Development ]
Benchmarking DataFrame Row Access (iloc vs to_dict)
Intro
In my current role I do a lot of code reviews as part of our PR process for the framework my team is working on. These code reviews give me the opportunity to see a lot of different approaches to solving data access problems. I get to see the good and the bad.
Getting the top row
Recently I’ve seen a several situations where a developer is doing a group by
operation in SQL, loading the result set into a DataFrame and then trying to access the top row. In this comparison,
I’m not going to focus on the SQL or group by
part of the work, but on the ‘accessing the top row of a dataframe’
part.
Example
Here is an example dataframe that I will use to demonstrate the desired operation:
index rint rfloat rstr
0 28 0.174798 testing
0 32 0.572884 that
1 16 0.847770 that
In this example, the desire is to retrieve the first row.
to_dict
approach
Here is the approach that I have seen in code reviews. It works.
def access_row_using_to_dict(df):
return df.to_dict(orient='records')[0]['rstr']
In words, this function, receives a dataframe, converts the dataframe to a list of dictionaries. Then access the
0th index of the list and retrieves the value of the dictionary by key ‘rstr’.
Here is an example of the DataFrame to_dict
structure:
[
{'index': 0, 'rint': 28, 'rfloat': 0.1747977850239647, 'rstr': 'testing'},
{'index': 0, 'rint': 32, 'rfloat': 0.5728837861868796, 'rstr': 'that'},
{'index': 1, 'rint': 16, 'rfloat': 0.847770298364003, 'rstr': 'that'}
]
df.to_dict(orient='records')[0]
will yield:
{'index': 0, 'rint': 28, 'rfloat': 0.1747977850239647, 'rstr': 'testing'}
So this approach will work, but it seems like there should be a better way.
iloc
approach
def access_row_using_iloc(df):
return df.iloc[0].rstr
This approach uses a method in DataFrame
to retrieve the 0th row from the dataframe as a pandas.core.series.Series
row = df.iloc[0]
row
index 0
rint 28
rfloat 0.174798
rstr testing
Name: 0, dtype: object
Then the specific value in the Series can be retrieved either by property name or key lookup:
row.rstr
'testing'
row['rstr']
'testing'
Comparing the two
Generally, the different between the two approaches:
to_dict
approach transforms the whole dataframe into a new data structure and then the value is retrieved by
navigating that new data structure. the iloc
approach uses the existing data structure.
Which is faster?
In order to compare the two access methods, I am using pytest with the benchmarking module. I created a benchmark function for each of the two methods:
test_benchmark_to_dict_index
@pytest.mark.parametrize('row_count',[2,200,200000])
def test_benchmark_to_dict_index(rdf, row_count, benchmark):
df = build_dataframe_with_test_row(rdf, row_count)
benchmark(access_row_using_to_dict, df)
test_benchmark_iloc_index
@pytest.mark.parametrize('row_count',[2,200,200000])
def test_benchmark_iloc_index(rdf, row_count, benchmark):
df = build_dataframe_with_test_row(rdf, row_count)
benchmark(access_row_using_iloc, df)
Each of these benchmark functions only exercise the access row part of the work, not generating the dataframe. The benchmark functions are parameterized to run for several row count sizes to show any variation that may happen as a result of row count size.
Benchmark Results
Here is a demonstration run of the two benchmark functions. This result shows the difference between the two access functions for all the row counts
(.venv) RandomDataFrame % pytest test_to_dict_vs_iloc.py --benchmark-only --benchmark-columns=median,iqr,outliers,ops,rounds,iterations
======================================================================================================== test session starts =========================================================================================================
platform darwin -- Python 3.9.10, pytest-7.1.2, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/brucelowther/src/python/RandomDataFrame/RandomDataFrame
plugins: benchmark-3.4.1
collected 9 items
test_to_dict_vs_iloc.py s...s...s [100%]
----------------------------------------------------------- benchmark: 6 tests -----------------------------------------------------------
Name (time in us) Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_iloc_index[200] 31.6250 (1.0) 1.2500 (2.72) 157;267 31,262.3316 (1.0) 9147 1
test_benchmark_iloc_index[200000] 32.2920 (1.02) 0.4590 (1.0) 73;1546 30,679.6681 (0.98) 6192 1
test_benchmark_iloc_index[2] 32.2080 (1.02) 1.1260 (2.45) 166;270 30,593.6293 (0.98) 7223 1
test_benchmark_to_dict_index[2] 64.1250 (2.03) 1.5830 (3.45) 61;293 15,327.4559 (0.49) 3531 1
test_benchmark_to_dict_index[200] 311.1665 (9.84) 8.6660 (18.88) 670;60 3,219.5607 (0.10) 2698 1
test_benchmark_to_dict_index[200000] 263,655.4580 (>1000.0) 5,919.6145 (>1000.0) 2;0 3.7931 (0.00) 5 1
------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
==================================================================================================== 6 passed, 3 skipped in 5.76s ====================================================================================================
Summary
Three interesting results:
- In all cases, the
iloc
approach is faster than theto_dict
approach. - The
iloc
approach appears to be stable over different row count sizes. (The median time for 2 rows is nearly the same as the median time for 200000 rows.) - The
to_dict
approach is very dependent upon the row count size of the dataframe
I always find interesting results when I do benchmark comparisions between various techniques. I hope this write up, and the gitlab repo can help you learn where time is being spent in your own python code.
Code Reference
GitLab Repository containing the benchmark and related packages.