Intro

In my current role I do a lot of code reviews as part of our PR process for the framework my team is working on. These code reviews give me the opportunity to see a lot of different approaches to solving data access problems. I get to see the good and the bad.

Getting the top row

Recently I’ve seen a several situations where a developer is doing a group by operation in SQL, loading the result set into a DataFrame and then trying to access the top row. In this comparison, I’m not going to focus on the SQL or group by part of the work, but on the ‘accessing the top row of a dataframe’ part.

Example

Here is an example dataframe that I will use to demonstrate the desired operation:

index  rint    rfloat     rstr
    0    28  0.174798  testing
    0    32  0.572884     that
    1    16  0.847770     that

In this example, the desire is to retrieve the first row.

`to_dict` approach

Here is the approach that I have seen in code reviews. It works.

def access_row_using_to_dict(df):
    return df.to_dict(orient='records')[0]['rstr']

In words, this function, receives a dataframe, converts the dataframe to a list of dictionaries. Then access the 0th index of the list and retrieves the value of the dictionary by key ‘rstr’. Here is an example of the DataFrame to_dict structure:

[
    {'index': 0, 'rint': 28, 'rfloat': 0.1747977850239647, 'rstr': 'testing'}, 
    {'index': 0, 'rint': 32, 'rfloat': 0.5728837861868796, 'rstr': 'that'}, 
    {'index': 1, 'rint': 16, 'rfloat': 0.847770298364003, 'rstr': 'that'}
]

df.to_dict(orient='records')[0] will yield:

    {'index': 0, 'rint': 28, 'rfloat': 0.1747977850239647, 'rstr': 'testing'}

So this approach will work, but it seems like there should be a better way.

`iloc` approach

def access_row_using_iloc(df):
    return df.iloc[0].rstr

This approach uses a method in DataFrame to retrieve the 0th row from the dataframe as a pandas.core.series.Series

row = df.iloc[0]
row
index            0
rint            28
rfloat    0.174798
rstr       testing
Name: 0, dtype: object

Then the specific value in the Series can be retrieved either by property name or key lookup:

row.rstr
'testing'
row['rstr']
'testing'

Comparing the two

Generally, the different between the two approaches: to_dict approach transforms the whole dataframe into a new data structure and then the value is retrieved by navigating that new data structure. the iloc approach uses the existing data structure.

Which is faster?

In order to compare the two access methods, I am using pytest with the benchmarking module. I created a benchmark function for each of the two methods:

`test_benchmark_to_dict_index`

@pytest.mark.parametrize('row_count',[2,200,200000])
def test_benchmark_to_dict_index(rdf, row_count, benchmark):
    df = build_dataframe_with_test_row(rdf, row_count)
    benchmark(access_row_using_to_dict, df)

test_benchmark_iloc_index

@pytest.mark.parametrize('row_count',[2,200,200000])
def test_benchmark_iloc_index(rdf, row_count, benchmark):
    df = build_dataframe_with_test_row(rdf, row_count)
    benchmark(access_row_using_iloc, df)

Each of these benchmark functions only exercise the access row part of the work, not generating the dataframe. The benchmark functions are parameterized to run for several row count sizes to show any variation that may happen as a result of row count size.

Benchmark Results

Here is a demonstration run of the two benchmark functions. This result shows the difference between the two access functions for all the row counts


(.venv) RandomDataFrame % pytest test_to_dict_vs_iloc.py --benchmark-only --benchmark-columns=median,iqr,outliers,ops,rounds,iterations
======================================================================================================== test session starts =========================================================================================================
platform darwin -- Python 3.9.10, pytest-7.1.2, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/brucelowther/src/python/RandomDataFrame/RandomDataFrame
plugins: benchmark-3.4.1
collected 9 items                                                                                                                                                                                                                    

test_to_dict_vs_iloc.py s...s...s                                                                                                                                                                                              [100%]


----------------------------------------------------------- benchmark: 6 tests -----------------------------------------------------------
Name (time in us)                              Median                   IQR            Outliers          OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_iloc_index[200]                31.6250 (1.0)          1.2500 (2.72)      157;267  31,262.3316 (1.0)        9147           1
test_benchmark_iloc_index[200000]             32.2920 (1.02)         0.4590 (1.0)       73;1546  30,679.6681 (0.98)       6192           1
test_benchmark_iloc_index[2]                  32.2080 (1.02)         1.1260 (2.45)      166;270  30,593.6293 (0.98)       7223           1
test_benchmark_to_dict_index[2]               64.1250 (2.03)         1.5830 (3.45)       61;293  15,327.4559 (0.49)       3531           1
test_benchmark_to_dict_index[200]            311.1665 (9.84)         8.6660 (18.88)      670;60   3,219.5607 (0.10)       2698           1
test_benchmark_to_dict_index[200000]     263,655.4580 (>1000.0)  5,919.6145 (>1000.0)       2;0       3.7931 (0.00)          5           1
------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
==================================================================================================== 6 passed, 3 skipped in 5.76s ====================================================================================================

Summary

Three interesting results:

In all cases, the iloc approach is faster than the to_dict approach.
The iloc approach appears to be stable over different row count sizes. (The median time for 2 rows is nearly the same as the median time for 200000 rows.)
The to_dict approach is very dependent upon the row count size of the dataframe

I always find interesting results when I do benchmark comparisions between various techniques. I hope this write up, and the gitlab repo can help you learn where time is being spent in your own python code.

Code Reference

GitLab Repository containing the benchmark and related packages.

https://gitlab.com/osunderdog/pandasrandomdataframe

Benchmarking DataFrame Row Access (iloc vs to_dict)