Ibis and Trino
Lester Martin
Developer Adocate
Starburst
Lester Martin
Developer Adocate
Starburst


More deployment options
A bird and a bunny walk into a bar…
The bird says, āIām the Python dataframe library with tons of optionalityā. The bunny says, āIām the SQL engine with heaps of optionalityā. They instantly became BFFs. The end.
OK.. that ājokeā did stink, but usingĀ IbisĀ withĀ Trino for the backendĀ is the complete opposite. Itās optionality2Ā and thatās a pretty sweet thing!!
This post is a repeat ofĀ pystarburst (the dataframe api), but this time using Ibis. As before, this post is NOT attempting to teach you everything you need to know about the DataFrame API, but does provide some insight into the subject matter. Letās get into it!
Setup your environment
Like before, Iām assuming you already have Python3 and pip installed (goĀ hereĀ if you need some help). We can start byĀ installing Ibis and dependencies for the Trino backend.
$ pip install 'ibis-framework[trino]'
Iām usingĀ these instructionsĀ to continue on with the setup activities and will be using my personalĀ Starburst GalaxyĀ tenant to test with. Unlike in the instructions, Iām just hard-coding my connection details in my boilerplate code (masked, of course).
import os
import ibis
Ā
ibis.options.interactive = True
Ā
user = "lXXXXXXXXm/aXXXXXXXXXn"
password = "<password>"
host = "lXXXXXXXXXo"
port = "443"
catalog = "tpch"
schema = "tiny"
Ā
con = ibis.trino.connect(
Ā Ā Ā Ā user=user, password=password, host=host, port=port,
Ā Ā Ā Ā database=catalog, schema=schema
)
Ā
tiny_region = con.table("region")
print(tiny_region[0:5])
Test the boilerplate code
Iām using the CLI, but you could easily run this in your favorite web-based notebook.
$ python3 ibis-main.py
Yep, we basically ran a simple SELECT statement and we can verify in Starburst Galaxyās Query history screen that it executed.
Explore the API
Thereās a lot of solid documentation on the Ibis site, but theĀ Basic operationsĀ page serves as good of a starting point as any other to get your hands dirty writing some code.
If interested in comparing side-by-side the DataFrame API code from Ibis with that from PyStarburst, just pull up myĀ pystarburst (the dataframe api)Ā blog alongside this one since Iām porting that code to work with Ibis.
Select a full table
custDF = con.table("customer")
print(custDF.head(10))
Use projection
projectedDF = custDF.select("name", "acctbal", "nationkey")
print(projectedDF.head(10))
Filter the rows
filteredDF = projectedDF.filter(projectedDF["acctbal"] > 9900.0)
print(filteredDF.head(100))
Notice that even though 100 records were requested to be displayed, there are only 7 records that meet this criteria.
Select a second table
Letās create a DataFrame from theĀ nation
Ā table to later join withĀ customer
. In the example below, we are chaining methods together instead of assigning each output to a distinct variable as we have done up until now.
# Grab new table, drop 2 cols, and rename 2 others
nationDF = con.table("nation") \
.drop("regionkey", "comment") \
.rename(
dict(
nation_name="name",
n_nationkey="nationkey"
)
)
print(nationDF.head(10))
Join the tables
This is the EXACT same syntax used in PyStarburst (and yes, PySpark, too).
joinedDF = filteredDF.join(nationDF,
filteredDF.nationkey == nationDF.n_nationkey)
print(joinedDF.head(10))
As you can see, the join results have all of the columns from all of the DataFrames.
Project the joined result
Like before, we can just discard the unwanted columns.
projectedJoinDF = joinedDF.drop("nationkey", "n_nationkey")
print(projectedJoinDF.head(10))
Apply a sort
This approach looks a little bit different than PyStarburstās, but it is easy enough to follow.
orderedDF = projectedJoinDF.order_by([ibis.desc("acctbal")])
print(orderedDF.head(10))
Put it all together
While the creation of multiple DataFrame objects was used above, in practice (as discussed when fetching theĀ nation
table) most DataFrame API programmers chain many methods together to look at bit more like this.
nationDF = con.table("nation") \
.drop("regionkey", "comment") \
.rename(
dict(
nation_name="name",
n_nationkey="nationkey"
)
)
custDF = con.table("customer") \
.select("name", "acctbal", "nationkey") \
.filter(projectedDF["acctbal"] > 9900.0)
apiSQL = custDF.join(nationDF,
custDF.nationkey == nationDF.n_nationkey) \
.drop("nationkey", "n_nationkey") \
.order_by([ibis.desc("acctbal")])
print(apiSQL.head(10))
Comparing with PyStarburst
The results are the same as before just like in the original PyStarburst postās code that we just ported to use Ibis. The APIs are different enough that you likely would want to pick one and stick with it instead of trying to use them both daily. For me personally, Iām currently in the PyStarburst camp.
Sure, I work at Starburst which has SOMETHING to do with that, but it is really because it lines up more closely to the PySpark implementation I have spend a number of years working with. If I was starting from scratch I wouldnāt have the prior experience to drive me to think this way and I would likely do more personal research and comparison.
It surely is NOT a dig against Ibis and its optionality of being able to run the same DataFrame program against multiple backend SQL engines. That is an INCREDIBLE foundational feature of this framework. Fortunately, Trinoās connector architecture and long list of integrations offers that same kind of optionality at the SQL engine layer.
Of course, if there is a backend in the list below that you need this flexibility for that is not a supported integration with Trino (quack-quack comes to mind, but isnāt the only one in that list not currently integrated with Trino), then youāll have to take a hard look at Ibis.
For Trino (and Starburst) fans like myself, it is surely a win if you chose to go down the Ibis path as we do believe in this integration and want to only make it better over time.
As for performance & optimization⦠I was running all these simple examples on the TPCH connector (and the tiny
Ā schema at that) which absolutely does not allow any inference to be made from the limited set of examples I ran for this blog post. One would assume that the CBO would ultimately decide to tackle the problem the same way regardless of which DataFrame API implementation was used.
In fact, we ended up getting a VERY SIMILAR query plan for PyStarburst and Ibis as expected. The DAG on the left is from PyStarburst and the one on the right from the Ibis invocation.
If those pictures look like hieroglyphics and all that CBO & DAG talk was mumbo-jumbo,Ā and you want to learn more, check out these free training modules from Starburst Academy.
Back to the visualizations. Yes, the text is amazingly small (and fuzzy) and is almost completely unreadable, but I do see something on the left that didnāt happen on the right.
The PyStarburst execution ended up running ScanFilterProjects instead of just ScanFilters that Ibis produced. Again⦠do NOT read anything into this; especially with the data generator connector I was using. It just lets me know I need to do some more side-by-side research.
For that, pushing the TPCH generated data into a modern table format like Iceberg and using a bigger sized schema could then offer some real testing opportunities.
All in all, Iām happy I stayed up until 2:00 am on a work night doing some exploration and sharing my initial findings.
Originally published, here.
Get Started with Starburst Galaxy today
Sign up for Starburst Galaxy and get up to 500 credits free.