PySpark has at all times offered fantastic SQL and Python APIs for querying information. As of Databricks Runtime 12.1 and Apache Spark 3.4, parameterized queries assist secure and expressive methods to question information with SQL utilizing Pythonic programming paradigms.
This submit explains tips on how to make parameterized queries with PySpark and when this can be a good design sample to your code.
Parameters are useful for making your Spark code simpler to reuse and check. Additionally they encourage good coding practices. This submit will show the 2 other ways to parameterize PySpark queries:
Let us take a look at tips on how to use each kinds of PySpark parameterized queries and discover why the built-in performance is healthier than different alternate options.
Advantages of parameterized queries
Parameterized queries encourage the “do not repeat your self” (DRY) sample, make unit testing simpler, and make SQL easier-to-reuse. Additionally they forestall SQL injection assaults, which might pose safety vulnerabilities.
It may be tempting to repeat and paste massive chunks of SQL when writing comparable queries. Parameterized queries encourage abstracting patterns and writing code with the DRY sample.
Parameterized queries are additionally simpler to check. You’ll be able to parameterize a question so it’s simple to run on manufacturing and check datasets.
Alternatively, manually parameterizing SQL queries with Python f-strings is a poor different. Take into account the next disadvantages:
- Python f-strings don’t defend in opposition to SQL injection assaults.
- Python f-strings don’t perceive Python native objects resembling DataFrames, columns, and particular characters.
Let us take a look at tips on how to parameterize queries with parameter markers, which defend your code from SQL injection vulnerabilities, and assist computerized sort conversion of widespread PySpark situations in string format.
Parameterized queries with PySpark customized string formatting
Suppose you might have the next information desk known as h20_1e9
with 9 columns:
+-----+-----+------------+---+---+-----+---+---+---------+
| id1| id2| id3|id4|id5| id6| v1| v2| v3|
+-----+-----+------------+---+---+-----+---+---+---------+
|id008|id052|id0000073659| 84| 89|82005| 5| 11|64.785802|
|id079|id037|id0000041462| 4| 35|28153| 1| 1|28.732545|
|id098|id031|id0000027269| 27| 38|13508| 5| 2|59.867875|
+-----+-----+------------+---+---+-----+---+---+---------+
You want to parameterize the next SQL question:
SELECT id1, SUM(v1) AS v1
FROM h20_1e9
WHERE id1 = "id089"
GROUP BY id1
You’d prefer to make it simple to run this question with completely different values of id1
. This is tips on how to parameterize and run the question with completely different id1
values.
question = """SELECT id1, SUM(v1) AS v1
FROM h20_1e9
WHERE id1 = {id1_val}
GROUP BY id1"""
spark.sql(question, id1_val="id016").present()
+-----+------+
| id1| v1|
+-----+------+
|id016|298268|
+-----+------+
Now rerun the question with one other argument:
spark.sql(question, id1_val="id018").present()
+-----+------+
| id1| v1|
+-----+------+
|id089|300446|
+-----+------+
The PySpark string formatter additionally helps you to execute SQL queries instantly on a DataFrame with out explicitly defining non permanent views.
Suppose you might have the next DataFrame known as person_df
:
+---------+--------+
|firstname| nation|
+---------+--------+
| frank| usa|
| sourav| india|
| rahul| india|
| sim|buglaria|
+---------+--------+
This is tips on how to question the DataFrame with SQL.
spark.sql(
"choose nation, depend(*) as num_ppl from {person_df} group by nation",
person_df=person_df,
).present()
+--------+-------+
| nation|num_ppl|
+--------+-------+
| usa| 1|
| india| 2|
|bulgaria| 1|
+--------+-------+
Working queries on a DataFrame utilizing SQL syntax with out having to manually register a brief view may be very good!
Let’s now see tips on how to parameterize queries with arguments in parameter markers.
Parameterized queries with parameter markers
It’s also possible to use a dictionary of arguments to formulate a parameterized SQL question with parameter markers.
Suppose you might have the next view named some_purchases:
+-------+------+-------------+
| merchandise|quantity|purchase_date|
+-------+------+-------------+
| socks| 7.55| 2022-05-15|
|purse| 49.99| 2022-05-16|
| shorts| 25.0| 2023-01-05|
+-------+------+-------------+
This is tips on how to make a parameterized question with named parameter markers to calculate the overall quantity spent on a given merchandise.
question = "SELECT merchandise, sum(quantity) from some_purchases group by merchandise having merchandise = :merchandise"
Compute the overall quantity spent on socks.
spark.sql(
question,
args={"merchandise": "socks"},
).present()
+-----+-----------+
| merchandise|sum(quantity)|
+-----+-----------+
|socks| 32.55|
+-----+-----------+
It’s also possible to parameterize queries with unnamed parameter markers; see right here for extra info.
Apache Spark sanitizes parameters markers, so this parameterization strategy additionally protects you from SQL injection assaults.
How PySpark sanitizes parameterized queries
This is a high-level description of how Spark sanitizes the named parameterized queries:
- The SQL question arrives with an optionally available key/worth parameters record.
- Apache Spark parses the SQL question and replaces the parameter references with corresponding parse tree nodes.
- Throughout evaluation, a Catalyst rule runs to switch these references with their offered parameter values from the parameters.
- This strategy protects in opposition to SQL injection assaults as a result of it solely helps literal values. Common string interpolation applies substitution on the SQL string; this technique could be weak to assaults if the string accommodates SQL syntax aside from the meant literal values.
As beforehand talked about, there are two kinds of parameterized queries supported in PySpark:
The {}
syntax does a string substitution on the SQL question on the consumer aspect for ease of use and higher programmability. Nevertheless, it doesn’t defend in opposition to SQL injection assaults for the reason that question textual content is substituted earlier than being despatched to the Spark server.
Parameterization makes use of the args
argument of the sql()
API and passes the SQL textual content and parameters individually to the server. The SQL textual content will get parsed with the parameter placeholders, substituting the values of the parameters specified within the args
within the analyzed question tree.
There are two flavors of server-side parameterized queries: named parameter markers and unnamed parameter markers. Named parameter markers use the :<param_name>
syntax for placeholders. See the documentation for extra info on tips on how to use unnamed parameter markers.
Parameterized queries vs. string interpolation
It’s also possible to use common Python string interpolation to parameterize queries, but it surely’s not as handy.
This is how we might need to parameterize our earlier question with Python f-strings:
some_df.createOrReplaceTempView("no matter")
the_date = "2021-01-01"
min_value = "4.0"
table_name = "no matter"
question = f"""SELECT * from {table_name}
WHERE the_date > '{the_date}' AND quantity > {min_value}"""
spark.sql(question).present()
This is not as good for the next causes:
- It requires creating a brief view.
- We have to symbolize the date as a string, not a Python date.
- We have to wrap the date in single quotes within the question to format the SQL string correctly.
- This does not defend in opposition to SQL injection assaults.
In sum, built-in question parameterization capabilities are safer and more practical than string interpolation.
Conclusion
PySpark parameterized queries provide you with new capabilities to write down clear code with acquainted SQL syntax. They’re handy whenever you need to question a Spark DataFrame with SQL. They allow you to use widespread Python information varieties like floating level values, strings, dates, and datetimes, which routinely convert to SQL values underneath the hood. On this method, now you can leverage widespread Python idioms and write stunning code.
Begin leveraging PySpark parameterized queries at this time, and you’ll instantly get pleasure from the advantages of a better high quality codebase.