Handling date fields is an everyday task in the data world. Whether it’s a common audit date field or other fields providing specific day, month, or year information, managing these fields correctly is vital. You might even have processes parametrized by execution variables like date ranges to constrain operations effectively.
While determining the appropriate functions for setting dates is straightforward (e.g., Scala’s default Java-based date handling or Spark’s current_date(): Column
function), the real challenge lies in establishing a correct, normalized date format. More intriguingly, how do you handle any conceivable input format and output a consistent, normalized format?
Regular Expressions (Regex) #
There’s no shortage of resources online that offer tips and solutions for normalizing dates. Here, we’ll focus on using regular expressions in Scala to accept any date format as input and normalize it for downstream processes reliant on dates as execution parameters.
Date Formats #
To effectively utilize regular expressions, it’s essential to understand date formats. A date typically consists of three parts: day, month, and year. For example:
- A month is a number from 1 to 12.
- A year is typically a four-digit number.
Here’s an example Regex to normalize dates in the format dd-mm-yyyy
:
val dateRegex = "([0-9]{2})-([0-9]{2})-([0-9]{4})"
This expression has three groups (enclosed in parentheses) representing day, month, and year, respectively. For instance, February 28, 2022, would be written as 28-02-2022
. Adjusting the format to use /
as a separator instead of -
is straightforward:
val dateRegex = "([0-9]{2})/([0-9]{2})/([0-9]{4})"
Use Case: Filtering Data Based on Dates #
Let’s apply this concept to a process where, given a specific input date, we filter and display all rows from a CSV file containing stock prices within a specified range.
Base Code #
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{DataFrame, SparkSession}
object RegexDate {
val spark: SparkSession =
SparkSession
.builder()
.master("local[*]")
.getOrCreate()
/** Format Date: YYYY-mm-dd */
val data: DataFrame =
spark.read.option("header", true)
.csv("src/main/resources/historical-data/csv/stocks/AAPL.csv")
def showFrom(date: String): Unit =
data.where(col("Date") >= date).show(false)
def main(args: Array[String]): Unit = {
val date: String = "2021-07-10"
showFrom(date)
}
}
CSV Content #
The dataset, downloaded from Yahoo Finance, contains Apple stock prices:
Date,Open,High,Low,Close,Volume,Adj Close
2021-07-10,120.99,121.0,119.98,120.99,1000,120.99
...
Dataframe #
Date | Open | High | Low | Close | Volume | Adj Close |
---|---|---|---|---|---|---|
2021-07-10 | 120.99 | 121.0 | 119.98 | 120.99 | 1000 | 120.99 |
Validating Input Dates with Preconditions or require
#
In the base code, any misformatted date input may cause confusing errors. For instance, using 07/10/2021
(mm/dd/yyyy
format) could result in the process interpreting the year as 07
.
Using require
statements can ensure inputs adhere to the expected format (yyyy-mm-dd
):
def showFrom(date: String): Unit = {
require(date.contains("-") && date.split("-").length == 3, "Date separator must be '-'")
require(date.split("-").head.length == 4, "Date format must be yyyy-mm-dd")
data.where(col("Date") >= date).show(false)
}
Normalizing Dates with Regular Expressions #
Instead of relying solely on preconditions, let’s normalize dates programmatically. Assume input dates come in two formats: dd-mm-yyyy
and yyyy/mm/dd
.
Example: Normalization Function #
val normalizeDateSep: String = "-"
val dateRegex1: Regex = "([0-9]{4})/([0-9]{2})/([0-9]{2})".r
val dateRegex2: Regex = "([0-9]{2})-([0-9]{2})-([0-9]{4})".r
def normalizeDate(date: String): String =
date match {
case dateRegex1(year, month, day) => Array(year, month, day).mkString(normalizeDateSep)
case dateRegex2(day, month, year) => Array(year, month, day).mkString(normalizeDateSep)
}
With this basic normalization, you can accept multiple input formats and ensure consistency in your data processing pipeline.
.r
at the end of a string converts it into a Regex object, allowing for pattern matching.
This example forms the foundation for building more flexible and robust date normalization utilities.
Stay tuned for part two, where we’ll enhance this utility with smarter constructors for more semantic and concise code!