If you have ever encountered a CSV with any or many inconsistently formatted text columns, I’m sure you have come to understand the importance of standardizing the data and extracting important information for analysis. Try looking at the Australian Commercial Real Estate dataset from Kaggle for instance - you’ll immediately notice that the inconsistencies in the price column make it hard to analyze property prices. Thankfully, with Sigma, it’s relatively easy to make this data more useful.
Let’s run with the Kaggle dataset example for a minute…
- Download the CSV here and upload the CSV into a Sigma workbook table.
- Create a new column called [Lower of Price] with the formula:
RegexpReplace(RegexpReplace(Lower([price]), ",", ""), " million", "m")
- This will replace any commas with blank spaces “”, and it will replace " million" with “m”.
- Add a new column called [Extracted Price Text] using the formula:
RegexpExtract([Lower of Price], "[0-9]+.[0-9]+[a-z]|[0-9]+")
- This will extract any digits separated by a decimal and followed by a letter ([0-9]+.[0-9]+[a-z]), or (|) any number of digits ([0-9]+). Note that this would not work if I switched the order of extraction (i.e., if I wrote “[0-9]+|[0-9]+.[0-9]+[a-z]” as the second argument in the RegexpExtract function, it would never return any numbers that followed a decimal point).
- Add a new column called [Extracted Price] using the formula:
If(Contains([Extracted Price Text], "m"), Number(RegexpReplace([Extracted Price Text], "m", "")) * 1000000, Number([Extracted Price Text]))
- If the [Extracted Price Text] column contains the letter “m” for “million”, this replaces the “m” with a blank space “” and returns the remaining decimal number times 1 million.
- If the letter “m” is not in the [Extracted Price Text] column, this simply returns the number from the [Extracted Price Text] column.
Congratulations! Now that you’ve standardized, extracted, and transformed the price information, you can proceed with analysis.
If you have any questions on the above, or any questions about how to use regular expressions in your particular use-case, please leave a comment! I’m happy to provide further guidance.