Created
May 10, 2025 13:18
-
-
Save jgomo3/7b29c1a63c97c1be789e6a0e996a3486 to your computer and use it in GitHub Desktop.
Demo of regular expressions with Unicode support: split a text into words
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(def split-into-words (partial re-seq #"\w+")) | |
(comment | |
;; split-into-words, implemented with the trivial \w+ regular | |
;; expression, it works fine in English: | |
(split-into-words "Have a nice day.") | |
;; => ("Have" "a" "nice" "day") | |
;; But it fails with other languages, like in Spanish. In the | |
;; following example, the word "día" is splited into "d" and "a" | |
(split-into-words "Que tenga un buen día") | |
;; => ("Que" "tenga" "un" "buen" "d" "a") | |
) | |
(def split-into-words (partial re-seq #"\p{IsAlphabetic}+")) | |
(comment | |
;; Now, split-into-words, implemented with the regular expression | |
;; with Unicode Support, works correctly with other languages like | |
;; Spanish: | |
(split-into-words "Que tenga un buen día") | |
;; => ("Que" "tenga" "un" "buen" "día") | |
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
;; Demostración del uso de expresiones regulares con soporte de Unicode. | |
;; Divide un texto en palabras. | |
(def split-into-words (partial re-seq #"\w+")) | |
(comment | |
;; split-into-words, implementado con la expression regular trivial | |
;; \w+, funciona bien en inglés: | |
(split-into-words "Have a nice day.") | |
;; => ("Have" "a" "nice" "day") | |
;; Pero falla con otros lenguajes, como el español. En el siguiente | |
;; ejemplo, la palabra día es dividida en «d» y en «a»: | |
(split-into-words "Que tenga un buen día") | |
;; => ("Que" "tenga" "un" "buen" "d" "a") | |
) | |
(def split-into-words (partial re-seq #"\p{IsAlphabetic}+")) | |
(comment | |
;; En cambio ahora, split-into-words, implementado con una | |
;; expression regular usando el soporte a Unicode, trabaja | |
;; conrrectamente con lenguajes como el español: | |
(split-into-words "Que tenga un buen día") | |
;; => ("Que" "tenga" "un" "buen" "día") | |
) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment