Category: Article

An unhelpful summary of PyConES 2024

Last week I went to Vigo to attend the Spanish PyCon, PyConES. I traveled there fully sponsored by Datamaran, the company that I work for. I attended as a speaker, since a colleague and I prepared a talk proposal that got accepted (something something ESG with clustering and LLM reformulation).

Day 1 🌦

The first day had only two turns for workshop tracks. I attended:

Overcoming the One Billion Row Challenge with Python

Jordi Contestí, Kiko Correoso, Ernesto Coloma Rotger

Read more...

Taking a look at (some) tokenizers

Recently I have been working on writing a tokenizer from scratch in Rust. In the process, I wanted to really understand the implementation of some commonly used tokenizers. Fun, I know.

Moses Tokenizer

Webpage http://www2.statmt.org/ and the implementation that I’ll talk about tokenizer.perl.

If you already know Moses, you know. Silent nod. For everyone else, Moses is an NLP framework written in Perl, focused on statistical machine translation. It is super easy to install anywhere and everyone loves it because of it. Aside from language models and statistical machine translation utilities, it also includes tools to clean text, namely punctuation normalization, tokenization and cleaning corpora by limiting sentence length. From these I am concerned with the normalize-punctuation script and the tokenizer itself, because I don’t remember ever not using punctuation normalization along with the tokenizer.

Read more...

An unhelpful summary of AMTA 2022

Last September I attended virtually the 15th biennial conference of the Association for Machine Translation in the Americas, A.K.A. AMTA 2022. There I presented an adaptation of my master’s thesis in collaboration with my tutor (shameless self-promotion here). Even though at the moment I am not enrolled in any course, he helped me with several revisions and provided the budget for attendance, so I wanted to thank him before I move on.

Read more...