Me, lifting a cup of jujube tea.

Me, lifting a cup of jujube tea.

Hi there, I’m Martín and this is my personal web thingy. I sometimes write about things that have caught my eye, most of which has to do with linguistics, computer science, both or neither.

You can contact me through any of the social links in the About Me tab.

♦ ♦ ♦

Blog

An unhelpful summary of PyConES 2024

Last week I went to Vigo to attend the Spanish PyCon, PyConES. I traveled there fully sponsored by Datamaran, the company that I work for. I attended as a speaker, since a colleague and I prepared a talk proposal that got accepted (something something ESG with clustering and LLM reformulation).

Day 1 🌦

The first day had only two turns for workshop tracks. I attended:

Overcoming the One Billion Row Challenge with Python

Jordi Contestí, Kiko Correoso, Ernesto Coloma Rotger

Read more...

Taking a look at (some) tokenizers

Recently I have been working on writing a tokenizer from scratch in Rust. In the process, I wanted to really understand the implementation of some commonly used tokenizers. Fun, I know.

Moses Tokenizer

Webpage http://www2.statmt.org/ and the implementation that I’ll talk about tokenizer.perl.

If you already know Moses, you know. Silent nod. For everyone else, Moses is an NLP framework written in Perl, focused on statistical machine translation. It is super easy to install anywhere and everyone loves it because of it. Aside from language models and statistical machine translation utilities, it also includes tools to clean text, namely punctuation normalization, tokenization and cleaning corpora by limiting sentence length. From these I am concerned with the normalize-punctuation script and the tokenizer itself, because I don’t remember ever not using punctuation normalization along with the tokenizer.

Read more...

An unhelpful summary of AMTA 2022

Last September I attended virtually the 15th biennial conference of the Association for Machine Translation in the Americas, A.K.A. AMTA 2022. There I presented an adaptation of my master’s thesis in collaboration with my tutor (shameless self-promotion here). Even though at the moment I am not enrolled in any course, he helped me with several revisions and provided the budget for attendance, so I wanted to thank him before I move on.

Read more...

Things I have read (I)

This is the first of a series of posts where I’ll write some impressions on the latest books that I have read. It started as a personal log, but since I tend to read a ton of book reviews and they are my main way of discovering new stuff, I reckon that someone who reads it may also get something out of it.

The Grownup

I’m currently in Seoul for a while, and my choices for new stuff to read are limited to second-hand bookstores that have an English section, so not much reading in Spanish will be done until I am back home. So I was happy to find this one. I quite liked Gone Girl and The Grownup seemed to be a nice light story. It reads easy and has an enjoyable prose, there is a couple of nice surprises and I finished it in the time that it takes me to go through a tea. Flynn seems to really like a specific type of character, which becomes that much more apparent now that I am reading Sharp Objects. Kind of like an actor who always plays as themselves.

Read more...