What are embeddings?
Hello friends, it’s been a while since we’ve had a Normcore Tech. This has become my signature first line every time I release a newsletter, so sorry in advance, but also hi again and hope you’re doing well.
Last time we chatted, I was feeling lost on social media. The time before that, I was running a conference. And now, I am coming to your inbox because I’ve been working on something even bigger and more ridiculous.
Bigger than reverse engineering the Duolingo owl? Bigger than reverse-engineering diamond hands? Yes and yes. You may be noticing a pattern here, and one of the themes of this newsletter has been that that I like to reverse-engineer stuff until it makes sense to me.
The latest thing I reverse-engineered were large language models. I wrote a 70+ page paper about embeddings, for all audiences. Here is the site introducing the project. Here is the PDF. And here is the code. As I mentioned on Twitter, all of this is free, but it cost me everything.
A few years back at work, I started working with embeddings, which are a foundational data structure that neural networks, including today’s large language models. They were great, but a lot about the way they’re presented in industry didn’t make sense to me. So, I started to dig.
What started as a simple blog post became longer and longer, morphed into a survey paper written in LaTeX, a deep dive into how embeddings work, why we use them in a business context, their engineering considerations, a history of NLP, (kind of like this), and how we use it in recommender systems.
Over the past year, I’ve worked on refining this understanding, stealing away hours before the kids woke up at five in the morning to reason through hundreds of academic papers and tie together lots of loose threads. Even though it doesn’t contribute any new research to the field, it was a very grueling process, one I imagine many researchers go through on a regular basis. At one point I texted my friends, who have PhDs, and asked, “Is it supposed to be this hard and lonely? Was it this bad for you?” and they said yes and then told me never to bring up their dissertations again.
But it’s also been extremely rewarding to ignore the unbearable noise of the LLM hype train on social media and really dive deep into something I wanted to understand, and, hopefully, come out, mostly unscathed on the other side with a much deeper understanding.
Now I’m done, and it’s here. Although it was mostly a learning process for myself, I hope that others looking to learn more about embeddings get something out of it, too.
Enjoy! See you in the latent space, and also hoping to also put out a real newsletter soon, once I stop seeing LaTeX symbols in my mind.