Neural nets are just people all the way down
When I was in seventh grade, we had to take a class called home ec. Everyone brushed it off as a super easy class. “All you have to do is cook and sew,” everyone said. One of our first projects, after learning how sewing machines work, was sewing a pair of pajama pants.
You’d think it’s a pretty simple process. Cut the front and the bottom out according to a pattern. Sew them together. Insert an elastic string so you can tie the pants. Easy peasy. The project took about a month of work, and, if I recall correctly, I ended up getting an 86 on it, to the mortification of my immigrant parents, who didn’t accept anything below a 95. The pajama pants fell apart after about five months of wearing them. Also, they were too short.
I was reminded of this when I was browsing MetaFilter a couple weeks ago and read this compilation of links about how you can’t automate sewing.
It begins with a link to a tweet,
The thread continues, “But yes. When you go into a store - H&M, F21, Macy's, Bergdorf's -- every piece of clothing you see was assembled by a person.”
I was absolutely floored by this. I consider myself a (relatively) educated person, but I simply had no idea that mostly people were making my clothes. I
It turns out that the reason for this is that robots are terrible at stuff that is completely easy for people to do:
...garment creation is an industry that has proven impossible to automate. Human intuition and dexterity when it coming to manipulating fabrics is difficult to program efficiently for a variety of reasons:
1. The variables of fabrics are vast. Fabrics have different levels of stretch, thickness and weaves. Fabrics can crease, fabrics can fold and there is often imperfections and pulls that occur.
2. The garments being manufactured change frequently. Two words: Fast Fashion. H&M and Forever21 both get DAILY shipments of new styles. With each new style, an automated sewing machine would need to be programed with a new set of rules.
3. The movements required to fabricate are complex. While sewing you are often pulling or easing the fabric. Pieces of fabric have to be lined up perfectly or panels won’t match, buttons and holes won’t align and even something as simple as a zipper won’t work
Now, every time I put on a piece of clothing, I think about the fact that it was put together by hand.
This let me down the rabbit hole of watching video of robots folding towels. They really suck at it (and by the way, the video is shown at 50x speed. It actually takes an excruciatingly long time to fold a single towel.)
This, in turn, made me think of some of the concepts behind a field I’m familiar with, machine learning, that other people may not know are done mostly by hand.
So let’s talk about creating training data sets.
A training data set is an example data set that’s used in teaching a machine learning model to make correct predictions. Say you have an app, HotDog or NotHotDog, like in the show Silicon Valley:
In order to identify whether something is or is not a hot dog, the neural net needs to know what a “hot dog” looks like.
For humans, it’s easy. But computers need to understand the dimensions of the hot dog: how long hot dogs can be, how wide, what color, do they usually have toppings? And on and on, so that when it looks at a picture of something, it can say, well this piece of food is longer than it is wide, has a pink piece of something in the middle between two pieces of bread, and, is kind of round at the ends, therefore I’m pretty sure it’s a hot dog.
But how does a computer program know what a hot dog looks like? How long/short it is? By comparing it to other pictures of hot dogs. The neural net can then compare and contrast the differential of pixels, on a pixel-by-pixel basis (with lots of math mixed in) to see if the shape of the labeled hot dog looks like the new mystery food and therefore classifies the new food correctly.
(If you are interested in the nitty gritty in a very well-explained way, I strongly recommend checking out Brandon’s videos on how neural nets work.)
Ok, but where do the images of hot dogs and pizza come from?
Well, initially, they came from a database called ImageNet. ImageNet was the brainchild of a professor , Dr. Fei-Fei Li, who was working in AI (at Stanford).
Wired has a good article about the project:
At a time when most AI research focused on models and algorithms, Li wanted to expand and improve the data available to train AI algorithms.
Her next challenge was to get the giant thing built. That meant a lot of people would have to spend a lot of hours doing the tedious work of tagging photos. Li tried paying Princeton students $10 an hour, but progress was slow going. Then a student asked her if she’d heard of Amazon Mechanical Turk.
Suddenly she could corral many workers, at a fraction of the cost. But expanding a workforce from a handful of Princeton students to tens of thousands of invisible Turkers had its own challenges. Li had to factor in the workers’ likely biases. “Online workers, their goal is to make money the easiest way, right?” she says. “If you ask them to select panda bears from 100 images, what stops them from just clicking everything?” So she embedded and tracked certain images—such as pictures of golden retrievers that had already been correctly identified as dogs—to serve as a control group.
If the Turks labeled these images properly, they were working honestly.
The article goes on,
In 2009, Li’s team felt that the massive set—3.2 million images—was comprehensive enough to use, and they published a paper on it, along with the database. (It later grew to 15 million.) At first the project got little attention. But then the team had an idea: They reached out to the organizers of a computer-vision competition taking place the following year in Europe and asked them to allow competitors to use the ImageNet database to train their algorithms. This became the ImageNet Large Scale Visual Recognition Challenge.
And finally,
In 2012, University of Toronto researcher Geoffrey Hinton entered the ImageNet competition, using the database to train a type of AI known as a deep neural network. It turned out to be far more accurate than anything that had come before—and he won. Li hadn’t planned to go see Hinton get his award; she was on maternity leave, and the ceremony was happening in Florence, Italy. But she recognized that history was being made. So she bought a last-minute ticket and crammed herself into a middle seat for an overnight flight.
Li went on to do a bunch of other cool AI stuff. Eventually, as is the case with many academics who work in and around Silicon Valley, she caught the attention of Google, and went on to work there. Eventually, she became caught up in a bit of controversy around Google and Project Maven (but I’ll save that deep dive for another newsletter.)
The point is that ImageNet became the gold standard in machine learning research. Li rightly gained fame because she (and her graduate students) had spearheaded the creation of a clean, large dataset of correctly labeled images.
But the dataset was, really, created by hundreds of thousands of people manually identifying what the pictures were. To date, more than 14 million images have been labeled by ImageNet, aka by people from around the world looking at images and clicking on buttons for cents.
And, let’s dig a bit deeper. Where did the images for ImageNet come from? The ImageNet paper (always an interesting read), says:
So the images were collected from something called a WordNet synonym, and put into search engines, which rendered the correct images for those words. But what’s WordNet, and how was it built?
WordNet is a special type of word-specific database was put together by linguists George Miller and Christiane Fellbaum at Princeton to be both a dictionary and thesaurus for specific words. For example, if you put in “pizza” to WordNet online, you get back something that looks like this:
All the words under “S” are the “synsets”, or “groupings of synonymous words that express the same concept.” So pizza and pizza pie are the same. But who came up with the idea that pizza and pizza pie are the same? Who defined made the choice to include Sicilian pizza as a hyponym, and who defined it as a pizza made with a thick crust?
Some of this was the work of computer programs, but much of it, as Miller and Fellbaum describe in their book “WordNet” (a really interesting read), was based on things that people were pulling together from various dictionaries, various word corpuses, and manual classification. (By the way, as is the case with a lot of professors, writers, etc, Miller mentions that his wife helped WordNet, his most famous work. She’s not mentioned in the obituary Princeton wrote about him.)
Ok! Now we’re getting to the bottom, right? We have something called the Brown Corpus, that identifies how words are tagged. But who made the Brown Corpus?
After joining the faculty of Brown, Francis took a course in computational linguistics from Henry Kučera, who taught as a member of the Slavic Department staff. In the early 1960s, they began collaborating on compiling a one-million-word computerized cross-section of American English, which was entitled the Brown Standard Corpus of Present-Day American English, but commonly known as the Brown Corpus. The work was compiled between 1963 and 1964, using books, magazines, newspapers, and other edited sources of informative and imaginative prose published in 1961. Once completed, the Brown Corpus was published in 1964. Each word in the corpus is tagged with its part of speech and the subject matter category of its source.
Wikipedia doesn’t say that Kučera and Francis used graduate students to help with this monstrous task, but it’s safe to say that it probably was the case.
Ok! Now we’re at the bottom.
So, if you’re doing image recognition in 2019, it’s highly likely you’re using an image recognition system built by images tagged by people using Mechanical Turk in 2007 that sit on top of language classification systems built by graduate students prowling newspaper clippings in the 1960s.
Simply put, every single piece of decision-making in a high-tech neural network initially rests on a human being manually putting something together and making a choice.
As this fantastic essay on the topic says,
At their core, training sets for imaging systems consist of a collection of images that have been labeled in various ways and sorted into categories. As such, we can describe their overall architecture as generally consisting of three layers: the overall taxonomy (the aggregate of classes and their hierarchical nesting, if applicable), the individual classes (the singular categories that images are organized into, e.g., “apple,”), and each individually labeled image (i.e., an individual picture that has been labeled an apple). Our contention is that every layer of a given training set’s architecture is infused with politics.
Lately, these systems have gotten a ton of bad rap (check Rachel’s wonderful site for resources on ethics around AI).
To call attention to the mis-labels that these people-built systems have applied to various objects, people, and faces, a recent project, ImageNet Roulette, invited people to upload their images and labeled them in return:
The site, ImageNet Roulette, pegged one man as an “orphan.” Another was a “nonsmoker.” A third, wearing glasses, was a “swot, grind, nerd, wonk, dweeb.”
ImageNet Roulette is a digital art project intended to shine a light on the quirky, unsound and offensive behavior that can creep into the artificial-intelligence technologies that are rapidly changing our everyday lives, including the facial recognition services used by internet companies, police departments and other government agencies.
In light of all of this swirling around the ecosystem, ImageNet, with Dr. Li again involved, is rethinking how it does things.
Over the past year, we have been conducting a research project to systematically identify and remedy fairness issues that resulted from the data collection process in the
person
subtree of ImageNet. So far we have identified three issues and have proposed corresponding constructive solutions, including removing offensive terms, identifying the non-visual categories, and offering a tool to rebalance the distribution of images in theperson
subtree of ImageNet.While conducting our study, since January 2019 we have disabled downloads of the full ImageNet data, except for the small subset of 1,000 categories used in the ImageNet Challenge. We are in the process of implementing our proposed remedies.
How is the team doing this?
So far out of 2,832 synsets within the
person
subtree we’ve identified 1,593 offensive synsets (including “unsafe” and “sensitive”).
As with all things related to high technology (and sewing), they’re starting by hand.
Art: Weaver, Van Gogh, 1884
What I’m reading lately
Pinterest…the normcore of the worst offender tech sites
Journalists can do more in the current anti-tech climate
A thread on coding music (if you haven’t seen David Beazley’s Python talks, highly recommended)
This book looks interesting:
About the Author and Newsletter
I’m a data scientist in Philadelphia. This newsletter is about issues in tech that I’m not seeing covered in the media or blogs and want to read about. Most of my free time is spent wrangling a preschooler and an infant, reading, and writing bad tweets. I also have longer opinions on things. Find out more here or follow me on Twitter.
If you like this newsletter, forward it to friends!