We need a way to reduce the number of categorical variables so we can place items of similar categories closer together. Intuitively, we want to be able to create a denser representation of the categories and maintain some of the implicit relationship information between items. This requires extensive work and manual labeling that’s typically infeasible. We could generate more one-to-one mappings, or attempt to group them and look for similarities. As a result, we have no way of evaluating the relationship between two entities. This means that the terms “Hotdog” and “Hamburger” are no closer together than “Hotdog” and “Pepsi”. In vector space, categories with little variance are not any closer together than those with high variance. Since each item is technically equidistant in vector space, it omits context around similarity. For variables with many unique categories, it creates an unmanageable number of dimensions. This technically works in turning a category into a set of continuous variables, but we literally end up with a huge vector of 0s with a single or a handful of 1’s. Prior to embeddings, one of the most common methods used was one-hot encoding. This makes perfect sense for a field like “age” but is nonsensical when the numbers represent a categorical variable. It also sees numbers that are similar as being similar items. That means higher numbers are “greater than” lower numbers. We could try to represent items by a product ID however, neural networks treat numerical inputs as continuous variables. items and users) into numbers and vectors. That means that in domains such as recommender systems, we must transform non-numeric variables (ex. ![]() In the neural network below each of the input features must be numeric. Specifically, most machine learning algorithms can only take low-dimensional numerical data as inputs. To understand embeddings, we must first understand the basic requirements of a machine learning model. In this article, we’ll deep dive into what embeddings are, how they work, and how they are often operationalized in real-world systems. Many more use them blindly without understanding what they are. However, many data scientists find them archaic and confusing. Embeddings have pervaded the data scientist’s toolkit, and dramatically changed how NLP, computer vision, and recommender systems work.
0 Comments
Leave a Reply. |