In this set of videos, we'll reintroduce the concept of unsupervised learning and what it entails. This will serve as the foundation for the remainder of this specific course. In the last set of courses, we dove into the algorithms available, assuming that we have the known outcome available in our dataset. In this course, we're going to talk about a whole other class of machine learning algorithms called unsupervised learning. This class of algorithms are relevant when we don't have outcomes we are trying to predict. But rather we're interested in finding structures within our dataset and perhaps want to partition our dataset into smaller pieces. Now there can be a couple of use cases for this unsupervised learning. One popularly use case is called clustering, where we use our unlabeled data to identify an unknown structure and an example of this may be segmenting our customers into different groups. The other major use case for unsupervised algorithms is for dimensionality reduction. Namely using structural characteristics to reduce the size of our dataset without losing much information contained in that original dataset. Now in regards to clustering, will be covering the k-means algorithm, hierarchical agglomerative clustering algorithm, the DBSCAN algorithm and the mean shift algorithm. Then in regards to dimensionality reduction we'll be covering principal component analysis or PCA as well as non-negative matrix factorization. Now we don't go into the said all over here, but we'll go into each of these in more depth as we get through these videos. Now, just to give you some intuition as to why dimensionality reduction will be important, let's talk about that infamous curse of dimensionality, or infamous for those of us in these circles. Now dimensionality refers to the number of features in our data. Theoretically and in ideal situations, the more features we have, the better the model should perform since models have more things to learn from, so they should therefore be more successful. However, real life is more complicated than that. There are several reasons why too many features may end up leading to worse performance in practice. If you have too many features, several things can go wrong. Maybe some of those features are spurious correlations, meaning they correlate within your dataset, but maybe not outside your dataset as new data comes in. Too many features may create more noise and signal. Algorithms find it harder to sort through non-meaningful features, if you have too many features. Then the number of training examples required will increase exponentially with the dimensionality. This becomes especially clear when we think about distance-based algorithms, such as the k-nearest neighbors that we talked about in our last course. If we look here and we imagine that we have a survey with 10 possible responses and for those 10 possible responses to get 60 percent coverage, we only need six different people to answer, that is. If we add on another survey with 10 possible response values. That in order to get that same 60 percent coverage so that your k-nearest neighbors have the same distance from whatever the new value coming in is, we would need 60 people to respond. We need 60 different rows of data in order to get our same coverage that we had when we just had six with one dimension. Then you can imagine once we increase that to three dimensions and we have three different surveys, each one with 10 possible positions. Then in order to get that same coverage for each neighbor to be equally distanced as it was for that original one dimension with only 10 positions, we wouldn't need 600 different rows. You see how the more dimensions you add on, the more rows you need, the more data you need to get that same amount of coverage. On top of that, higher dimensions will often lead to slower performance as dealing with more columns is going to be more computationally expensive. Also it will lead to the incidence of outliers increasing as that number of dimensions increases. To mitigate some, not all of the problems I just mentioned, one usually needs a lot of rows to train on, as I just mentioned, which may not be possible in real life. You may not be able to gather these 600 different examples, or if you imagine, obviously we would increase to multiple dimensions, much more than three and we need that many more rows to get a certain amount of coverage. Therefore, it often becomes a need to reduce the dimension of one dataset. So far we've seen feature selection as a way of achieving this. In this course, we'll discuss how we can accomplish the same goal using unsupervised machine learning models, such as principal component analysis, which we just discussed or PCA. Now, to think about this in a real life example, now this curse of dimensionality comes up often in applications. If we consider that customer churn example that we discussed in earlier courses, the original dataset had 54 different columns, so 54 different features, and some like age or under 30 or senior citizen will obviously be very closely related. Other such as latitude, for example, are essentially duplicated, we have those duplicated throughout. Even if we remove duplicates and non-numeric columns, this curse of dimensionality can still apply, we can still have too many columns even if they are not necessarily perfectly correlated. Things that we can do with this churn dataset, clustering can help identify groups of similar customers without us thinking about whether or not they churn or not, maybe that'll allow us to segment our customers into different groupings. Then dimensionality reduction can improve both the performance because it can speed it up as we reduce the number of features and the interpretability of each of these groupings that we just came up with. Just a high level overview. When we're working with unsupervised learning, we start off with an unlabeled dataset. We then fit that unlabelled dataset dependent on the model that we choose and we get our model. Then once we have that model, we can look at unlabeled data. We're still working with unlabeled data, but we can look at this new data. Use that model that we just fit right from just before, and then use that to predict our new groupings that we now have or the new dimensionality reduction depending on what we are doing, whether it's dimensionality reduction or clustering. An example for clustering, if we want to group news articles by topics, and we don't have those topics as labels. So we have our starting point of text articles of unknown topics. We then create our model, whether at k-means or whatever other models that we will discuss in order to see what groupings we will naturally find in our dataset. We fit that to the dataset so that we have our model fitted to figure out according to certain features that are within these articles, according to certain words showing up, we come up with certain groupings. We then take another group of texts articles of unknown topics. We use that model that we just fit. Then we can use that model to talk certain words, certain features in order to determine the groupings. We can then predict similar articles according to the articles that we had in our original dataset.