{"id":22787,"date":"2023-07-02T23:55:24","date_gmt":"2023-07-02T23:55:24","guid":{"rendered":"https:\/\/www.goodacademic.com\/blog\/questions\/data-mining-this-assignment-is-about-two-things-1-doing-a-simple-knn-analysis-using-a-built-in-package-from-scikit-learn-sklearn-2-making-an-elbow-plot\/"},"modified":"2023-07-02T23:55:24","modified_gmt":"2023-07-02T23:55:24","slug":"data-mining-this-assignment-is-about-two-things-1-doing-a-simple-knn-analysis-using-a-built-in-package-from-scikit-learn-sklearn-2-making-an-elbow-plot","status":"publish","type":"questions","link":"https:\/\/www.goodacademic.com\/blog\/questions\/data-mining-this-assignment-is-about-two-things-1-doing-a-simple-knn-analysis-using-a-built-in-package-from-scikit-learn-sklearn-2-making-an-elbow-plot\/","title":{"rendered":"Data Mining  This assignment is about two things: (1) doing a simple KNN analysis using a built-in package from Scikit-learn\/sklearn (2) making an \u2018elbow\u2019 plot."},"content":{"rendered":"<p>.<span style=\"font-size: unset; color: var(--color-1); font-family: var(--ion-font-family, inherit); text-align: initial;\">Data Mining<\/span><\/p>\n<p style=\"line-height: 1.2; cursor: auto; font-size: unset;\"><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">This assignment is about two things:<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">(1) doing a simple KNN analysis using a built-in package from Scikit-learn\/sklearn<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">(2) making an \u2018elbow\u2019 plot.<\/span><\/p>\n<p style=\"line-height: 1.2; cursor: auto; font-size: unset;\"><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">(1)<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">(a)<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">Now before we get into the data mining, let\u2019s first \u2018pre-process\u2019 the dataset. So, we<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">pretend we only have access to 70 percent of this iris data set.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">If you think of the entire data set as a dataframe, then you are going to only use<span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">the top<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span>70% of the<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">rows for your \u2018data mining\u2019 purpose. You will reserve the bottom 30%of the rows as the data<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">set to see how well your model performs. To make it more relatable, imagine the data vendor<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">tells you that the bottom 30% is premium data which you would have to pay a lot of money for<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">the access. For the sake of clarity, the top 70% data is called \u2018seen_data_set\u2019 and the bottom<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">30% is called \u2018unseen_data_set\u2019.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">Please write a few lines of code to do this. It should be easy.<\/span><\/p>\n<p style=\"line-height: 1.2; cursor: auto; font-size: unset;\"><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">(b)<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">Now we are going to further split this seen_data_set into 80\/20. Make sure<span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">to use<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span>random state =<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">1234 in the split for the result to be reproducible. Set the K value to be 5. Find the<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">corresponding test_score which measures the precision of your prediction on the test data set.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><\/p>\n<p style=\"line-height: 1.2; cursor: auto; font-size: unset;\"><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">(2)<\/span><\/p>\n<p style=\"line-height: 1.2; cursor: auto; font-size: unset;\"><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">Please write a short report to answer this question.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">The test_score you obtain in the first part is probably very good.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">But how does K value affect the precision of your prediction? Suppose K value varies from 3 to<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">15.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">Also, how good is it when we apply our mining model to the \u2018unseen_data_set\u2019? Imagine the<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">bottom 30% of the original data is so prohibitively expensive that you are going to have to<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">predict it by yourself.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">Make a plot where the X measures how K value<span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">varies,<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span>and Y measures the precision score of<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">the test_data_set of \u2018seen_data_set\u2019.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><br style=\"cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">Make a plot where the X measures how K value varies and Y<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">measures the precision score when<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">you apply your model to the unseen data set.<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">Write the code required to make these two plots include the two plots in the report<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\"><span style=\"cursor: auto; font-size: unset;\">&nbsp;<\/span><\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">make a few comments on both of the plots<\/span><span style=\"line-height: 21.6px; cursor: auto; font-size: unset;\">.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>.Data Mining This assignment is about two things:(1) doing a simple KNN analysis using a built-in package from Scikit-learn\/sklearn(2) making an \u2018elbow\u2019 plot. (1)(a)Now before we get into the data mining, let\u2019s first \u2018pre-process\u2019 the dataset. So, we&nbsp;pretend we only have access to 70 percent of this iris data set.&nbsp;If you think of the entire [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"closed","template":"","meta":[],"disciplines":[734],"paper_types":[],"tagged":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/questions\/22787"}],"collection":[{"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/questions"}],"about":[{"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/types\/questions"}],"author":[{"embeddable":true,"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/comments?post=22787"}],"version-history":[{"count":0,"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/questions\/22787\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/media?parent=22787"}],"wp:term":[{"taxonomy":"disciplines","embeddable":true,"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/disciplines?post=22787"},{"taxonomy":"paper_types","embeddable":true,"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/paper_types?post=22787"},{"taxonomy":"tagged","embeddable":true,"href":"https:\/\/www.goodacademic.com\/blog\/wp-json\/wp\/v2\/tagged?post=22787"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}