Feature Selection

posted Originally published at dev.to 1 min read

Sometimes there is a lot of 'noise' in the data. By noise I mean data that is not relevant to the target variable. There are methods to determine the impact of each column to the outcome variable, and then selecting only the the columns that are of high enough impact.

The are methods which iterate through all the columns and identifies not only each of the columns that are relevant but also which combinations of columns affect the target variable the most. For example, "what is the likelihood of a user posting on social media?", columns are "amount-of-posts-seen-before", "internet-connection-quality", "followings-activity", and "time-of-day". Let's say "amount-of-posts-seen-before", "internet-connection-quality", and "followings-activity" are determined to be the relevant columns. However, there could be the case that the subset of "amount-of-posts-seen-before", and "internet-connection-quality" alone are more impactful on the target variable than having all relevant columns. So after feature selection the data that will be kept would be the columns "amount-of-posts-seen-before", and "internet-connection-quality" along with the target variable.

If you read this far, tweet to the author to show them you care. Tweet a Thanks
0 votes

More Posts

Building an Intelligent Cross-Chain Transaction Optimizer with Python & Gemini AI

Natasha Robinson 1 - Jul 21

Stemming vs Lemmatization in NLP: Main Differences

Thatohatsi Matshidiso Tilodi - Jun 1, 2024

Pandas v3.x Defaults Copy-on-Write Feature - Get Used to it Early

Sachin Pal - Jan 7

Overcoming Key Challenges to Machine Learning Adoption

LavBobba - Oct 14

Machine Learning Isn’t Deep Learning (And Why Mixing Them Up Costs You Deals)

Sourav Bandyopadhyay - Jul 19
chevron_left