Feature Selection

posted Originally published at dev.to 1 min read

Sometimes there is a lot of 'noise' in the data. By noise I mean data that is not relevant to the target variable. There are methods to determine the impact of each column to the outcome variable, and then selecting only the the columns that are of high enough impact.

The are methods which iterate through all the columns and identifies not only each of the columns that are relevant but also which combinations of columns affect the target variable the most. For example, "what is the likelihood of a user posting on social media?", columns are "amount-of-posts-seen-before", "internet-connection-quality", "followings-activity", and "time-of-day". Let's say "amount-of-posts-seen-before", "internet-connection-quality", and "followings-activity" are determined to be the relevant columns. However, there could be the case that the subset of "amount-of-posts-seen-before", and "internet-connection-quality" alone are more impactful on the target variable than having all relevant columns. So after feature selection the data that will be kept would be the columns "amount-of-posts-seen-before", and "internet-connection-quality" along with the target variable.

1 Comment

0 votes

More Posts

Dashboard Operasional Armada Rental Mobil dengan Python + FastAPI

Masbadar - Mar 12

Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat

abarth23 - Apr 27

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

Implementing Cellular Redundancy: Cross-Cloud Failover with AWS Transit Gateway and Azure ExpressRou

Cláudio Raposo - May 5

Starting My First Real Contribution to Fenn (Open Source ML Tooling)

Emmanuel Cortes - May 27
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

2 comments
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!