Feature engineering - Princeton University Computer Science · Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (2024)

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (1)

Feature engineering

Leon Bottou

COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (2)

Summary

Summary

I. The importance of featuresII. Feature relevance

III. Selecting featuresIV. Learning features

Leon Bottou 2/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (3)

I. The importance of features

Leon Bottou 3/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (4)

Simple linear models

People like simple linear models with convex loss functions

– Training has a unique solution.

– Easy to analyze and easy to debug.

Which basis functions Φ?

– Also called the features.

Many basis functions

– Poor testing performance.

Few basis functions

– Poor training performance, in general.

– Good training performance if we pick the right ones.

– The testing performance is then good as well.

Leon Bottou 4/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (5)

Explainable models

Modelling for prediction

– Sometimes one builds a model for its predictions.

– The model is the operational system.

– Better prediction =⇒ $$$.

Modelling for explanations

– Sometimes one builds a model for interpreting its structure.

– The human acquires knowledge from the model.

– The human then design the operational system.

(we need humans because our modelling technology is insufficient.)

Selecting the important features

– More compact models are usually easier to interpret.

– A model optimized for explanability is not optimized for accuracy.

– Identification problem vs. emulation problem.

Leon Bottou 5/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (6)

Feature explosion

Initial features

– The initial pick of feature is always an expression of prior knowledge.

images −→ pixels, contours, textures, etc.signal −→ samples, spectrograms, etc.

time series −→ ticks, trends, reversals, etc.biological data −→ dna, marker sequences, genes, etc.

text data −→ words, grammatical classes and relations, etc.

Combining features

– Combinations that linear system cannot represent:

polynomial combinations, logical conjunctions, decision trees.

– Total number of features then grows very quickly.

Solutions– Kernels (with caveats, see later)– Feature selection (but why should it work at all?)

Leon Bottou 6/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (7)

II. Relevant features

Assume we know distribution p (X, Y ).

Y : outputX : input, all featuresXi : one feature

Ri = X \Xi : all features but Xi,

Leon Bottou 7/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (8)

Probabilistic feature relevance

Strongly relevant feature

– Definition: Xi ⊥6⊥ Y |RiFeature Xi brings information that no other feature contains.

Weakly relevant feature

– Definition: Xi ⊥6⊥ Y | S for some strict subset S of Ri.

Feature Xi brings information that also exists in other features.

Feature Xi brings information in conjunction with other features.

Irrelevant feature

– Definition: neither strongly relevant nor weakly relevant.

Stronger than Xi ⊥⊥ Y . See the XOR example.

Relevant feature

– Definition: not irrelevant.

Leon Bottou 8/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (10)

Interesting example

��

��

Correlated variables may be useless by themselves.

Leon Bottou 10/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (11)

Interesting example

��

��

�������

Strongly relevant variables may be useless for classification.

Leon Bottou 11/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (12)

Bad news

Forward selection

– Start with empty set of features S0 = ∅.– Incrementally add features Xt such that Xt ⊥6⊥ Y | St−1.

Will find all strongly relevant features.

May not find some weakly relevant features (e.g. xor).

Backward selection

– Start with full set of features S0 = X.

– Incrementally remove features Xi such that Xt ⊥⊥ Y | St−1 \Xt.Will keep all strongly relevant features.

May eliminate some weakly relevant features (e.g. redundant).

Finding all relevant features is NP-hard.

– Possible to construct a distribution that demands

an exhaustive search through all the subsets of features.

Leon Bottou 12/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (13)

III. Selecting features

How to select relevant features

when p(x, y) is unknown

but data is available?

Leon Bottou 13/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (14)

Selecting features from data

Training data is limited

– Restricting the number of features is a capactity control mechanism.

– We may want to use only a subset of the relevant features.

Notable approaches

– Feature selection using regularization.

– Feature selection using wrappers.

– Feature selection using greedy algorithms.

Leon Bottou 14/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (15)

L0L0L0 structural risk minimization

������������������� ���

���������������

��

��

����

Algorithm

1. For r = 1 . . . d, find system fr ∈ Sr that minimize training error.

2. Evaluate fr on a validation set.

3. Pick f? = arg minrEvalid(fr)

Note

– The NP-hardness remains hidden in step (1).

Leon Bottou 15/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (16)

L0L0L0 structural risk minimization

������������������� ���

���������������

��

��

����

Let Er = minf∈Sr

Etest(f ). The following result holds (Ng 1998):

Etest(f?) ≤ min

r=1...d

Er + O

√ hrntrain

+ O

√r log d

ntrain

+O

(√log d

nvalid

)

Assume Er is quite good for a low number of features r.Meaning that few features are relevant.

Then we can still find a good classifier if hr and log d are reasonable.We can filter an exponential number of irrelevant features.

Leon Bottou 16/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (17)

L0L0L0 regularisation

minw

1

n

n∑i=1

`(y, fw(x)) + λ count{wj 6= 0}

This would be the same as L0-SRM.

But how can we optimize that?

Leon Bottou 17/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (18)

L1L1L1 regularisation

The L1 norm is the first convex Lp norm.

minw

1

n

n∑i=1

`(y, fw(x)) + λ|w|1

Same logarithmic property

(Tsybakov 2006).

L1 regulatization can weed an

exponential number of irrelevant

features.

See also “compressed sensing”.

Leon Bottou 18/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (19)

L2L2L2 regularisation

The L2 norm is the same as the maximum margin idea.

minw

1

n

n∑i=1

`(y, fw(x)) + λ‖w‖2

Logarithmic property is lost.

Rotationally invariant regularizer!

SVMs do not have magic properties

for filtering out irrelevant features.

They perform best when dealing

with lots of relevant features.

Leon Bottou 19/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (20)

L1/2L1/2L1/2 regularization ?

minw

1

n

n∑i=1

`(y, fw(x)) + λ‖w‖12

This is non convex.

Therefore hard to optimize.

Initialize with L1 norm solution

then perform gradient steps.This is surely not optimal,but gives sparser solutionsthan L1 regularization !

Works better than L1 in practice.

But this is a secret!

Leon Bottou 20/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (21)

Wrapper approaches

Wrappers

– Assume we have chosen a learning system and algorithm.

– Navigate feature subsets by adding/removing features.

– Evaluate on the validation set.

Backward selection wrapper

– Start with all features.

– Try removing each feature and measure validation set impact.

– Remove the feature that causes the least harm.

– Repeat.

Notes

– There are many variants (forward, backtracking, etc.)

– Risk of overfitting the validation set.

– Computationally expensive.

– Quite effective in practice.

Leon Bottou 21/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (22)

Greedy methods

Algorithms that incorporate features one by one.

Decision trees

– Each decision can be seen as a feature.

– Pruning the decision tree prunes the features

Ensembles

– Ensembles of classifiers involving few features.

– Random forests.

– Boosting.

Leon Bottou 22/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (23)

Greedy method example

The Viola-Jones face recognizer

Lots of very simple features.∑R∈Rects

αr∑

(i,j)∈Rx[i, j]

Quickly evaluated by first precomputing

Xi0 j0 =∑i≤i0

∑j≤j0

x[i, j]

Run AdaBoost with weak classifiers bases on these features.

Leon Bottou 23/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (24)

IV. Feature learning

Leon Bottou 24/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (25)

Feature learning in one slide

Suppose we have weight on a feature X.

Suppose we prefer a closely related feature X + ε.

����������������

���������������� ���

����������������������

���������������������

Leon Bottou 25/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (26)

Feature learning and multilayer models

����������������

� ��������� �����

����������������

� ����������������

Leon Bottou 26/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (27)

Feature learning for image analysis

2D Convolutional Neural Networks

– 1989: isolated handwritten digit recognition

– 1991: face recognition, sonar image analysis

– 1993: vehicle recognition

– 1994: zip code recognition

– 1996: check reading

INPUT 32x32

Convolutions SubsamplingConvolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps6@14x14

S4: f. maps 16@5x5

C5: layer120

C3: f. maps 16@10x10

F6: layer 84

Full connectionFull connection

Gaussian connections

OUTPUT 10

Leon Bottou 27/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (28)

Feature learning for face recognition

Note: more powerful but slower than Viola-Jones

Leon Bottou 28/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science· Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (29)

Feature learning revisited

Handcrafted features

– Result from knowledge acquired by the feature designer.

– This knowledge was acquired on multiple datasets

associated with related tasks.

Multilayer features

– Trained on a single dataset (e.g. CNNs).

– Requires lots of training data.

– Interesting training data is expensive

Multitask/multilayer features

– In the vicinity of an interesting task with costly labels

there are related tasks with abundant labels.

– Example: face recognition ↔ face comparison.

– More during the next lecture!

Leon Bottou 29/29 COS 424 – 4/22/2010

Feature engineering - Princeton University Computer Science · Feature explosion Initial features { The initial pick of feature is always an expression of prior knowledge. images! - [PDF Document] (2024)

FAQs

What is the feature engineering feature? ›

Feature engineering is the process of selecting, manipulating and transforming raw data into features that can be used in supervised learning. It consists of five processes: feature creation, transformations, feature extraction, exploratory data analysis and benchmarking.

What is feature explosion? ›

Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include: Feature templates - implementing feature templates instead of coding new features. Feature combinations - combinations that cannot be represented by a linear system.

What is feature selection in feature engineering? ›

Feature selection plays a crucial role in improving model accuracy, reducing overfitting, and enhancing computational efficiency. By transforming raw data into meaningful representations, feature engineering enables models to effectively capture relevant patterns.

What is an example of feature engineering? ›

One example of feature engineering is how continuous data is handled during the model building and refinement process. Continuous data is the most common type, and it simply means that a value might be any one of many within a range. An age of a person or a temperature on a day are examples of this kind of data.

What are the four processes of feature engineering? ›

Feature engineering in ML consists of four main steps: Feature Creation, Transformations, Feature Extraction, and Feature Selection.

Is feature engineering a skill? ›

Yes, feature engineering frequently requires domain knowledge. Creating relevant and significant features relies on knowledge of the problem context. Domain-specific insights assist in crafting features that better capture the underlying patterns in the data.

What are the features of explosion? ›

An explosion of a fuel in air involves the rapid oxidation of combustible material, leading to a rapid increase in temperature and pressure. The violence of an explosion is related to the rate of energy release due to chemical reactions relative to the degree of confinement and heat losses.

What does explode mean in programming? ›

A built-in function in PHP that splits a string into different strings is known as explode(). The splitting of the string is based on a string delimiter, that is, explode in PHP function splits the string wherever the delimiter element occurs.

What is an example explosion? ›

When a cracker is lit, sudden reaction takes place with the evolution of light, sound and heat. Hence, bursting of firecrackers is an example of an explosion.

Should feature selection come before or after feature engineering? ›

Similar to feature engineering, different feature selection algorithms are optimal for different types of data. And as always, the goals of the data scientist have to be accounted for as well when choosing the feature selection algorithm. But before all of this, feature engineering should always come first.

What are the three types of feature selection methods? ›

There are three general classes of feature selection algorithms: Filter methods, wrapper methods and embedded methods. The role of feature selection in machine learning is, 1. To reduce the dimensionality of feature space.

What is feature engineering for beginners? ›

Basic Techniques in Feature Engineering
  • Mean/Median Imputation: Filling missing areas in a dataset with the mean or median of the column.
  • Mode Imputation: Filling missing spots in a dataset with the most common entry in the same column.
  • Interpolation: Filling in missing data with values of data points around it.
May 16, 2024

Is feature engineering still relevant? ›

Answer: No, feature engineering is not dead or outdated; it remains a critical step in machine learning model development for enhancing predictive performance and extracting meaningful patterns from data.

How to do good feature engineering? ›

Here are some smart tips and tricks for effective feature engineering:
  1. Understand the Problem Domain. ...
  2. Explore the Data. ...
  3. Use Domain Knowledge. ...
  4. Feature Scaling. ...
  5. Feature Selection. ...
  6. Feature Encoding. ...
  7. Use Automated Feature Engineering.
Feb 15, 2023

What is feature engineering function? ›

Feature engineering involves the extraction and transformation of variables from raw data, such as price lists, product descriptions, and sales volumes so that you can use features for training and prediction.

What is feature engineering in design? ›

Feature engineering is the process of transforming raw data into relevant information for use by machine learning models. In other words, feature engineering is the process of creating predictive model features. A feature—also called a dimension—is an input variable used to generate model predictions.

What is feature engineering or feature extraction? ›

Feature engineering is converting raw data into features/attributes that better reflect the underlying structure of the data. Feature extraction is the process of transforming raw data into the desired form.

What is AI feature engineering? ›

Feature engineering is the addition and construction of additional variables, or features, to your dataset to improve machine learning model performance and accuracy. The most effective feature engineering is based on sound knowledge of the business problem and your available data sources.

References

Top Articles
Latest Posts
Article information

Author: Lidia Grady

Last Updated:

Views: 5969

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Lidia Grady

Birthday: 1992-01-22

Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

Phone: +29914464387516

Job: Customer Engineer

Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.