Introduction
This
article describes how to use model filtering, a powerful new feature in SQL Server 2008
Data Mining. Filtering allows you to take full advantage of the mining
structure / mining model dichotomy, separating the data staging (in the structure)
from the modeling (in the mining model). This means that you can define and
process a single mining structure for your problem space and then build mining
models on specific slices of interest within that space.
SQL
Server 2008 Data Mining allows specifying, for each mining model, a filter to
be applied on the training data. The filter acts as a partitioning mechanism
inside the mining structure and it is applied on top of any training/testing
partitioning already existing at the structure level. Furthermore,
filters can be applied to data in nested tables. Filters allow you to build
different models for various partitions of the data with minimal effort, either
to obtain better overall accuracy (think of it as manually specifying the first
split in a decision tree), or to compare the patterns between different
partitions.
We will cover a lot of ground in
this article. Use the links below to jump straight to the sections you're most
interested in:
Creating Filtered Models
So,
let’s start with a simple scenario, using the same old
Adventure Works data -
building a mining model that predicts how likely a customer is to purchase a
bike based on various demographic information. Let’s also assume that you know -
based on your experience with the data - that the Geography factor has an
important role and you don’t want your North American patterns to hide (because
of volume) the peculiarities of the Pacific market.
In
this section, we will show you both the DMX and UI ways of working with
filtered models. Pick your path and follow it!
Suppose
you start with a mining structure defined like below:
|
DMX:
CREATE MINING STRUCTURE TestStructure
(
CustomerKey LONG KEY,
Region TEXT DISCRETE,
YearlyIncome DOUBLE DISCRETIZED,
EnglishOccupation TEXT DISCRETE,
CommuteDistance TEXT DISCRETE,
BikeBuyer BOOLEAN DISCRETE
) WITH HOLDOUT( 30 PERCENT )
Then,
train the mining structure with an INSERT INTO statement:
INSERT INTO MINING STRUCTURE TestStructure
(
CustomerKey,
Region,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer
)
OPENQUERY([Adventure
Works DW], ‘SELECT
CustomerKey,
Region,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer FROM dbo.vTargetMail‘)
|
UI:
To create the structure using BI
Development Studio, you just need to select “New Mining Structure”. On the
“Create the Mining Structure” page of the wizard select “Create mining
structure with no models”. Assuming you already have a data source view for
Adventure Works 2008, select it and then select vTargetMail in the “Specify table type” in the wizard. Select the
following columns and change the content and data types to match those as shown
below:

Now,
the modeling part. As usual, you would create a mining model as follows:
|
DMX:
ALTER MINING STRUCTURE TestStructure
ADD MINING MODEL BikeBuyerClass
(
CustomerKey,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH DRILLTHROUGH
|
UI:
In the UI, you can create the mining
model by selecting create related mining model. Name the model BikeBuyerClass
and select Decision Trees as the algorithm. Once the model is selected you will
have to set the BikeBuyer column to Predict
and the Region column to ignore as
shown below:

Now
you can add a mining model to the mining structure using only the data for the
Pacific area:
|
DMX:
ALTER MINING STRUCTURE TestStructure
ADD MINING MODEL DT_Pacific
(
CustomerKey,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH FILTER( Region=’Pacific’ ), DRILLTHROUGH
The
new syntax element is the FILTER construct, which instructs the model to use
in training only those cases where the Region column contains the ‘Pacific’
value.
|
UI:
To create the filtered model in the UI,
you will need to create the model first and then add the filter. To use fewer
steps, you can create the new model based on BikeBuyerClass model we just
created. Just right click on the BikeBuyerModel name and select “New Mining
Model” as shown below.

This
can also be accomplished by picking the mining model you want to clone and
using the Mining Model drop down from the file menu and selecting “New Mining
Model”.
Now,
to add the filter you have to select “Set Model Filter” from the Mining Model
menu while the mining model is selected or from the right click contextual
menu. This will bring the Model Filter dialog.
The
dialog presents data in tabular form. Each row is a condition. Each condition
can have up to four columns:
·
And
/ Or: This will do a logical And/Or between the different conditions. Valid
values are “And” and “Or”. The value can only be set from the second condition
on. Current version doesn’t allow condition grouping from the grid.
·
Mining
Structure Column: This can be any column from the mining structure, values are
prefilled so you only need to select a column name from a drop down. Only
columns that will not show are key columns since you can’t filter on keys.
·
Operator:
This is a drop down with valid operators based on the data type of the selected
mining structure column (more on this below).
·
Value:
This is the value used to compare against the structure column using the operator. Value is not always necessary
(more on this below).

As
you can see above, the same expression used in the DMX version of the filter is
displayed in the Expression portion
of the dialog.
Note:
An expression can be edited manually by selecting “Edit Query” in the dialog.
Be careful: once the expression is edited you will not be able to use the grid
to add elements to the filter. Everything will need to be added by hand in the
Expression text box.
Similarly,
add a new model for the North America region:
|
DMX:
ALTER MINING STRUCTURE TestStruct
ADD MINING MODEL DT_NorthAmerica
(
CustomerKey,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH FILTER(
Region=’North America’ ), DRILLTHROUGH
|
UI:
Clone
DT_Pacific and edit the filter to be North America instead of Pacific as shown below.

Note that if you clone the DT_Pacific
model to create the DT_NorthAmerica model using the steps described above, the
filters in the base mining model will also be cloned. This can be a real time-saver when creating multiple models with similar filters
- the time-saving adds up when you start building complex filters.
|
DMX Deep-Dive on
Filter Syntax
As you
notice, FILTER is syntactically a function, taking a Boolean expression.
The Boolean expression may be a simple predicate (of the
Column=Value kind), an EXISTS predicate or a Boolean combination
(using AND, OR and NOT operators) of multiple
predicates.
The
Column part of a simple predicate is always the name of a mining
structure column.
The
corresponding mining structure column does not have to be included in the
mining model. In the examples above, Region is no longer an interesting
attribute when all the training cases seen by a model have the same
value (e.g. Pacific), therefore it is not part of the model.
A
simple predicate is defined as <Column> , where
the accepted set of operators depends on the content type of the column,
and Value must always be a constant. Value generally has
to have the same type as the column but can also be the NULL
constant.
UI:
In the Filter dialog in BI Studio the NULL constant is
denoted by two added values in the operator drop downs as IS NULL and IS
NOT NULL to demote <Column> = NULL or <Column> <> NULL
respectively.
If the
column is:
-
A
case level KEY - then no operator can be applied (no key filters are
supported) . Nested KEY on the other hand is just a mapping between
case and nested tables and can be used to filter (more on nested
table filtering later)
-
DISCRETE or DISCRETIZED - supports the = and <> operators and also
IS NULL and IS NOT NULL in the Filter dialog
-
CONTINUOUS - supports the =, <>, <, <=, >, >= operators and also IS
NULL and IS NOT NULL in the Filter dialog
XML
serialization of the floating point double numbers as well as rounding
errors may result in losses of less significant digits, therefore using
an = or <> operator with a double value may result in unexpected
results, so it is safer to use a combination of other predicates.
In case
of discretized (numeric or date/time) values, the = and <> operators
have a semantic similar with the PREDICTION JOIN operator: they evaluate
the bucket that contains the value and then apply the predicate to the
bucket index. Therefore, for a discretized YearlyIncome column with
buckets (0, 10000], (10000, 20000] etc. the filters below mean exactly
the same thing:
-
(YearlyIncome=5000)
-
(YearlyIncome=9000)
-
(YearlyIncome=999.999)
Applying a filter on a mining model means, effectively, training the
mining model with the data returned by a query like below:
SELECT
… FROM MINING STRUCTURE TestStructure.CASES
WHERE IsTrainingCase() AND ()
For
those of you who use OLAP mining models, filters are somewhat similar
to defining a cube slice on a mining structure - but at the model level.
If your
filtered model has drillthrough enabled, then all the
structure cases that match the filter will be linked with a content
node, not only the training cases from the structure. Of course, only
the structure training cases matching the filter will be
used in training.
Therefore, the following queries are equivalent
SELECT
… FROM MINING MODEL DT_NorthAmerica.CASES
WHERE IsTrainingCase()
SELECT
… FROM MINING STRUCTURE TestStructure.CASES
WHERE IsTrainingCase()AND (Region=’North America’) |
Model Accuracy in Filtered Models
Accuracy
stored procedures can be applied on various data partitions (test or training).
Filters add a new dimension to the data partitions, which can now be filtered
test data or filtered training data. Accuracy stored procedures have syntax
like below:
CALL SystemGetLiftTable(BikeBuyerClass, 2, ‘BikeBuyer’, true)
// - to get the lift chart for the TRUE state of the BikeBuyer attribute
CALL SystemGetClassificationMatrix(BikeBuyerClass, 2,
‘BikeBuyer’) // - to get the full classification matrix for the BikeBuyer
attribute
The
second parameter (2) is a bitmask identifying the data partition(s) to be used
in the accuracy stored procedure. Here are some bit values:
- 1 identifies the training
partition
- 2 identifies the test
partition
- 3 (bitwise OR combination of
1 and 2) identifies ALL the data in the mining structure (training + test)
Filters
introduce another bit in the bitmask, with a value of 4. Therefore, here are
some new values for the second parameter of the accuracy stored procedures:
-
5 (4 bitwise OR 1) identifies the filtered training data seen by the model in
training
- 6 (4 bitwise OR 2) identifies the filtered test data - those test cases
from the mining structure where the model filter applies
-
7 (4 bitwise OR 1 bitwise OR 2) identifies all the cases (training or test) in
the structure where the model filter applies
Therefore
the following calls might return different lift tables:
CALL SystemGetLiftTable(DT_Pacific, 2, ‘BikeBuyer’, true) // -
to get the lift chart for the TRUE state of the BikeBuyer attribute on all the
test cases in the structure
CALL SystemGetLiftTable(DT_Pacific, 6, ‘BikeBuyer’, true) // -
to get the lift chart for the TRUE state of the BikeBuyer attribute on the test
cases in the structure that match the filter defined in DT_Pacific
Filtering Nested Tables
The
most common use of nested tables in SQL Server Data Mining is to model
transaction tables. That is, model Customer entities together with the “bag” of
products they purchased.
Nested
table filters primarily serve two purposes:
-
Use (in the
model) only those nested table rows with a certain property (e.g. consider
only products that are not “Coca Cola”, because most people do buy “Coca
Cola”, so it is not that interesting)
-
Use (in the
model) only those cases where the associated nested tables have certain
properties (e.g. - build a prediction model on customers that have
purchased Coca Cola)
Notice
the difference between the filters: while the first filter removes products
from shopping baskets before modeling, the second removes customers from
the model (those that have not purchased Coca Cola).
Let’s
start with a mining structure containing customers, some demographic
information and a list of products they bought:
CREATE MINING STRUCTURE TestStructure2
(
OrderNumber TEXT KEY,
IncomeGroup TEXT DISCRETE,
Products TABLE
(
Product TEXT KEY,
LineNumber LONG
DISCRETE
)
)
Next
step, train the mining structure:
INSERT INTO MINING STRUCTURE TestStructure2
(
OrderNumber,
IncomeGroup,
Products
(
SKIP,
Product,
LineNumber
)
)
SHAPE {OPENQUERY([Adventure
Works DW], ‘SELECT
OrderNumber,
IncomeGroup
FROM dbo.vAssocSeqOrders ORDER BY OrderNumber‘)
}
APPEND ({ OPENQUERY([Adventure
Works DW], ‘SELECT
OrderNumber,
Model,
LineNumber
FROM dbo.vAssocSeqLineItems ORDER BY
OrderNumber‘) }
RELATE OrderNumber TO
OrderNumber) AS Products
Now,
the modeling part: for the first kind of filter described in the beginning, the
goal is to build a model that predicts some, for example, demographics
(IncomeGroup) based on the products a customer purchased, without considering a
certain product (say, the ‘Mountain-500′ bike):
ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL IncomeGroupFromProducts
(
OrderNumber,
IncomeGroup PREDICT,
Products
(
Product
) WITH FILTER(Product<>’Mountain-500′)
)USING Microsoft_Decision_Trees
The
syntax of the nested table filter is very similar with the syntax of model
filters - it effectively specifies a WHERE clause to be applied on the rows of
the nested table. The training data for the model can be obtained with the
following drillthrough query:
SELECT
OrderNumber, IncomeGroup,
(SELECT Product FROM
Products WHERE Product<>’Mountain-500′)
FROM MINING STRUCTURE TestStructure2.CASES WHERE IsTrainingCase()
For
the UI create a structure with no models and add a new Model setting the column
LineNumber to ignore (as described in
the previous section). The main difference you will notice from the previous
example is that now if you select the nested table column a new element is selectable
in the Mining Model / Contextual menu: Set
Nested Table Row Filter. Select this option to open the Nested table filter
dialog. You will now you opened the correct dialog because it will say Nested
Table Row Filter for <Model>.
The only items that will be available in the filter are nested table columns.
The filter for IncomeGroupFromProducts is shown below.

The
syntax changes for the second filter described before. Now the goal is to build
a model predicting IncomeGroup only for those customers that did
buy a certain product:
ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL IncomeGroupFromMountain500
(
OrderNumber,
IncomeGroup PREDICT,
Products
(
Product
)
)USING Microsoft_Decision_Trees WITH
FILTER(
EXISTS (SELECT * FROM
Products WHERE Product=’Mountain-500′)
)
The
new syntax element is the EXISTS predicate, which evaluates to TRUE if the
query specified as argument returns at least one row.
Note
that the list of columns for the EXISTS query is never used, so * is as good as
anything else. The WHERE clause of the EXISTS predicate supports, in the case
of nested table filters, only the simple
predicates (or Boolean combinations of such predicates)
described before.
The
same filter can be applied using the Model Filter dialog in BI Studio. The main
difference will be the available mining structure columns where you can select
any of the structure’s columns or the nested table. You need to select the
nested table as the Structure Column. Valid operators for nested tables are: Contains which translates into
Exists and Not Contains which is the equivalent of Not Exists.
If
you select the nested table then a ‘plus’ sign will appear to the left of that
condition, this means you can expand the nested table filter. The nested table
conditions work pretty much as regular conditions with the same <Column> <Operator> <Value>
rules. Everything inside the nested conditions will be used to create an Exists clause (with or without the
Not). You can have multiple of these
clauses and use AND / OR to create Boolean combinations. You can see the filter
for IncomeGroupFromMountain500 below:

The
training data for the model can be obtained with the following drillthrough
query:
SELECT
OrderNumber, IncomeGroup,
(SELECT Product FROM
Products)
FROM MINING STRUCTURE TestStructure2.CASES
WHERE
IsTrainingCase() AND
EXISTS( SELECT * FROM Products
WHERE Product=’Mountain-500′)
The
query actually works in DMX, which implies that DMX contains now the EXISTS
function which evaluates to TRUE if the sub-query argument returns at least one
row. Note that, when used in a regular DMX query (i.e. not in a filter), the
EXISTS sub-query can be applied to any nested table (including function
results, such as PredictHistogram) and can take any WHERE clause that is valid
in that context, including UDFs (therefore, outside of a filter, the EXISTS
clause is not limited to structure table columns and simple predicates).
Combinations
of filters and other SQL Server 2008 DMX features can be used to have very
specialized data views, allowing complex modeling scenarios.
Multiple Filters on
Nested Tables
Notice
how the structure above contains a LineNumber column in the nested table, a
numeric field indicating the category of the product on the same line. The
model below detects cross-sales rules that lead from products in line 1 (bikes)
to products in any other category:
ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL LineNumber1Recommendations
(
OrderNumber,
Products AS LineNumber1Products
(
Product
) WITH FILTER(LineNumber=1),
Products AS OtherLineNumberProducts PREDICT_ONLY
(
Product
) WITH FILTER(LineNumber<>1)
)USING Microsoft_Association_Rules
WITH FILTER(
EXISTS(SELECT * FROM Products WHERE
LineNumber=1) AND
EXISTS(SELECT * FROM Products WHERE
LineNumber<>1)
)
Here
is a breakout of the filters and constructs:
-
LineNumber1Products
is an input nested table, based on data in the Products structure column,
and containing only products in the category denoted by LineNumber 1
-
OtherLineNumberProducts
is a predictable (PREDICT_ONLY) nested table, also based on data in the
Products structure column, but containing only products in the categories
other than the one denoted by LineNumber 1
-
The model is
trained only on those customers that have both at least one product in the
LineNumber 1category and at least one product in a different category,
because any other customers are not relevant for the problem the model is
trying to solve.
-
All the filters
(both the nested table row filters and the model case filter) are applied
on mining structure columns which do not necessarily show in the model
(the case of LineNumber). When they do show in the model under an alias,
the filter is applied on the source structure column name.
UI:
The same can be accomplished in BI Studio by
creating Nested Filters for each nested table and then on Model filter with
both Exist clauses.
[Acknowledgement: Most of the content for this
article comes from a two-part
post on this topic on Bogdan
Crivat’s blog.]