BigQuery ML 分類模型

Pei Lee
13 min readApr 1, 2019

--

用 BigQuery 的機器學習做顧客分類,預測顧客未來是否會購物。這邊資料集用 BigQuery 上的公開資料 data-to-insights。

一、資料探索

  • 實際進行過購物行為的顧客佔所有顧客的幾 %?
  • 最暢銷的前五大商品?
  • 有多少比例的顧客會回購?

二、預測標的

顧客未來是否會購物

三、選擇特徵並建立訓練資料集

選擇特徵是機器學習非常重要的一環,因為模型最好也只能跟你給的 input 資料一樣好,如果給的特徵的資訊不足以讓模型學習到特徵跟標的之間的關係,就沒辦法取得準確的模型。這邊先選擇兩個特徵:

  • 跳出次數
  • 停留時間
#standardSQL 
SELECT * EXCEPT(fullVisitorId)
FROM # features
(
SELECT fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.newVisits = 1
)
JOIN (
SELECT fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND
totals.newVisits IS NULL) > 0, 1, 0)
AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
USING (fullVisitorId)
ORDER BY time_on_site DESC
LIMIT 10;
  • EXCEPT ():排除欄位
  • USING ():在大部分情況下等同於 ON,但在 SELECT * 時會有一些狀況,詳情可看官方文件

四、建立模型

目前有兩種模型可以用:

來源:QwikLabs

第一個是用線性迴歸預測數值,第二個是用羅吉斯迴歸做分類,這邊我們要用的是後者,語法如下:

#standardSQL  
CREATE OR REPLACE MODEL `ecommerce.classification_model`
OPTIONS (
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)

AS
#standardSQL
SELECT * EXCEPT(fullVisitorId)
FROM # features
(
SELECT fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430'
)
# train on first 9 months
JOIN (
SELECT fullvisitorid,
IF(COUNTIF(totals.transactions > 0
AND totals.newVisits IS NULL) > 0, 1, 0)
AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
USING (fullVisitorId) ;

五、評估模型

在分類問題中,我們希望能夠最小化 False Positive Rate(猜顧客會買但他沒買)並且最大化 True Positive Rate(猜顧客會買而且他真的買了),這樣的關係我們會畫成 ROC 曲線來做視覺化,而在曲線下的面積(Area Under the Curve)我們叫做 AUC。在 BigQuery 中很神奇的是,AUC 是可以 Query 的!下面我們就來 Query 一下:

#standardSQL 
SELECT roc_auc,
CASE WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor'
END AS model_quality
FROM ML.EVALUATE( MODEL ecommerce.classification_model,
( SELECT * EXCEPT(fullVisitorId)
FROM # features
( SELECT fullVisitorId,
IFNULL(totals.bounces, 0)
AS bounces,
IFNULL(totals.timeOnSite, 0)
AS time_on_site
FROM `data-to-
insights.ecommerce.web_analytics`
WHERE totals.newVisits = 1
AND date BETWEEN '20170501'
AND '20170630')
# eval on 2 months
JOIN ( SELECT fullvisitorid,
IF(COUNTIF(totals.transactions
> 0 AND totals.newVisits IS
NULL) > 0, 1, 0)
AS will_buy_on_return_visit
FROM `data-to-
insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
USING (fullVisitorId)
));

Query 結果:

六、加強模型後再評估

剛才結果不甚理想,這邊我們再新增特徵:

  • 顧客來源媒體
  • 顧客來源國家
  • 顧客使用裝置
  • 初次造訪抵達的轉換階段
#standardSQL 
CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS (
    model_type='logistic_reg',
    labels = ['will_buy_on_return_visit']
    )

AS
WITH all_visitor_stats AS (
SELECT fullvisitorid,
IF(COUNTIF(totals.transactions > 0
AND totals.newVisits IS NULL) > 0, 1, 0)
AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
# add in new features
SELECT * EXCEPT(unique_session_id)
FROM ( SELECT CONCAT(fullvisitorid, CAST(visitId AS STRING))
AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64))
AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats
USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);

接著來評估模型:(上面那串實在太大一沱了,取個變數名稱叫 [一大沱] 取代從 WITH 開始到結束的語法)

#standardSQL 
SELECT roc_auc,
CASE WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor'
END AS model_quality
FROM ML.EVALUATE(MODEL ecommerce.classification_model_2, ([一大沱]));

結果:

0.91,到達了一個相當不錯的分數(滿分為 1)。

七、預測

得到滿意的模型後,下一步就是實際做預測了:

#standardSQL
SELECT *
FROM ml.PREDICT(MODEL `ecommerce.classification_model_2`,
(
WITH all_visitor_stats AS (
SELECT fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits
IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT CONCAT(fullvisitorid, '-',CAST(visitId AS STRING))
AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64))
AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats
USING(fullvisitorid)
WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN '20170701' AND '20170801' # test 1 month
GROUP BY unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
)
ORDER BY predicted_will_buy_on_return_visit DESC;

執行結果:

這邊新增了三個欄位,第一個 predicted_will_buy_on_return_visit 代表最後的預測結果,第二個 predicted_will_buy_on_return_visit_probs.label 只是列出標籤而已,第三個 predicted_will_buy_on_return_visit.prob 是每位顧客在該標籤的信心水準(confidence)。

以上。

一開始接觸機器學習的時候就是用 R 跟 Python,沒有想過有一天可以用下 Query 的方式來執行,操作起來覺得還算直覺,蠻不錯的,可惜的是現在模型的種類比較有限。不過以 Google 的進展來說,相信不久後也能有 Boosting 的模型可以用了吧。

看其他我寫的 BigQuery 文章:

--

--

Pei Lee
Pei Lee

Written by Pei Lee

Data Team Lead @ Retail & E-commerce Industry

No responses yet