用 BigQuery 的機器學習做顧客分類,預測顧客未來是否會購物。這邊資料集用 BigQuery 上的公開資料 data-to-insights。
一、資料探索
- 實際進行過購物行為的顧客佔所有顧客的幾 %?
- 最暢銷的前五大商品?
- 有多少比例的顧客會回購?
二、預測標的
顧客未來是否會購物
三、選擇特徵並建立訓練資料集
選擇特徵是機器學習非常重要的一環,因為模型最好也只能跟你給的 input 資料一樣好,如果給的特徵的資訊不足以讓模型學習到特徵跟標的之間的關係,就沒辦法取得準確的模型。這邊先選擇兩個特徵:
- 跳出次數
- 停留時間
#standardSQL
SELECT * EXCEPT(fullVisitorId)
FROM # features
(
SELECT fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.newVisits = 1
)
JOIN (
SELECT fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND
totals.newVisits IS NULL) > 0, 1, 0)
AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
USING (fullVisitorId)
ORDER BY time_on_site DESC
LIMIT 10;
- EXCEPT ():排除欄位
- USING ():在大部分情況下等同於 ON,但在 SELECT * 時會有一些狀況,詳情可看官方文件。
四、建立模型
目前有兩種模型可以用:
第一個是用線性迴歸預測數值,第二個是用羅吉斯迴歸做分類,這邊我們要用的是後者,語法如下:
#standardSQL
CREATE OR REPLACE MODEL `ecommerce.classification_model`
OPTIONS (
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS
#standardSQL
SELECT * EXCEPT(fullVisitorId)
FROM # features
(
SELECT fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430'
)
# train on first 9 months
JOIN (
SELECT fullvisitorid,
IF(COUNTIF(totals.transactions > 0
AND totals.newVisits IS NULL) > 0, 1, 0)
AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
USING (fullVisitorId) ;
五、評估模型
在分類問題中,我們希望能夠最小化 False Positive Rate(猜顧客會買但他沒買)並且最大化 True Positive Rate(猜顧客會買而且他真的買了),這樣的關係我們會畫成 ROC 曲線來做視覺化,而在曲線下的面積(Area Under the Curve)我們叫做 AUC。在 BigQuery 中很神奇的是,AUC 是可以 Query 的!下面我們就來 Query 一下:
#standardSQL
SELECT roc_auc,
CASE WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor'
END AS model_quality
FROM ML.EVALUATE( MODEL ecommerce.classification_model,
( SELECT * EXCEPT(fullVisitorId)
FROM # features
( SELECT fullVisitorId,
IFNULL(totals.bounces, 0)
AS bounces,
IFNULL(totals.timeOnSite, 0)
AS time_on_site
FROM `data-to-
insights.ecommerce.web_analytics`
WHERE totals.newVisits = 1
AND date BETWEEN '20170501'
AND '20170630')
# eval on 2 months
JOIN ( SELECT fullvisitorid,
IF(COUNTIF(totals.transactions
> 0 AND totals.newVisits IS
NULL) > 0, 1, 0)
AS will_buy_on_return_visit
FROM `data-to-
insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
USING (fullVisitorId)
));
Query 結果:
六、加強模型後再評估
剛才結果不甚理想,這邊我們再新增特徵:
- 顧客來源媒體
- 顧客來源國家
- 顧客使用裝置
- 初次造訪抵達的轉換階段
#standardSQL
CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS (
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS WITH all_visitor_stats AS (
SELECT fullvisitorid,
IF(COUNTIF(totals.transactions > 0
AND totals.newVisits IS NULL) > 0, 1, 0)
AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
) # add in new features
SELECT * EXCEPT(unique_session_id)
FROM ( SELECT CONCAT(fullvisitorid, CAST(visitId AS STRING))
AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64))
AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats
USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);
接著來評估模型:(上面那串實在太大一沱了,取個變數名稱叫 [一大沱] 取代從 WITH 開始到結束的語法)
#standardSQL
SELECT roc_auc,
CASE WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor'
END AS model_quality
FROM ML.EVALUATE(MODEL ecommerce.classification_model_2, ([一大沱]));
結果:
0.91,到達了一個相當不錯的分數(滿分為 1)。
七、預測
得到滿意的模型後,下一步就是實際做預測了:
#standardSQL
SELECT *
FROM ml.PREDICT(MODEL `ecommerce.classification_model_2`,
(
WITH all_visitor_stats AS (
SELECT fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits
IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
) SELECT CONCAT(fullvisitorid, '-',CAST(visitId AS STRING))
AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64))
AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats
USING(fullvisitorid)
WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN '20170701' AND '20170801' # test 1 month
GROUP BY unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
)
ORDER BY predicted_will_buy_on_return_visit DESC;
執行結果:
這邊新增了三個欄位,第一個 predicted_will_buy_on_return_visit 代表最後的預測結果,第二個 predicted_will_buy_on_return_visit_probs.label 只是列出標籤而已,第三個 predicted_will_buy_on_return_visit.prob 是每位顧客在該標籤的信心水準(confidence)。
以上。
一開始接觸機器學習的時候就是用 R 跟 Python,沒有想過有一天可以用下 Query 的方式來執行,操作起來覺得還算直覺,蠻不錯的,可惜的是現在模型的種類比較有限。不過以 Google 的進展來說,相信不久後也能有 Boosting 的模型可以用了吧。
看其他我寫的 BigQuery 文章: