C4.5アルゴリズム

決定木を作成するアルゴリズム

C4.5は、ロス・クインランによって開発された決定木を生成するためのアルゴリズムです。^{[1] C4.5は、クインランの以前の}ID3アルゴリズムの拡張版です。C4.5によって生成された決定木は分類に使用できるため、C4.5は統計的分類器と呼ばれることがよくあります。2011年、 Weka機械学習ソフトウェアの開発者は、C4.5アルゴリズムを「おそらく今日まで実践で最も広く使用されている機械学習の主力製品である、画期的な決定木プログラム」と評しました。^[2]

2008年にSpringer LNCSが発表した優れた論文「データマイニングにおけるトップ10アルゴリズム」で1位にランクインして以来、非常に人気が高まりました。^[3]

アルゴリズム

C4.5は、 ID3と同様に、情報エントロピーの概念を用いて、学習データセットから決定木を構築します。学習データは、既に分類済みのサンプルの集合です。各サンプルはp次元ベクトルで構成され、ベクトルはサンプルの属性値または特徴と、そのクラスを表します。 $S={s_{1},s_{2},...}$ $s_{i}$ $(x_{1,i},x_{2,i},...,x_{p,i})$ $x_{j}$ $s_{i}$

C4.5は、木の各ノードにおいて、サンプルセットを最も効果的にいずれかのクラスに偏ったサブセットに分割するデータ属性を選択します。分割基準は、正規化された情報ゲイン（エントロピーの差）です。正規化された情報ゲインが最も高い属性が決定に使用されます。その後、C4.5アルゴリズムは分割されたサブリストを再帰的に処理します。

このアルゴリズムにはいくつかの基本ケースがあります。

リスト内のすべてのサンプルは同じクラスに属します。この場合、そのクラスを選択するように指示する決定木のリーフノードが作成されます。
いずれの特徴も情報ゲインを提供しません。この場合、C4.5はクラスの期待値を使用して、ツリーの上位に決定ノードを作成します。
これまでに見たことのないクラスのインスタンスが見つかりました。C4.5は、期待値を使用して、ツリーの上位に決定ノードを作成します。

擬似コード

擬似コードでは、決定木を構築するための一般的なアルゴリズムは次のようになります。^[4]

上記の基本ケースを確認します。
各属性aについて、 aでの分割から得られる正規化された情報ゲイン比を求めます。
正規化された情報ゲインが最も高い属性をa_bestとします。
a_bestで分割する決定ノードを作成します。
Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node.

Implementations

J48 is an open source Java implementation of the C4.5 algorithm in the Weka data mining tool.

Improvements from ID3 algorithm

C4.5 made a number of improvements to ID3. Some of these are:

Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.^[5]
Handling training data with missing attribute values - C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations.
Handling attributes with differing costs.
Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.

Improvements in C5.0/See5 algorithm

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are:^[6]^[7]

Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude)
Memory usage - C5.0 is more memory efficient than C4.5
Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.
Support for boosting - Boosting improves the trees and gives them more accuracy.
Weighting - C5.0 allows you to weight different cases and misclassification types.
Winnowing - a C5.0 option automatically winnows the attributes to remove those that may be unhelpful.

Source for a single-threaded Linux version of C5.0 is available under the GNU General Public License (GPL).

References

^ Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
^ Ian H. Witten; Eibe Frank; Mark A. Hall (2011). "Data Mining: Practical machine learning tools and techniques, 3rd Edition". Morgan Kaufmann, San Francisco. p. 191. Archived from the original on 2020-11-27. Retrieved 2017-07-04.
^ Umd.edu - Top 10 Algorithms in Data Mining
^ S.B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques", Informatica 31(2007) 249-268, 2007
^ J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.
^ See5/C5.0 は C4.5 より優れていますか?
^ M. KuhnとK. Johnson、「応用予測モデリング」、Springer 2013

外部リンク

Ross Quinlan のホームページ上のオリジナルの実装: http://www.rulequest.com/Personal/
See5とC5.0

[1] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

[2] Ian H. Witten; Eibe Frank; Mark A. Hall (2011). "Data Mining: Practical machine learning tools and techniques, 3rd Edition". Morgan Kaufmann, San Francisco. p. 191. Archived from the original on 2020-11-27. Retrieved 2017-07-04.

[3] Umd.edu - Top 10 Algorithms in Data Mining

[4] S.B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques", Informatica 31(2007) 249-268, 2007

[5] J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.

[6] See5/C5.0 は C4.5 より優れていますか?

[7] M. KuhnとK. Johnson、「応用予測モデリング」、Springer 2013