pandasでutf-8のCSVを読み込むときにBOMの有無でどう変わるか

pandasのread_csv()はencodingオプションで，読み込ませるCSVの文字コードを指定できる（デフォルトはutf-8）．ここで，BOM付きutf-8のCSVをutf-8として読み込んでも正しく動くのかが気になった．ちなみに，BOM付きutf-8はExcelやMay 2019 Update以前のWindows10のメモ帳でCSVを作成した時に作成される．また，逆にBOMなしのutf-8をBOM付きutf-8として読み込んでも大丈夫なのか．ちょうど調べる機会があったのでまとめてみた．

疑問

utf-8のCSVをutf8としてread_csv()する（正常）
BOM付きutf-8のCSVをutf8としてread_csv()する
utf-8のCSVをutf_8_sigとしてread_csv()する
BOM付きutf-8のCSVをutf_8_sigとしてread_csv()する（正常）

1と4については問題ないはずだが，2と3は正しく読み込めるだろうか．なお，utf_8_sigはBOM付きutf-8のこと．

検証環境

python 3.8.0
pandas 0.25.3

検証

BOMなしutf-8とBOM付きutf-8で2つのCSVファイルを用意した．見た目上は変わらないが，バイナリ表示をすると先頭3バイトにしっかりBOM(ef bb bf)が表示されている．

$ cat data_nobom.csv
year,month,date
2020,1,1
2020,1,2
2020,1,3

$ od -A n -t x1 data_nobom.csv | xargs -L1
79 65 61 72 2c 6d 6f 6e 74 68 2c 64 61 74 65 0a
32 30 32 30 2c 31 2c 31 0a 32 30 32 30 2c 31 2c
32 0a 32 30 32 30 2c 31 2c 33 0a

$ cat data_bom.csv
year,month,date
2020,1,1
2020,1,2
2020,1,3

$ od -A n -t x1 data_bom.csv | xargs -L1
ef bb bf 79 65 61 72 2c 6d 6f 6e 74 68 2c 64 61
74 65 0a 32 30 32 30 2c 31 2c 31 0a 32 30 32 30
2c 31 2c 32 0a 32 30 32 30 2c 31 2c 33 0a

正しくread_csv()できていないと，BOM付きutf-8の場合，ファイル先頭のBOMを文字として認識してしまうはず．

In [30]: year = '\ufeffyear'

In [31]: print(year)
year

In [32]: print(year.encode())
b'\xef\xbb\xbfyear'

1. utf-8のCSVをutf8としてread_csv()する

In [33]: df = pd.read_csv('data_nobom.csv', encoding='utf8')

In [35]: print(df.columns[0].encode())
b'year'

2. BOM付きutf-8のCSVをutf8としてread_csv()する

In [36]: df = pd.read_csv('data_bom.csv', encoding='utf8')

In [38]: print(df.columns[0].encode())
b'year'

3. utf-8のCSVをutf_8_sigとしてread_csv()する

In [39]: df = pd.read_csv('data_nobom.csv', encoding='utf_8_sig')

In [41]: print(df.columns[0].encode())
b'year'

4. BOM付きutf-8のCSVをutf_8_sigとしてread_csv()する

In [42]: df = pd.read_csv('data_bom.csv', encoding='utf_8_sig')

In [44]: print(df.columns[0].encode())
b'year'

結論

結論としては，BOM付きutf-8であってもただのutf-8であっても、encodingオプションにutf_8_sigを指定してもしなくても、pandasのread_csv()は正常に動いた．（もちろん、shift-jisであればencodingにshift-jisを指定しなくては動かない）

コードで確認する

pandasのコードを確認すると，確かにbomのチェックをするコードが存在した．

        # This was the first line of the file,
        # which could contain the BOM at the
        # beginning of it.
        if self.pos == 1:
            line = self._check_for_bom(line)

↑https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L2844

    def _check_for_bom(self, first_row):
        """
        Checks whether the file begins with the BOM character.
        If it does, remove it. In addition, if there is quoting
        in the field subsequent to the BOM, remove it as well
        because it technically takes place at the beginning of
        the name, not the middle of it.
        """

↑https://github.com/pandas-dev/pandas/blob/master/pandas/io/parsers.py#L2731

encodingにかかわらず，読み込んだCSVの1行目を処理するときは必ず_check_for_bom()メソッドが呼ばれ，BOMが存在するときは取り除いている．BOMが存在しないときは何もしていない．

これで，「2. BOM付きutf-8のCSVをutf8としてread_csv()する」がうまく動く理由がわかった．では，encoding='utf_8_sig'を指定したときの処理はどうなるのだろうか．「3. utf-8のCSVをutf_8_sigとしてread_csv()する」場合の謎が残る．

        if encoding:
            # Encoding
            f = open(path_or_buf, mode, encoding=encoding, newline="")

↑https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L440

BOMの処理がなされたあとはpython標準のopenでcsvの処理がなされているみたい．このコードじゃないとしても，pandasのどこかでopen()かcsvモジュールが呼ばれて，引数のencodingオプションを指定しているのは容易に想像がつく．では，open()でutf-8のファイルをutf_8_sigとして読み込ませたらどうなるのか．

On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

https://docs.python.org/3/library/codecs.html#standard-encodings

公式のドキュメントにちゃんと記載があった．

                if input[:3] == codecs.BOM_UTF8:
                    (output, consumed) = \
                       codecs.utf_8_decode(input[3:], errors, final)
                    return (output, consumed+3)

↑https://github.com/python/cpython/blob/master/Lib/encodings/utf_8_sig.py#L65

ここまで来たならコードまで確認したい．コードでも先頭3バイトにBOMがあるときだけutf_8_sigの処理をするという条件分岐が存在した．つまり，utf8なのにutf_8_sigとして読み込んでも，utf_8_sigとしての処理はされずutf8として処理されるということ．

これで，「3. utf-8のCSVをutf_8_sigとしてread_csv()する」の謎も解決！pandasはかしこい！

dorapon2000’s diary

忘備録的な。セキュリティとかネットワークすきです。