관리-도구
편집 파일: charsetprober.cpython-311.pyc
� �܋f, � �p � d dl Z d dlZd dlmZmZ ddlmZmZ ej d� � Z G d� d� � Z dS )� N)�Optional�Union� )�LanguageFilter�ProbingStates% [a-zA-Z]*[�-�]+[a-zA-Z]*[^a-zA-Z�-�]?c �` � e Zd ZdZej fdeddfd�Zdd�Zede e fd�� � Zede e fd�� � Zd e eef defd �Zedefd�� � Zdefd�Zed e eef defd�� � Zed e eef defd�� � Zed e eef defd�� � ZdS )� CharSetProbergffffff�?�lang_filter�returnNc � � t j | _ d| _ || _ t j t � � | _ d S )NT) r � DETECTING�_state�activer �logging� getLogger�__name__�logger)�selfr s �L/opt/cloudlinux/venv/lib64/python3.11/site-packages/chardet/charsetprober.py�__init__zCharSetProber.__init__, s1 � �"�,������&����'��1�1����� c �( � t j | _ d S �N)r r r �r s r �resetzCharSetProber.reset2 s � �"�,����r c � � d S r � r s r �charset_namezCharSetProber.charset_name5 s � ��tr c � � t �r ��NotImplementedErrorr s r �languagezCharSetProber.language9 s � �!�!r �byte_strc � � t �r r )r r# s r �feedzCharSetProber.feed= s � �!�!r c � � | j S r )r r s r �statezCharSetProber.state@ s � ��{�r c � � dS )Ng r r s r �get_confidencezCharSetProber.get_confidenceD s � ��sr �bufc �2 � t j dd| � � } | S )Ns ([ -])+� )�re�sub)r* s r �filter_high_byte_onlyz#CharSetProber.filter_high_byte_onlyG s � ��f�&��c�2�2��� r c � � t � � }t � | � � }|D ]Z}|� |dd� � � |dd� }|� � � s|dk rd}|� |� � �[|S )u7 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [-ÿ] marker: everything else [^a-zA-Z-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. N���� �r, )� bytearray�INTERNATIONAL_WORDS_PATTERN�findall�extend�isalpha)r* �filtered�words�word� last_chars r �filter_international_wordsz(CharSetProber.filter_international_wordsL s� � � �;�;�� ,�3�3�C�8�8��� '� '�D��O�O�D��"��I�&�&�&� �R�S�S� �I��$�$�&�&� !�9�w�+>�+>� � ��O�O�I�&�&�&�&��r c �v � t � � }d}d}t | � � � d� � } t | � � D ]U\ }}|dk r|dz }d}�|dk r<||k r4|s2|� | ||� � � |� d� � d}�V|s|� | |d � � � |S ) a[ Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Fr �c� >r � <r, TN)r3 � memoryview�cast� enumerater6 )r* r8 �in_tag�prev�curr�buf_chars r �remove_xml_tagszCharSetProber.remove_xml_tagsn s� � � �;�;��������o�o�"�"�3�'�'��'��n�n� � �N�D�(� �4����a�x������T�!�!��$�;�;�v�;� �O�O�C��T� �N�3�3�3��O�O�D�)�)�)���� � (� �O�O�C����J�'�'�'��r )r N)r � __module__�__qualname__�SHORTCUT_THRESHOLDr �NONEr r �propertyr �strr r"