qianfan.dataset package

Library aimed to helping developer to interactive with Dataset

class qianfan.dataset.DataSource[source]

Bases: ABC

basic data source class

abstract async afetch(**kwargs: Any) → str[source]

Asynchronously fetch data from source

Args:: **kwargs (Any): optional arguments
Returns:: str: content retrieved from data source

abstract async asave(data: str, **kwargs: Any) → bool[source]

Asynchronously export the data to the data source and return whether the import was successful or failed

Args:: data (str): data need to be saved **kwargs (Any): optional arguments
Returns:: bool: is saving successful

abstract fetch(**kwargs: Any) → str[source]

Fetch data from source

Args:: **kwargs (Any): optional arguments
Returns:: str: content retrieved from data source

abstract format_type() → FormatType[source]

Get format type binding to source

Returns:: FormatType: format type binding to source

abstract save(data: str, **kwargs: Any) → bool[source]

Export the data to the data source and return whether the import was successful or failed

Args:: data (str): data need to be saved **kwargs (Any): optional arguments
Returns:: bool: is saving successful

abstract set_format_type(format_type: FormatType) → None[source]

Set format type binding to source

Args:: format_type (FormatType): format type binding to source

class qianfan.dataset.Dataset(*, inner_table: Table, inner_data_source_cache: Optional[DataSource] = None, inner_schema_cache: Optional[Schema] = None)[source]

Bases: Table

append(elem: Any) → Self[source]

append an element to dataset

Args:: elem (Union[List[Dict], Tuple[Dict], Dict]): elements added to dataset
Returns:: Self: Dataset itself

col_append(elem: Any) → Self[source]

append a row to dataset

Args:

elem (Dict[str, List]): dict containing element added to dataset: must has column name “name” and column data list “data”

Returns:

Self: Dataset itself

col_delete(index: Union[int, str]) → Self[source]

delete an column from dataset

Args:: index (str): column name to delete
Returns:: Self: Dataset itself

col_filter(op: Callable[[Any], bool]) → Self[source]

filter on dataset’s column

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: Dataset itself

col_list(by: Optional[Union[slice, int, str, List[int], Tuple[int], List[str], Tuple[str]]] = None) → Any[source]

get column(s) from dataset

Args:

by (Optional[Union[int, str, Sequence[int], Sequence[str]]]):: index or indices for columns, default to None, in which case return a python list of dataset column

Returns:

Any: dataset column list

col_map(op: Callable[[Any], Any]) → Self[source]

map on dataset’s column

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: Dataset itself

col_names() → List[str][source]

get column name list

Returns:: List[str]: column name list

classmethod create_from_pyarrow_table(table: Table, schema: Optional[Schema] = None) → Dataset[source]

create a dataset from pyarrow table

Args:: table (pyarrow): pyarrow table object used to create dataset。 schema (Optional[Schema]):

schema used to validate before exporting data, default to None
Returns:: Dataset: a dataset instance

classmethod create_from_pyobj(data: Union[List[Dict[str, Any]], Dict[str, List]], schema: Optional[Schema] = None) → Dataset[source]

create a dataset from python dict or list

Args:

data (Union[List[Dict[str, Any]], Dict[str, List]]):: python object used to create dataset。
schema (Optional[Schema]):: schema used to validate before exporting data, default to None

Returns:

Dataset: a dataset instance

delete(index: Union[int, str]) → Self[source]

delete an element from dataset

Args:: index (Union[int, str]): element index to delete
Returns:: Self: Dataset itself

filter(op: Callable[[Any], bool]) → Self[source]

filter on dataset

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: Dataset itself

inner_data_source_cache: Optional[DataSource]

inner_schema_cache: Optional[Schema]

list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None, **kwargs: Any) → Any[source]

get element(s) from dataset

Args:

by (Optional[Union[slice, int, Sequence[int]]]):: index or indices for elements, default to None, in which case return a python list of dataset row

Returns:

Any: dataset row list

classmethod load(source: Optional[DataSource] = None, data_file: Optional[str] = None, qianfan_dataset_id: Optional[int] = None, huggingface_name: Optional[str] = None, schema: Optional[Schema] = None, **kwargs: Any) → Dataset[source]

Read data from the source or create a source from the parameters and create a Table instance. If a schema is specified, perform validation after importing.

Args:

source (Optional[DataSource]): where dataset load from,: default to None，in which case, a datasource will be created inside dataset using parameters below
data_file (Optional[str]):: dataset local file path, default to None
qianfan_dataset_id (Optional[int]):: qianfan dataset ID, default to None
huggingface_name (Optional[str]):: Hugging Face dataset name, not available now
schema: (Optional[Schema]):: schema used to validate loaded data, default to None

kwargs (Any): optional arguments

Returns:

Dataset: a dataset instance

map(op: Callable[[Any], Any]) → Self[source]

map on dataset

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: Dataset itself

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'inner_data_source_cache': FieldInfo(annotation=Union[DataSource, NoneType], required=False), 'inner_schema_cache': FieldInfo(annotation=Union[Schema, NoneType], required=False), 'inner_table': FieldInfo(annotation=Table, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

online_data_process(operators: List[QianfanOperator]) → Dict[str, Any][source]

create an online ETL task on qianfan not available currently

Args:

operators (List[QianfanOperator]): operators applied to ETL task

Returns:

Dict[str, Any]: ETL task info, contains 3 field:

is_succeeded (bool): whether ETL task succeed etl_task_id (Optional[int]): etl task id, only

exists when etl task is created successfully

new_dataset_id (Optional[int]): dataset id which: stores data after etl, only exists when etl task is succeeded

save(destination: Optional[DataSource] = None, data_file: Optional[str] = None, qianfan_dataset_id: Optional[int] = None, qianfan_dataset_create_args: Optional[Dict[str, Any]] = None, huggingface_name: Optional[str] = None, schema: Optional[Schema] = None, **kwargs: Any) → bool[source]

Write data to source if a schema has been passed, validate data before exporting

Args:

destination (Optional[DataSource]):: data source where dataset exports，default to None. in which case, a datasource will be created inside dataset using parameters below
data_file (Optional[str]):: dataset local file path, default to None
qianfan_dataset_id (Optional[int]):: qianfan dataset ID, default to None
qianfan_dataset_create_args: (Optional[Dict[str: Any]]):: create arguments for creating a bare dataset on qianfan, default to None
huggingface_name (Optional[str]):: Hugging Face dataset name, not available now
schema: (Optional[Schema]):: schema used to validate before exporting data, default to None

kwargs (Any): optional arguments

Returns:

bool: is saving succeeded

class qianfan.dataset.FileDataSource(*, path: str, file_format: Optional[FormatType] = None)[source]

Bases: DataSource, BaseModel

file data source

async afetch(**kwargs: Any) → str[source]

Asynchronously Read data from file. Not available currently

Args:: **kwargs (Any): Arbitrary keyword arguments.
Returns:: str: A string containing the data read from the file.

async asave(data: str, **kwargs: Any) → bool[source]

Asynchronously Write data to file。 Not available currently

Args:: data (str): data waiting to be written。 **kwargs (Any): optional arguments。
Returns:: bool: has data been written successfully

fetch(**kwargs: Any) → str[source]

Read data from file.

Args:: **kwargs (Any): Arbitrary keyword arguments.
Returns:: str: A string containing the data read from the file.

file_format: Optional[FormatType]

format_type() → FormatType[source]

Get format type binding to source

Returns:: FormatType: format type binding to source

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'file_format': FieldInfo(annotation=Union[FormatType, NoneType], required=False), 'path': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

path: str

save(data: str, **kwargs: Any) → bool[source]

Write data to file。

Args:: data (str): data waiting to be written。 **kwargs (Any): optional arguments。
Returns:: bool: has data been written successfully

set_format_type(format_type: FormatType) → None[source]

Set format type binding to source

Args:: format_type (FormatType): format type binding to source

class qianfan.dataset.QianfanDataSource(*, id: int, group_id: int, name: str, set_type: DataSetType, project_type: DataProjectType, template_type: DataTemplateType, version: int, storage_type: DataStorageType, storage_id: str, storage_path: str, storage_raw_path: Optional[str] = None, storage_name: str, storage_region: Optional[str] = None, info: Dict[str, Any] = {}, download_when_init: bool = False, data_format_type: FormatType, ak: Optional[str] = None, sk: Optional[str] = None)[source]

Bases: DataSource, BaseModel

Qianfan data source

async afetch(**kwargs: Any) → str[source]

Asynchronously read data from qianfan or local cache。 Not available currently

Args:: **kwargs (Any): Arbitrary keyword arguments.
Returns:: str: A string containing the data.

ak: Optional[str]

async asave(data: str, is_annotated: bool = False, **kwargs: Any) → bool[source]

Asynchronously write data to qianfan currently only support to write to user BOS storage

Not available currently

Args:
data (str): data waiting to be uploaded。 is_annotated (bool): has data been annotated **kwargs (Any): optional arguments。

Returns:: bool: has data been uploaded successfully

classmethod create_bare_dataset(name: str, template_type: DataTemplateType, storage_type: DataStorageType = DataStorageType.PublicBos, storage_id: Optional[str] = None, storage_path: Optional[str] = None, addition_info: Optional[Dict[str, Any]] = None, ak: Optional[str] = None, sk: Optional[str] = None, **kwargs: Any) → QianfanDataSource[source]

create bare dataset on qianfan as data source, which is empty Args:

name (str): dataset name you want template_type (DataTemplateType): template type applying to data set storage_type (Optional[DataStorageType]):

data storage type used to store your data, default to PublicBos

storage_id (Optional[str]): private BOS bucket name，
needed when storage_type is PrivateBos, default to None

storage_path (Optional[str]): private BOS file path，
needed when storage_type is PrivateBos, default to None

addition_info (Optional[Dict[str, Any]]):
additional info you want to have，default to None

ak (Optional[str]):
console ak related to your dataset and bos，default to None

sk (Optional[str]):
console sk related to your dataset and bos，default to None

kwargs (Any): other arguments

Returns:: QianfanDataSource: A datasource represents your dataset on Qianfan

data_format_type: FormatType

download_when_init: bool

fetch(**kwargs: Any) → str[source]

Read data from qianfan or local cache。

Args:: **kwargs (Any): Arbitrary keyword arguments.
Returns:: str: A string containing the data.

format_type() → FormatType[source]

Get format type binding to qianfan data source

Returns:: FormatType: format type binding to qianfan data source

classmethod get_existed_dataset(dataset_id: int, is_download_to_local: bool = True, ak: Optional[str] = None, sk: Optional[str] = None, **kwargs: Any) → QianfanDataSource[source]

Load a dataset from qianfan as data source

Args:

dataset_id (int): dataset id on Qianfan, show as “数据集版本 ID” is_download_to_local (bool):

does dataset download file when initialize object，default to True

ak (Optional[str]):: console ak related to your dataset and bos，default to None
sk (Optional[str]):: console sk related to your dataset and bos，default to None

kwargs (Any): other arguments

Returns:

QianfanDataSource: A datasource represents your dataset on Qianfan

group_id: int

id: int

info: Dict[str, Any]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'ak': FieldInfo(annotation=Union[str, NoneType], required=False), 'data_format_type': FieldInfo(annotation=FormatType, required=True), 'download_when_init': FieldInfo(annotation=bool, required=False, default=False), 'group_id': FieldInfo(annotation=int, required=True), 'id': FieldInfo(annotation=int, required=True), 'info': FieldInfo(annotation=Dict[str, Any], required=False, default={}), 'name': FieldInfo(annotation=str, required=True), 'project_type': FieldInfo(annotation=DataProjectType, required=True), 'set_type': FieldInfo(annotation=DataSetType, required=True), 'sk': FieldInfo(annotation=Union[str, NoneType], required=False), 'storage_id': FieldInfo(annotation=str, required=True), 'storage_name': FieldInfo(annotation=str, required=True), 'storage_path': FieldInfo(annotation=str, required=True), 'storage_raw_path': FieldInfo(annotation=Union[str, NoneType], required=False), 'storage_region': FieldInfo(annotation=Union[str, NoneType], required=False), 'storage_type': FieldInfo(annotation=DataStorageType, required=True), 'template_type': FieldInfo(annotation=DataTemplateType, required=True), 'version': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

name: str

project_type: DataProjectType

release_dataset() → bool[source]

make a dataset released

Returns:: bool: Whether releasing succeeded

save(data: str, is_annotated: bool = False, does_release: bool = False, **kwargs: Any) → bool[source]

Write data to qianfan Currently only support to write to user BOS storage

Args:
data (str): data waiting to be uploaded。 is_annotated (bool): has data been annotated, default to False does_release (bool): does release dataset after saving successfully, default to False **kwargs (Any): optional arguments。

Returns:: bool: has data been uploaded successfully

set_format_type(format_type: FormatType) → None[source]

Set format type binding to qianfan data source Not available

TextOnly -> Jsonl MultiModel -> Json

set_type: DataSetType

sk: Optional[str]

storage_id: str

storage_name: str

storage_path: str

storage_raw_path: Optional[str]

storage_region: Optional[str]

storage_type: DataStorageType

template_type: DataTemplateType

version: int

class qianfan.dataset.Table(*, inner_table: Table)[source]

Bases: BaseModel, Appendable, Listable, Processable

dataset representation on memory inherited from pyarrow.Table，implementing interface in process_interface.py

class Config[source]

Bases: object

arbitrary_types_allowed = True

append(elem: Any) → Self[source]

append an element to pyarrow table

Args:: elem (Union[List[Dict], Tuple[Dict], Dict]): elements added to pyarrow table
Returns:: Self: Table itself

col_append(elem: Any) → Self[source]

append a row to pyarrow table

Args:

elem (Dict[str, List]): dict containing element added to pyarrow table: must has column name “name” and column data list “data”

Returns:

Self: Table itself

col_delete(index: Union[int, str]) → Self[source]

delete an column from pyarrow table

Args:: index (str): column name to delete
Returns:: Self: Table itself

col_filter(op: Callable[[Any], bool]) → Self[source]

filter on pyarrow table’s column

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: Table itself

col_list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) → Any[source]

get column(s) from pyarrow table

Args:

by (Optional[Union[int, str, Sequence[int], Sequence[str]]]):: index or indices for columns, default to None, in which case return a python list of pyarrow table column

Returns:

Any: pyarrow table column list

col_map(op: Callable[[Any], Any]) → Self[source]

map on pyarrow table’s column

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: Table itself

col_names() → List[str][source]

get column name list

Returns:: List[str]: column name list

delete(index: Union[int, str]) → Self[source]

delete an element from pyarrow table

Args:: index (Union[int, str]): element index to delete
Returns:: Self: Table itself

filter(op: Callable[[Any], bool]) → Self[source]

filter on pyarrow table’s row

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: Table itself

get_column_count() → int[source]

get pyarrow table column count。

Returns:: int: column count。

get_row_count() → int[source]

get pyarrow table row count。

Returns:: int: row count。

inner_table: Table

list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) → Any[source]

get element(s) from pyarrow table

Args:

by (Optional[Union[slice, int, Sequence[int]]]):: index or indices for elements, default to None, in which case return a python list of pyarrow table row

Returns:

Any: pyarrow table row list

map(op: Callable[[Any], Any]) → Self[source]

map on pyarrow table’s row

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: Table itself

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'inner_table': FieldInfo(annotation=Table, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

to_pydict() → Dict[source]

convert a pyarrow table to dict

Returns:: Dict: a dict

to_pylist() → List[source]

convert a pyarrow table to list

Returns:: List: a list

Submodules

qianfan.dataset.consts module

constants for dataset using

qianfan.dataset.data_operator module

data operator for qianfan online not available currently

class qianfan.dataset.data_operator.DeduplicationSimhash(*, operator_name: str = 'deduplication_simhash', operator_type: str = 'deduplication', distance: float)[source]

Bases: Deduplicator

Deduplicator class to deduplicate by simhash

distance: float

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'distance': FieldInfo(annotation=float, required=True), 'operator_name': FieldInfo(annotation=str, required=False, default='deduplication_simhash'), 'operator_type': FieldInfo(annotation=str, required=False, default='deduplication')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.Deduplicator(*, operator_name: str, operator_type: str = 'deduplication')[source]

Bases: QianfanOperator

Deduplicator class for online ETL operator

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=True), 'operator_type': FieldInfo(annotation=str, required=False, default='deduplication')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_type: str

class qianfan.dataset.data_operator.DesensitizationProcessor(*, operator_name: str, operator_type: str = 'desensitization')[source]

Bases: QianfanOperator

Sensitive data processor class for online ETL operator

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=True), 'operator_type': FieldInfo(annotation=str, required=False, default='desensitization')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_type: str

class qianfan.dataset.data_operator.ExceptionRegulator(*, operator_name: str, operator_type: str = 'clean')[source]

Bases: QianfanOperator

Exception class for online ETL operator

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=True), 'operator_type': FieldInfo(annotation=str, required=False, default='clean')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_type: str

class qianfan.dataset.data_operator.Filter(*, operator_name: str, operator_type: str = 'filter')[source]

Bases: QianfanOperator

Filter class for online ETL operator

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=True), 'operator_type': FieldInfo(annotation=str, required=False, default='filter')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_type: str

class qianfan.dataset.data_operator.FilterCheckCharacterRepetitionRemoval(*, operator_name: str = 'filter_check_character_repetition_removal', operator_type: str = 'filter', character_repetition_max_cutoff: float)[source]

Bases: Filter

Filter class to check character repetition removal

character_repetition_max_cutoff: float

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'character_repetition_max_cutoff': FieldInfo(annotation=float, required=True), 'operator_name': FieldInfo(annotation=str, required=False, default='filter_check_character_repetition_removal'), 'operator_type': FieldInfo(annotation=str, required=False, default='filter')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.FilterCheckFlaggedWords(*, operator_name: str = 'filter_check_flagged_words', operator_type: str = 'filter', flagged_words_max_cutoff: float)[source]

Bases: Filter

Filter class to check flagged words

flagged_words_max_cutoff: float

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'flagged_words_max_cutoff': FieldInfo(annotation=float, required=True), 'operator_name': FieldInfo(annotation=str, required=False, default='filter_check_flagged_words'), 'operator_type': FieldInfo(annotation=str, required=False, default='filter')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.FilterCheckLangId(*, operator_name: str = 'filter_check_lang_id', operator_type: str = 'filter', lang_id_min_cutoff: float)[source]

Bases: Filter

Filter class to check lang id

lang_id_min_cutoff: float

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'lang_id_min_cutoff': FieldInfo(annotation=float, required=True), 'operator_name': FieldInfo(annotation=str, required=False, default='filter_check_lang_id'), 'operator_type': FieldInfo(annotation=str, required=False, default='filter')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.FilterCheckNumberWords(*, operator_name: str = 'filter_check_number_words', operator_type: str = 'filter', number_words_min_cutoff: int = 1, number_words_max_cutoff: int = 10000)[source]

Bases: Filter

Filter class to check number of words

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'number_words_max_cutoff': FieldInfo(annotation=int, required=False, default=10000), 'number_words_min_cutoff': FieldInfo(annotation=int, required=False, default=1), 'operator_name': FieldInfo(annotation=str, required=False, default='filter_check_number_words'), 'operator_type': FieldInfo(annotation=str, required=False, default='filter')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

number_words_max_cutoff: int

number_words_min_cutoff: int

operator_name: str

class qianfan.dataset.data_operator.FilterCheckPerplexity(*, operator_name: str = 'filter_check_perplexity', operator_type: str = 'filter', perplexity_max_cutoff: int)[source]

Bases: Filter

Filter class to check perplexity

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='filter_check_perplexity'), 'operator_type': FieldInfo(annotation=str, required=False, default='filter'), 'perplexity_max_cutoff': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

perplexity_max_cutoff: int

class qianfan.dataset.data_operator.FilterCheckSpecialCharacters(*, operator_name: str = 'filter_check_special_characters', operator_type: str = 'filter', special_characters_max_cutoff: float)[source]

Bases: Filter

Filter class to check special characters

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='filter_check_special_characters'), 'operator_type': FieldInfo(annotation=str, required=False, default='filter'), 'special_characters_max_cutoff': FieldInfo(annotation=float, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

special_characters_max_cutoff: float

class qianfan.dataset.data_operator.FilterCheckWordRepetitionRemoval(*, operator_name: str = 'filter_check_word_repetition_removal', operator_type: str = 'filter', word_repetition_max_cutoff: float)[source]

Bases: Filter

Filter class to check word repetition removal

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='filter_check_word_repetition_removal'), 'operator_type': FieldInfo(annotation=str, required=False, default='filter'), 'word_repetition_max_cutoff': FieldInfo(annotation=float, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

word_repetition_max_cutoff: float

class qianfan.dataset.data_operator.QianfanOperator(*, operator_name: str, operator_type: str)[source]

Bases: BaseModel

Basic class for online ETL operator

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=True), 'operator_type': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

operator_type: str

class qianfan.dataset.data_operator.RemoveEmoji(*, operator_name: str = 'remove_emoji', operator_type: str = 'clean')[source]

Bases: ExceptionRegulator

Exception class to remove emoji

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='remove_emoji'), 'operator_type': FieldInfo(annotation=str, required=False, default='clean')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.RemoveInvisibleCharacter(*, operator_name: str = 'remove_invisible_character', operator_type: str = 'clean')[source]

Bases: ExceptionRegulator

Exception class to remove invisible character

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='remove_invisible_character'), 'operator_type': FieldInfo(annotation=str, required=False, default='clean')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.RemoveNonMeaningCharacters(*, operator_name: str = 'remove_non_meaning_characters', operator_type: str = 'clean')[source]

Bases: ExceptionRegulator

Exception class to remove non-meaning characters

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='remove_non_meaning_characters'), 'operator_type': FieldInfo(annotation=str, required=False, default='clean')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.RemoveWebIdentifiers(*, operator_name: str = 'remove_web_identifiers', operator_type: str = 'clean')[source]

Bases: ExceptionRegulator

Exception class to remove web identifiers

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='remove_web_identifiers'), 'operator_type': FieldInfo(annotation=str, required=False, default='clean')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.ReplaceEmails(*, operator_name: str = 'replace_emails', operator_type: str = 'desensitization')[source]

Bases: DesensitizationProcessor

Sensitive data processor class to replace emails

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='replace_emails'), 'operator_type': FieldInfo(annotation=str, required=False, default='desensitization')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.ReplaceIdentifier(*, operator_name: str = 'replace_identifier', operator_type: str = 'desensitization')[source]

Bases: DesensitizationProcessor

Sensitive data processor class to replace identifier

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='replace_identifier'), 'operator_type': FieldInfo(annotation=str, required=False, default='desensitization')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.ReplaceIp(*, operator_name: str = 'replace_ip', operator_type: str = 'desensitization')[source]

Bases: DesensitizationProcessor

Sensitive data processor class to replace ip

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='replace_ip'), 'operator_type': FieldInfo(annotation=str, required=False, default='desensitization')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.ReplaceTraditionalChineseToSimplified(*, operator_name: str = 'replace_traditional_chinese_to_simplified', operator_type: str = 'clean')[source]

Bases: ExceptionRegulator

Exception class to replace traditional chinese to simplified

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='replace_traditional_chinese_to_simplified'), 'operator_type': FieldInfo(annotation=str, required=False, default='clean')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

class qianfan.dataset.data_operator.ReplaceUniformWhitespace(*, operator_name: str = 'replace_uniform_whitespace', operator_type: str = 'clean')[source]

Bases: ExceptionRegulator

Exception class to replace uniform whitespace

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'operator_name': FieldInfo(annotation=str, required=False, default='replace_uniform_whitespace'), 'operator_type': FieldInfo(annotation=str, required=False, default='clean')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

operator_name: str

qianfan.dataset.data_source module

data source which is related to download/upload

class qianfan.dataset.data_source.DataSource[source]

Bases: ABC

basic data source class

abstract async afetch(**kwargs: Any) → str[source]

Asynchronously fetch data from source

Args:: **kwargs (Any): optional arguments
Returns:: str: content retrieved from data source

abstract async asave(data: str, **kwargs: Any) → bool[source]

Asynchronously export the data to the data source and return whether the import was successful or failed

Args:: data (str): data need to be saved **kwargs (Any): optional arguments
Returns:: bool: is saving successful

abstract fetch(**kwargs: Any) → str[source]

Fetch data from source

Args:: **kwargs (Any): optional arguments
Returns:: str: content retrieved from data source

abstract format_type() → FormatType[source]

Get format type binding to source

Returns:: FormatType: format type binding to source

abstract save(data: str, **kwargs: Any) → bool[source]

Export the data to the data source and return whether the import was successful or failed

Args:: data (str): data need to be saved **kwargs (Any): optional arguments
Returns:: bool: is saving successful

abstract set_format_type(format_type: FormatType) → None[source]

Set format type binding to source

Args:: format_type (FormatType): format type binding to source

class qianfan.dataset.data_source.FileDataSource(*, path: str, file_format: Optional[FormatType] = None)[source]

Bases: DataSource, BaseModel

file data source

async afetch(**kwargs: Any) → str[source]

Asynchronously Read data from file. Not available currently

Args:: **kwargs (Any): Arbitrary keyword arguments.
Returns:: str: A string containing the data read from the file.

async asave(data: str, **kwargs: Any) → bool[source]

Asynchronously Write data to file。 Not available currently

Args:: data (str): data waiting to be written。 **kwargs (Any): optional arguments。
Returns:: bool: has data been written successfully

fetch(**kwargs: Any) → str[source]

Read data from file.

Args:: **kwargs (Any): Arbitrary keyword arguments.
Returns:: str: A string containing the data read from the file.

file_format: Optional[FormatType]

format_type() → FormatType[source]

Get format type binding to source

Returns:: FormatType: format type binding to source

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'file_format': FieldInfo(annotation=Union[FormatType, NoneType], required=False), 'path': FieldInfo(annotation=str, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

path: str

save(data: str, **kwargs: Any) → bool[source]

Write data to file。

Args:: data (str): data waiting to be written。 **kwargs (Any): optional arguments。
Returns:: bool: has data been written successfully

set_format_type(format_type: FormatType) → None[source]

Set format type binding to source

Args:: format_type (FormatType): format type binding to source

class qianfan.dataset.data_source.FormatType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Enum for data source format type

Csv = 'csv'

Json = 'json'

Jsonl = 'jsonl'

Text = 'txt'

class qianfan.dataset.data_source.QianfanDataSource(*, id: int, group_id: int, name: str, set_type: DataSetType, project_type: DataProjectType, template_type: DataTemplateType, version: int, storage_type: DataStorageType, storage_id: str, storage_path: str, storage_raw_path: Optional[str] = None, storage_name: str, storage_region: Optional[str] = None, info: Dict[str, Any] = {}, download_when_init: bool = False, data_format_type: FormatType, ak: Optional[str] = None, sk: Optional[str] = None)[source]

Bases: DataSource, BaseModel

Qianfan data source

async afetch(**kwargs: Any) → str[source]

Asynchronously read data from qianfan or local cache。 Not available currently

Args:: **kwargs (Any): Arbitrary keyword arguments.
Returns:: str: A string containing the data.

ak: Optional[str]

async asave(data: str, is_annotated: bool = False, **kwargs: Any) → bool[source]

Asynchronously write data to qianfan currently only support to write to user BOS storage

Not available currently

Args:
data (str): data waiting to be uploaded。 is_annotated (bool): has data been annotated **kwargs (Any): optional arguments。

Returns:: bool: has data been uploaded successfully

classmethod create_bare_dataset(name: str, template_type: DataTemplateType, storage_type: DataStorageType = DataStorageType.PublicBos, storage_id: Optional[str] = None, storage_path: Optional[str] = None, addition_info: Optional[Dict[str, Any]] = None, ak: Optional[str] = None, sk: Optional[str] = None, **kwargs: Any) → QianfanDataSource[source]

create bare dataset on qianfan as data source, which is empty Args:

name (str): dataset name you want template_type (DataTemplateType): template type applying to data set storage_type (Optional[DataStorageType]):

data storage type used to store your data, default to PublicBos

storage_id (Optional[str]): private BOS bucket name，
needed when storage_type is PrivateBos, default to None

storage_path (Optional[str]): private BOS file path，
needed when storage_type is PrivateBos, default to None

addition_info (Optional[Dict[str, Any]]):
additional info you want to have，default to None

ak (Optional[str]):
console ak related to your dataset and bos，default to None

sk (Optional[str]):
console sk related to your dataset and bos，default to None

kwargs (Any): other arguments

Returns:: QianfanDataSource: A datasource represents your dataset on Qianfan

data_format_type: FormatType

download_when_init: bool

fetch(**kwargs: Any) → str[source]

Read data from qianfan or local cache。

Args:: **kwargs (Any): Arbitrary keyword arguments.
Returns:: str: A string containing the data.

format_type() → FormatType[source]

Get format type binding to qianfan data source

Returns:: FormatType: format type binding to qianfan data source

classmethod get_existed_dataset(dataset_id: int, is_download_to_local: bool = True, ak: Optional[str] = None, sk: Optional[str] = None, **kwargs: Any) → QianfanDataSource[source]

Load a dataset from qianfan as data source

Args:

dataset_id (int): dataset id on Qianfan, show as “数据集版本 ID” is_download_to_local (bool):

does dataset download file when initialize object，default to True

ak (Optional[str]):: console ak related to your dataset and bos，default to None
sk (Optional[str]):: console sk related to your dataset and bos，default to None

kwargs (Any): other arguments

Returns:

QianfanDataSource: A datasource represents your dataset on Qianfan

group_id: int

id: int

info: Dict[str, Any]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'ak': FieldInfo(annotation=Union[str, NoneType], required=False), 'data_format_type': FieldInfo(annotation=FormatType, required=True), 'download_when_init': FieldInfo(annotation=bool, required=False, default=False), 'group_id': FieldInfo(annotation=int, required=True), 'id': FieldInfo(annotation=int, required=True), 'info': FieldInfo(annotation=Dict[str, Any], required=False, default={}), 'name': FieldInfo(annotation=str, required=True), 'project_type': FieldInfo(annotation=DataProjectType, required=True), 'set_type': FieldInfo(annotation=DataSetType, required=True), 'sk': FieldInfo(annotation=Union[str, NoneType], required=False), 'storage_id': FieldInfo(annotation=str, required=True), 'storage_name': FieldInfo(annotation=str, required=True), 'storage_path': FieldInfo(annotation=str, required=True), 'storage_raw_path': FieldInfo(annotation=Union[str, NoneType], required=False), 'storage_region': FieldInfo(annotation=Union[str, NoneType], required=False), 'storage_type': FieldInfo(annotation=DataStorageType, required=True), 'template_type': FieldInfo(annotation=DataTemplateType, required=True), 'version': FieldInfo(annotation=int, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

name: str

project_type: DataProjectType

release_dataset() → bool[source]

make a dataset released

Returns:: bool: Whether releasing succeeded

save(data: str, is_annotated: bool = False, does_release: bool = False, **kwargs: Any) → bool[source]

Write data to qianfan Currently only support to write to user BOS storage

Args:
data (str): data waiting to be uploaded。 is_annotated (bool): has data been annotated, default to False does_release (bool): does release dataset after saving successfully, default to False **kwargs (Any): optional arguments。

Returns:: bool: has data been uploaded successfully

set_format_type(format_type: FormatType) → None[source]

Set format type binding to qianfan data source Not available

TextOnly -> Jsonl MultiModel -> Json

set_type: DataSetType

sk: Optional[str]

storage_id: str

storage_name: str

storage_path: str

storage_raw_path: Optional[str]

storage_region: Optional[str]

storage_type: DataStorageType

template_type: DataTemplateType

version: int

qianfan.dataset.dataset module

dataset core concept, a wrap of data processing, data transmission and data validation

class qianfan.dataset.dataset.Dataset(*, inner_table: Table, inner_data_source_cache: Optional[DataSource] = None, inner_schema_cache: Optional[Schema] = None)[source]

Bases: Table

append(elem: Any) → Self[source]

append an element to dataset

Args:: elem (Union[List[Dict], Tuple[Dict], Dict]): elements added to dataset
Returns:: Self: Dataset itself

col_append(elem: Any) → Self[source]

append a row to dataset

Args:

elem (Dict[str, List]): dict containing element added to dataset: must has column name “name” and column data list “data”

Returns:

Self: Dataset itself

col_delete(index: Union[int, str]) → Self[source]

delete an column from dataset

Args:: index (str): column name to delete
Returns:: Self: Dataset itself

col_filter(op: Callable[[Any], bool]) → Self[source]

filter on dataset’s column

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: Dataset itself

col_list(by: Optional[Union[slice, int, str, List[int], Tuple[int], List[str], Tuple[str]]] = None) → Any[source]

get column(s) from dataset

Args:

by (Optional[Union[int, str, Sequence[int], Sequence[str]]]):: index or indices for columns, default to None, in which case return a python list of dataset column

Returns:

Any: dataset column list

col_map(op: Callable[[Any], Any]) → Self[source]

map on dataset’s column

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: Dataset itself

col_names() → List[str][source]

get column name list

Returns:: List[str]: column name list

classmethod create_from_pyarrow_table(table: Table, schema: Optional[Schema] = None) → Dataset[source]

create a dataset from pyarrow table

Args:: table (pyarrow): pyarrow table object used to create dataset。 schema (Optional[Schema]):

schema used to validate before exporting data, default to None
Returns:: Dataset: a dataset instance

classmethod create_from_pyobj(data: Union[List[Dict[str, Any]], Dict[str, List]], schema: Optional[Schema] = None) → Dataset[source]

create a dataset from python dict or list

Args:

data (Union[List[Dict[str, Any]], Dict[str, List]]):: python object used to create dataset。
schema (Optional[Schema]):: schema used to validate before exporting data, default to None

Returns:

Dataset: a dataset instance

delete(index: Union[int, str]) → Self[source]

delete an element from dataset

Args:: index (Union[int, str]): element index to delete
Returns:: Self: Dataset itself

filter(op: Callable[[Any], bool]) → Self[source]

filter on dataset

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: Dataset itself

inner_data_source_cache: Optional[DataSource]

inner_schema_cache: Optional[Schema]

inner_table: PyarrowTable

list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None, **kwargs: Any) → Any[source]

get element(s) from dataset

Args:

by (Optional[Union[slice, int, Sequence[int]]]):: index or indices for elements, default to None, in which case return a python list of dataset row

Returns:

Any: dataset row list

classmethod load(source: Optional[DataSource] = None, data_file: Optional[str] = None, qianfan_dataset_id: Optional[int] = None, huggingface_name: Optional[str] = None, schema: Optional[Schema] = None, **kwargs: Any) → Dataset[source]

Read data from the source or create a source from the parameters and create a Table instance. If a schema is specified, perform validation after importing.

Args:

source (Optional[DataSource]): where dataset load from,: default to None，in which case, a datasource will be created inside dataset using parameters below
data_file (Optional[str]):: dataset local file path, default to None
qianfan_dataset_id (Optional[int]):: qianfan dataset ID, default to None
huggingface_name (Optional[str]):: Hugging Face dataset name, not available now
schema: (Optional[Schema]):: schema used to validate loaded data, default to None

kwargs (Any): optional arguments

Returns:

Dataset: a dataset instance

map(op: Callable[[Any], Any]) → Self[source]

map on dataset

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: Dataset itself

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'inner_data_source_cache': FieldInfo(annotation=Union[DataSource, NoneType], required=False), 'inner_schema_cache': FieldInfo(annotation=Union[Schema, NoneType], required=False), 'inner_table': FieldInfo(annotation=Table, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

online_data_process(operators: List[QianfanOperator]) → Dict[str, Any][source]

create an online ETL task on qianfan not available currently

Args:

operators (List[QianfanOperator]): operators applied to ETL task

Returns:

Dict[str, Any]: ETL task info, contains 3 field:

is_succeeded (bool): whether ETL task succeed etl_task_id (Optional[int]): etl task id, only

exists when etl task is created successfully

new_dataset_id (Optional[int]): dataset id which: stores data after etl, only exists when etl task is succeeded

save(destination: Optional[DataSource] = None, data_file: Optional[str] = None, qianfan_dataset_id: Optional[int] = None, qianfan_dataset_create_args: Optional[Dict[str, Any]] = None, huggingface_name: Optional[str] = None, schema: Optional[Schema] = None, **kwargs: Any) → bool[source]

Write data to source if a schema has been passed, validate data before exporting

Args:

destination (Optional[DataSource]):: data source where dataset exports，default to None. in which case, a datasource will be created inside dataset using parameters below
data_file (Optional[str]):: dataset local file path, default to None
qianfan_dataset_id (Optional[int]):: qianfan dataset ID, default to None
qianfan_dataset_create_args: (Optional[Dict[str: Any]]):: create arguments for creating a bare dataset on qianfan, default to None
huggingface_name (Optional[str]):: Hugging Face dataset name, not available now
schema: (Optional[Schema]):: schema used to validate before exporting data, default to None

kwargs (Any): optional arguments

Returns:

bool: is saving succeeded

qianfan.dataset.process_interface module

interface file

class qianfan.dataset.process_interface.Appendable[source]

Bases: ABC

make object ‘appendable’

abstract append(elem: Any) → Self[source]

append an element at Appendable object

Args:: elem (Any): element to append
Returns:: Self: a new Appendable object after appending

class qianfan.dataset.process_interface.Listable[source]

Bases: ABC

make object ‘listable’

abstract list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) → Any[source]

get an element from object

Args:

by (Optional[Union[slice, int, str, Sequence[int], Sequence[str]]):: index used to get data or data list, default to None

Returns:

Any: elements

class qianfan.dataset.process_interface.Processable[source]

Bases: ABC

make object ‘processable’

abstract delete(index: Union[int, str]) → Self[source]

delete an element from Processable object

Args:: index (Union[int, str]): element index to delete
Returns:: Self: a new Processable object after delete

abstract filter(op: Callable[[Any], bool]) → Self[source]

filter on a Processable object

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: a new Processable object after filtering

abstract map(op: Callable[[Any], Any]) → Self[source]

map on a Processable object

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: a new Processable object after mapping

qianfan.dataset.schema module

schema for validation currently qianfan schema only

class qianfan.dataset.schema.QianfanGenericText[source]

Bases: QianfanSchema

validator for generic text dataset

validate(table: Table) → bool[source]

validate a table

Args:: table (Table): table need to be validated
Returns:: bool:whether table is valid

class qianfan.dataset.schema.QianfanNonSortedConversation[source]

Bases: QianfanSchema

validator for non-sorted, conversational dataset

validate(table: Table) → bool[source]

validate a table

Args:: table (Table): table need to be validated
Returns:: bool:whether table is valid

class qianfan.dataset.schema.QianfanQuerySet[source]

Bases: QianfanSchema

validator for query set dataset

validate(table: Table) → bool[source]

validate a table

Args:: table (Table): table need to be validated
Returns:: bool:whether table is valid

class qianfan.dataset.schema.QianfanSchema[source]

Bases: Schema

validate(table: Table) → bool[source]

validate a dataset.Table object currently check field and type only, not content in table

Args:: table (Table): table need to be validated
Returns:: bool:whether table is valid

class qianfan.dataset.schema.QianfanSortedConversation[source]

Bases: QianfanSchema

validator for sorted, conversational dataset

validate(table: Table) → bool[source]

validate a table

Args:: table (Table): table need to be validated
Returns:: bool:whether table is valid

class qianfan.dataset.schema.QianfanText2Image[source]

Bases: QianfanSchema

validator for text to image dataset

validate(table: Table) → bool[source]

validate a table

Args:: table (Table): table need to be validated
Returns:: bool:whether table is valid

class qianfan.dataset.schema.Schema[source]

Bases: ABC

abstract validate(table: Table) → bool[source]

validate a dataset.Table object currently check field and type only, not content in table

Args:: table (Table): table need to be validated
Returns:: bool:whether table is valid

qianfan.dataset.table module

wrapper for pyarrow.Table

class qianfan.dataset.table.Table(*, inner_table: Table)[source]

Bases: BaseModel, Appendable, Listable, Processable

dataset representation on memory inherited from pyarrow.Table，implementing interface in process_interface.py

class Config[source]

Bases: object

arbitrary_types_allowed = True

append(elem: Any) → Self[source]

append an element to pyarrow table

Args:: elem (Union[List[Dict], Tuple[Dict], Dict]): elements added to pyarrow table
Returns:: Self: Table itself

col_append(elem: Any) → Self[source]

append a row to pyarrow table

Args:

elem (Dict[str, List]): dict containing element added to pyarrow table: must has column name “name” and column data list “data”

Returns:

Self: Table itself

col_delete(index: Union[int, str]) → Self[source]

delete an column from pyarrow table

Args:: index (str): column name to delete
Returns:: Self: Table itself

col_filter(op: Callable[[Any], bool]) → Self[source]

filter on pyarrow table’s column

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: Table itself

col_list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) → Any[source]

get column(s) from pyarrow table

Args:

by (Optional[Union[int, str, Sequence[int], Sequence[str]]]):: index or indices for columns, default to None, in which case return a python list of pyarrow table column

Returns:

Any: pyarrow table column list

col_map(op: Callable[[Any], Any]) → Self[source]

map on pyarrow table’s column

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: Table itself

col_names() → List[str][source]

get column name list

Returns:: List[str]: column name list

delete(index: Union[int, str]) → Self[source]

delete an element from pyarrow table

Args:: index (Union[int, str]): element index to delete
Returns:: Self: Table itself

filter(op: Callable[[Any], bool]) → Self[source]

filter on pyarrow table’s row

Args:: op (Callable[[Any], bool]): handler used to filter
Returns:: Self: Table itself

get_column_count() → int[source]

get pyarrow table column count。

Returns:: int: column count。

get_row_count() → int[source]

get pyarrow table row count。

Returns:: int: row count。

inner_table: Table

list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) → Any[source]

get element(s) from pyarrow table

Args:

by (Optional[Union[slice, int, Sequence[int]]]):: index or indices for elements, default to None, in which case return a python list of pyarrow table row

Returns:

Any: pyarrow table row list

map(op: Callable[[Any], Any]) → Self[source]

map on pyarrow table’s row

Args:: op (Callable[[Any], Any]): handler used to map
Returns:: Self: Table itself

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'inner_table': FieldInfo(annotation=Table, required=True)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

to_pydict() → Dict[source]

convert a pyarrow table to dict

Returns:: Dict: a dict

to_pylist() → List[source]

convert a pyarrow table to list

Returns:: List: a list