qianfan.dataset package
Library aimed to helping developer to interactive with Dataset
- class qianfan.dataset.DataExportDestinationType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
int,Enum- PlatformBos: int = 0
导出到平台 Bos
- PrivateBos: int = 1
导出到私有 Bos
- class qianfan.dataset.DataProjectType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
int,EnumProject type used by Qianfan Data
- Conversation: int = 20
对话类
- GenericText: int = 401
返文本类
- QuerySet: int = 402
Query 查询类
- Text2Image: int = 705
文生图类
- class qianfan.dataset.DataSetType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
int,Enum- MultiModel: int = 7
多模态数据集
- TextOnly: int = 4
文本类数据集
- class qianfan.dataset.DataSource[source]
Bases:
ABCbasic data source class
- abstract async afetch(**kwargs: Any) Union[str, List[str]][source]
Asynchronously fetch data from source
- Args:
**kwargs (Any): optional arguments
- Returns:
Union[str, List[str]]: content retrieved from data source
- abstract async asave(data: str, **kwargs: Any) bool[source]
Asynchronously export the data to the data source and return whether the import was successful or failed
- Args:
data (str): data need to be saved **kwargs (Any): optional arguments
- Returns:
bool: is saving successful
- abstract fetch(**kwargs: Any) Union[str, List[str]][source]
Fetch data from source
- Args:
**kwargs (Any): optional arguments
- Returns:
Union[str, List[str]]: content retrieved from data source
- abstract format_type() FormatType[source]
Get format type binding to source
- Returns:
FormatType: format type binding to source
- abstract save(data: str, **kwargs: Any) bool[source]
Export the data to the data source and return whether the import was successful or failed
- Args:
data (str): data need to be saved **kwargs (Any): optional arguments
- Returns:
bool: is saving successful
- abstract set_format_type(format_type: FormatType) None[source]
Set format type binding to source
- Args:
format_type (FormatType): format type binding to source
- class qianfan.dataset.DataSourceType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
int,Enum- PrivateBos: int = 1
私有 Bos
包含 zip 压缩包的分享链接
- class qianfan.dataset.DataStorageType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
str,Enum- PrivateBos: str = 'usrBos'
用户私有 Bos
- PublicBos: str = 'sysBos'
平台公共 Bos
- class qianfan.dataset.DataTemplateType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
int,EnumTemplate type used by Qianfan Data
- GenericText: int = 40100
泛文本
- NonSortedConversation: int = 2000
非排序对话
- QuerySet: int = 40200
Query 查询
- SortedConversation: int = 2001
含排序对话
- Text2Image: int = 70500
文生图
- class qianfan.dataset.Dataset(inner_table: Table, inner_data_source_cache: Optional[DataSource] = None, inner_schema_cache: Optional[Schema] = None, input_columns: Optional[List[str]] = None, reference_column: Optional[str] = None, **kwargs: Any)[source]
Bases:
Table- add_default_group_column() Self[source]
add “_group” column to Dataset, the value in “_group” column are sequential incremental
- Returns:
Self: Dataset itself
- append(elem: Any, add_new_group: bool = False, is_grouped: bool = True) Self[source]
append element(s) to dataset
- Args:
- elem (Union[List[List[Dict]], List[Dict], Tuple[Dict], Dict]):
Elements added to dataset
- add_new_group (bool):
Whether elem has a new group id. Only used when dataset is grouped.
- is_grouped (bool):
Are element in elem in same group. Only used when dataset is grouped and elem is Sequence and add_new_group was set True. Default to True, all elements will be in same group. If it’s True, each element will have sequential incremental group id from last available group id.
- Returns:
Self: Dataset itself
- async atest_using_llm(model_id: Optional[int] = None, model_version_id: Optional[int] = None, service_model: Optional[str] = None, service_endpoint: Optional[str] = None, is_chat_service: bool = True, does_show_latency: bool = True, **kwargs: Any) Dataset[source]
using arguments to init an llm instance and get output on current dataset from it asynchronously set only model arguments our service arguments to instantiating
- Args:
- model_id (Optional[int]):
id of your own model, default to None
- model_version_id (Optional[int]):
version id of your own model, default to None
- service_model (Optional[str]):
name of model you want to use as service, default to None
- service_endpoint (Optional[str]):
endpoint of service, default to None
- is_chat_service (bool):
the service type of service, default to True. Service will be Completion if False
- does_show_latency (bool):
whether result dataset contain latency info column when using Service as evaluated object. Depending on different request mode (stream and non-stream), it will contains request_complete_latency or (first_token_latency, request_complete_latency) combo. Default to True
- **kwargs (Any):
optional argument dict
- Returns:
Dataset: A dataset contains inputs, reference outputs and llm outputs
- col_append(elem: Any) Self[source]
append a row to dataset
- Args:
- elem (Dict[str, List]): a dict containing element added to dataset, which
must has column name “name” and column data list “data”
- Returns:
Self: Dataset itself
- col_delete(index: Union[int, str]) Self[source]
delete an column from dataset
- Args:
index (str): column name to delete
- Returns:
Self: Dataset itself
- col_filter(op: Callable[[Any], bool]) Self[source]
filter on dataset’s column
- Args:
op (Callable[[Any], bool]): handler used to filter
- Returns:
Self: Dataset itself
- col_insert(elem: Any, index: Any) Self[source]
append a row to dataset
- Args:
- elem (Dict[str, List]): dict containing element added to dataset
must has column name “name” and column data list “data”
index (int): where to insert new column
- Returns:
Self: Dataset itself
- col_list(by: Optional[Union[slice, int, str, List[int], Tuple[int], List[str], Tuple[str]]] = None) Any[source]
get column(s) from dataset
- Args:
- by (Optional[Union[int, str, Sequence[int], Sequence[str]]]):
index or indices for columns, default to None, in which case return a python list of dataset column
- Returns:
Any: dataset column list
- col_map(op: Callable[[Any], Any]) Self[source]
map on dataset’s column
- Args:
op (Callable[[Any], Any]): handler used to map
- Returns:
Self: Dataset itself
- col_renames(new_names: List[str]) Self[source]
rename all dataset column
- Args:
new_names (List[str]): All new names for columns
- Returns:
Self: A brand-new Dataset with new name
- classmethod create_from_pyarrow_table(table: Table, schema: Optional[Schema] = None, **kwargs: Any) Dataset[source]
create a dataset from pyarrow table
- Args:
- table (pyarrow):
pyarrow table object used to create dataset。
- schema (Optional[Schema]):
schema used to validate before exporting data, default to None
- **kwargs (Any):
optional arguments
- Returns:
Dataset: a dataset instance
- classmethod create_from_pyobj(data: Union[List[Dict[str, Any]], Dict[str, List]], schema: Optional[Schema] = None, **kwargs: Any) Dataset[source]
create a dataset from python dict or list
- Args:
- data (Union[List[Dict[str, Any]], Dict[str, List]]):
python object used to create dataset。
- schema (Optional[Schema]):
schema used to validate before exporting data, default to None
- **kwargs (Any):
optional arguments
- Returns:
Dataset: a dataset instance
- delete(index: Union[int, str]) Self[source]
delete an element from dataset
- Args:
index (Union[int, str]): element index to delete
- Returns:
Self: Dataset itself
- delete_group_column() Self[source]
remove “_group” column from Dataset
- Returns:
Self: Dataset itself
- filter(op: Callable[[Any], bool]) Self[source]
filter on dataset
- Args:
op (Callable[[Any], bool]): handler used to filter
- Returns:
Self: Dataset itself
- property get_input_data: Dict[str, List[Any]]
get input columns data in dataset
- Returns:
- Dict[str, List[Any]]: a dict
which indicates the “column name-column data” pairs
- property get_reference_data: List[Any]
get reference data in dataset
- Returns:
List[Any]: list of output data column
- insert(elem: Any, index: Any, group_id: int = -1, add_new_group: bool = False, is_grouped: bool = True) Self[source]
insert element(s) to dataset
- Args:
- elem (Union[List[List[Dict]], List[Dict], Tuple[Dict], Dict]):
Elements added to dataset
index (int): where to insert element(s) group_id (int):
which group id you want to apply to new element(s). Default to -1, which means let group id be automatically inferred from table.
- add_new_group (bool):
Whether elem has a new group id. Only used when dataset is grouped and group_id is -1
- is_grouped (bool):
Are element in elem in same group. Only used when dataset is grouped and elem is Sequence and add_new_group was set True. Default to True, all elements will be in same group. If it’s True, each element will have sequential incremental group id from last available group id.
- Returns:
Self: Dataset itself
- is_dataset_generic_text() bool[source]
tell whether current dataset is generic text dataset
- Returns:
bool: whether current dataset is generic text dataset
- is_dataset_located_in_qianfan() bool[source]
tell whether current dataset is cloud-based dataset
- Returns:
bool: whether current dataset is cloud-based dataset
- list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None, **kwargs: Any) Any[source]
get element(s) from dataset
- Args:
- by (Optional[Union[slice, int, Sequence[int]]]):
index or indices for elements, default to None, in which case return a python list of dataset row
- Returns:
Any: dataset row list
- classmethod load(source: Optional[DataSource] = None, data_file: Optional[str] = None, qianfan_dataset_id: Optional[int] = None, bos_load_args: Optional[Dict[str, Any]] = None, huggingface_dataset: Optional[Any] = None, schema: Optional[Schema] = None, organize_data_as_group: bool = False, **kwargs: Any) Dataset[source]
Read data from the source or create a source from the parameters and create a Table instance. If a schema is specified, perform validation after importing.
- Args:
- source (Optional[DataSource]): where dataset load from,
default to None,in which case, a datasource will be created inside dataset using parameters below
- data_file (Optional[str]):
dataset local file path, default to None
- qianfan_dataset_id (Optional[int]):
qianfan dataset ID, default to None
- bos_load_args: (Optional[Dict[str, Any]]):
create a dataset and import initial dataset content from args
- huggingface_dataset (Optional[Dict[str, Any], Any]):
Huggingface dataset object, only support DatasetDict and Dataset of Huggingface datasets.
- schema (Optional[Schema]):
schema used to validate loaded data, default to None
- organize_data_as_group (bool):
only available when data source’s format is FormatType.Jsonl. Indicates whether organize data within dataset in group format, default to False, and when it’s True, the default format will be a group-based 2D structure.
**kwargs (Any): optional arguments
- Returns:
Dataset: a dataset instance
- map(op: Callable[[Any], Any]) Self[source]
map on dataset
- Args:
op (Callable[[Any], Any]): handler used to map
- Returns:
Self: Dataset itself
- online_data_process(operators: List[QianfanOperator]) Dict[str, Any][source]
create an online ETL task on qianfan
- Args:
operators (List[QianfanOperator]): operators applied to ETL task
- Returns:
- Dict[str, Any]: ETL task info, contains 3 field:
is_succeeded (bool): whether ETL task succeed etl_task_id (Optional[int]): etl task id, only
exists when etl task is created successfully
- new_dataset_id (Optional[int]): dataset id which
stores data after etl, only exists when etl task is succeeded
- save(destination: Optional[DataSource] = None, data_file: Optional[str] = None, qianfan_dataset_id: Optional[int] = None, qianfan_dataset_create_args: Optional[Dict[str, Any]] = None, schema: Optional[Schema] = None, replace_source: bool = False, **kwargs: Any) bool[source]
Write data to source if a schema has been passed, validate data before exporting
- Args:
- destination (Optional[DataSource]):
data source where dataset exports,default to None. in which case, a datasource will be created inside dataset using parameters below
- data_file (Optional[str]):
dataset local file path, default to None
- qianfan_dataset_id (Optional[int]):
qianfan dataset ID, default to None
- qianfan_dataset_create_args: (Optional[Dict[str: Any]]):
create arguments for creating a bare dataset on qianfan, default to None
- schema: (Optional[Schema]):
schema used to validate before exporting data, default to None
- replace_source: (bool):
if replace the original source, default to False
kwargs (Any): optional arguments
- Returns:
bool: is saving succeeded
- start_online_data_process_task(operators: List[QianfanOperator]) int[source]
create an online ETL task on qianfan
- Args:
operators (List[QianfanOperator]): operators applied to ETL task
- Returns:
int: etl task id
- test_using_llm(model_id: Optional[int] = None, model_version_id: Optional[int] = None, service_model: Optional[str] = None, service_endpoint: Optional[str] = None, is_chat_service: bool = True, does_show_latency: bool = True, **kwargs: Any) Dataset[source]
using arguments to init an llm instance and get output on current dataset from it set only model arguments our service arguments to instantiating
- Args:
- model_id (Optional[int]):
id of your own model, default to None
- model_version_id (Optional[int]):
version id of your own model, default to None
- service_model (Optional[str]):
name of model you want to use as service, default to None
- service_endpoint (Optional[str]):
endpoint of service, default to None
- is_chat_service (bool):
the service type of service, default to True. Service will be Completion if False
- does_show_latency (bool):
whether result dataset contain latency info column when using Service as evaluated object. Depending on different request mode (stream and non-stream), it will contains request_complete_latency or (first_token_latency, request_complete_latency) combo. Default to True
- **kwargs (Any):
optional argument dict
- Returns:
Dataset: A dataset contains inputs, reference outputs and llm outputs
- class qianfan.dataset.FileDataSource(*, path: str, file_format: Optional[FormatType] = None, save_as_folder: bool = False)[source]
Bases:
DataSource,BaseModelfile data source
- async afetch(**kwargs: Any) Union[str, List[str]][source]
Asynchronously Read data from file. Not available currently
- Args:
**kwargs (Any): Arbitrary keyword arguments.
- Returns:
- Union[str, List[str]]:
String or list of string containing the data read from the file.
- async asave(data: Union[str, List[str]], **kwargs: Any) bool[source]
Asynchronously Write data to file。 Not available currently
- Args:
data (Union[str, List[str]]): data waiting to be written。 **kwargs (Any): optional arguments。
- Returns:
bool: has data been written successfully
- fetch(**kwargs: Any) Union[str, List[str]][source]
Read data from file.
- Args:
**kwargs (Any): Arbitrary keyword arguments.
- Returns:
- Union[str, List[str]]:
String or list of string containing the data read from the file.
- file_format: Optional[FormatType]
- format_type() FormatType[source]
Get format type binding to source
- Returns:
FormatType: format type binding to source
- path: str
- save(data: Union[str, List[str]], **kwargs: Any) bool[source]
Write data to file。
- Args:
data (Union[str, List[str]]): data waiting to be written。 **kwargs (Any): optional arguments。
- Returns:
bool: has data been written successfully
- save_as_folder: bool
- set_format_type(format_type: FormatType) None[source]
Set format type binding to source
- Args:
format_type (FormatType): format type binding to source
- class qianfan.dataset.QianfanDataSource(*, id: int, group_id: int, name: str, set_type: DataSetType, project_type: DataProjectType, template_type: DataTemplateType, version: int, storage_type: DataStorageType, storage_id: str, storage_path: str, storage_raw_path: Optional[str] = None, storage_name: str, storage_region: Optional[str] = None, info: Dict[str, Any] = {}, download_when_init: bool = False, data_format_type: FormatType, ak: Optional[str] = None, sk: Optional[str] = None)[source]
Bases:
DataSource,BaseModelQianfan data source
- async afetch(**kwargs: Any) Union[str, List[str]][source]
Asynchronously read data from qianfan or local cache。 Not available currently
- Args:
**kwargs (Any): Arbitrary keyword arguments.
- Returns:
Union[str, List[str]]: content retrieved from data source
- ak: Optional[str]
- async asave(data: str, is_annotated: bool = False, **kwargs: Any) bool[source]
Asynchronously write data to qianfan currently only support to write to user BOS storage
Not available currently
- Args:
data (str): data waiting to be uploaded。 is_annotated (bool): has data been annotated **kwargs (Any): optional arguments。
- Returns:
bool: has data been uploaded successfully
- classmethod create_bare_dataset(name: str, template_type: DataTemplateType, storage_type: DataStorageType = DataStorageType.PublicBos, storage_id: Optional[str] = None, storage_path: Optional[str] = None, addition_info: Optional[Dict[str, Any]] = None, ak: Optional[str] = None, sk: Optional[str] = None, **kwargs: Any) QianfanDataSource[source]
create bare dataset on qianfan as data source, which is empty Args:
name (str): dataset name you want template_type (DataTemplateType): template type applying to data set storage_type (Optional[DataStorageType]):
data storage type used to store your data, default to PublicBos
- storage_id (Optional[str]): private BOS bucket name,
needed when storage_type is PrivateBos, default to None
- storage_path (Optional[str]): private BOS file path,
needed when storage_type is PrivateBos, default to None
- addition_info (Optional[Dict[str, Any]]):
additional info you want to have,default to None
- ak (Optional[str]):
console ak related to your dataset and bos,default to None
- sk (Optional[str]):
console sk related to your dataset and bos,default to None
kwargs (Any): other arguments
- Returns:
QianfanDataSource: A datasource represents your dataset on Qianfan
- classmethod create_from_bos_file(name: str, template_type: DataTemplateType, storage_id: str, storage_path: str, file_name: str, is_data_annotated: bool, storage_type: DataStorageType = DataStorageType.PrivateBos, addition_info: Optional[Dict[str, Any]] = None, ak: Optional[str] = None, sk: Optional[str] = None, is_download_to_local: bool = True, **kwargs: Any) QianfanDataSource[source]
create a dataset on qianfan as data source, which will import data from specific bos Args:
name (str): dataset name you want template_type (DataTemplateType): template type applying to data set storage_id (str): private BOS bucket name storage_path (str): private BOS file path file_name (str): file need to upload is_data_annotated (bool): is data in bos annotated storage_type (Optional[DataStorageType]):
data storage type used to store your data, default to PrivateBos
- addition_info (Optional[Dict[str, Any]]):
additional info you want to have,default to None
- ak (Optional[str]):
console ak related to your dataset and bos,default to None
- sk (Optional[str]):
console sk related to your dataset and bos,default to None
- is_download_to_local (bool):
does dataset download file when initialize object,default to True
kwargs (Any): other arguments
- Returns:
QianfanDataSource: A datasource represents your dataset on Qianfan
- data_format_type: FormatType
- download_when_init: bool
- fetch(**kwargs: Any) Union[str, List[str]][source]
Read data from qianfan or local cache。
- Args:
**kwargs (Any): Arbitrary keyword arguments.
- Returns:
Union[str, List[str]]: content retrieved from data source
- format_type() FormatType[source]
Get format type binding to qianfan data source
- Returns:
FormatType: format type binding to qianfan data source
- classmethod get_existed_dataset(dataset_id: int, is_download_to_local: bool = True, ak: Optional[str] = None, sk: Optional[str] = None, **kwargs: Any) QianfanDataSource[source]
Load a dataset from qianfan as data source
- Args:
dataset_id (int): dataset id on Qianfan, show as “数据集版本 ID” is_download_to_local (bool):
does dataset download file when initialize object,default to True
- ak (Optional[str]):
console ak related to your dataset and bos,default to None
- sk (Optional[str]):
console sk related to your dataset and bos,default to None
kwargs (Any): other arguments
- Returns:
QianfanDataSource: A datasource represents your dataset on Qianfan
- group_id: int
- id: int
- info: Dict[str, Any]
- name: str
- project_type: DataProjectType
- release_dataset(**kwargs: Any) bool[source]
make a dataset released
- Returns:
bool: Whether releasing succeeded
- save(data: Optional[str] = None, zip_file_path: Optional[str] = None, is_annotated: bool = False, does_release: bool = False, sup_storage_id: str = '', sup_storage_path: str = '', sup_storage_region: str = '', **kwargs: Any) bool[source]
Write data to qianfan Currently only support to write to user BOS storage
- Args:
data (str): data waiting to be uploaded. Default to None zip_file_path (Optional[str]):
zip file path which contains data files, default to None.
is_annotated (bool): has data been annotated, default to False does_release (bool):
does release dataset after saving successfully, default to False
- sup_storage_id (Optional[str]):
bos bucket name used for uploading, we recommend to use this parameter when your destination dataset on qianfan is stored in public BOS. Default to empty str
- sup_storage_path (Optional[str]):
bos bucket file path used for uploading, we recommend to use this parameter when your destination dataset on qianfan is stored in public BOS. Default to empty str
- sup_storage_region (Optional[str]):
bos bucket region used for uploading, we recommend to use this parameter when your destination dataset on qianfan is stored in public BOS. Default to empty str
**kwargs (Any): optional arguments。
- Returns:
bool: has data been uploaded successfully
- set_format_type(format_type: FormatType) None[source]
Set format type binding to qianfan data source Not available
TextOnly -> Jsonl MultiModel -> Json
- set_type: DataSetType
- sk: Optional[str]
- storage_id: str
- storage_name: str
- storage_path: str
- storage_raw_path: Optional[str]
- storage_region: Optional[str]
- storage_type: DataStorageType
- template_type: DataTemplateType
- version: int
- class qianfan.dataset.Table(inner_table: Table)[source]
Bases:
Addable,Listable,Processabledataset representation on memory inherited from pyarrow.Table,implementing interface in process_interface.py
- append(elem: Any, add_new_group: bool = False, is_grouped: bool = True) Self[source]
append an element to pyarrow table
- Args:
elem (Union[List[Dict], Tuple[Dict], Dict]): Elements added to pyarrow table add_new_group (bool):
Whether elem has a new group id. Only used when table is grouped.
- is_grouped (bool):
Are element in elem in same group. Only used when table is grouped and elem is Sequence and add_new_group was set True. Default to True, all elements will be in same group. If it’s True, each element will have sequential incremental group id from last available group id.
- Returns:
Self: Table itself
- col_append(elem: Any) Self[source]
append a row to pyarrow table
- Args:
- elem (Dict[str, List]): dict containing element added to pyarrow table
must has column name “name” and column data list “data”
- Returns:
Self: Table itself
- col_delete(index: Union[int, str]) Self[source]
delete a column from pyarrow table
- Args:
index (str): column name to delete
- Returns:
Self: Table itself
- col_filter(op: Callable[[Any], bool]) Self[source]
filter on pyarrow table’s column
- Args:
op (Callable[[Any], bool]): handler used to filter
- Returns:
Self: Table itself
- col_insert(elem: Any, index: Any) Self[source]
append a row to pyarrow table
- Args:
- elem (Dict[str, List]): dict containing element added to pyarrow table
must has column name “name” and column data list “data”
index (int): where to insert new column
- Returns:
Self: Table itself
- col_list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) Any[source]
get column(s) from pyarrow table
- Args:
- by (Optional[Union[int, str, Sequence[int], Sequence[str]]]):
index or indices for columns, default to None, in which case return a python list of pyarrow table column
- Returns:
Any: pyarrow table column list
- col_map(op: Callable[[Any], Any]) Self[source]
map on pyarrow table’s column
- Args:
op (Callable[[Any], Any]): handler used to map
- Returns:
Self: Table itself
- col_renames(new_names: List[str]) Self[source]
rename all dataset column
- Args:
new_names (List[str]): All new names for columns
- Returns:
Self: A brand-new Table with new name
- delete(index: Union[int, str]) Self[source]
delete an element from pyarrow table
- Args:
index (Union[int, str]): element index to delete
- Returns:
Self: Table itself
- filter(op: Callable[[Any], bool]) Self[source]
filter on pyarrow table’s row
- Args:
op (Callable[[Any], bool]): handler used to filter
- Returns:
Self: Table itself
- insert(elem: Any, index: Any, group_id: int = -1, add_new_group: bool = False, is_grouped: bool = True) Self[source]
insert an element to pyarrow table
- Args:
elem (Union[List[Dict], Tuple[Dict], Dict]): Elements added to pyarrow table index (int): where to insert element(s) group_id (int):
which group id you want to apply to new element(s). Default to -1, which means let group id be automatically inferred from table.
- add_new_group (bool):
Whether elem has a new group id. Only used when table is grouped and group_id is -1
- is_grouped (bool):
Are element in elem in same group. Only used when table is grouped and elem is Sequence and add_new_group was set True. Default to True, all elements will be in same group. If it’s True, each element will have sequential incremental group id from last available group id.
- Returns:
Self: Table itself
- list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) Any[source]
get element(s) from pyarrow table
- Args:
- by (Optional[Union[slice, int, Sequence[int]]]):
index or indices for elements, default to None, in which case return a python list of pyarrow table row
- Returns:
Any: pyarrow table row list
- map(op: Callable[[Any], Any]) Self[source]
map on pyarrow table’s row
- Args:
op (Callable[[Any], Any]): handler used to map
- Returns:
Self: Table itself
Submodules
qianfan.dataset.consts module
constants for dataset using
qianfan.dataset.data_operator module
data operator for qianfan online not available currently
- class qianfan.dataset.data_operator.DeduplicationSimhash(*, operator_name: str = 'deduplication_simhash', operator_type: str = 'deduplication', distance: float)[source]
Bases:
DeduplicatorDeduplicator class to deduplicate by simhash
- distance: float
- operator_name: str
- class qianfan.dataset.data_operator.Deduplicator(*, operator_name: str, operator_type: str = 'deduplication')[source]
Bases:
QianfanOperatorDeduplicator class for online ETL operator
- operator_type: str
- class qianfan.dataset.data_operator.DesensitizationProcessor(*, operator_name: str, operator_type: str = 'desensitization')[source]
Bases:
QianfanOperatorSensitive data processor class for online ETL operator
- operator_type: str
- class qianfan.dataset.data_operator.ExceptionRegulator(*, operator_name: str, operator_type: str = 'clean')[source]
Bases:
QianfanOperatorException class for online ETL operator
- operator_type: str
- class qianfan.dataset.data_operator.Filter(*, operator_name: str, operator_type: str = 'filter')[source]
Bases:
QianfanOperatorFilter class for online ETL operator
- operator_type: str
- class qianfan.dataset.data_operator.FilterCheckCharacterRepetitionRemoval(*, operator_name: str = 'filter_check_character_repetition_removal', operator_type: str = 'filter', default_character_repetition_max_cutoff: float)[source]
Bases:
FilterFilter class to check character repetition removal
- default_character_repetition_max_cutoff: float
- operator_name: str
- class qianfan.dataset.data_operator.FilterCheckFlaggedWords(*, operator_name: str = 'filter_check_flagged_words', operator_type: str = 'filter', flagged_words_max_cutoff: float)[source]
Bases:
FilterFilter class to check flagged words
- flagged_words_max_cutoff: float
- operator_name: str
- class qianfan.dataset.data_operator.FilterCheckLangId(*, operator_name: str = 'filter_check_lang_id', operator_type: str = 'filter', lang_id_min_cutoff: float)[source]
Bases:
FilterFilter class to check lang id
- lang_id_min_cutoff: float
- operator_name: str
- class qianfan.dataset.data_operator.FilterCheckNumberWords(*, operator_name: str = 'filter_check_number_words', operator_type: str = 'filter', number_words_min_cutoff: int = 1, number_words_max_cutoff: int = 10000)[source]
Bases:
FilterFilter class to check number of words
- number_words_max_cutoff: int
- number_words_min_cutoff: int
- operator_name: str
- class qianfan.dataset.data_operator.FilterCheckPerplexity(*, operator_name: str = 'filter_check_perplexity', operator_type: str = 'filter', perplexity_max_cutoff: int)[source]
Bases:
FilterFilter class to check perplexity
- operator_name: str
- perplexity_max_cutoff: int
- class qianfan.dataset.data_operator.FilterCheckSpecialCharacters(*, operator_name: str = 'filter_check_special_characters', operator_type: str = 'filter', special_characters_max_cutoff: float)[source]
Bases:
FilterFilter class to check special characters
- operator_name: str
- special_characters_max_cutoff: float
- class qianfan.dataset.data_operator.FilterCheckWordRepetitionRemoval(*, operator_name: str = 'filter_check_word_repetition_removal', operator_type: str = 'filter', word_repetition_max_cutoff: float)[source]
Bases:
FilterFilter class to check word repetition removal
- operator_name: str
- word_repetition_max_cutoff: float
- class qianfan.dataset.data_operator.QianfanOperator(*, operator_name: str, operator_type: str)[source]
Bases:
BaseModelBasic class for online ETL operator
- operator_name: str
- operator_type: str
- class qianfan.dataset.data_operator.RemoveEmoji(*, operator_name: str = 'remove_emoji', operator_type: str = 'clean')[source]
Bases:
ExceptionRegulatorException class to remove emoji
- operator_name: str
- class qianfan.dataset.data_operator.RemoveInvisibleCharacter(*, operator_name: str = 'remove_invisible_character', operator_type: str = 'clean')[source]
Bases:
ExceptionRegulatorException class to remove invisible character
- operator_name: str
- class qianfan.dataset.data_operator.RemoveNonMeaningCharacters(*, operator_name: str = 'remove_non_meaning_characters', operator_type: str = 'clean')[source]
Bases:
ExceptionRegulatorException class to remove non-meaning characters
- operator_name: str
- class qianfan.dataset.data_operator.RemoveWebIdentifiers(*, operator_name: str = 'remove_web_identifiers', operator_type: str = 'clean')[source]
Bases:
ExceptionRegulatorException class to remove web identifiers
- operator_name: str
- class qianfan.dataset.data_operator.ReplaceEmails(*, operator_name: str = 'replace_emails', operator_type: str = 'desensitization')[source]
Bases:
DesensitizationProcessorSensitive data processor class to replace emails
- operator_name: str
- class qianfan.dataset.data_operator.ReplaceIdentifier(*, operator_name: str = 'replace_identifier', operator_type: str = 'desensitization')[source]
Bases:
DesensitizationProcessorSensitive data processor class to replace identifier
- operator_name: str
- class qianfan.dataset.data_operator.ReplaceIp(*, operator_name: str = 'replace_ip', operator_type: str = 'desensitization')[source]
Bases:
DesensitizationProcessorSensitive data processor class to replace ip
- operator_name: str
- class qianfan.dataset.data_operator.ReplaceTraditionalChineseToSimplified(*, operator_name: str = 'replace_traditional_chinese_to_simplified', operator_type: str = 'clean')[source]
Bases:
ExceptionRegulatorException class to replace traditional chinese to simplified
- operator_name: str
- class qianfan.dataset.data_operator.ReplaceUniformWhitespace(*, operator_name: str = 'replace_uniform_whitespace', operator_type: str = 'clean')[source]
Bases:
ExceptionRegulatorException class to replace uniform whitespace
- operator_name: str
qianfan.dataset.data_source module
data source which is related to download/upload
- class qianfan.dataset.data_source.DataSource[source]
Bases:
ABCbasic data source class
- abstract async afetch(**kwargs: Any) Union[str, List[str]][source]
Asynchronously fetch data from source
- Args:
**kwargs (Any): optional arguments
- Returns:
Union[str, List[str]]: content retrieved from data source
- abstract async asave(data: str, **kwargs: Any) bool[source]
Asynchronously export the data to the data source and return whether the import was successful or failed
- Args:
data (str): data need to be saved **kwargs (Any): optional arguments
- Returns:
bool: is saving successful
- abstract fetch(**kwargs: Any) Union[str, List[str]][source]
Fetch data from source
- Args:
**kwargs (Any): optional arguments
- Returns:
Union[str, List[str]]: content retrieved from data source
- abstract format_type() FormatType[source]
Get format type binding to source
- Returns:
FormatType: format type binding to source
- abstract save(data: str, **kwargs: Any) bool[source]
Export the data to the data source and return whether the import was successful or failed
- Args:
data (str): data need to be saved **kwargs (Any): optional arguments
- Returns:
bool: is saving successful
- abstract set_format_type(format_type: FormatType) None[source]
Set format type binding to source
- Args:
format_type (FormatType): format type binding to source
- class qianfan.dataset.data_source.FileDataSource(*, path: str, file_format: Optional[FormatType] = None, save_as_folder: bool = False)[source]
Bases:
DataSource,BaseModelfile data source
- async afetch(**kwargs: Any) Union[str, List[str]][source]
Asynchronously Read data from file. Not available currently
- Args:
**kwargs (Any): Arbitrary keyword arguments.
- Returns:
- Union[str, List[str]]:
String or list of string containing the data read from the file.
- async asave(data: Union[str, List[str]], **kwargs: Any) bool[source]
Asynchronously Write data to file。 Not available currently
- Args:
data (Union[str, List[str]]): data waiting to be written。 **kwargs (Any): optional arguments。
- Returns:
bool: has data been written successfully
- fetch(**kwargs: Any) Union[str, List[str]][source]
Read data from file.
- Args:
**kwargs (Any): Arbitrary keyword arguments.
- Returns:
- Union[str, List[str]]:
String or list of string containing the data read from the file.
- file_format: Optional[FormatType]
- format_type() FormatType[source]
Get format type binding to source
- Returns:
FormatType: format type binding to source
- path: str
- save(data: Union[str, List[str]], **kwargs: Any) bool[source]
Write data to file。
- Args:
data (Union[str, List[str]]): data waiting to be written。 **kwargs (Any): optional arguments。
- Returns:
bool: has data been written successfully
- save_as_folder: bool
- set_format_type(format_type: FormatType) None[source]
Set format type binding to source
- Args:
format_type (FormatType): format type binding to source
- class qianfan.dataset.data_source.FormatType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]
Bases:
EnumEnum for data source format type
- Csv = 'csv'
- Json = 'json'
- Jsonl = 'jsonl'
- Text = 'txt'
- class qianfan.dataset.data_source.QianfanDataSource(*, id: int, group_id: int, name: str, set_type: DataSetType, project_type: DataProjectType, template_type: DataTemplateType, version: int, storage_type: DataStorageType, storage_id: str, storage_path: str, storage_raw_path: Optional[str] = None, storage_name: str, storage_region: Optional[str] = None, info: Dict[str, Any] = {}, download_when_init: bool = False, data_format_type: FormatType, ak: Optional[str] = None, sk: Optional[str] = None)[source]
Bases:
DataSource,BaseModelQianfan data source
- async afetch(**kwargs: Any) Union[str, List[str]][source]
Asynchronously read data from qianfan or local cache。 Not available currently
- Args:
**kwargs (Any): Arbitrary keyword arguments.
- Returns:
Union[str, List[str]]: content retrieved from data source
- ak: Optional[str]
- async asave(data: str, is_annotated: bool = False, **kwargs: Any) bool[source]
Asynchronously write data to qianfan currently only support to write to user BOS storage
Not available currently
- Args:
data (str): data waiting to be uploaded。 is_annotated (bool): has data been annotated **kwargs (Any): optional arguments。
- Returns:
bool: has data been uploaded successfully
- classmethod create_bare_dataset(name: str, template_type: DataTemplateType, storage_type: DataStorageType = DataStorageType.PublicBos, storage_id: Optional[str] = None, storage_path: Optional[str] = None, addition_info: Optional[Dict[str, Any]] = None, ak: Optional[str] = None, sk: Optional[str] = None, **kwargs: Any) QianfanDataSource[source]
create bare dataset on qianfan as data source, which is empty Args:
name (str): dataset name you want template_type (DataTemplateType): template type applying to data set storage_type (Optional[DataStorageType]):
data storage type used to store your data, default to PublicBos
- storage_id (Optional[str]): private BOS bucket name,
needed when storage_type is PrivateBos, default to None
- storage_path (Optional[str]): private BOS file path,
needed when storage_type is PrivateBos, default to None
- addition_info (Optional[Dict[str, Any]]):
additional info you want to have,default to None
- ak (Optional[str]):
console ak related to your dataset and bos,default to None
- sk (Optional[str]):
console sk related to your dataset and bos,default to None
kwargs (Any): other arguments
- Returns:
QianfanDataSource: A datasource represents your dataset on Qianfan
- classmethod create_from_bos_file(name: str, template_type: DataTemplateType, storage_id: str, storage_path: str, file_name: str, is_data_annotated: bool, storage_type: DataStorageType = DataStorageType.PrivateBos, addition_info: Optional[Dict[str, Any]] = None, ak: Optional[str] = None, sk: Optional[str] = None, is_download_to_local: bool = True, **kwargs: Any) QianfanDataSource[source]
create a dataset on qianfan as data source, which will import data from specific bos Args:
name (str): dataset name you want template_type (DataTemplateType): template type applying to data set storage_id (str): private BOS bucket name storage_path (str): private BOS file path file_name (str): file need to upload is_data_annotated (bool): is data in bos annotated storage_type (Optional[DataStorageType]):
data storage type used to store your data, default to PrivateBos
- addition_info (Optional[Dict[str, Any]]):
additional info you want to have,default to None
- ak (Optional[str]):
console ak related to your dataset and bos,default to None
- sk (Optional[str]):
console sk related to your dataset and bos,default to None
- is_download_to_local (bool):
does dataset download file when initialize object,default to True
kwargs (Any): other arguments
- Returns:
QianfanDataSource: A datasource represents your dataset on Qianfan
- data_format_type: FormatType
- download_when_init: bool
- fetch(**kwargs: Any) Union[str, List[str]][source]
Read data from qianfan or local cache。
- Args:
**kwargs (Any): Arbitrary keyword arguments.
- Returns:
Union[str, List[str]]: content retrieved from data source
- format_type() FormatType[source]
Get format type binding to qianfan data source
- Returns:
FormatType: format type binding to qianfan data source
- classmethod get_existed_dataset(dataset_id: int, is_download_to_local: bool = True, ak: Optional[str] = None, sk: Optional[str] = None, **kwargs: Any) QianfanDataSource[source]
Load a dataset from qianfan as data source
- Args:
dataset_id (int): dataset id on Qianfan, show as “数据集版本 ID” is_download_to_local (bool):
does dataset download file when initialize object,default to True
- ak (Optional[str]):
console ak related to your dataset and bos,default to None
- sk (Optional[str]):
console sk related to your dataset and bos,default to None
kwargs (Any): other arguments
- Returns:
QianfanDataSource: A datasource represents your dataset on Qianfan
- group_id: int
- id: int
- info: Dict[str, Any]
- name: str
- project_type: DataProjectType
- release_dataset(**kwargs: Any) bool[source]
make a dataset released
- Returns:
bool: Whether releasing succeeded
- save(data: Optional[str] = None, zip_file_path: Optional[str] = None, is_annotated: bool = False, does_release: bool = False, sup_storage_id: str = '', sup_storage_path: str = '', sup_storage_region: str = '', **kwargs: Any) bool[source]
Write data to qianfan Currently only support to write to user BOS storage
- Args:
data (str): data waiting to be uploaded. Default to None zip_file_path (Optional[str]):
zip file path which contains data files, default to None.
is_annotated (bool): has data been annotated, default to False does_release (bool):
does release dataset after saving successfully, default to False
- sup_storage_id (Optional[str]):
bos bucket name used for uploading, we recommend to use this parameter when your destination dataset on qianfan is stored in public BOS. Default to empty str
- sup_storage_path (Optional[str]):
bos bucket file path used for uploading, we recommend to use this parameter when your destination dataset on qianfan is stored in public BOS. Default to empty str
- sup_storage_region (Optional[str]):
bos bucket region used for uploading, we recommend to use this parameter when your destination dataset on qianfan is stored in public BOS. Default to empty str
**kwargs (Any): optional arguments。
- Returns:
bool: has data been uploaded successfully
- set_format_type(format_type: FormatType) None[source]
Set format type binding to qianfan data source Not available
TextOnly -> Jsonl MultiModel -> Json
- set_type: DataSetType
- sk: Optional[str]
- storage_id: str
- storage_name: str
- storage_path: str
- storage_raw_path: Optional[str]
- storage_region: Optional[str]
- storage_type: DataStorageType
- template_type: DataTemplateType
- version: int
qianfan.dataset.dataset module
dataset core concept, a wrap of data processing, data transmission and data validation
- class qianfan.dataset.dataset.Dataset(inner_table: Table, inner_data_source_cache: Optional[DataSource] = None, inner_schema_cache: Optional[Schema] = None, input_columns: Optional[List[str]] = None, reference_column: Optional[str] = None, **kwargs: Any)[source]
Bases:
Table- add_default_group_column() Self[source]
add “_group” column to Dataset, the value in “_group” column are sequential incremental
- Returns:
Self: Dataset itself
- append(elem: Any, add_new_group: bool = False, is_grouped: bool = True) Self[source]
append element(s) to dataset
- Args:
- elem (Union[List[List[Dict]], List[Dict], Tuple[Dict], Dict]):
Elements added to dataset
- add_new_group (bool):
Whether elem has a new group id. Only used when dataset is grouped.
- is_grouped (bool):
Are element in elem in same group. Only used when dataset is grouped and elem is Sequence and add_new_group was set True. Default to True, all elements will be in same group. If it’s True, each element will have sequential incremental group id from last available group id.
- Returns:
Self: Dataset itself
- async atest_using_llm(model_id: Optional[int] = None, model_version_id: Optional[int] = None, service_model: Optional[str] = None, service_endpoint: Optional[str] = None, is_chat_service: bool = True, does_show_latency: bool = True, **kwargs: Any) Dataset[source]
using arguments to init an llm instance and get output on current dataset from it asynchronously set only model arguments our service arguments to instantiating
- Args:
- model_id (Optional[int]):
id of your own model, default to None
- model_version_id (Optional[int]):
version id of your own model, default to None
- service_model (Optional[str]):
name of model you want to use as service, default to None
- service_endpoint (Optional[str]):
endpoint of service, default to None
- is_chat_service (bool):
the service type of service, default to True. Service will be Completion if False
- does_show_latency (bool):
whether result dataset contain latency info column when using Service as evaluated object. Depending on different request mode (stream and non-stream), it will contains request_complete_latency or (first_token_latency, request_complete_latency) combo. Default to True
- **kwargs (Any):
optional argument dict
- Returns:
Dataset: A dataset contains inputs, reference outputs and llm outputs
- col_append(elem: Any) Self[source]
append a row to dataset
- Args:
- elem (Dict[str, List]): a dict containing element added to dataset, which
must has column name “name” and column data list “data”
- Returns:
Self: Dataset itself
- col_delete(index: Union[int, str]) Self[source]
delete an column from dataset
- Args:
index (str): column name to delete
- Returns:
Self: Dataset itself
- col_filter(op: Callable[[Any], bool]) Self[source]
filter on dataset’s column
- Args:
op (Callable[[Any], bool]): handler used to filter
- Returns:
Self: Dataset itself
- col_insert(elem: Any, index: Any) Self[source]
append a row to dataset
- Args:
- elem (Dict[str, List]): dict containing element added to dataset
must has column name “name” and column data list “data”
index (int): where to insert new column
- Returns:
Self: Dataset itself
- col_list(by: Optional[Union[slice, int, str, List[int], Tuple[int], List[str], Tuple[str]]] = None) Any[source]
get column(s) from dataset
- Args:
- by (Optional[Union[int, str, Sequence[int], Sequence[str]]]):
index or indices for columns, default to None, in which case return a python list of dataset column
- Returns:
Any: dataset column list
- col_map(op: Callable[[Any], Any]) Self[source]
map on dataset’s column
- Args:
op (Callable[[Any], Any]): handler used to map
- Returns:
Self: Dataset itself
- col_renames(new_names: List[str]) Self[source]
rename all dataset column
- Args:
new_names (List[str]): All new names for columns
- Returns:
Self: A brand-new Dataset with new name
- classmethod create_from_pyarrow_table(table: Table, schema: Optional[Schema] = None, **kwargs: Any) Dataset[source]
create a dataset from pyarrow table
- Args:
- table (pyarrow):
pyarrow table object used to create dataset。
- schema (Optional[Schema]):
schema used to validate before exporting data, default to None
- **kwargs (Any):
optional arguments
- Returns:
Dataset: a dataset instance
- classmethod create_from_pyobj(data: Union[List[Dict[str, Any]], Dict[str, List]], schema: Optional[Schema] = None, **kwargs: Any) Dataset[source]
create a dataset from python dict or list
- Args:
- data (Union[List[Dict[str, Any]], Dict[str, List]]):
python object used to create dataset。
- schema (Optional[Schema]):
schema used to validate before exporting data, default to None
- **kwargs (Any):
optional arguments
- Returns:
Dataset: a dataset instance
- delete(index: Union[int, str]) Self[source]
delete an element from dataset
- Args:
index (Union[int, str]): element index to delete
- Returns:
Self: Dataset itself
- delete_group_column() Self[source]
remove “_group” column from Dataset
- Returns:
Self: Dataset itself
- filter(op: Callable[[Any], bool]) Self[source]
filter on dataset
- Args:
op (Callable[[Any], bool]): handler used to filter
- Returns:
Self: Dataset itself
- property get_input_data: Dict[str, List[Any]]
get input columns data in dataset
- Returns:
- Dict[str, List[Any]]: a dict
which indicates the “column name-column data” pairs
- property get_reference_data: List[Any]
get reference data in dataset
- Returns:
List[Any]: list of output data column
- insert(elem: Any, index: Any, group_id: int = -1, add_new_group: bool = False, is_grouped: bool = True) Self[source]
insert element(s) to dataset
- Args:
- elem (Union[List[List[Dict]], List[Dict], Tuple[Dict], Dict]):
Elements added to dataset
index (int): where to insert element(s) group_id (int):
which group id you want to apply to new element(s). Default to -1, which means let group id be automatically inferred from table.
- add_new_group (bool):
Whether elem has a new group id. Only used when dataset is grouped and group_id is -1
- is_grouped (bool):
Are element in elem in same group. Only used when dataset is grouped and elem is Sequence and add_new_group was set True. Default to True, all elements will be in same group. If it’s True, each element will have sequential incremental group id from last available group id.
- Returns:
Self: Dataset itself
- is_dataset_generic_text() bool[source]
tell whether current dataset is generic text dataset
- Returns:
bool: whether current dataset is generic text dataset
- is_dataset_located_in_qianfan() bool[source]
tell whether current dataset is cloud-based dataset
- Returns:
bool: whether current dataset is cloud-based dataset
- list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None, **kwargs: Any) Any[source]
get element(s) from dataset
- Args:
- by (Optional[Union[slice, int, Sequence[int]]]):
index or indices for elements, default to None, in which case return a python list of dataset row
- Returns:
Any: dataset row list
- classmethod load(source: Optional[DataSource] = None, data_file: Optional[str] = None, qianfan_dataset_id: Optional[int] = None, bos_load_args: Optional[Dict[str, Any]] = None, huggingface_dataset: Optional[Any] = None, schema: Optional[Schema] = None, organize_data_as_group: bool = False, **kwargs: Any) Dataset[source]
Read data from the source or create a source from the parameters and create a Table instance. If a schema is specified, perform validation after importing.
- Args:
- source (Optional[DataSource]): where dataset load from,
default to None,in which case, a datasource will be created inside dataset using parameters below
- data_file (Optional[str]):
dataset local file path, default to None
- qianfan_dataset_id (Optional[int]):
qianfan dataset ID, default to None
- bos_load_args: (Optional[Dict[str, Any]]):
create a dataset and import initial dataset content from args
- huggingface_dataset (Optional[Dict[str, Any], Any]):
Huggingface dataset object, only support DatasetDict and Dataset of Huggingface datasets.
- schema (Optional[Schema]):
schema used to validate loaded data, default to None
- organize_data_as_group (bool):
only available when data source’s format is FormatType.Jsonl. Indicates whether organize data within dataset in group format, default to False, and when it’s True, the default format will be a group-based 2D structure.
**kwargs (Any): optional arguments
- Returns:
Dataset: a dataset instance
- map(op: Callable[[Any], Any]) Self[source]
map on dataset
- Args:
op (Callable[[Any], Any]): handler used to map
- Returns:
Self: Dataset itself
- online_data_process(operators: List[QianfanOperator]) Dict[str, Any][source]
create an online ETL task on qianfan
- Args:
operators (List[QianfanOperator]): operators applied to ETL task
- Returns:
- Dict[str, Any]: ETL task info, contains 3 field:
is_succeeded (bool): whether ETL task succeed etl_task_id (Optional[int]): etl task id, only
exists when etl task is created successfully
- new_dataset_id (Optional[int]): dataset id which
stores data after etl, only exists when etl task is succeeded
- save(destination: Optional[DataSource] = None, data_file: Optional[str] = None, qianfan_dataset_id: Optional[int] = None, qianfan_dataset_create_args: Optional[Dict[str, Any]] = None, schema: Optional[Schema] = None, replace_source: bool = False, **kwargs: Any) bool[source]
Write data to source if a schema has been passed, validate data before exporting
- Args:
- destination (Optional[DataSource]):
data source where dataset exports,default to None. in which case, a datasource will be created inside dataset using parameters below
- data_file (Optional[str]):
dataset local file path, default to None
- qianfan_dataset_id (Optional[int]):
qianfan dataset ID, default to None
- qianfan_dataset_create_args: (Optional[Dict[str: Any]]):
create arguments for creating a bare dataset on qianfan, default to None
- schema: (Optional[Schema]):
schema used to validate before exporting data, default to None
- replace_source: (bool):
if replace the original source, default to False
kwargs (Any): optional arguments
- Returns:
bool: is saving succeeded
- start_online_data_process_task(operators: List[QianfanOperator]) int[source]
create an online ETL task on qianfan
- Args:
operators (List[QianfanOperator]): operators applied to ETL task
- Returns:
int: etl task id
- test_using_llm(model_id: Optional[int] = None, model_version_id: Optional[int] = None, service_model: Optional[str] = None, service_endpoint: Optional[str] = None, is_chat_service: bool = True, does_show_latency: bool = True, **kwargs: Any) Dataset[source]
using arguments to init an llm instance and get output on current dataset from it set only model arguments our service arguments to instantiating
- Args:
- model_id (Optional[int]):
id of your own model, default to None
- model_version_id (Optional[int]):
version id of your own model, default to None
- service_model (Optional[str]):
name of model you want to use as service, default to None
- service_endpoint (Optional[str]):
endpoint of service, default to None
- is_chat_service (bool):
the service type of service, default to True. Service will be Completion if False
- does_show_latency (bool):
whether result dataset contain latency info column when using Service as evaluated object. Depending on different request mode (stream and non-stream), it will contains request_complete_latency or (first_token_latency, request_complete_latency) combo. Default to True
- **kwargs (Any):
optional argument dict
- Returns:
Dataset: A dataset contains inputs, reference outputs and llm outputs
qianfan.dataset.dataset_utils module
utilities dataset needs
- qianfan.dataset.dataset_utils.log_latency_info(result: QfResponse, index: int, stream_index: int = 1) Tuple[source]
qianfan.dataset.process_interface module
interface file
- class qianfan.dataset.process_interface.Addable[source]
Bases:
ABCmake object ‘addable’
- class qianfan.dataset.process_interface.Processable[source]
Bases:
ABCmake object ‘processable’
- abstract delete(index: Union[int, str]) Self[source]
delete an element from Processable object
- Args:
index (Union[int, str]): element index to delete
- Returns:
Self: a new Processable object after delete
qianfan.dataset.schema module
schema for validation currently qianfan schema only
- class qianfan.dataset.schema.QianfanGenericText[source]
Bases:
QianfanSchemavalidator for generic text dataset
- class qianfan.dataset.schema.QianfanNonSortedConversation[source]
Bases:
QianfanSchemavalidator for non-sorted, conversational dataset
- class qianfan.dataset.schema.QianfanQuerySet[source]
Bases:
QianfanSchemavalidator for query set dataset
- class qianfan.dataset.schema.QianfanSortedConversation[source]
Bases:
QianfanSchemavalidator for sorted, conversational dataset
- class qianfan.dataset.schema.QianfanText2Image[source]
Bases:
QianfanSchemavalidator for text to image dataset
qianfan.dataset.table module
wrapper for pyarrow.Table
- class qianfan.dataset.table.Table(inner_table: Table)[source]
Bases:
Addable,Listable,Processabledataset representation on memory inherited from pyarrow.Table,implementing interface in process_interface.py
- append(elem: Any, add_new_group: bool = False, is_grouped: bool = True) Self[source]
append an element to pyarrow table
- Args:
elem (Union[List[Dict], Tuple[Dict], Dict]): Elements added to pyarrow table add_new_group (bool):
Whether elem has a new group id. Only used when table is grouped.
- is_grouped (bool):
Are element in elem in same group. Only used when table is grouped and elem is Sequence and add_new_group was set True. Default to True, all elements will be in same group. If it’s True, each element will have sequential incremental group id from last available group id.
- Returns:
Self: Table itself
- col_append(elem: Any) Self[source]
append a row to pyarrow table
- Args:
- elem (Dict[str, List]): dict containing element added to pyarrow table
must has column name “name” and column data list “data”
- Returns:
Self: Table itself
- col_delete(index: Union[int, str]) Self[source]
delete a column from pyarrow table
- Args:
index (str): column name to delete
- Returns:
Self: Table itself
- col_filter(op: Callable[[Any], bool]) Self[source]
filter on pyarrow table’s column
- Args:
op (Callable[[Any], bool]): handler used to filter
- Returns:
Self: Table itself
- col_insert(elem: Any, index: Any) Self[source]
append a row to pyarrow table
- Args:
- elem (Dict[str, List]): dict containing element added to pyarrow table
must has column name “name” and column data list “data”
index (int): where to insert new column
- Returns:
Self: Table itself
- col_list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) Any[source]
get column(s) from pyarrow table
- Args:
- by (Optional[Union[int, str, Sequence[int], Sequence[str]]]):
index or indices for columns, default to None, in which case return a python list of pyarrow table column
- Returns:
Any: pyarrow table column list
- col_map(op: Callable[[Any], Any]) Self[source]
map on pyarrow table’s column
- Args:
op (Callable[[Any], Any]): handler used to map
- Returns:
Self: Table itself
- col_renames(new_names: List[str]) Self[source]
rename all dataset column
- Args:
new_names (List[str]): All new names for columns
- Returns:
Self: A brand-new Table with new name
- delete(index: Union[int, str]) Self[source]
delete an element from pyarrow table
- Args:
index (Union[int, str]): element index to delete
- Returns:
Self: Table itself
- filter(op: Callable[[Any], bool]) Self[source]
filter on pyarrow table’s row
- Args:
op (Callable[[Any], bool]): handler used to filter
- Returns:
Self: Table itself
- insert(elem: Any, index: Any, group_id: int = -1, add_new_group: bool = False, is_grouped: bool = True) Self[source]
insert an element to pyarrow table
- Args:
elem (Union[List[Dict], Tuple[Dict], Dict]): Elements added to pyarrow table index (int): where to insert element(s) group_id (int):
which group id you want to apply to new element(s). Default to -1, which means let group id be automatically inferred from table.
- add_new_group (bool):
Whether elem has a new group id. Only used when table is grouped and group_id is -1
- is_grouped (bool):
Are element in elem in same group. Only used when table is grouped and elem is Sequence and add_new_group was set True. Default to True, all elements will be in same group. If it’s True, each element will have sequential incremental group id from last available group id.
- Returns:
Self: Table itself
- list(by: Optional[Union[slice, int, str, Sequence[int], Sequence[str]]] = None) Any[source]
get element(s) from pyarrow table
- Args:
- by (Optional[Union[slice, int, Sequence[int]]]):
index or indices for elements, default to None, in which case return a python list of pyarrow table row
- Returns:
Any: pyarrow table row list
- map(op: Callable[[Any], Any]) Self[source]
map on pyarrow table’s row
- Args:
op (Callable[[Any], Any]): handler used to map
- Returns:
Self: Table itself
qianfan.dataset.table_utils module
utilities table needs