qianfan.evaluation package

Library aimed to helping developer to evaluate their model on qianfan

class qianfan.evaluation.EvaluationManager(*, local_evaluators: Optional[List[LocalEvaluator]] = None, qianfan_evaluators: Optional[List[QianfanEvaluator]] = None, task_id: Optional[str] = None)[source]

Bases: BaseModel

logic control center of evaluation

eval(llms: Sequence[Union[Model, Service]], dataset: Dataset, **kwargs: Any) Optional[EvaluationResult][source]

Evaluate the performance of models on the dataset.

Args:
llms (List[Union[Model, Service]]):

List of models or service to be evaluated.

dataset (Dataset):

The dataset on which models will be evaluated.

**kwargs (Any):

Other keyword arguments.

Returns:

Optional[EvaluationResult]: Evaluation result of models on the dataset.

eval_only(dataset: Dataset, **kwargs: Any) EvaluationResult[source]

running evaluation only on specific dataset

Args:
dataset (Dataset):

dataset which comes from batch inference or be batch-inference like

**kwargs (Any):

other keyword arguments.

Returns:

EvaluationResult: Evaluation result of models on the dataset.

local_evaluators: Optional[List[LocalEvaluator]]
qianfan_evaluators: Optional[List[QianfanEvaluator]]
task_id: Optional[str]
class qianfan.evaluation.EvaluationResult(result_dataset: Dataset, metrics: Optional[Dict[str, Dict[str, Any]]] = None)[source]

Bases: object

Evaluation Result

Submodules

qianfan.evaluation.consts module

constants of evaluation

qianfan.evaluation.evaluation_manager module

manager which manage whole procedure of evaluation

class qianfan.evaluation.evaluation_manager.EvaluationManager(*, local_evaluators: Optional[List[LocalEvaluator]] = None, qianfan_evaluators: Optional[List[QianfanEvaluator]] = None, task_id: Optional[str] = None)[source]

Bases: BaseModel

logic control center of evaluation

eval(llms: Sequence[Union[Model, Service]], dataset: Dataset, **kwargs: Any) Optional[EvaluationResult][source]

Evaluate the performance of models on the dataset.

Args:
llms (List[Union[Model, Service]]):

List of models or service to be evaluated.

dataset (Dataset):

The dataset on which models will be evaluated.

**kwargs (Any):

Other keyword arguments.

Returns:

Optional[EvaluationResult]: Evaluation result of models on the dataset.

eval_only(dataset: Dataset, **kwargs: Any) EvaluationResult[source]

running evaluation only on specific dataset

Args:
dataset (Dataset):

dataset which comes from batch inference or be batch-inference like

**kwargs (Any):

other keyword arguments.

Returns:

EvaluationResult: Evaluation result of models on the dataset.

local_evaluators: Optional[List[LocalEvaluator]]
qianfan_evaluators: Optional[List[QianfanEvaluator]]
task_id: Optional[str]

qianfan.evaluation.evaluation_result module

The result of a evaluation

class qianfan.evaluation.evaluation_result.EvaluationResult(result_dataset: Dataset, metrics: Optional[Dict[str, Dict[str, Any]]] = None)[source]

Bases: object

Evaluation Result

qianfan.evaluation.evaluator module

collection of evaluator

class qianfan.evaluation.evaluator.Evaluator[source]

Bases: BaseModel, ABC

an class for evaluating single entry

abstract evaluate(input: Union[str, List[Dict[str, Any]]], reference: str, output: str) Dict[str, Any][source]

evaluate one entry

class qianfan.evaluation.evaluator.LocalEvaluator[source]

Bases: Evaluator, ABC

Bass class for evaluator running locally

For user who want to implement their own LocalEvaluator, they should overwrite function evaluate, in which input represents input string or chat history, reference as standard answer of input, and output for llm output string.

And the return value should be a Dict containing evaluation metrics and metric values for single llm output.

class qianfan.evaluation.evaluator.ManualEvaluatorDimension(*, dimension: str, description: Optional[str] = None)[source]

Bases: BaseModel

dimension used for manual mode

description: Optional[str]
dimension: str
class qianfan.evaluation.evaluator.QianfanEvaluator[source]

Bases: Evaluator

empty implementation base class for qianfan evaluator

evaluate(input: Union[str, List[Dict[str, Any]]], reference: str, output: str) Dict[str, Any][source]

evaluate one entry

class qianfan.evaluation.evaluator.QianfanManualEvaluator(*, evaluation_dimensions: List[ManualEvaluatorDimension] = [ManualEvaluatorDimension(dimension='满意度', description=None)])[source]

Bases: QianfanEvaluator

qianfan manual evaluator config class

classmethod dimension_validation(input_dict: Any) Any[source]
evaluation_dimensions: List[ManualEvaluatorDimension]
class qianfan.evaluation.evaluator.QianfanRefereeEvaluator(*, app_id: int, prompt_metrics: str = '综合得分', prompt_steps: str = '\n1.仔细阅读所提供的问题,确保你理解问题的要求和背景。\n2.仔细阅读所提供的标准答案,确保你理解问题的标准答案\n3.阅读答案,并检查是否用词不当\n4.检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n', prompt_max_score: int = 5)[source]

Bases: QianfanEvaluator

qianfan referee evaluator config class

app_id: int
prompt_max_score: int
prompt_metrics: str
prompt_steps: str
class qianfan.evaluation.evaluator.QianfanRuleEvaluator(*, using_similarity: bool = False, using_accuracy: bool = False, stop_words: Optional[str] = None)[source]

Bases: QianfanEvaluator

qianfan rule evaluator config class

stop_words: Optional[str]
using_accuracy: bool
using_similarity: bool

qianfan.evaluation.local_evaluator module

qianfan.evaluation.opencompass_evaluator module

opencompass evaluator evaluator

class qianfan.evaluation.opencompass_evaluator.OpenCompassLocalEvaluator[source]

Bases: LocalEvaluator