CPSC2020

class torch_ecg.databases.CPSC2020(db_dir: str | bytes | PathLike | None = None, working_dir: str | bytes | PathLike | None = None, verbose: int = 1, **kwargs: Any)[source]

Bases: CPSCDataBase

The 3rd China Physiological Signal Challenge 2020: Searching for Premature Ventricular Contraction (PVC) and Supraventricular Premature Beat (SPB) from Long-term ECGs

ABOUT

  1. training data consists of 10 single-lead ECG recordings collected from arrhythmia patients, each of the recording last for about 24 hours

  2. data and annotations are stored in v5 .mat files

  3. A02, A03, A08 are patient with atrial fibrillation

  4. sampling frequency = 400 Hz

  5. Detailed information:

    rec

    ?AF

    Length(h)

    # N beats

    # V beats

    # S beats

    # Total beats

    A01

    No

    25.89

    109,062

    0

    24

    109,086

    A02

    Yes

    22.83

    98,936

    4,554

    0

    103,490

    A03

    Yes

    24.70

    137,249

    382

    0

    137,631

    A04

    No

    24.51

    77,812

    19,024

    3,466

    100,302

    A05

    No

    23.57

    94,614

    1

    25

    94,640

    A06

    No

    24.59

    77,621

    0

    6

    77,627

    A07

    No

    23.11

    73,325

    15,150

    3,481

    91,956

    A08

    Yes

    25.46

    115,518

    2,793

    0

    118,311

    A09

    No

    25.84

    88,229

    2

    1,462

    89,693

    A10

    No

    23.64

    72,821

    169

    9,071

    82,061

  6. challenging factors for accurate detection of SPB and PVC: amplitude variation; morphological variation; noise

  7. Challenge official website [1].

Note

  1. the records can roughly be classified into 4 groups:

    N

    A01, A03, A05, A06

    V

    A02, A08

    S

    A09, A10

    VS

    A04, A07

  2. as premature beats and atrial fibrillation can co-exists (via the following code, and data from CINC2020), the situation becomes more complicated.

    >>> from utils.scoring_aux_data import dx_cooccurrence_all
    >>> dx_cooccurrence_all.loc["AF", ["PAC","PVC","SVPB","VPB"]]
    PAC     20
    PVC     19
    SVPB     4
    VPB     20
    Name: AF, dtype: int64
    

    this could also be seen from this dataset, via the following code as an example:

    >>> from data_reader import CPSC2020Reader as CR
    >>> db_dir = "/media/cfs/wenhao71/data/CPSC2020/TrainingSet/"
    >>> dr = CR(db_dir)
    >>> rec = dr.all_records[1]
    >>> dr.plot(rec, sampfrom=0, sampto=4000, ticks_granularity=2)
    
  3. PVC and SPB can also co-exist, as illustrated via the following code (from CINC2020):

    >>> from utils.scoring_aux_data import dx_cooccurrence_all
    >>> dx_cooccurrence_all.loc[["PVC","VPB"], ["PAC","SVPB",]]
    PAC SVPB
    PVC 14 1
    VPB 27 0
    and also from the following code:
    >>> for rec in dr.all_records:
    >>>     ann = dr.load_ann(rec)
    >>>     spb = ann["SPB_indices"]
    >>>     pvc = ann["PVC_indices"]
    >>>     if len(np.diff(spb)) > 0:
    >>>         print(f"{rec}: min dist among SPB = {np.min(np.diff(spb))}")
    >>>     if len(np.diff(pvc)) > 0:
    >>>         print(f"{rec}: min dist among PVC = {np.min(np.diff(pvc))}")
    >>>     diff = [s-p for s,p in product(spb, pvc)]
    >>>     if len(diff) > 0:
    >>>         print(f"{rec}: min dist between SPB and PVC = {np.min(np.abs(diff))}")
    A01: min dist among SPB = 630
    A02: min dist among SPB = 696
    A02: min dist among PVC = 87
    A02: min dist between SPB and PVC = 562
    A03: min dist among SPB = 7044
    A03: min dist among PVC = 151
    A03: min dist between SPB and PVC = 3750
    A04: min dist among SPB = 175
    A04: min dist among PVC = 156
    A04: min dist between SPB and PVC = 178
    A05: min dist among SPB = 182
    A05: min dist between SPB and PVC = 22320
    A06: min dist among SPB = 455158
    A07: min dist among SPB = 603
    A07: min dist among PVC = 153
    A07: min dist between SPB and PVC = 257
    A08: min dist among SPB = 2903029
    A08: min dist among PVC = 106
    A08: min dist between SPB and PVC = 350
    A09: min dist among SPB = 180
    A09: min dist among PVC = 7719290
    A09: min dist between SPB and PVC = 1271
    A10: min dist among SPB = 148
    A10: min dist among PVC = 708
    A10: min dist between SPB and PVC = 177
    

Usage

  1. ECG arrhythmia (PVC, SPB) detection

Issues

  1. currently, using xqrs as qrs detector, a lot more (more than 1000) rpeaks would be detected for A02, A07, A08, which might be caused by motion artefacts (or AF?); a lot less (more than 1000) rpeaks would be detected for A04. numeric details are as follows:

    rec

    ?AF

    # beats by xqrs

    # Total beats

    A01

    No

    109,502

    109,086

    A02

    Yes

    119,562

    103,490

    A03

    Yes

    135,912

    137,631

    A04

    No

    92,746

    100,302

    A05

    No

    94,674

    94,640

    A06

    No

    77,955

    77,627

    A07

    No

    98,390

    91,956

    A08

    Yes

    126,908

    118,311

    A09

    No

    89,972

    89,693

    A10

    No

    83,509

    82,061

  2. (fixed by an official update) A04 has duplicate “PVC_indices” (13534856,27147621,35141190 all appear twice): before correction of load_ann

    >>> from collections import Counter
    >>> db_dir = "/mnt/wenhao71/data/CPSC2020/TrainingSet/"
    >>> data_gen = CPSC2020Reader(db_dir=db_dir,working_dir=db_dir)
    >>> rec = 4
    >>> ann = data_gen.load_ann(rec)
    >>> Counter(ann["PVC_indices"]).most_common()[:4]
    [(13534856, 2), (27147621, 2), (35141190, 2), (848, 1)]
    
  3. when extracting morphological features using augmented rpeaks for A04,

    RuntimeWarning: invalid value encountered in double_scalars
    

    would raise for

    \[R\_value = (R\_value - y_min) / (y\_max - y\_min)\]

    and for

    \[y\_values[n] = (y\_values[n] - y\_min) / (y\_max - y\_min).\]

    This is caused by the 13882273-th sample, which is contained in “PVC_indices”, however, whether it is a PVC beat, or just motion artefact, is in doubt!

References

Citation

10.1166/jmihi.2020.3289

Parameters:
  • db_dir (path-like, optional) – Storage path of the database. If not specified, data will be fetched from Physionet.

  • working_dir (path-like, optional) – Working directory, to store intermediate files and log files.

  • verbose (int, default 1) – Level of logging verbosity.

  • kwargs (dict, optional) – Auxilliary key word arguments

property database_info: DataBaseInfo

The DataBaseInfo object of the database.

get_absolute_path(rec: str | int, extension: str | None = None, ann: bool = False) Path[source]

Get the absolute path of the record rec.

Parameters:
  • rec (str or int) – Record name or index of the record in all_records.

  • extension (str, optional) – Extension of the file.

  • ann (bool, default False) – Whether to get the annotation file path or not.

Returns:

abs_path – Absolute path of the file.

Return type:

pathlib.Path

get_subject_id(rec: int | str) int[source]

Attach a unique subject ID to the record.

Parameters:

rec (str or int) – Record name or index of the record in all_records.

Returns:

pid – the subject_id corr. to rec.

Return type:

int

load_ann(rec: int | str, sampfrom: int | None = None, sampto: int | None = None) Dict[str, ndarray][source]

Load the annotations of the record rec.

Parameters:
  • rec (str or int) – Record name or index of the record in all_records.

  • sampfrom (int, optional) – Start index of the data to be loaded.

  • sampto (int, optional) – End index of the data to be loaded.

Returns:

ann – Annotation dictionary with items (ndarray) “SPB_indices” and “PVC_indices”, which record the indices of SPBs and PVCs.

Return type:

dict

load_data(rec: int | str, sampfrom: int | None = None, sampto: int | None = None, data_format: str = 'channel_first', units: str = 'mV', fs: Real | None = None, return_fs: bool = False) ndarray | Tuple[ndarray, Real][source]

Load the ECG data of the record rec.

Parameters:
  • rec (str or int) – Record name or index of the record in all_records.

  • sampfrom (int, optional) – Start index of the data to be loaded.

  • sampto (int, optional) – End index of the data to be loaded.

  • data_format (str, default "channel_first") – Format of the ECG data, “channel_last” (alias “lead_last”), or “channel_first” (alias “lead_first”), or “flat” (alias “plain”).

  • units (str or None, default "mV") – Units of the output signal, can also be “μV” (with aliases “uV”, “muV”).

  • fs (numbers.Real, optional) – Frequency of the output signal. if not None, the loaded data will be resampled to this frequency; if None, the loaded data will be returned as is.

  • return_fs (bool, default False) – Whether to return the sampling frequency of the output signal.

Returns:

  • data (numpy.ndarray) – The loaded ECG data.

  • data_fs (numbers.Real, optional) – Sampling frequency of the output signal. Returned if return_fs is True.

locate_premature_beats(rec: int | str, premature_type: str | None = None, window: Real = 10, sampfrom: int | None = None, sampto: int | None = None) List[List[int]][source]

Locate the sample indices of premature beats in a record.

The locations are in the form of a list of lists, and each list contains the interval of sample indices of premature beats.

Parameters:
  • rec (str or int) – Record name or index of the record in all_records.

  • premature_type (str, optional) – Premature beat type, can be one of “SPB”, “PVC”. If not specified, both SPBs and PVCs will be located.

  • window (numbers.Real, default 10) – Window length of each premature beat, with units in seconds.

  • sampfrom (int, optional) – Start index of the premature beats to locate.

  • sampto (int, optional) – End index of the premature beats to locate.

Returns:

premature_intervals – List of intervals of premature beats.

Return type:

list

plot(rec: int | str, data: ndarray | None = None, ann: Dict[str, ndarray] | None = None, ticks_granularity: int = 0, sampfrom: int | None = None, sampto: int | None = None, rpeak_inds: Sequence[int] | ndarray | None = None) None[source]

Plot the ECG signal of a record.

Parameters:
  • rec (str or int) – Record name or index of the record in all_records.

  • data (numpy.ndarray, optional) – ECG signal to plot. If given, data of rec will not be used. This is useful when plotting filtered data.

  • ann (dict, optional) – Annotations for data, covering those from annotation files, with items “SPB_indices”, “PVC_indices”, each of which is a ndarray. Ignored if data is None.

  • ticks_granularity (int, default 0) – Granularity to plot axis ticks, the higher the more ticks. 0 (no ticks) –> 1 (major ticks) –> 2 (major + minor ticks)

  • sampfrom (int, optional) – Start index of the data to plot.

  • sampto (int, optional) – End index of the data to plot.

  • rpeak_inds (array_like, optional) – Indices of R peaks. If data is None, then indices should be the absolute indices in the record.

train_test_split_rec(test_rec_num: int = 2) Dict[str, List[str]][source]

Split the records into train set and test (val) set.

Parameters:

test_rec_num (int, default 2) – Number of records for the test (val) set.

Returns:

split_res – Split result dictionary, with items “train”, “test”, both of which are lists of record names.

Return type:

dict

property url: str

URL(s) for downloading the database.