Download GoldMine User Guide PDF

TitleGoldMine User Guide
LanguageEnglish
File Size6.1 MB
Total Pages115
Document Text Contents
Page 1

GoldMine User Guide
A Component of the GOLD Suite

5.4 Release

Copyright © 2015 Cambridge Crystallographic Data Centre

Registered Charity No 800579

Page 2

ii GoldMine User Guide

Conditions of Use

The GOLD suite of programs (the "Program") comprising all or some of the following:
Hermes (including as Relibase+ client and as SuperStar interface), GOLD, GoldMine,
associated documentation and software, are copyright works of CCDC Software Limited and
its licensors and all rights are protected. Use of the Program is permitted solely in
accordance with a valid Software Licence Agreement or a valid Licence and Support
Agreement with CCDC Software Limited or a valid Licence of Access to the CSD System with
CCDC and the Program is proprietary. All persons accessing the Program should make
themselves aware of the conditions contained in the Software Licence Agreement or Licence
and Support Agreement or Licence of Access Agreement.

In particular:

The Program is to be treated as confidential and may NOT be disclosed or re-
distributed in any form, in whole or in part, to any third party.

No representations, warranties, or liabilities are expressed or implied in the supply
of the Program by CCDC Software Ltd., its servants or agents, except where such
exclusion or limitation is prohibited, void or unenforceable under governing law.

- GOLD © 2015 CCDC Software Ltd.
- Hermes © 2015 CCDC Software Ltd.
- GoldMine © 2015 CCDC Software Ltd.

Implementation of ChemScore, Heme, Kinase and Astex Statistical Potential scoring
functions and the Diverse Solutions code within GOLD © 2001-2015 Astex Therapeutics Ltd.

All rights reserved

Licences may be obtained from:

CCDC Software Ltd.
12 Union Road
Cambridge CB2 1EZ
United Kingdom

Web: www.ccdc.cam.ac.uk
Telephone: +44-1223-336408
Email: [email protected]

Page 57

GoldMine User Guide 49

3.13.2 Calculating Correlation Matrices
 If accessing via the GoldMine Controller, select all the descriptors for which

descriptive statistics are sought, and then hit Statistics in the data-view menu,
followed by Correlations...

 If a spreadsheet is on display:
- To calculate a correlation matrix for two or more descriptors, choose the

required columns by clicking on the descriptor names at the column heads
(the column backgrounds will turn black) and then hit Statistics in the
spreadsheet data-view menu, followed by Correlations...

- If no columns are chosen, they will all be included.
- The correlation matrix will be based only on the currently visible rows.

 If any other data view is on display (e.g. a heat plot):
- Click on Statistics in the data-view menu followed by Correlations... to

calculate the correlation matrix for those descriptors involved in the data
view you are displaying (e.g. the two descriptors plotted on the heat plot).

 By default, the matrix will show Pearson correlation coefficients in the lower
triangle, and the significance levels of the coefficients in the upper triangle.
Correlations that are statistically significant (significance level < 0.05) will be shown
on a coloured background.

 The default contents of the matrix may be altered by clicking on the buttons at the
bottom left and top-right of the display area. By default, these will say Correlation
and Significance, respectively. By clicking on them, you can alter the lower or upper
triangle, respectively, to display Pearson correlation coefficient, covariance,
significance level, or Spearman correlation coefficient.

 If either Pearson or Spearman correlation coefficients are shown in one of the
triangles and significance levels in the other, the significance level will be that of
whichever type of correlation coefficient is on display.

 The default presentation may also be modified by right-clicking anywhere in the
correlation-matrix display area (or selecting Display from the data-view menu) and
selecting Configure... from the resulting menu. The resulting dialogue box will allow
you to:

- Specify a title.
- Set the upper or lower triangle to display any of: Pearson correlation

coefficient; covariance, significance level; Spearman correlation coefficient.
- Turn the highlighting of statistically-significant values on or off and alter the

colour used for highlighting (click on the coloured panel and select from the
resulting colour palette).

- Alter the level at which a correlation is deemed significant.

3.13.3 Principal Component Analysis
 The method of Principal Components Analysis (PCA) is based on transforming a set

of potentially correlated variables into a new, and smaller, set of uncorrelated and
mutually orthogonal variables called principal components. This process can make it
easier to understand multivariate data and can significantly aid the location and
identification of clusters of observations having similar geometry.

Page 58

50 GoldMine User Guide

 The technique is often used when trying to analyse the variation in a number of
correlated molecular or intermolecular geometric parameters within a dataset of
related crystal structures.

 To perform a principal component analysis, choose the required descriptors by
clicking on the descriptor names at the column heads (the column backgrounds will
turn black) and then hit Statistics in the spreadsheet data-view menu, followed by
Principal components. If no columns are chosen, then a window will appear in
which you can choose which descriptors you wish to include. This will open a data-
view window that looks something like this:

- PC can also be carried out from descriptors selected in the GoldMine
controller.

- The descriptors that will be included in the calculation can be changed using
the tick-boxes next to the descriptors name. Alternatively, you can use the
All, or None buttons to choose descriptors. Right-clicking anywhere in the
data-view and selecting Configure... from the resulting menu will open a
window in which you can customise the output from the calculation and its
appearance.

- Hit the Calculate button to run the PCA on the selected set of descriptors.
The resulting PC scores are displayed on the right hand side of the data-
view:

Page 114

106 GoldMine User Guide

 Bring up the Define Training and Test Sets window via the Tools command on the
Data Analysis menu bar. Drag and drop the descriptor Consensus_3term from the
Descriptors pane into the top area of this window and highlight it. Type in an
appropriate Best set name (e.g. Best_Consensus_3term) and click on Create. This
will create a selection representing the best poses per structure, according to the
Discrimination model we created earlier.

 We now will examine the performance of the model in discriminating actives from
inactives over only the best poses for each ligand according to the model.

 Create a selection representing the active poses within the
Best_Consensus_3_term set by combining the Best_Consensus_3_term and
Actives selections in the Selection manager AND box. Save this selection as
Best_consensus_3_term_actives.

 In the Descriptors pane highlight Consensus_3term, and select ROC Plot from the
Plots pull-down menu.

 In the resulting dialogue box choose Best_consensus_3_term_actives as the
selection to pick out.

 The ROC curve and associated enrichment metrics should now be displayed.

 Repeat this experiment except use the GS_Best test set as the set to work with.
 Compare the enrichment metrics. Which test set is the model more successful on?
 You can finish the tutorial here. Alternatively there are one or two things you might

like to experiment with.

Other Things to Try
 You can try adding a fourth variable to the model using Auto add. Use the same

training and test sets as you did originally. Reject any variable that comes up that is a
component of a scoring function, to try to find something more interesting. Accept
the first variable that comes up that describes occluded functionality on the ligand.

Page 115

GoldMine User Guide 107

Compare the enrichment characteristics with the three component model. There
should be a slight improvement.

Check the coefficients of the model. Do they make sense? In other words are the
signs of the coefficients consistent with what you’d expect for the corresponding
descriptors?

Save the model. Create a selection for the best poses per ligand according to that
model and examine its discriminatory performance over those poses.

Experiment with adding further descriptors to the model and see if you can detect
when over-fitting is occurring.

This ends the tutorial.

Similer Documents