Help

Input types for enzymes/proteins and metabolites

    Enzymes

    The input for an enzyme is required to be a string with the enzymes' amino acid sequence.

    Metabolites

    There are three valid input types for the metabolite: SMILES string, KEGG Compound ID, and InChI string

      InChI string

      InChI strings are textual representations of chemical structures. Every InChI string is a unique identifier and contains detailed information about the structure of a small molecule. For more details on InChI, see this page from IUPAC.

      KEGG Compound ID

      The KEGG Compound database contains identifiers for many small molecules and drugs. A KEGG Compound ID starts with a "C" or "D" followed by a five-digit number. For more information see the KEGG homepage.

      SMILES

      Simplified Molecular Input Line Entry Specification (SMILES) allows to represent the structure of a molecule using ASCII strings. You can get the SMILES for a molecule e.g. by searching for the molecules name in PubChem. Since SMILES representations are not unique for all molecules, we recommend to use InChI string or KEGG Compound IDs instead, if possible.

CSV file as input

    What is a CSV file?

    If you want to make multipe predictions at once you need to upload a file in CSV format, with pairs such as enzyme-metabolite, enzyme-reaction, or transporter-molecule depending on the model you are using. CSV files can be created with spreadsheet programs like Excel or OpenOffice Calc or with any text editor. For more details on how to create a file in CSV format, see here.

    How should your CSV file look like?

    Your CSV file depend on the model you are using. Attention: InChI strings can contain commas (","), therefore, the CSV file should use tabs or semicolons (";") as separators.
    You can download a sample csv file.

    CSV File Sample

    Example of multiple inputs with CSV File. The amino acid sequences and metabolites displayed here are not real ones

      Enzyme-Substrate Pair Prediction:

      Your CSV file should contain exactly two columns, one called "Enzymes" and and one called "Metabolites". Every row of your file should contain one enzyme and one metabolite in a format that is described above. The upper limit of accepted numbers of enzyme-metabolite pairs is 500. You can download a sample csv file here.

      kcat prediction:

      Your CSV file should contain exactly three columns, called "Enzymes", "Substrates", and "Products". Every row of your file should contain one enzyme-reaction pair in a format that is described above. The upper limit of accepted numbers of enzyme-reaction pairs is 500. You can download a sample csv file here.

      KM prediction:

      Your CSV file should contain exactly two columns, one called "Enzymes" and and one called "Metabolites". Every row of your file should contain one enzyme and one metabolite in a format that is described above. The upper limit of accepted numbers of enzyme-metabolite pairs is 500. You can download a sample csv file here.

      SPOT:

      Your CSV file should contain exactly two columns, one called "Proteins" and and one called "Molecules". Every row of your file should contain one enzyme and one molecule in a format that is described above. The upper limit of accepted numbers of transporter-molecule pairs is 1000. You can download a sample csv file here.

Interpretation of the results

ESP (Enzyme-Substrate Pair Prediction)

    Prediction score

    The prediction score is a value between 0 and 1. Scores close to 1 mean that the model predicts that the metabolite is likely a substrate for the given enzyme, whereas scores close to 0 mean that the model predicts that the metabolite is likely not a substrate for the enzyme.
    Prediction scores close to 0.5 (i.e. scores in the range of 0.3 to 0.7) should be considered with caution. The prediction model is unsure which class it should predict in these cases.

    Is the metabolite in training set?

    We have shown that the prediction performance of our model is low when it is applied to metabolites which were not present in our training set. Therefore, we check for every uploaded metabolite if it was part of our training set. We return this information in the column "metabolite in training set". You can download a complete list with all metabolites that were part of our training set as a TSV file or as a Excel file.

TurNuP (kcat prediction)

    Output

    Both functionalities, single input and multiple input, provide one kcat prediction for every enzyme-reaction pair in the unit 1 per second.

    How accurate is the kcat prediction?

    On average, predictions for new enzyme-reaction pairs deviate from the true kcat value by a 4.9-fold (see our manuscript for more details). If similar enzymes or similar reactions were part of our training set, model accuarcy increases. As an estimate of model performance, the single input option outputs an enzyme sequence identity and a reaction similariry score compared to the training data.

KM prediction

    Output

    Both functionalities, single input and multiple input, provide one KM prediction for every enzyme-substrate pair in the unit of mol.L-1.

    How accurate is the KM prediction?

    to write

SPOT (Transporter - Substrate Pair prediction)

    Prediction score

    The prediction score is a value between 0 and 1. Scores close to 1 mean that the model predicts that the molecule is likely a substrate for the given transporter, whereas scores close to 0 mean that the model predicts that the molecule is likely not a substrate for the transporter. Prediction scores close to 0.5 (i.e. scores in the range of 0.4 to 0.6) should be considered with caution. The prediction model is unsure which class it should predict in these cases.

    Is the (potential) substrate in training set?

    We have shown that the prediction performance of our model is better when substrates are present in our training set. Therefore, we check for every uploaded molecule if it was part of our training set. We return this information in the column "molecule in training set". You can download a complete list with all substrates that were part of our training set as a TSV file or as a Excel file.