Practical: UNIX 2

Aims

Objectives

After this practical you will:

Exercises

  1. The HTML source for the publication list of the Bioinformatics Research Group at Computing Science is in file /chalmers/users/kemp/DAT160/practical2/publications.html. Copy this file into your own file space. Write UNIX commands (using sed and/or grep) that filter the file in the following ways, and write the result to standard output:

    1. Change all occurrences of `Bioinformatics' to `BIOINFORMATICS'.

    2. Delete space characters at the start of each line.

    3. Display lines that do not contain the word 'Kemp' (i) using grep (ii) using sed.

    4. Display the titles of all 2004 publications (hint: start by finding lines that contain 'publicationListItem' and '2004', then remove everything on the line before the title, and everything after the title).

    5. Remove all HTML tags (e.g. "<head>").

    6. Remove all HTML tags and then delete all lines that are blank or contain only space characters.

  2. The HTML source for the Pathguide pathway resource list is in file /chalmers/users/kemp/DAT160/practical2/pathguide.html

    1. Count how many lines in that file contain the word 'fullrecord'.

    2. Use UNIX commands to create a single alphabetical list of the names of all of the resources in the Pathguide list (like the list in /chalmers/users/kemp/DAT160/practical2/resource_names).

      Count how many entries are in that list.

  3. The Gene Ontology (GO) Consortium are developing three controlled vocabularies containing recommended terms that should be used when describing molecular function, biological processes and cellular components. File /chalmers/users/kemp/DAT160/practical2/goslim_generic.go contains a cut-down version of the GO ontologies.

    Examples of Gene Ontology terms in this file include "cell communication", "cell recognition", "cytoplasm", "peptidase activity", etc.

    Write a command that extracts all terms from a Gene Ontology file and writes a sorted list of terms (one term per line, without duplicates) to standard output.

  4. File /chalmers/users/kemp/DAT160/latex/table.tex is a LaTeX source file that contains two tables.
    Copy this file, replace my name with your own, then create and view a PDF file produced from this file (hint: use the commands latex, dvipdf and acroread).

    How many ampersand characters (&) are in file table.tex?

  5. File /chalmers/users/kemp/DAT160/practical2/StatReport.html contains a summary of the number of structure files deposited in the Protein Data Bank and released each year from 1973 to mid-1998. Use UNIX commands to create two files, each containing two columns of numbers:

    1. file "deposited" which contains the year and the number of entries deposited in that year;

    2. file "released" which contains the year and the number of entries released in that year.

Supplementary Material

You can read more about the Gene Ontology (GO) project on the Gene Ontology Consortium web site. Several different web-based tools for searching and browsing GO have been implemented. Try using some of these.