I am in the process of writing a book about this work in general with all the explanations necessary and with some of the results of this research. Hopefully this might help and inspire other individuals to become involved.
You will find some of the Perl scripts code, which might be needed to get you started making your own disease protein datasets, on the additional resources page (see below) and these are also included in the book's ample appendices.
Please be aware that there are a lot of colour images, so you might well think to download the eBook onto your computer, if your Kindle or other device does not show coloured images properly.
If you are interested in buying this work please click on the link below.
Disease Motifs - A Wild Speculative Excursion into the Field of Bioinformatics
Some Perl scripts and links. Just click on the links in red which are all the first few words on the left-hand side of the page. The perl scripts have been placed online as a .txt file actually; when you click on the links to these they will just open up in the web browser. All you will need to do is "select all" on the web page, and copy then it to a Notepad text file, giving it the desired name such as master.pl, being sure that you do not save it as master.pl.txt (see the explanations in The Basics).
I will have to warn you that these Perl scripts can cause damage to your OS on your computer, such that you will need to reinstall the OS. You use these scripts at your own risk. I have taken out the gamma scripts, as you will know if you have read this book, as these did stumble when looking at the P63 protein and fatally wiped out Windows 10 on my laptop.
These script will all go in to the DiseaseGrab directory execpt OHfilegrab which will should be placed into the databaseMaesta directory.
master.pl - Appendix 1.01.
DiseaseGrab.pl - Appendix 1.02.
fastaD.pl - Appendix 1.03.
FTvariant.pl - Appendix 1.04.
fastaDeliver.pl - Appendix 1.05.
OHfilegrab.pl - Appendix 1.06.
WinZip I use WinZip to extract the files from a compressed file.
The Complete Swiss-Prot Protein Database can be downloaded from Uniprot. Once in the "complete" directory at Uniprot select uniprot_sprot.dat.gz. This is a compressed file and you will need to extract it before using it; but try not to open this file as your computer is going to struggle. I usually rename this "all.dat".
The Swiss-Prot Taxonomic Divisions Protein Databases can be downloaded from Uniprot. Once inside the "taxonomic_divisions" directory at Uniprot, select whichever files you want to work with, but please do not select any of the uniprot_trembl files as these may cause damage to your computer's Opertaing System (OS) when using these Perl scripts.
Further Perl scripts, blastall.exe, formatdb.exe and links.
Excuteable programmes blastall.exe and formadb.exe
Clicking on these will open up a window's box giving you the option to save these programmes to your computer.
formatdb.exe can be downloaded from here.
blastall.exe can be downloaded from here.
Formatting, database creation and fasta sequence file creation scripts
OSfilegrab.pl - Appendix 2.01.
formatdb.pl - Appendix 2.02. This is not really essential to download as this is already included in the Maesta.pl
fasta.pl - Appendix 2.03.
Maesta.pl - Appendix 2.04.
DEfilegrab.pl - Appendix 2.39. This is not essential but is an additional programme you might like to have.
selene.pl - Appendix 2.40. This is not essential but is an additional programme you might like to have.
M2 scripts and HighPoint.pl
master.pl - Appendix 2.06.
databases.txt - Appendix 2.07.
initialData.txt - Appendix 2.08.
alpha.pl - Appendix 2.09.
Xalpha.pl - Appendix 2.10.
alphaN.pl - Appendix 2.11.
alphaEx.pl - Appendix 2.12.
alphaExNC.pl - Appendix 2.13.
alphaExCN.pl - Appendix 2.14.
alphaGV.pl - Appendix 2.15.
alphaHTML.pl - Appendix 2.16.
alphaGV.css - Appendix 2.17.
indexpage.pl - Appendix 2.24.
newstyle.css - Appendix 2.25.
gNpolymorph.pl - Appendix 2.27.
masterGraphReDoWithoutBlast.pl - Appendix 2.28.
sets.txt - Appendix 2.28.
MasterSetMaker.pl - Appendix 2.29.
DiseaseGrab.pl - Appendix 2.30.
fastaD.pl - Appendix 2.31.
FTVariant.pl - Appendix 2.32.
fastaDeliver.pl - Appendix 2.33.
linkages.pl - Appendix 2.34.
setsHTML.pl - Appendix 2.35.
The Forbidden Gamma scripts
These scripts have been taken out of the M2 scripts as they brought the Windows 10 Operating System down, fatally, on my laptop. The problem protein datasheet responsible was P53 protein (P53_HUMAN). Please do not try to reconnect this programmes into M2 as they have the capacity to bring down the OS on your computer.
Parkinson's Disease Protein Dataset Additional Files
mainSDhp_HP_10_dmean_1.txt - Appendix 2.39.
quest.pl - Appendix 3.07. Just use this with the standard list of databases otherwise things get a bit messy.
CompareAndCompose.pl - Appendix 3.08.
rickettsia.xlsx This is the (very) rough working excel sheet of the 6SD_1dm data. One sheet I was trying to input/create/find motifs manually. This is unfinished work.
inputs.zip This is a compressed (zipped) file of the total inputs file for the "parkinson" project. You can download this and then "extract" the contents and start your own project, or just look at the datasheets.
Please look at the further explanation to this in the book (earlier chapters). I am still in the process of writing this chapter (chapter four), which hopefully will be available late next year (2020) sometime, hopefully, all being well. This might be made available separately, but almost certainly it will be appended onto the book.
Primary Onco-Protein Derivative Datasets
These are compressed (zipped) files, all subtracted from the Swiss-Prot human protein database. You can download these and then "extract" the contents and start your own project; or you might just want to look at the datasheets. If you are a passive researcher who wants to just check things out, and you do come across any false positives in the datasheets, such as some incriminating statement within the datasheet saying "this is not a cancer protein" (or something similar), please send let me know by sending me the ID of the protein (or IDs of the proteins) and I might subtract this (or these) from the dataset.
Also be aware that the within the folder/directory that you are extracting such as "cancer" there will be an inner folder named either: "human" as in the "human" protein database, from which the proteins were taken from; or "inputs" and this will be because I have renamed the original "human" folder to "inputs" for the running of the M2 scripts. If you want to run the compareandcompose.pl script this inner folder needs to be named "human" for the M2 scripts this will have to be named "inputs".
Secondary Onco-Protein Derivative Datasets
These are taken from the cancer protein dataset, as opposed to the carcinoma protein dataset (see the discussion in Oncology when it is published).
As mentioned above, the inner folder will be either named: "cancer" (not "human" this time); or "inputs". In these sets it would be particularly useful to point out any false positives; as any mention in the datasheet of the lung, prostate etc might be in reference to the location of the protein and might have no connection with any of the named cancers such as lung cancer etc.