My experience on submitting the first Bioconductor package
I am sharing with you here what I have learned and experienced during developing my first R package - PhyloProfile and after successfully submitting it to Bioconductor repository.
Table of Contents
My tips for better R coding
Code format
Coding style and format must be consistent throughout the package:
- camelCaps for 
variableandfunctionnames - 4 spaces for 
indent(no tabs!) - no lines longer than 80 characters
 - always use space after a comma
 - use 
<-not=for assignment - …
 
You should check the required format of the software repository, where you are planning to submit your program/package to.
Best practices
- Avoid repeated code! If you need to write a chunk of code in many places, better making a helper function
 - Function length should have ≤ 50 lines
 - Function and its arguments need to be well documented: their description, data type and format of input and output, example. Arguments should have default values.
 - Avoid adding new dependent library/package. Try to utilize one package as much as possible. If the needed functions are available in the R base package, there is no need to use other libraries if you don’t have a convincing reason (e.g. higher speed). For example, here are list of equivalent functions between 
R baseandstringr. Tip: check the barely used libraries and try to replace them. - Communicate with users using function 
message("hello friend")and give informative error message withstop("hey, this is wrong input!")instead of just return NULLreturn() - Others:
    
- Use 
vapply()instead ofsapply()and use the variousapplyfunctions instead offorloops - Use 
seq_len()orseq_along()instead of1:... - Use 
TRUE/FALSEinstead ofT/F - Avoid 
class()==andclass()!=instead useis() - Use 
system2()instead ofsystem - Avoid the use of 
<<- - …
 
 - Use 
 
Efficient code
- Utilize vectorized functions and avoid loops (
for,whileorapplyfamily). If loop is a must, keep it as simplest as you can. - Avoid copy-and-append (e.g. 
rbind), use pre-allocatea-and-fill instead (e.g.lapply()) - …
 
In the following example, I will use different methods to do the same thing (filter the input phylogenetic profile data and return only lines, where an orthologous protein is found) and return a new data frame as an output. The input data frame has 5 columns (“geneID”, “ncbiID”, “orthoID”, “Domain_similarity”, “traceability”) and 3234 lines. The output will contain 3 columns (“geneID”, “ncbiID”, “orthoID”) and 1366 lines.
- Using 
forloop and the returned data frame is not preallocated:forloop <- function(df) { newDf <- data.frame() for (i in seq_len(nrow(df))) { if (!is.na(df$orthoID[i])) { newRow <- df[i, c("geneID", "ncbiID", "orthoID")] newDf <- rbind(newDf, newRow) } } return(newDf) } - Using 
forloop and the returned data frame is preallocated:forloopPreallocate <- function(df) { newDf <- data.frame( geneID = character(nrow(df)), ncbiID = character(nrow(df)), orthoID = character(nrow(df)), stringsAsFactors = FALSE ) j <- 1 for (i in seq_len(nrow(df))) { if (!is.na(df$orthoID[i])) { newDf[j, ] <- df[i, c("geneID", "ncbiID", "orthoID")] j <- j + 1 } } return(newDf) } - Using 
sapply()function (this, or other members of the “apply” function likeapply(),lapply(),vapply()are just alternative forms of “for-loop”, but they are more effective in some cases):sapplyfn <- function(df) { tmp <- sapply( seq_len(nrow(df)), function (i) { if (!is.na(df$orthoID[i])) { return(df[i, c("geneID", "ncbiID", "orthoID")]) } } ) return(do.call(rbind, tmp)) } - Using vectorized function 
subset():vectorizefn <- function(df) { return(subset(df, !is.na(df$orthoID))) } 
And this is the benchmark of those 4 functions in term of the calculation speed:
Unit: microseconds
               expr       min       mean      median         max
            forloop 305481.51 366958.9898 346929.2885 593712.576
 forloopPreallocate 170936.60 231358.6863 206390.8430 705593.905
           sapplyfn 126964.86 156118.3279 143331.4655 358841.424
        vectorizefn    182.09    309.3324    262.5635   2883.188
There are some more approaches to speed up your code, you can find them at:
How to create an R package
This book of Hadley Wickham is an excellent resource for learning about creating an R package. It will guide you from explaining the package structure to describing each individual component of a package that one needs to build. Or this post shows you how to develop a good R package. I suggest you should read it first before starting.
Some important things I have learned are:
- Write good documentation, including man pages for all the functions and vignettes for the package. 
manfile, or manual for a function, need to be detailed. Beside describing the purpose of the function and meaning of function’s parameters (arguments), one needs to give also the data types of input and output, and an (should be runable) example showing the usage of that function. Avignetteis, different from themanfiles, used for explaining the details of the whole package, as well as for demonstrating some real use-cases using the package. - Test the functions with 
testthat - Use continuous integration (such as travis-ci) for automatically build and test code changes
 - Write good documentation (yes, it need to be repeatedly emphasized!)
 
How to submit your package to Bioconductor
To submit an R package to Bioconductor, you need to follow their package and contribution guidelines. I summarize here some main steps you need to do and what you should notice.
- Store your R package in GitHub. The source code must be in the 
masterbranch of that GitHub repository. - Add SSH keys to your GitHub account.
    
- ssh-keygen -t rsa -b 4096 -C “your_email@example.com”
 - copy and add the generated ssh public key to github under https://github.com/settings/keys
 - add the private key file into ~/.ssh/config (change 
id_rsa_privateby the file name of your private key)host git.bioconductor.org HostName git.bioconductor.org IdentityFile ~/.ssh/id_rsa_private User github_user_name 
 - Open a new issue in Bioconductor contributions GitHub. Share the link to your package repository and use the 
package nameas the title of that issue. - Add a webhook to your repository in order to automatically trigger a package build when you push a valid commit to the master branch.
 
A valid commit is recognized by the version of the package. For the first commit, start with Version: 0.99.0. Whenever you want to trigger a new package build and check on the Bioconductor server, you push a commit with a version bump to Version: 0.99.1. Read this post to understand how the version number in Bioconductor is controlled.
After being accepted, your package will be available in Bioconductor’s git repository. The SSH key you created in step 2 will be used to maintain your package (e.g. bug fix, add new features,…). Make sure that the email linked to your package must be the same as the one being shown in your BioC git profile and it also must be present in your github account (can also be the alternative email).
In some cases, you want to submit a related package (such as an experiment data package), that is located in another GitHub repository. To do that, you just need to post a comment in the current issue like AdditionalPackage: https://github.com/username/repositoryname. This must be posted by YOU, the same GitHub user that created the issue. And you also need to add a webhook to that related package, the same as you did for the main package.
That’s it! Have fun and good luck with your first contribution ;-)