unix crash course
The goal of this crash-course is to get you acquainted with some the basic concepts and commands of the UNIX command line.
Mastering it is not something done in a single day, and so the goal is not to make you into sysadmins, but to give you enough groundings so that, armed with a good cheatsheet and with a bit of patience, you should be able to run the commands needed to perform a differential expression or enrichment analysis and then be able to interact with your result files.
This crash course is made to be followed along either on a distant Rstudio server (the teacher will tell you how to connect there)
You will learn to :
- navigate the filesystem using common unix commands
- look at text files and manipulate them
- use command line tools options
- redirect command line tools output to files
start a terminal in Rstudio
filesystem navigation
type commands and then press enter to execute them
pwd
pwd
= print working directory : tells you were the ternminal is in the filesystem
pwd
ls
ls
= list : list the content of the working directory
ls
Giving arguments to command - 1 : some commands take arguments (very much like function in R).
Arguments are given separated by spaces.
For example, ls
can take an argument corresponding the name of a directory, in which case it will list the content of that directory rather than the current working directory:
ls R_crash_course_data
cd
cd
= change directory : moves the current working directory to the folder given as argument
cd R_crash_course_data
pwd # check the new working directory
.. is a shortcut for “parent directory”
cd ..
. is a shortcut for “current directory”
cd . # we move to the current directory, so we stay in place
cd without argument will take you back to your home directory
cd
You can move through multiple folders in one go
cd R_crash_course_data/data/
Giving arguments to command - 2 :
Some arguments are given by name, in which case they start with -
(one letter argument) or --
(more than one letter argument).
Also some arguments require a value, but some are just “flags” to merely need to be added to the command line to take effect:
ls --color=auto
ls -l
One can combine multiple one-letter argument with the same -
ls -l -h
##equivalent to
ls -lh
So in the end you may have a command line looking like
ls -lh --color=auto
Its output looks like:
total 5.3M
-rw-rw-r-- 1 wandrille wandrille 2.7M Aug 15 13:22 diamonds.csv
-rw-rw-r-- 1 wandrille wandrille 2.7M Aug 15 13:22 diamonds.tsv
-rw-rw-r-- 1 wandrille wandrille 1.4K Aug 15 13:22 virginica.csv
- The first line tells you to total size of the files in the folder (but it does not account for sub-folders)
file permissions | # of links | owner | group | size | last modification | filename |
---|---|---|---|---|---|---|
-rw-rw-r– | 1 | wandrille | wandrille | 2.7M | Aug 15 13:22 | diamonds.csv |
File permission are read write execute
directory? | owner permissions | group permissions | others permission |
---|---|---|---|
- | rw- | rw- | r– |
not a directory | can read and write | can read and write | can read only |
moving files around
First let’s move back to our home folder
cd
mkdir
mkdir
= make directory : creates a directory
mkdir unix_exercise
ls -lh # to check that the directory was created
cp
cp
= copy : copies a file to a given location
cp R_crash_course_data/iris.csv . # copying the iris.csv file to the current directory
ls -lh --color=auto # checking the result
Entire folder can be recursively copied, but you need to add the -r
flag
cp R_crash_course_data/data/ unix_exercise/ #trying to copy the data/ folder to unix_exercise/
cp: -r not specified; omitting directory 'R_crash_course_data/data/'
cp -r R_crash_course_data/data/ unix_exercise/
ls -Rlh --color=auto unix_exercise ## checking the result; -R makes a recursive ls
mv
mv
= move : moves file to a given location
ls # check that iris.csv is here
mv iris.csv unix_exercise # move iris.csv to the unix_exercise/ folder
ls # check that iris.csv is not here anymore
ls unix_exercise # check that iris.csv is in unix_exercise/
it can also be used to change the name of a file or folder
cd unix_exercise
mv iris.csv IRIS.csv
ls
rm
rm
= remove : removes files or folder
Warning
there is no recycle bin, or trash bin, or whatever. Any deletion is final.
Warning
I repeat: any deletion is final.
let’s create a copy of IRIS.csv to delete it:
cp IRIS.csv iris_copy.csv
ls
rm IRIS.csv
folder’s can be deleted with the -r option
rm -r data
wildcard character : the * of the show
*
is the wildcard character, which can subtitute itself to any number of characters.
we use it to craft powerful regular expressions that let us apply commands to selected files based on their names:
# copy all files in ../R_crash_course_data/data/ to the current working directory
cp ../R_crash_course_data/data/* .
ls
diamonds.csv diamonds.tsv iris_copy.csv virginica.csv
here the *
substituted itself for all possibles file names.
Where it shine is that it can be combined with other characters:
ls -l *.csv ## all csv files
ls -l diamonds.* ## all files starting with diamonds
exercise : basic file system manipulation
- change your working directory back to your home directory
- create a folder named
unix_exercise2
- look at the files in a folder named
/data/DATA/mouseMT
. What is the size of each file? Do you have permission to write there? - copy all sample_a files from
/data/DATA/mouseMT
tounix_exercise2
- look at the files you just copied: do you have write permission with them now?
looking at file content
Let’s move to the unix_exercise
folder
cd # to your home directory first
cd unix_exercise
head / tail
head
shows you the first 10 lines of a file
head diamonds.csv
"carat","cut","color","clarity","depth","table","price","x","y","z"
0.23,"Ideal","E","SI2",61.5,55,326,3.95,3.98,2.43
0.21,"Premium","E","SI1",59.8,61,326,3.89,3.84,2.31
0.23,"Good","E","VS1",56.9,65,327,4.05,4.07,2.31
0.29,"Premium","I","VS2",62.4,58,334,4.2,4.23,2.63
0.31,"Good","J","SI2",63.3,58,335,4.34,4.35,2.75
0.24,"Very Good","J","VVS2",62.8,57,336,3.94,3.96,2.48
0.24,"Very Good","I","VVS1",62.3,57,336,3.95,3.98,2.47
0.26,"Very Good","H","SI1",61.9,55,337,4.07,4.11,2.53
0.22,"Fair","E","VS2",65.1,61,337,3.87,3.78,2.49
with -n
you can specify the number of lines to show
head -n 3 diamonds.csv
"carat","cut","color","clarity","depth","table","price","x","y","z"
0.23,"Ideal","E","SI2",61.5,55,326,3.95,3.98,2.43
0.21,"Premium","E","SI1",59.8,61,326,3.89,3.84,2.31
tail
shows you the last lines of a file:
tail -n 3 diamonds.csv
0.7,"Very Good","D","SI1",62.8,60,2757,5.66,5.68,3.56
0.86,"Premium","H","SI2",61,58,2757,6.15,6.12,3.74
0.75,"Ideal","D","SI2",62.2,55,2757,5.83,5.87,3.64
more
more
prints a file content in the terminal.
When the file is large it prints a page, and you can press
* enter
: print one more line
* space
: print one more page
* q
: stop printing
more diamonds.csv
wc
wc = word count : counts the number of lines, words, and characters in a file
wc diamonds.csv
53941 66023 2772143 diamonds.csv
We use it most often with option -l
to get just the number of lines:
wc -l iris_copy.csv
151 iris_copy.csv
exercise : lookoing at files
- look at the first and last lines of
/data/DATA/mouseMT/sample_a1.fastq
- we have the logs of a Quality Control tool in file
/data/Solutions/Liu2015/010_l_fastqc_Liu2015.o
. Look at its content with more or less : does it look like all files where processed OK? - how many lines are in each of the fastq files
/data/DATA/mouseMT/
(hint: use * to make your life easier)
get –help
Of course, you cannot guess all commands arguments.
You can consult the manual page of some command with man
, but not all tools have one…
Instead, in general you can get a list of arguments with --help
(or sometimes -h
)
ls --help
advanced but useful stuff
output redirection : >
>
let’s you make it so that all standard output (but not error messages) of a command go to a file you specify
This is very useful when you call a tool whose logs you would like to keep (the logs of many bioinformatics tools can be input files for other tools).
wc -l /data/DATA/mouseMT/*.fastq > mouseMT.file_sizes.txt
more mouseMT.file_sizes.txt
emergency exit : Ctrl+C
Sometimes you will launch a job, which may take a while and you will realize, with horror, that you made a mistake and this is not what you wanted to do!
Ctrl+C
stops the execution of whatever command is curently running.
For example, the following command counts number of lines in big files, which takes a while.
Let it run for a couple of seconds and then stop it with Ctrl+C
wc -l /data/DATA/Liu2015/*
background job : &
Ooften you want to launch a job which you know will take a while, but you’d like to be able to continue working while it runs.
By adding &
at the end of the command line the job will be executed in the background.
wc -l /data/DATA/Liu2015/* > Liu2015.file_sizes.txt &
You can use top
or ps -u
to see your job running (with top
, type q
to exit).
When the command is finished you will get in your terminal something like:
[1]+ Done wc -l /data/DATA/Liu2015/* > Liu2015.file_sizes.txt
variables
As in R, is it possible to store data in a variable.
It is mostly useful to help us manipulate file names more easily.
They are declared with =
, but you have to be careful: use no spaces
INPUT_FOLDER=/data/DATA/mouseMT
then you can use them with ${variable_name}:
ls ${INPUT_FOLDER}/*
ls ${INPUT_FOLDER}/sample_a*
Note
You will notice most of my bash variable are UPPERCASE. It is not mandatory but I find it a useful convention.
Note
instead of ${INPUT_FOLDER}
, we could write $INPUT_FOLDER but it is a bit more prone to mixup when integrating the variable content within other character, so beware.
loop
Loops let us “easily” deploy the same command while changing 1 variable.
This is very useful to deploy a bioinformatic tool on many input files in a tidy way.
For example, imagine we want to deploy a command separately on the mouseMT fatsq files of sample a, and of sample b, it could look something like:
# I use wc -l as my command here
wc -l /data/DATA/mouseMT/sample_a*.fastq > mouseMT.a.file_sizes.txt
wc -l /data/DATA/mouseMT/sample_b*.fastq > mouseMT.b.file_sizes.txt
That works, but imagine that you have more than a handful of options and it becomes tedious.
Additionnally, speaking from experience it is easy to make small typos that will riun your day here (eg, adapting the input file names, but not the output file name )
A for loop in bash looks like:
for VAR in a b
do # done starts the loop block
## some command, which will be repeated, ${VAR} will have a different value each time
wc -l /data/DATA/mouseMT/sample_${VAR}*.fastq > mouseMT.${VAR}.file_sizes.txt
done # done finishes the loop block