(finding-things)=
# Finding Things
```{objectives}
- Questions
- How can we find files in complex folders?
- How can we find lines in files?
- Keypoints
- `grep` and `find` can be used to find files
- `grep` can also be used to search in files
```
```{instructor-note}
- Demo/teaching: 15 min
- Exercise: 15 min
```
A critical skill for working on UNIX systems in order to find back files and folders.
````{note}
- First we will demonstrate a couple of commands and later we will use these
in an exercise.
- If you want to type-along with the instructors, you can download and extract
the example like this:
```
cd
wget https://gitlab.sigma2.no/training/tutorials/unix-for-hpc/-/raw/master/content/episodes/finding-things/finding-things.tar.gz --no-check-certificate
tar xzvf finding-things.tar.gz
```
````
## Finding files in folders with `find`
You should now have a folder named `finding-things` in your current directory containing
a bunch of files and folders with random names and different file types.
Now we will try to find a couple of different files in this mess.
Let's start with a file called `output.txt` which is in one of the
subfolders.
Our tools is fittingly called `find` and is available on all Unix and Linux
machines.
The general syntax is quite simple
```console
$ find STARTING_POINT OPTIONS
```
but if we look into its help file using `man find`, we see that there are
tons of parameters and options available.
### Based on full file name
In our case, we know the folder in which to look (the 'STARTING_POINT') and the
exact name, so the command becomes:
```console
$ find finding-things -name output.txt
```
After pressing enter we will see the following result
```console
finding-things/godcjrbv/output.txt
```
which means that find successfully found the file with name `output.txt` in folder `finding-things/godcjrbv`.
If, for some reason we are not sure about the capitalization of our file (was
it 'output.txt', 'Output.txt', 'OUTPUT.txt' or something else), we can use
`-iname` instead. This way we ignore the case of search term
```console
$ find finding-things -iname output.txt
```
The result of this command is in our case:
```console
finding-things/godcjrbv/output.txt
finding-things/aivxievn/feipoogd/OUTPUT.txt
```
so we see that there were actually two files called the same, but in different subfolders, and with different capitalisation, and using the `-iname` option ensured we found them both.
### Based on partial file name
Let's imagine that we forgot the name of a data file we want to use but we
remember it is a `.csv` file. We could look through each folder separately or
try out all names with a `.csv` suffix, we can think off.
Instead we can use find in combination with wildcard characters ([see here for more
information](https://linuxhint.com/bash_wildcard_tutorial/)). The two most
useful wildcards are the asterisk (\*) and the question mark (?).
The asterisk matches zero or more characters while the question mark represents
any _single_ character.
So if we want to list all `.csv` files, we can use the asterisk like this:
```console
$ find finding-things -iname '*.csv'
```
The output from this command is
```console
finding-things/aivxievn/qbafmtbq/data10.csv
finding-things/aivxievn/qbafmtbq/data1.csv
finding-things/aivxievn/qbafmtbq/data11.csv
finding-things/aivxievn/qbafmtbq/data2.csv
finding-things/aivxievn/qbafmtbq/data14.csv
finding-things/aivxievn/qbafmtbq/data13.csv
finding-things/aivxievn/qbafmtbq/data4.csv
finding-things/aivxievn/qbafmtbq/data3.csv
finding-things/aivxievn/feipoogd/data.csv
```
We use `-iname` here to also match `.CSV` files (we don't want to miss them
unintentionally). The quotation marks make sure that the `*.csv` is interpreted
by find and not the shell (bash) itself.
### Based on other attributes
`find` can not only be used to look for files based on their name but also on
other properties like file size or access/modification date.
This can be useful to display only files which have been changed or created in
the last minutes or hours.
For example, let's create a new empty file with
```console
$ touch finding-things/new_file.txt
```
This file is now much newer than the other files in the folder and we can find
it with the either the `-amin` or `-mmin` option, depending if
we are looking for access or modification time, respectively.
So to find all files that were modified less than 10 minutes ago, you can do:
```console
$ find finding-things -mmin -10
```
resulting in the output
```console
finding-things
finding-things/new_file.txt
```
As expected we tot all files or folders that have been modified (in our case, have
been created) less than 10 min ago. Your output might look different depending on when
you created the `finding-things` folder. By contrast `-mmin +10` would show all
files that have been modified more than 10 min ago.
Another property, that is sometimes useful for differentiating files, is their
size. If we want to see all files in our `finding-things` directory larger than 100
kB we can use
```console
$ find finding-things -size +100k
```
which gives the result:
```console
finding-things/aivxievn/qbafmtbq/data10.csv
finding-things/aivxievn/qbafmtbq/data1.csv
finding-things/aivxievn/qbafmtbq/plot.jpg
finding-things/aivxievn/qbafmtbq/data11.csv
finding-things/aivxievn/qbafmtbq/data2.csv
finding-things/aivxievn/qbafmtbq/data14.csv
finding-things/aivxievn/qbafmtbq/data13.csv
finding-things/aivxievn/qbafmtbq/data4.csv
finding-things/aivxievn/qbafmtbq/data3.csv
```
Other possible input options are for example `-1M` (smaller than 1 MB) or `+2G`
(larger than 2 GB).
### Combining attributes
To unlock find's full potential, it is possible to combine different attribute
to search for files very precisely. If we for example want to find all `.csv`
files with a size of greater than 200 kB we can use:
```console
$ find finding-things -name '*.csv' -size +200k
```
There was one file with those attributes, namely
```console
finding-things/aivxievn/qbafmtbq/data13.csv
```
## Finding lines in a file with `grep`
While `find`is useful for finding files based on their names and other
parameters, `grep` let's us find things within files.
Basic usage (there are a lot of options for more clever things, see the man
page `man grep`) uses the syntax `grep whatToFind fileToSearch`.
### Finding lines in a specific file
We got a list of genes from a colleague, we want to analyse for a project.
First let's find the list with `find` and have a look at its layout with
`head`.
```console
$ find finding-things -name 'genelist.tsv'
```
From that command we found the path to this file, namely
```console
finding-things/genelist.tsv
```
We use this path to look at the first 5 lines in the file, just to get an overview of how the file looks like.
```console
$ head -n5 finding-things/genelist.tsv
```
which results in
```console
Entry Gene names Organism
Q14914 PTGR1 LTB4DH Homo sapiens (Human)
Q6GMI9 uxs1 zgc:91980 Danio rerio (Zebrafish) (Brachydanio rerio)
O75452 RDH16 RODH4 SDR9C8 Homo sapiens (Human)
O23530 SNC1 BAL At4g16890 dl4475c FCAALL.51 Arabidopsis thaliana (Mouse-ear cress)
```
We see that the file contains genes organised in three columns, the last
describing the organism. Assuming we first want to list all genes from rats, we
can use grep like this:
```console
$ grep rat finding-things/genelist.tsv
```
This return one result.
```console
O70351 Hsd17b10 Erab rattus norvegicus (rat)
```
But from our colleague's email we know that there
should be two rat genes, so we try the `-i` option to ignore capitalization:
```console
$ grep -i rat finding-things/genelist.tsv
```
This time we get both genes as result:
```console
O08699 Hpgd Pgdh1 Rattus norvegicus (Rat)
O70351 Hsd17b10 Erab rattus norvegicus (rat)
```
### Finding lines in any file in a folder and its subfolders
We remembered these genes from a conversation with another colleague.
Fortunately we wrote down some notes. But where is the files with the notes? It
must be in our folder. So let's look for the gene name in all files in the
folder 'finding-things' using the `-r` option of `grep`:
```console
$ grep -r Hsd17b10 finding-things -C 1 -n
```
The result is the following:
```
finding-things/experiment1/notes.txt-5-- Genes which would be interesting to check out, according to Prof Ole Martin:
finding-things/experiment1/notes.txt:6:Hsd17b10, Slc25a51, AtFDH1, Hpgd Pgdh1
finding-things/experiment1/notes.txt-7-- Udon is a thick noodle made from wheat flour
--
finding-things/godcjrbv/sequence.fasta:1:>sp|O70351|HCD2_RAT 3-hydroxyacyl-CoA dehydrogenase type-2 OS=Rattus norvegicus OX=10116 GN=Hsd17b10 PE=1 SV=3
finding-things/godcjrbv/sequence.fasta-2-MAAAVRSVKGLVAVITGGASGLGLSTAKRLVGQGATAVLLDVPNSEGETEAKKLGGNCIF
--
finding-things/genelist.tsv-222-A0A1P8B9L1 FDH AtFDH1 FORMATE DEHYDROGENASE formate dehydrogenase At5g14780 T9L3.80 T9L3_80 Arabidopsis thaliana (Mouse-ear cress)
finding-things/genelist.tsv:223:O70351 Hsd17b10 Erab rattus norvegicus (rat)
finding-things/genelist.tsv-224-F4JGL5 NDB2 NAD(P)H dehydrogenase B2 At4g05020 C17L7.11 Arabidopsis thaliana (Mouse-ear cress)
```
As expected, this returns the gene in 'genelist.tsv' but indeed also the note file we were
looking for (in addition to another file where this string occurs). `-n` shows the line number and `-C 1` provides us with some
context by displaying one line before and after the line containing our search
term.
## Exercise
`````{exercise} Exercise (15 min)
1. We not only forgot where we saved the output file `output.txt`, but also
the figure we plotted. How can you find the file `graph.jpg`? How do we
look for this file only in the folder `experiment1` and its subfolders?
2. Now we want to find a plot we saved somewhere but we don't remember which
file format we used (`.jpg`, `.bmp`, `.png` or something else).
How could you find a file with the name 'plot' but unknown suffix?
3. Let's assume that we can't remember if our output files are of type `.jpg`
or `.JPG` but we know that it is larger than 500 kB.
How can we find it?
4. The notes file also contained the gene 'AtFDH1' that might be interesting.
From which organism is that gene in the 'genelist.tsv' file?
5. How can we find the file containing the sequence of gene 'Hsd17b10'? It is
saved somewhere in our folder but not properly labeled.
````{solution}
1. If we want to look in all subfolders of our basefolder `finding-things`, we can
use:
```console
$ find finding-things -name graph.jpg
```
If we only want to look in the subfolder `experiment1`, we can modify the
starting point of the search.
```console
$ find finding-things/experiment1 -name graph.jpg
```
2. One possibility to look for a file with an unknown suffix is to use a
wildcard:
use:
```console
$ find finding-things -name 'plot.*'
```
Don't forget the quotation marks around `plot.*`.
3. You can combine multiple search option. First remember to use `-iname` if
you want to search for files without taking the capitalizations into
account. Second the `-size` parameter let's you find files based on their size.
```console
$ find finding-things -iname '*.jpg' -size +500k
```
4. Use grep to search for our gene name in the right file:
```console
$ grep AtFDH1 finding-things/genelist.tsv
```
5. Use the `-r` option to recursively look for the search phrase in all files
in the specified folder, in our case the `finding-things` folder.
```console
$ grep -r Hsd17b10 finding-things
```
````
`````
```{keypoints}
- It is easy to loose the overview in large folders with many subfolders and
files.
- Both `grep` and `find` enable us to find files (again).
- Both commands offer many options to find excatly what you are looking for,
check out their man pages with `man find` and `man grep`.
```