GEO Data Mining: A Personal Journey

GEO Data Mining: A Personal Journey

Introduction

After attending a speaking tour by Jimmy, I decided to embark on a journey to learn GEO data mining. I watched a video, thought it was real, and then looked up Jimmy’s last year’s data mining posts. I didn’t expect the first phase to be so challenging, but fortunately, I was able to resolve the issue. In this article, I will share my experience.

GTF: A Gene Annotation File

GTF (General Feature Format) is a file format used to annotate genes. The file contains information about the gene, its type, and the corresponding relationship. In this article, we will explore how to update GTF from a file to give geneID, type, and the corresponding relationship.

Downloading the GTF File

To begin, I downloaded the GTF file from the human genome assembly GRCh38.96.gtf.gz. I extracted the file using the shell and compared it with the original post by Jimmy. However, I found that the file was not in the same format as the original post. I suspected that the file might be the latest version of GTF.

Converting the GTF File

I decided to convert the GTF file to the gene2type format. I used the following shell commands to achieve this:

awk '{if (! NF || / ^ # /) {next}} 1' public/reference/gtf/gencode/gencode.v25lift37.annotation.gtf | cut -f9 | sed 's / \ "// g' | sed 's /; // g' | awk '{print $ 4" \ t "$ 8} '| awk' {if (/ ^ E /) {next}} 1 '| awk' {print $ 2" \ t "$ 1} '| sort -k 1 | uniq> gencode.v25lift37.annotation.gtf.gene2type

However, I found that the conversion was not perfect. The number of genes was too few (only 7641), and there were many strange types of content. I suspected that the conversion was not correct.

Finding the Correct GTF File

After carefully studying the download site, I found that the original post by Jimmy was using an older version of the GTF file. I downloaded the correct file, gencode.v30lift37.annotation.gtf, and compared it with the new version of the file. I found that the latest version of GTF did not contain the gene_status field, but all other fields were the same.

Converting the Correct GTF File

I decided to convert the correct GTF file to the gene2type format. I used the following shell commands to achieve this:

awk '{if (! NF || / ^ # /) {next}} 1' gencode.v30lift37.annotation.gtf | cut -f9 | sed 's / \ "// g' | sed 's /; // g' | awk '{print $ 4" \ t" $ 6} '| awk' {if (/ ^ E /) {next}} 1 '| awk' {print $ 2" \ t" $ 1} '| sort -k 1 | uniq> gencode.v30new.annotation.gtf.gene2type

This time, the conversion was perfect, and I was able to get the correct gene2type file.

Conclusion

In conclusion, my experience with GEO data mining was challenging, but I was able to resolve the issue. I learned that it’s essential to carefully study the download site and compare the files to ensure that the conversion is correct. I also learned that perseverance and continued exploration are crucial when faced with challenges. Finally, I learned how to use markdown and shell language to convert files and share my experience with others.

Acknowledgement

I would like to thank Jimmy for his guidance and support throughout this journey. I would also like to thank the community for their feedback and suggestions.