HPC Data Management for NGS Analysis: Storage, Transfer, and Sharing Best Practices

Video Tutorial

Introduction to Data Management on High-Performance Computing Systems

High-Performance Computing (HPC) systems have become essential tools for Next-Generation Sequencing (NGS) data analysis. These powerful computing environments allow researchers to process and analyze massive genomic datasets that would be impossible to handle on standard desktop computers. However, working effectively with HPC systems requires understanding how to properly store, transfer, and share the large data files involved in NGS analysis.

In this tutorial, we’ll explore the fundamental aspects of data management on HPC systems, focusing specifically on the needs of NGS researchers with limited programming background. We’ll cover everything from basic storage concepts to secure data transfer methods across different operating systems.

Understanding HPC Storage Architecture

Before diving into specific tools and commands, it’s important to understand how storage typically works on HPC systems. Most HPC environments have several different storage locations, each with different purposes:

Home Directory

Your home directory (often accessed via ~/ or /home/username/) is where you’ll find yourself when you first log in. This space:

Is relatively small (often 50-100GB quota)
Is usually backed up regularly
Is suitable for scripts, small configuration files, and important results
Is NOT suitable for raw NGS data or intermediate files

Project/Work Directory

Many HPC systems have a larger shared space for project data (accessed via paths like /project/ or /work/). This space:

Has much larger quotas (terabytes)
May have some backup protection
Is suitable for important processed data and results
May be shared with other project members

Scratch Directory

The scratch space (often /scratch/ or similar) is designed for temporary storage:

Has very large capacity but may have automatic file deletion policies
Has no backup protection
Is optimized for high-speed I/O operations
Is perfect for raw NGS data and intermediate files during processing

Understanding this structure is crucial because placing your NGS data files in the right location can significantly impact both performance and data security.

Data Transfer and Download Tools for HPC

Downloading Public NGS Data Directly to HPC

When working with public NGS data, downloading directly to your HPC system saves time and prevents the need for multiple transfers. Here are the most common tools for this purpose:

SRA Toolkit

The SRA Toolkit is essential for downloading data from NCBI’s Sequence Read Archive (SRA). Check if it’s already installed on your HPC system by running the module avail command. If not available, you can easily install it using Conda following the steps outlined in the previous tutorial.

# Load the SRA toolkit module
module load sra-toolkit

# Download a FASTQ file by its accession number
fasterq-dump SRR28119110

wget and curl

For downloading data from other repositories or direct URLs, wget and curl are invaluable:

# Using wget to download a file
wget https://example.com/path/to/sequence_data.fastq.gz

# Using curl to download a file
curl -O https://example.com/path/to/sequence_data.fastq.gz

The key difference: wget is more robust for unstable connections since it can automatically resume interrupted downloads, while curl offers more options for complex download scenarios.

Aspera Connect

For faster downloads from repositories like EBI’s European Nucleotide Archive (ENA):

# Assuming ascp is in your path
ascp -QT -l 300m -P33001 \
  era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR123/SRR12345678/SRR12345678_1.fastq.gz \
  /scratch/your_username/

Aspera can be significantly faster than HTTP-based downloads, especially for large NGS datasets, as it uses a proprietary UDP-based protocol.

Data Sharing Between HPC Users

Sharing Within the Same HPC System

When collaborating with others who have access to the same HPC system, you have several options:

File Permissions

The simplest way to share data is by adjusting file permissions:

# Make a directory readable by everyone in your group
mkdir shared_data
chmod g+rx shared_data

# Make all files within the directory readable by your group
chmod -R g+r shared_data/*

Here, g+rx adds read and execute permissions for your group to the directory, while g+r adds read permission to all files inside.

Shared Project Spaces

Many HPC systems offer project spaces specifically designed for collaboration:

# Create a directory in the project space
mkdir /project/your_project_id/shared_ngs_data

# Set appropriate permissions
chmod 750 /project/your_project_id/shared_ngs_data

The permission 750 means the owner has full access, group members can read and execute, while others have no access.

Sharing With External Collaborators

When collaborating with colleagues who lack access to your High-Performance Computing (HPC) environment, several specialized tools can facilitate efficient and secure data sharing. Below are three leading solutions:

Globus

Globus is a powerful tool designed specifically for research data transfer. Accessed through a web interface or endpoint client, it provides a user-friendly experience while managing complex, large-scale data transfers securely. Many institutions maintain Globus endpoints, making data sharing as simple as selecting files and specifying the recipient’s endpoint. For setup assistance, contact your institution’s IT team.

Aspera

IBM Aspera is a high-speed data transfer solution designed for moving large research datasets efficiently. Accessible through web interfaces, desktop clients, or command-line tools, Aspera uses patented FASP® protocol technology to maximize transfer speeds regardless of network conditions or physical distance. Many research institutions and commercial enterprises maintain Aspera servers that enable secure data sharing with collaborators worldwide. Contact your organization’s IT department to determine if Aspera services are available or to establish a new deployment for your research needs.

Box

Box is a secure cloud content management platform widely adopted by research institutions. It offers comprehensive web interfaces, desktop synchronization clients, and mobile applications for accessing research data from anywhere. Box provides enterprise-grade security features including encryption, access controls, and compliance certifications that protect sensitive research data. Many academic institutions have enterprise Box agreements that provide researchers with enhanced storage quotas and collaboration features. Box’s strengths include its intuitive interface, robust sharing permissions, and integration with numerous research and productivity applications. Check with your institution’s IT department about available Box accounts and any specific institutional policies for research data storage.

Data Integrity Checking with MD5 Checksums

Understanding Data Integrity for NGS Files

NGS data files are typically very large—often reaching tens or hundreds of gigabytes. When transferring such massive files between systems, there’s always a risk of data corruption, which could lead to invalid analysis results. This is why verifying data integrity is a critical step in NGS data management.

What is MD5 and How Does it Work?

MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function that produces a 128-bit (16-byte) hash value, typically expressed as a 32-character hexadecimal number. When applied to a file, MD5 generates a unique “fingerprint” or “checksum” that changes if even a single bit in the file is altered.

Here’s how the process works:

The MD5 algorithm reads the entire file, bit by bit
It processes this data through a complex mathematical function
The result is a fixed-length string (the checksum) that uniquely represents the file’s contents
If the file changes in any way, the resulting MD5 checksum will be completely different

Creating MD5 Checksums for NGS Files

Before transferring NGS data, you should generate checksums for your files:

# Generate MD5 checksum for a single file
md5sum large_genome.fastq.gz > large_genome.fastq.gz.md5

# Generate MD5 checksums for multiple files
md5sum *.fastq.gz > fastq_files.md5

# View the contents of an MD5 file
cat large_genome.fastq.gz.md5
# Output: 7d9c92e0e1e2c9d248e6088fd9be8daf large_genome.fastq.gz

In many public repositories, MD5 checksums are provided alongside the data files. For example, when downloading from the European Nucleotide Archive (ENA), you’ll often find .md5 files that contain the expected checksums.

Verifying Data Integrity After Transfer

After transferring files to or from your HPC system, you should verify their integrity:

# Verify a single file against its MD5 file
md5sum -c large_genome.fastq.gz.md5
# Output: large_genome.fastq.gz: OK

# Verify multiple files at once
md5sum -c fastq_files.md5
# Output: 
# sample1.fastq.gz: OK
# sample2.fastq.gz: OK
# sample3.fastq.gz: FAILED
# md5sum: WARNING: 1 computed checksum did NOT match

The -c flag tells md5sum to check the files against the checksums in the specified file. If a file’s current checksum matches the one in the MD5 file, you’ll see “OK.” If not, you’ll see “FAILED,” indicating the file may be corrupted.

MD5 vs. Other Checksum Algorithms

While MD5 is commonly used for file integrity checking in bioinformatics, it’s worth noting that more secure alternatives exist:

# Using SHA-256 instead of MD5
sha256sum large_genome.fastq.gz > large_genome.fastq.gz.sha256
sha256sum -c large_genome.fastq.gz.sha256

# Using SHA-1
sha1sum large_genome.fastq.gz > large_genome.fastq.gz.sha1
sha1sum -c large_genome.fastq.gz.sha1

MD5 remains popular due to its speed and widespread support, but SHA-256 provides stronger guarantees against accidental collisions (where different files produce the same checksum). For NGS data integrity checking, MD5 is generally sufficient.

Practical Examples for NGS Workflows

Example 1: Downloading and Verifying Reference Genomes

When downloading reference genomes, always verify their integrity:

# Download a reference genome and its MD5 file
wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.md5

# Verify the downloaded genome
md5sum -c Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.md5
# Output: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz: OK

Example 2: Transferring and Verifying Your Own NGS Data

When moving your sequencing data to HPC:

# On your local machine, before transferring
cd /path/to/sequencing_data
md5sum *.fastq.gz > my_project_md5sums.md5

# Transfer files to HPC (including the MD5 file)
scp *.fastq.gz my_project_md5sums.md5 username@hpc.example.edu:/scratch/username/project/

# On HPC, after transferring
cd /scratch/username/project/
md5sum -c my_project_md5sums.md5

Transferring Data Between HPC and Personal Computers

Command-Line Transfer Tools

Using scp (Secure Copy)

The scp command is a straightforward tool for secure file transfers:

# From local to HPC
scp large_genome.fa username@hpc.example.edu:/scratch/username/

# From HPC to local
scp username@hpc.example.edu:/scratch/username/results.bam ./

# Transfer an entire directory
scp -r local_folder username@hpc.example.edu:/scratch/username/

Using SFTP (Secure File Transfer Protocol)

SFTP provides an interactive file transfer session with more flexibility than scp:

# Start an SFTP session
sftp username@hpc.example.edu

# Once connected, you can use various commands:
sftp> pwd           # Show current remote directory
sftp> lpwd          # Show current local directory
sftp> lls           # List files in the local directory
sftp> cd /scratch   # Change remote directory
sftp> lcd ~/Downloads  # Change local directory
sftp> get results.bam  # Download a file
sftp> put sequence.fa  # Upload a file
sftp> mget *.fastq.gz  # Download multiple files
sftp> mkdir new_dir    # Create a directory on remote
sftp> ls -la           # List remote files with details
sftp> exit             # Close the connection

SFTP is particularly useful when you need to perform multiple transfer operations or navigate through directories before deciding what to transfer.

Using LFTP (Enhanced FTP/SFTP Client)

LFTP is a sophisticated file transfer program that supports multiple protocols:

# Connect to an HPC system using SFTP protocol
lftp sftp://username@hpc.example.edu

# Once connected, you can use enhanced commands:
lftp username@hpc.example.edu:~> mirror -R local_dir remote_dir  # Upload a directory recursively
lftp username@hpc.example.edu:~> mirror remote_dir local_dir     # Download a directory recursively
lftp username@hpc.example.edu:~> queue put large_file1.bam       # Add file to transfer queue
lftp username@hpc.example.edu:~> queue put large_file2.bam       # Add another file
lftp username@hpc.example.edu:~> queue start                     # Start the queued transfers
lftp username@hpc.example.edu:~> pget -n 4 huge_genome.fa        # Download with 4 parallel connections
lftp username@hpc.example.edu:~> exit                            # Close the connection

LFTP excels with features like:

Parallel transfers to maximize bandwidth
Transfer queuing for batching operations
Robust handling of unstable connections
Ability to limit bandwidth usage
Background transfer capabilities

Example of a background transfer:

# Start LFTP and put it in the background
lftp -c "open sftp://username@hpc.example.edu; \
         mirror -v /scratch/username/results ~/local_results; \
         quit"

This is extremely useful for transferring large NGS datasets that might take hours to complete.

Using rsync

For more robust transfers, especially with large datasets:

# Sync a local directory to HPC
rsync -avz --progress ~/my_project/ username@hpc.example.edu:/scratch/username/my_project/

# Sync from HPC to local with compression
rsync -avz --progress username@hpc.example.edu:/scratch/username/results/ ~/local_results/

The flags are important:

-a preserves file attributes
-v provides verbose output
-z compresses data during transfer
--progress shows progress during transfer

Graphical Interface Tools for Data Transfer

For those who prefer visual interfaces over command-line tools, several excellent options are available:

FileZilla (Cross-Platform)

FileZilla is a popular, free SFTP client with an intuitive interface:

Installation: Download from https://filezilla-project.org/ and install
Connection Setup:

Click “File > Site Manager > New Site”
Enter a name for your connection
Set Protocol to “SFTP – SSH File Transfer Protocol”
Enter Host: your HPC address (e.g., hpc.example.edu)
Enter your username and password
Click “Connect”

Using FileZilla:

The left pane shows your local files
The right pane shows remote (HPC) files
Navigate to source and destination folders
Drag and drop files between panes
Right-click for additional options
View transfer progress at the bottom

FileZilla is excellent for researchers who need a visual representation of their file structure and prefer drag-and-drop operations.

Cyberduck (macOS and Windows)

Cyberduck offers a clean interface with bookmarking capabilities:

Installation: Download from https://cyberduck.io/ and install
Connection Setup:

Click “Open Connection”
Select SFTP from the dropdown
Enter server, username, and password
Click “Connect”

Using Cyberduck:

Browse through the file hierarchy
Double-click to download files (they’ll go to your Downloads folder)
Use the upload button to select files to transfer
Right-click files for options like “Edit With” to modify remote files directly
Create bookmarks for frequent connections

Cyberduck’s ability to edit remote files directly with local applications is particularly useful for NGS analysis scripts.

WinSCP (Windows)

WinSCP is a popular open-source SFTP client and file manager for Windows:

Installation: Download from https://winscp.net and install
Connection Setup:
- Launch WinSCP and enter your server details in the login dialog
- Input hostname, port number, username, and password
- Select SFTP as the file protocol
- Click “Save” to store the connection for future use
Using WinSCP:
- Navigate through remote directories in the right panel
- Local files appear in the left panel for easy drag-and-drop transfers
- Transfer files by dragging between panels or using dedicated transfer buttons
- Edit files directly with the built-in editor or configure external editors
- Use the synchronize feature to mirror directories between systems WinSCP’s dual-panel interface and powerful synchronization capabilities make it particularly valuable for maintaining consistent datasets between local and HPC environments.

Data Security Considerations on HPC

Understanding Security Requirements for NGS Data

NGS data often contains sensitive information, especially when derived from human samples. Understanding the security requirements is essential:

Data Classification

Most institutions classify data based on sensitivity:

Public data: Openly available genomic data like reference genomes
Restricted data: De-identified human genomic data requiring controlled access
Confidential data: Identifiable human genomic data with strict security requirements

Each classification will have different storage, transfer, and sharing requirements.

Encryption During Transfer

Always use encrypted transfer protocols:

# Good - uses encryption
scp my_data.bam username@hpc.example.edu:/scratch/username/
sftp username@hpc.example.edu
lftp sftp://username@hpc.example.edu

# Bad - no encryption
ftp example.edu  # Avoid for sensitive data

Never use unencrypted protocols like FTP or HTTP for transferring NGS data.

Storage Encryption

Some HPC systems offer encrypted storage options for sensitive data:

# Check if your data is on encrypted storage
df -h /path/to/your/data | grep encrypted

For highly sensitive data, consult with your HPC administrators about encrypted storage options.

How Security Affects Your Workflows

Security requirements may impact how you work:

Performance tradeoffs: Encrypted transfers and storage may be slower
Access limitations: Sensitive data may require working in specific secure environments
Sharing restrictions: May need formal agreements before sharing data
Audit requirements: Your access and transfers may be logged for compliance

Always check your institution’s data security policies when working with NGS data, especially human genome sequences.

Best Practices for NGS Data Management on HPC

Organization Strategies

Keeping your data organized is crucial for effective HPC usage:

# Example directory structure
/scratch/username/
├── project_name/
│   ├── raw_data/
│   ├── processed_data/
│   ├── scripts/
│   └── results/
└── README.txt  # Document what's stored where

Data Lifecycle Management

Plan for the entire lifecycle of your data:

Acquisition: How will you get the data to HPC?
Processing: Where will intermediate files be stored?
Analysis: Where will results be saved?
Archival: How will you preserve important results?
Deletion: How will you clean up unnecessary files?

# Example cleanup script for old intermediate files
find /scratch/username/project/intermediate/ -type f -atime +30 -name "*.tmp" -delete

Automation Tips

Automate repetitive transfer tasks:

# Create an SSH key for password-less transfers
ssh-keygen -t rsa -b 4096
ssh-copy-id username@hpc.example.edu

# Create a simple transfer script
cat > transfer_results.sh << 'EOF'
#!/bin/bash
rsync -avz --progress username@hpc.example.edu:/scratch/username/project/results/ ~/local_results/
EOF
chmod +x transfer_results.sh

For LFTP automation, create scripts for regular transfers:

# Create an LFTP script file
cat > sync_ngs_data.lftp << 'EOF'
open sftp://username@hpc.example.edu
lcd ~/local_ngs_data
cd /scratch/username/ngs_project
mirror -v --only-newer
quit
EOF

# Run the script when needed
lftp -f sync_ngs_data.lftp

This script will download only new files from your HPC project to your local directory.

Integrating Data Integrity Checks into Your Workflow

To ensure your NGS data remains intact throughout its lifecycle, integrate MD5 checks at key points:

# Example of integrated download, verification, and transfer workflow
#!/bin/bash

# 1. Download data from repository
echo "Downloading SRA data..."
prefetch SRR12345678
fastq-dump --split-files SRR28119110

# 2. Generate MD5 checksums immediately after download
echo "Generating checksums..."
md5sum SRR28119110*.fastq > downloaded_files.md5

# 3. Transfer to analysis directory
echo "Transferring to analysis directory..."
rsync -avz SRR28119110*.fastq /scratch/username/project/

# 4. Verify files after transfer
echo "Verifying transfer integrity..."
cd /scratch/username/project/
md5sum -c ~/downloaded_files.md5

if [ $? -ne 0 ]; then
    echo "ERROR: File integrity check failed!"
    exit 1
fi

echo "Files transferred and verified successfully."

By integrating MD5 checks into your workflow, you establish multiple verification points that help identify when and where data corruption might occur.

Common Pitfalls and How to Avoid Them

Storage Quota Issues

Running out of space is a common problem:

# Check your current usage and quota
quota -s

# Find large files that might be wasting space
find /scratch/username -type f -size +1G -exec ls -lh {} \;

Transfer Interruptions

For large NGS files, transfer interruptions are common:

# Instead of scp, use rsync which can resume
rsync -avz --partial --progress large_genome.bam username@hpc.example.edu:/scratch/username/

# Or use LFTP which handles interruptions well
lftp -c "open sftp://username@hpc.example.edu; \
         pget -n 8 -c /scratch/username/huge_genome.fa ~/Downloads/"

The --partial flag in rsync keeps partially transferred files, allowing resumption. In LFTP, the -c flag enables continue/resume of interrupted transfers, while -n 8 uses 8 parallel connections for faster transfers.

Permission Problems

Incorrect permissions can cause sharing issues:

# Fix permissions recursively
find /project/shared_data -type d -exec chmod 750 {} \;
find /project/shared_data -type f -exec chmod 640 {} \;

Forgetting Data Locations

As you work with multiple projects, tracking data location becomes challenging:

# Create a simple data catalog
cat > ~/data_catalog.txt << 'EOF'
Project: Liver Cancer RNAseq
Raw data: /scratch/username/liver_cancer/raw/
Results: /project/groupname/liver_cancer/results/
Backup: Globus endpoint "Institution DataVault" path "/backups/2023/liver_cancer/"
EOF

Checksum Verification Issues

Common problems with MD5 verification and how to solve them:

Mismatched checksums: When a file doesn’t match its expected checksum

   # Re-download or re-transfer the file
   rm corrupted_file.fastq
   wget https://source.example.edu/path/to/file.fastq

   # Verify again
   md5sum -c file.fastq.md5

Missing MD5 files: When you need to verify files but don’t have a checksum file

   # Contact the data provider for the correct checksums
   # Or, if the file is shared by a collaborator, ask them to provide MD5s

MD5 file format issues: Different systems might format MD5 files differently

   # If your MD5 verification fails due to format issues, reformat the file:
   awk '{print $1 "  " $2}' problematic.md5 > reformatted.md5

Troubleshooting Guide

Transfer Issues

Problem	Possible Solution
“Connection timed out”	Check network, try smaller chunks, use Globus
“Permission denied”	Check file permissions, ensure correct username
“Disk quota exceeded”	Clean up unnecessary files, request quota increase
“File not found”	Verify paths, check for typos
“Connection reset by peer”	Try LFTP with auto-retry or use a graphical tool with resume capability
“Checksum verification failed”	Re-transfer the file, check if file was modified during transfer

Command Debugging

When commands don’t work as expected:

# Add verbose flags
scp -v large_file.fastq username@hpc.example.edu:/scratch/username/

# Check system load on HPC
ssh username@hpc.example.edu "uptime"

# Check disk space
ssh username@hpc.example.edu "df -h /scratch"

# Test SFTP connection with debugging
sftp -v username@hpc.example.edu

Graphical Tool Issues

For problems with graphical transfer tools:

Connection failures:

Verify you can connect via command line first
Check if a firewall is blocking the connection
Try connecting to a different port if your HPC supports it

Slow transfers:

Try disabling any antivirus real-time scanning temporarily
In FileZilla, adjust concurrent transfers (Settings > Transfers)
In LFTP, increase parallel connections (pget -n 8)

Failed transfers:

Check local disk space
Ensure file paths don’t contain special characters
Try transferring to a different directory first

Conclusion

Effective data management is a fundamental skill for NGS analysis on HPC systems. By understanding the proper tools and techniques for storing, transferring, and sharing your genomic data, you can create more efficient workflows, collaborate more effectively, and ensure the security of sensitive information.

The addition of tools like SFTP and LFTP to your toolkit provides more flexibility for different transfer scenarios, while graphical interfaces make data management more accessible for those less comfortable with command-line operations. Perhaps most importantly, incorporating data integrity verification with MD5 checksums gives you confidence that your NGS data remains intact throughout its lifecycle—from download to analysis to archiving.

Remember that each HPC system has its own specific configuration and policies, so always consult your institution’s HPC documentation for system-specific details. With practice, managing large NGS datasets across systems will become second nature, allowing you to focus on the biological insights hidden within your data.

Data integrity verification isn’t just a best practice—it’s an essential step in ensuring reproducible science. When your analysis depends on terabytes of genomic data, even minor corruptions can lead to misleading results. By incorporating MD5 checksums into your workflow, you’re not just protecting your data; you’re protecting the validity of your scientific conclusions.

Have you faced any specific challenges with managing your NGS data on HPC systems? Feel free to share your experiences or questions in the comments section below!