Introduction to Data Management on High-Performance Computing Systems
High-Performance Computing (HPC) systems have become essential tools for Next-Generation Sequencing (NGS) data analysis. These powerful computing environments allow researchers to process and analyze massive genomic datasets that would be impossible to handle on standard desktop computers. However, working effectively with HPC systems requires understanding how to properly store, transfer, and share the large data files involved in NGS analysis.
In this tutorial, we’ll explore the fundamental aspects of data management on HPC systems, focusing specifically on the needs of NGS researchers with limited programming background. We’ll cover everything from basic storage concepts to secure data transfer methods across different operating systems.
Understanding HPC Storage Architecture
Before diving into specific tools and commands, it’s important to understand how storage typically works on HPC systems. Most HPC environments have several different storage locations, each with different purposes:
Home Directory
Your home directory (often accessed via ~/
or /home/username/
) is where you’ll find yourself when you first log in. This space:
- Is relatively small (often 50-100GB quota)
- Is usually backed up regularly
- Is suitable for scripts, small configuration files, and important results
- Is NOT suitable for raw NGS data or intermediate files
Project/Work Directory
Many HPC systems have a larger shared space for project data (accessed via paths like /project/
or /work/
). This space:
- Has much larger quotas (terabytes)
- May have some backup protection
- Is suitable for important processed data and results
- May be shared with other project members
Scratch Directory
The scratch space (often /scratch/
or similar) is designed for temporary storage:
- Has very large capacity but may have automatic file deletion policies
- Has no backup protection
- Is optimized for high-speed I/O operations
- Is perfect for raw NGS data and intermediate files during processing
Understanding this structure is crucial because placing your NGS data files in the right location can significantly impact both performance and data security.
Data Transfer and Download Tools for HPC
Downloading Public NGS Data Directly to HPC
When working with public NGS data, downloading directly to your HPC system saves time and prevents the need for multiple transfers. Here are the most common tools for this purpose:
SRA Toolkit
The SRA Toolkit is essential for downloading data from NCBI’s Sequence Read Archive (SRA). Check if it’s already installed on your HPC system by running the module avail
command. If not available, you can easily install it using Conda following the steps outlined in the previous tutorial.
# Load the SRA toolkit module
module load sra-toolkit
# Download a FASTQ file by its accession number
fasterq-dump SRR28119110
wget and curl
For downloading data from other repositories or direct URLs, wget
and curl
are invaluable:
# Using wget to download a file
wget https://example.com/path/to/sequence_data.fastq.gz
# Using curl to download a file
curl -O https://example.com/path/to/sequence_data.fastq.gz
The key difference: wget
is more robust for unstable connections since it can automatically resume interrupted downloads, while curl
offers more options for complex download scenarios.
Aspera Connect
For faster downloads from repositories like EBI’s European Nucleotide Archive (ENA):
# Assuming ascp is in your path
ascp -QT -l 300m -P33001 \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR123/SRR12345678/SRR12345678_1.fastq.gz \
/scratch/your_username/
Aspera can be significantly faster than HTTP-based downloads, especially for large NGS datasets, as it uses a proprietary UDP-based protocol.
Data Sharing Between HPC Users
Sharing Within the Same HPC System
When collaborating with others who have access to the same HPC system, you have several options:
File Permissions
The simplest way to share data is by adjusting file permissions:
# Make a directory readable by everyone in your group
mkdir shared_data
chmod g+rx shared_data
# Make all files within the directory readable by your group
chmod -R g+r shared_data/*
Here, g+rx
adds read and execute permissions for your group to the directory, while g+r
adds read permission to all files inside.
Shared Project Spaces
Many HPC systems offer project spaces specifically designed for collaboration:
# Create a directory in the project space
mkdir /project/your_project_id/shared_ngs_data
# Set appropriate permissions
chmod 750 /project/your_project_id/shared_ngs_data
The permission 750
means the owner has full access, group members can read and execute, while others have no access.
Sharing With External Collaborators
When collaborating with colleagues who lack access to your High-Performance Computing (HPC) environment, several specialized tools can facilitate efficient and secure data sharing. Below are three leading solutions:
Globus

Globus is a powerful tool designed specifically for research data transfer. Accessed through a web interface or endpoint client, it provides a user-friendly experience while managing complex, large-scale data transfers securely. Many institutions maintain Globus endpoints, making data sharing as simple as selecting files and specifying the recipient’s endpoint. For setup assistance, contact your institution’s IT team.
Aspera

IBM Aspera is a high-speed data transfer solution designed for moving large research datasets efficiently. Accessible through web interfaces, desktop clients, or command-line tools, Aspera uses patented FASP® protocol technology to maximize transfer speeds regardless of network conditions or physical distance. Many research institutions and commercial enterprises maintain Aspera servers that enable secure data sharing with collaborators worldwide. Contact your organization’s IT department to determine if Aspera services are available or to establish a new deployment for your research needs.
Box

Box is a secure cloud content management platform widely adopted by research institutions. It offers comprehensive web interfaces, desktop synchronization clients, and mobile applications for accessing research data from anywhere. Box provides enterprise-grade security features including encryption, access controls, and compliance certifications that protect sensitive research data. Many academic institutions have enterprise Box agreements that provide researchers with enhanced storage quotas and collaboration features. Box’s strengths include its intuitive interface, robust sharing permissions, and integration with numerous research and productivity applications. Check with your institution’s IT department about available Box accounts and any specific institutional policies for research data storage.
Data Integrity Checking with MD5 Checksums
Understanding Data Integrity for NGS Files
NGS data files are typically very large—often reaching tens or hundreds of gigabytes. When transferring such massive files between systems, there’s always a risk of data corruption, which could lead to invalid analysis results. This is why verifying data integrity is a critical step in NGS data management.
What is MD5 and How Does it Work?
MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function that produces a 128-bit (16-byte) hash value, typically expressed as a 32-character hexadecimal number. When applied to a file, MD5 generates a unique “fingerprint” or “checksum” that changes if even a single bit in the file is altered.
Here’s how the process works:
- The MD5 algorithm reads the entire file, bit by bit
- It processes this data through a complex mathematical function
- The result is a fixed-length string (the checksum) that uniquely represents the file’s contents
- If the file changes in any way, the resulting MD5 checksum will be completely different
Creating MD5 Checksums for NGS Files
Before transferring NGS data, you should generate checksums for your files:
# Generate MD5 checksum for a single file
md5sum large_genome.fastq.gz > large_genome.fastq.gz.md5
# Generate MD5 checksums for multiple files
md5sum *.fastq.gz > fastq_files.md5
# View the contents of an MD5 file
cat large_genome.fastq.gz.md5
# Output: 7d9c92e0e1e2c9d248e6088fd9be8daf large_genome.fastq.gz
In many public repositories, MD5 checksums are provided alongside the data files. For example, when downloading from the European Nucleotide Archive (ENA), you’ll often find .md5
files that contain the expected checksums.
Verifying Data Integrity After Transfer
After transferring files to or from your HPC system, you should verify their integrity:
# Verify a single file against its MD5 file
md5sum -c large_genome.fastq.gz.md5
# Output: large_genome.fastq.gz: OK
# Verify multiple files at once
md5sum -c fastq_files.md5
# Output:
# sample1.fastq.gz: OK
# sample2.fastq.gz: OK
# sample3.fastq.gz: FAILED
# md5sum: WARNING: 1 computed checksum did NOT match
The -c
flag tells md5sum
to check the files against the checksums in the specified file. If a file’s current checksum matches the one in the MD5 file, you’ll see “OK.” If not, you’ll see “FAILED,” indicating the file may be corrupted.
MD5 vs. Other Checksum Algorithms
While MD5 is commonly used for file integrity checking in bioinformatics, it’s worth noting that more secure alternatives exist:
# Using SHA-256 instead of MD5
sha256sum large_genome.fastq.gz > large_genome.fastq.gz.sha256
sha256sum -c large_genome.fastq.gz.sha256
# Using SHA-1
sha1sum large_genome.fastq.gz > large_genome.fastq.gz.sha1
sha1sum -c large_genome.fastq.gz.sha1
MD5 remains popular due to its speed and widespread support, but SHA-256 provides stronger guarantees against accidental collisions (where different files produce the same checksum). For NGS data integrity checking, MD5 is generally sufficient.
Practical Examples for NGS Workflows
Example 1: Downloading and Verifying Reference Genomes
When downloading reference genomes, always verify their integrity:
# Download a reference genome and its MD5 file
wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.md5
# Verify the downloaded genome
md5sum -c Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz.md5
# Output: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz: OK
Example 2: Transferring and Verifying Your Own NGS Data
When moving your sequencing data to HPC:
# On your local machine, before transferring
cd /path/to/sequencing_data
md5sum *.fastq.gz > my_project_md5sums.md5
# Transfer files to HPC (including the MD5 file)
scp *.fastq.gz my_project_md5sums.md5 username@hpc.example.edu:/scratch/username/project/
# On HPC, after transferring
cd /scratch/username/project/
md5sum -c my_project_md5sums.md5
Transferring Data Between HPC and Personal Computers
Command-Line Transfer Tools
Using scp (Secure Copy)
The scp
command is a straightforward tool for secure file transfers:
# From local to HPC
scp large_genome.fa username@hpc.example.edu:/scratch/username/
# From HPC to local
scp username@hpc.example.edu:/scratch/username/results.bam ./
# Transfer an entire directory
scp -r local_folder username@hpc.example.edu:/scratch/username/
Using SFTP (Secure File Transfer Protocol)
SFTP provides an interactive file transfer session with more flexibility than scp:
# Start an SFTP session
sftp username@hpc.example.edu
# Once connected, you can use various commands:
sftp> pwd # Show current remote directory
sftp> lpwd # Show current local directory
sftp> lls # List files in the local directory
sftp> cd /scratch # Change remote directory
sftp> lcd ~/Downloads # Change local directory
sftp> get results.bam # Download a file
sftp> put sequence.fa # Upload a file
sftp> mget *.fastq.gz # Download multiple files
sftp> mkdir new_dir # Create a directory on remote
sftp> ls -la # List remote files with details
sftp> exit # Close the connection
SFTP is particularly useful when you need to perform multiple transfer operations or navigate through directories before deciding what to transfer.
Using LFTP (Enhanced FTP/SFTP Client)
LFTP is a sophisticated file transfer program that supports multiple protocols:
# Connect to an HPC system using SFTP protocol
lftp sftp://username@hpc.example.edu
# Once connected, you can use enhanced commands:
lftp username@hpc.example.edu:~> mirror -R local_dir remote_dir # Upload a directory recursively
lftp username@hpc.example.edu:~> mirror remote_dir local_dir # Download a directory recursively
lftp username@hpc.example.edu:~> queue put large_file1.bam # Add file to transfer queue
lftp username@hpc.example.edu:~> queue put large_file2.bam # Add another file
lftp username@hpc.example.edu:~> queue start # Start the queued transfers
lftp username@hpc.example.edu:~> pget -n 4 huge_genome.fa # Download with 4 parallel connections
lftp username@hpc.example.edu:~> exit # Close the connection
LFTP excels with features like:
- Parallel transfers to maximize bandwidth
- Transfer queuing for batching operations
- Robust handling of unstable connections
- Ability to limit bandwidth usage
- Background transfer capabilities
Example of a background transfer:
# Start LFTP and put it in the background
lftp -c "open sftp://username@hpc.example.edu; \
mirror -v /scratch/username/results ~/local_results; \
quit"
This is extremely useful for transferring large NGS datasets that might take hours to complete.
Using rsync
For more robust transfers, especially with large datasets:
# Sync a local directory to HPC
rsync -avz --progress ~/my_project/ username@hpc.example.edu:/scratch/username/my_project/
# Sync from HPC to local with compression
rsync -avz --progress username@hpc.example.edu:/scratch/username/results/ ~/local_results/
The flags are important:
-a
preserves file attributes-v
provides verbose output-z
compresses data during transfer--progress
shows progress during transfer
Graphical Interface Tools for Data Transfer
For those who prefer visual interfaces over command-line tools, several excellent options are available:
FileZilla (Cross-Platform)

FileZilla is a popular, free SFTP client with an intuitive interface:
- Installation: Download from https://filezilla-project.org/ and install
- Connection Setup:
- Click “File > Site Manager > New Site”
- Enter a name for your connection
- Set Protocol to “SFTP – SSH File Transfer Protocol”
- Enter Host: your HPC address (e.g., hpc.example.edu)
- Enter your username and password
- Click “Connect”
- Using FileZilla:
- The left pane shows your local files
- The right pane shows remote (HPC) files
- Navigate to source and destination folders
- Drag and drop files between panes
- Right-click for additional options
- View transfer progress at the bottom
FileZilla is excellent for researchers who need a visual representation of their file structure and prefer drag-and-drop operations.
Cyberduck (macOS and Windows)

Cyberduck offers a clean interface with bookmarking capabilities:
- Installation: Download from https://cyberduck.io/ and install
- Connection Setup:
- Click “Open Connection”
- Select SFTP from the dropdown
- Enter server, username, and password
- Click “Connect”
- Using Cyberduck:
- Browse through the file hierarchy
- Double-click to download files (they’ll go to your Downloads folder)
- Use the upload button to select files to transfer
- Right-click files for options like “Edit With” to modify remote files directly
- Create bookmarks for frequent connections
Cyberduck’s ability to edit remote files directly with local applications is particularly useful for NGS analysis scripts.
WinSCP (Windows)

WinSCP is a popular open-source SFTP client and file manager for Windows:
- Installation: Download from https://winscp.net and install
- Connection Setup:
- Launch WinSCP and enter your server details in the login dialog
- Input hostname, port number, username, and password
- Select SFTP as the file protocol
- Click “Save” to store the connection for future use
- Using WinSCP:
- Navigate through remote directories in the right panel
- Local files appear in the left panel for easy drag-and-drop transfers
- Transfer files by dragging between panels or using dedicated transfer buttons
- Edit files directly with the built-in editor or configure external editors
- Use the synchronize feature to mirror directories between systems WinSCP’s dual-panel interface and powerful synchronization capabilities make it particularly valuable for maintaining consistent datasets between local and HPC environments.
Data Security Considerations on HPC
Understanding Security Requirements for NGS Data
NGS data often contains sensitive information, especially when derived from human samples. Understanding the security requirements is essential:
Data Classification
Most institutions classify data based on sensitivity:
- Public data: Openly available genomic data like reference genomes
- Restricted data: De-identified human genomic data requiring controlled access
- Confidential data: Identifiable human genomic data with strict security requirements
Each classification will have different storage, transfer, and sharing requirements.
Encryption During Transfer
Always use encrypted transfer protocols:
# Good - uses encryption
scp my_data.bam username@hpc.example.edu:/scratch/username/
sftp username@hpc.example.edu
lftp sftp://username@hpc.example.edu
# Bad - no encryption
ftp example.edu # Avoid for sensitive data
Never use unencrypted protocols like FTP or HTTP for transferring NGS data.
Storage Encryption
Some HPC systems offer encrypted storage options for sensitive data:
# Check if your data is on encrypted storage
df -h /path/to/your/data | grep encrypted
For highly sensitive data, consult with your HPC administrators about encrypted storage options.
How Security Affects Your Workflows
Security requirements may impact how you work:
- Performance tradeoffs: Encrypted transfers and storage may be slower
- Access limitations: Sensitive data may require working in specific secure environments
- Sharing restrictions: May need formal agreements before sharing data
- Audit requirements: Your access and transfers may be logged for compliance
Always check your institution’s data security policies when working with NGS data, especially human genome sequences.
Best Practices for NGS Data Management on HPC
Organization Strategies
Keeping your data organized is crucial for effective HPC usage:
# Example directory structure
/scratch/username/
├── project_name/
│ ├── raw_data/
│ ├── processed_data/
│ ├── scripts/
│ └── results/
└── README.txt # Document what's stored where
Data Lifecycle Management
Plan for the entire lifecycle of your data:
- Acquisition: How will you get the data to HPC?
- Processing: Where will intermediate files be stored?
- Analysis: Where will results be saved?
- Archival: How will you preserve important results?
- Deletion: How will you clean up unnecessary files?
# Example cleanup script for old intermediate files
find /scratch/username/project/intermediate/ -type f -atime +30 -name "*.tmp" -delete
Automation Tips
Automate repetitive transfer tasks:
# Create an SSH key for password-less transfers
ssh-keygen -t rsa -b 4096
ssh-copy-id username@hpc.example.edu
# Create a simple transfer script
cat > transfer_results.sh << 'EOF'
#!/bin/bash
rsync -avz --progress username@hpc.example.edu:/scratch/username/project/results/ ~/local_results/
EOF
chmod +x transfer_results.sh
For LFTP automation, create scripts for regular transfers:
# Create an LFTP script file
cat > sync_ngs_data.lftp << 'EOF'
open sftp://username@hpc.example.edu
lcd ~/local_ngs_data
cd /scratch/username/ngs_project
mirror -v --only-newer
quit
EOF
# Run the script when needed
lftp -f sync_ngs_data.lftp
This script will download only new files from your HPC project to your local directory.
Integrating Data Integrity Checks into Your Workflow
To ensure your NGS data remains intact throughout its lifecycle, integrate MD5 checks at key points:
# Example of integrated download, verification, and transfer workflow
#!/bin/bash
# 1. Download data from repository
echo "Downloading SRA data..."
prefetch SRR12345678
fastq-dump --split-files SRR28119110
# 2. Generate MD5 checksums immediately after download
echo "Generating checksums..."
md5sum SRR28119110*.fastq > downloaded_files.md5
# 3. Transfer to analysis directory
echo "Transferring to analysis directory..."
rsync -avz SRR28119110*.fastq /scratch/username/project/
# 4. Verify files after transfer
echo "Verifying transfer integrity..."
cd /scratch/username/project/
md5sum -c ~/downloaded_files.md5
if [ $? -ne 0 ]; then
echo "ERROR: File integrity check failed!"
exit 1
fi
echo "Files transferred and verified successfully."
By integrating MD5 checks into your workflow, you establish multiple verification points that help identify when and where data corruption might occur.
Common Pitfalls and How to Avoid Them
Storage Quota Issues
Running out of space is a common problem:
# Check your current usage and quota
quota -s
# Find large files that might be wasting space
find /scratch/username -type f -size +1G -exec ls -lh {} \;
Transfer Interruptions
For large NGS files, transfer interruptions are common:
# Instead of scp, use rsync which can resume
rsync -avz --partial --progress large_genome.bam username@hpc.example.edu:/scratch/username/
# Or use LFTP which handles interruptions well
lftp -c "open sftp://username@hpc.example.edu; \
pget -n 8 -c /scratch/username/huge_genome.fa ~/Downloads/"
The --partial
flag in rsync keeps partially transferred files, allowing resumption. In LFTP, the -c
flag enables continue/resume of interrupted transfers, while -n 8
uses 8 parallel connections for faster transfers.
Permission Problems
Incorrect permissions can cause sharing issues:
# Fix permissions recursively
find /project/shared_data -type d -exec chmod 750 {} \;
find /project/shared_data -type f -exec chmod 640 {} \;
Forgetting Data Locations
As you work with multiple projects, tracking data location becomes challenging:
# Create a simple data catalog
cat > ~/data_catalog.txt << 'EOF'
Project: Liver Cancer RNAseq
Raw data: /scratch/username/liver_cancer/raw/
Results: /project/groupname/liver_cancer/results/
Backup: Globus endpoint "Institution DataVault" path "/backups/2023/liver_cancer/"
EOF
Checksum Verification Issues
Common problems with MD5 verification and how to solve them:
- Mismatched checksums: When a file doesn’t match its expected checksum
# Re-download or re-transfer the file
rm corrupted_file.fastq
wget https://source.example.edu/path/to/file.fastq
# Verify again
md5sum -c file.fastq.md5
- Missing MD5 files: When you need to verify files but don’t have a checksum file
# Contact the data provider for the correct checksums
# Or, if the file is shared by a collaborator, ask them to provide MD5s
- MD5 file format issues: Different systems might format MD5 files differently
# If your MD5 verification fails due to format issues, reformat the file:
awk '{print $1 " " $2}' problematic.md5 > reformatted.md5
Troubleshooting Guide
Transfer Issues
Problem | Possible Solution |
---|---|
“Connection timed out” | Check network, try smaller chunks, use Globus |
“Permission denied” | Check file permissions, ensure correct username |
“Disk quota exceeded” | Clean up unnecessary files, request quota increase |
“File not found” | Verify paths, check for typos |
“Connection reset by peer” | Try LFTP with auto-retry or use a graphical tool with resume capability |
“Checksum verification failed” | Re-transfer the file, check if file was modified during transfer |
Command Debugging
When commands don’t work as expected:
# Add verbose flags
scp -v large_file.fastq username@hpc.example.edu:/scratch/username/
# Check system load on HPC
ssh username@hpc.example.edu "uptime"
# Check disk space
ssh username@hpc.example.edu "df -h /scratch"
# Test SFTP connection with debugging
sftp -v username@hpc.example.edu
Graphical Tool Issues
For problems with graphical transfer tools:
- Connection failures:
- Verify you can connect via command line first
- Check if a firewall is blocking the connection
- Try connecting to a different port if your HPC supports it
- Slow transfers:
- Try disabling any antivirus real-time scanning temporarily
- In FileZilla, adjust concurrent transfers (Settings > Transfers)
- In LFTP, increase parallel connections (
pget -n 8
)
- Failed transfers:
- Check local disk space
- Ensure file paths don’t contain special characters
- Try transferring to a different directory first
Conclusion
Effective data management is a fundamental skill for NGS analysis on HPC systems. By understanding the proper tools and techniques for storing, transferring, and sharing your genomic data, you can create more efficient workflows, collaborate more effectively, and ensure the security of sensitive information.
The addition of tools like SFTP and LFTP to your toolkit provides more flexibility for different transfer scenarios, while graphical interfaces make data management more accessible for those less comfortable with command-line operations. Perhaps most importantly, incorporating data integrity verification with MD5 checksums gives you confidence that your NGS data remains intact throughout its lifecycle—from download to analysis to archiving.
Remember that each HPC system has its own specific configuration and policies, so always consult your institution’s HPC documentation for system-specific details. With practice, managing large NGS datasets across systems will become second nature, allowing you to focus on the biological insights hidden within your data.
Data integrity verification isn’t just a best practice—it’s an essential step in ensuring reproducible science. When your analysis depends on terabytes of genomic data, even minor corruptions can lead to misleading results. By incorporating MD5 checksums into your workflow, you’re not just protecting your data; you’re protecting the validity of your scientific conclusions.
Have you faced any specific challenges with managing your NGS data on HPC systems? Feel free to share your experiences or questions in the comments section below!
Leave a Reply