Feature/orchestration 3322 merge (#225)

* adding new index * added unit tests for new index * adding max_count to index buckets * simplifying name * adding test for the index * preparing for a run * adding reader * using the extended readers and writers * removing large(>1000bp) clinvar variants * adding ref alt and compressed length in the record * fixing the unit tests * compressing the nsa index buckets * avoiding creating extended reader for every GetAnnotation query * creating db and index for OneKg * adding the lazy index * adding unit tests for lazy index and buckets * adding data source versions and genome assembly to the nsa index * adding an interface for Supplementary annotation data item * adding the provider * updating the provider * adding multiple annotations per position * adding data source details like matchByAllele, isArray, etc * adding new sa object to replace the old complex logic * testing the nsa provider * fixing bugs * adding the heap to the writer and making dbsnp an array of ids * optimizing index memory requirements * creating gnomad db * adding topmed nsa creator * adding indexes and readers for intervals * adding unit tests for interval index * Nirvana is now working with everything streamed in * Cleaned the codes for handling inputs from S3 bucket * Created the lambda wrapper project * WIP * preparing the interval readers and writers * Signed URL based solution for VCF reading * adding zstd dict to readers and writers. adding interval to NsaProvider * Use POCO type for Lambda input parsing * dgv ready to roll * StreamAnntation class implemented * Bugfix; StreamAnnotation works well locally * WIP * switching over to fixe sonarcube issues * WIP * make NirvanaLambda work * Use Json for lambda output * Use POCO for lambda output * refactoring * remove code that has been commented out * return null if chrom doesnot exist in index * displaying data related to compression decompression * starting with indexes * adding tests * working on the reader * Fixed the issue that truncated Json file generated * modifying the saWriter to become the new writer that has the blocked streams built in * removing compile errors * WIP * unit tests for reader writer working * Added regionEndpoint to LambdaWrapper * starting the preLoading effort * reading SA from s3 * adding timing output * code cleanup * adding new sa to lambda * ready for unit testing * reorganizing code. * better memory with better dictionary initialization * bugfix. NsaProvider should not add null arrays of supp intervals to annotated position * added onek SV database * removing files, updating unit tests * discarding large varaints * adding the cosmic db creator * removing all signs of TSV creators, readers, indexes, etc. * Updated the code to generate MITOMAP databases directly (#210) * fixing bug, refactoring NsaProvider, added GlobalMinor item * compiling with ref minor index * bugfixing the ref minor db creator * bug fixing. Ref minor tags are now showing up * phylop database maker ready for testing * phylop database is huge. 7.4GB. even larger than Gnomad. * npd writer done. working on npd reader * comments from code review * initial implementation * adding omim support * GetFileSize method tested * adding omim db creator, gene db writer * Tested Orchestrator locally * Use MemoryStream to pass the payload * Add JSON output index support * adding unit tests for gene reader and writer * gene annotations are showing up in output * now the json output is valid * Updated the csproj files for two Lambda projects * exac gene scores in place * removing empty omim entries * Additional changes in csproj files * removing legacy code. End to end unit tests broken * phylop bug for unknown chromosome fixed * use sync long running tasks * adding custom annotation and a minor bug fix * removing debugging code * VCF filtering feature implemented * Accept config files for http SA resources * Code cleanup * Custom SA Lambda created; Refactoring * Fix: always add chromosome as the key in the preload dictionary * Refactoring * CustomSaLambda tested * Get region endpoint from environment variable * adding custom interval support. removing .version file for custom annotation * adding unit tests for SA * Refactoring and bugfix * Use two readers for header stream and variant stream * Remove the need of version file when creating custom SA * new custom nsa naming convention. Increased unit test coverage for SAutils * Fixed the bug in proload vcf stream * LambdaWrapper bugfix * adding unit tests for omim, exacScores, clingen * PartitionInfo refactoring * incrementing data version for anavrin * Get the references in input VCF from tabix query * Tabix now uses IChromosome internally. * Merged some updates from develop. * removing duplicates from data source versions in header * Also check the chromosome in PassedTheEnd method of IVcfFilter * Integrate Tabix bugfix and update * fixing dbsnp output * fix for dbsnp * Fixed issues related to differing reference and tabix indices. * Make blockoffset seek working * bugfixing the preload operation in NsaReader * Bugfix: FastForward now skips the header lines and checks chr name * Don't throw exception when ending a section not openned yet in JasixIndex * unit test for Nsa reader preload and nsa provider * removing null global minor entries, reverting gene entries to old schema * Refactoring; Enabled SIFT and PolyPhen * Bugfix: create seekable webreadstream from HttpStreamSource * fixed omim bug where omim gene symbol lookup was wrong * Created Cloud project for common POCO class and AWS utils; Update the POCO models according to the Swagger page * Bugfix: Update the name of annotation lambda; add missing base name to annotation output * more omim bug fixes. discarding entries with no gene symbol from OMIM * moving SV annotations to positions. added reciprocal overlap * fixing reciprocal overlap issue * fixed unit test * dgv was missing * updating cosmic schema * Refactoring for unit tests * custom annotation fields are all string * trimming white spaces from headers in custom TSV * Refactored the S3 upload function * updated cosmic tissue object and removed AA from onekg * Fixed bugs in MitoMap database generation * Fixed the bugs in handling annotation jobs with annotation range set to null * Update the break points in chromosome partitioning * fixing bug for unrecognized contigs in preLoad utilities * Reduce the memory usage of preload function * fixing cosmic differences * replacing _ with space for cancer types * fixing unit test * cosmic small variants are capped at lenght 1k. Filtering for conflicting alleles applied * running after onekg bug for grch38. input file was incorrect * global major freq is 7 decimal points * Only preload a subset of the annotations * fixing the empty genes section issue * WIP * removing items that dont have valid refAllele * remving reciprocal overlap for breakends * fixed clinvar bug introduced while ref base checking * Use both RefSeq and Ensembl cache; AnnotationLambda with the same qualifier as the NirvanaLambda will be invoked * Added debugging codes * Upgrade to dotnet core 2.1 * ancestral allele back to onekg and custom annotation intervals being reported for all * Set AmazonLambdaClient timeout to 5 mins; clean /tmp folder before and after each annotation job * fixing clinvar bug and removing phylop scores for GRCh37 chrM * reciprocal overlap is 0 for insertions * fixing a typo * reciprocal overlap for insertion are not reported * Revert staggered preloading of SA * Fixed problems introduced during the merging * Cleaned the repo * Additional code cleaning * Add unit tests; Fix the bug in PassedTheEnd method * Added more unit tests * Calling AnnotationLambda w/o credentials; Always invoke the latest version of AnnotationLambda * Get ARN of annotation lambda as an environment variable * Bugfix: check null SaProvider before preloading * Remove the SortedVcfChecker * Use type keywords * More changes about using type keywords
Illumina · Oct 9, 2018 · 04dce0b · 04dce0b
1 parent 993e96c
commit 04dce0b
Show file tree

Hide file tree

Showing 133 changed files with 3,475 additions and 1,633 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,6 +6,7 @@
 *.user
 *.userosscache
 *.sln.docstates
+aws-lambda*.json
 
 # User-specific files (MonoDevelop/Xamarin Studio)
 *.userprefs
@@ -258,4 +259,4 @@ paket-files/
 
 # Python Tools for Visual Studio (PTVS)
 __pycache__/
-*.pyc
+*.pyc
diff --git a/AnnotationLambda/AnnotationConfig.cs b/AnnotationLambda/AnnotationConfig.cs
@@ -0,0 +1,18 @@
+// ReSharper disable InconsistentNaming
+
+using Cloud;
+using Genome;
+
+namespace AnnotationLambda
+{
+    public sealed class AnnotationConfig
+    {
+        public string id;
+        public string genomeAssembly;
+        public S3Path inputVcf;
+        public S3Path outputDir;
+        public string outputPrefix;
+        public string supplementaryAnnotations;
+        public AnnotationRange annotationRange;
+    }
+}
diff --git a/AnnotationLambda/AnnotationLambda.csproj b/AnnotationLambda/AnnotationLambda.csproj
@@ -0,0 +1,34 @@
+<Project Sdk="Microsoft.NET.Sdk">
+
+  <PropertyGroup>
+    <GenerateRuntimeConfigurationFiles>true</GenerateRuntimeConfigurationFiles>
+    <NoWarn>NU1605</NoWarn>
+    <TargetFramework>netcoreapp2.1</TargetFramework>
+    <OutputPath>..\bin\$(Configuration)</OutputPath>
+    <DebugType>Full</DebugType>
+  </PropertyGroup>
+
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'">
+    <TreatWarningsAsErrors>false</TreatWarningsAsErrors>
+  </PropertyGroup>
+
+  <ItemGroup>
+    <DotNetCliToolReference Include="Amazon.Lambda.Tools" Version="2.2.0" />
+  </ItemGroup>
+
+  <ItemGroup>
+    <PackageReference Include="Amazon.Lambda.Core" Version="1.0.0" />
+    <PackageReference Include="Amazon.Lambda.Serialization.Json" Version="1.3.0" />
+    <PackageReference Include="AWSSDK.Lambda" Version="3.3.0.3" />
+    <PackageReference Include="AWSSDK.S3" Version="3.3.1.2" />
+  </ItemGroup>
+
+  <ItemGroup>
+    <ProjectReference Include="..\Nirvana\Nirvana.csproj" />
+    <ProjectReference Include="..\Tabix\Tabix.csproj" />
+    <ProjectReference Include="..\Cloud\Cloud.csproj" />
+  </ItemGroup>
+
+  <Import Project="..\VariantAnnotation\CommonAssemblyInfo.props" />
+
+</Project>
diff --git a/AnnotationLambda/AnnotationResult.cs b/AnnotationLambda/AnnotationResult.cs
@@ -0,0 +1,14 @@
+// ReSharper disable InconsistentNaming
+
+using ErrorHandling;
+
+namespace AnnotationLambda
+{
+    public sealed class AnnotationResult
+    {
+        public string id;
+        public string status;
+        public string filePath;
+        public ExitCodes exitCode;
+    }
+}
diff --git a/AnnotationLambda/AssemblyInfo.cs b/AnnotationLambda/AssemblyInfo.cs
@@ -0,0 +1,3 @@
+using System.Runtime.CompilerServices;
+
+[assembly: InternalsVisibleTo("UnitTests")]
diff --git a/AnnotationLambda/LambdaWrapper.cs b/AnnotationLambda/LambdaWrapper.cs
@@ -0,0 +1,140 @@
+using System;
+using System.Collections.Generic;
+using System.IO;
+using System.IO.Compression;
+using System.Net.Mime;
+using System.Reflection;
+using Amazon.Lambda.Core;
+using Amazon.S3.Model;
+using Cloud;
+using Compression.FileHandling;
+using ErrorHandling;
+using Genome;
+using IO;
+using Nirvana;
+using Vcf;
+using Tabix;
+
+// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
+[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.Json.JsonSerializer))]
+
+namespace AnnotationLambda
+{
+    public sealed class LambdaWrapper
+    {
+        public string LocalTempOutputPath = "/tmp/Nirvana_temp";
+        private const string AnnotationSuccessMessage = "Annotation Complete";
+
+
+        public AnnotationResult RunNirvana(AnnotationConfig annotationConfig, ILambdaContext context)
+        {
+            var output = new AnnotationResult { id = annotationConfig.id };
+            try
+            {
+                //may not needed in future
+                var tempFolder = new DirectoryInfo(LocalTempOutputPath).Parent;
+                NirvanaHelper.CleanOutput(tempFolder == null ? Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location) : tempFolder.FullName);
+
+                var inputS3Client = S3Utilities.GetS3ClientWrapperFromEnvironment(annotationConfig.inputVcf.bucketName);
+                var outputS3Client = S3Utilities.GetS3ClientWrapperFromEnvironment(annotationConfig.outputDir.bucketName);
+
+                var annotationResources = GetAnnotationResources(inputS3Client, annotationConfig, annotationConfig.genomeAssembly);
+
+                var byteRange = new ByteRange(VirtualPosition.From(annotationResources.InputStartVirtualPosition).FileOffset, long.MaxValue);
+
+                if (annotationConfig.annotationRange != null)
+                {
+                    Console.WriteLine($"Annotation range: {annotationConfig.annotationRange.chromosome} {annotationConfig.annotationRange.start} {annotationConfig.annotationRange.end}");
+                }
+
+                using (var preloadVcfStream = new S3StreamSource(inputS3Client, annotationConfig.inputVcf).GetStream(byteRange))
+                {
+                    annotationResources.GetVariantPositions(new BlockGZipStream(preloadVcfStream, CompressionMode.Decompress), annotationConfig.annotationRange);
+                }
+                Console.WriteLine("Variant preloading done.");
+
+                using (var inputVcfStream = new BlockGZipStream(new S3StreamSource(inputS3Client, annotationConfig.inputVcf).GetStream(byteRange),
+                    CompressionMode.Decompress))
+                using (var headerStream = annotationConfig.annotationRange == null ? null : new BlockGZipStream(new S3StreamSource(inputS3Client, annotationConfig.inputVcf).GetStream(),
+                    CompressionMode.Decompress))
+                using (var outputJsonStream = new BlockGZipStream(FileUtilities.GetCreateStream(LocalTempOutputPath + NirvanaHelper.JsonSuffix),
+                            CompressionMode.Compress))
+                using (var outputJsonIndexStream = FileUtilities.GetCreateStream(LocalTempOutputPath + NirvanaHelper.JsonSuffix + NirvanaHelper.JsonIndexSuffix))
+                {
+
+                    IVcfFilter vcfFilter = annotationConfig.annotationRange == null
+                        ? new NullVcfFilter() as IVcfFilter
+                        : new VcfFilter(AnnotationRangeToChromosomeInterval(annotationConfig.annotationRange, annotationResources.SequenceProvider.RefNameToChromosome));
+
+
+                    StreamAnnotation.Annotate(headerStream, inputVcfStream, outputJsonStream, outputJsonIndexStream, null, null,
+                        annotationResources, vcfFilter);
+                    Console.WriteLine("Annotation done.");
+                }
+
+                output.filePath = S3Utilities.UploadBaseAndIndexFiles(outputS3Client, annotationConfig.outputDir,
+                    LocalTempOutputPath + NirvanaHelper.JsonSuffix,
+                    annotationConfig.outputPrefix + NirvanaHelper.JsonSuffix, NirvanaHelper.JsonIndexSuffix);
+                Console.WriteLine("Nirvana output files uploaded.");
+
+                File.Delete(LocalTempOutputPath + NirvanaHelper.JsonSuffix);
+                File.Delete(LocalTempOutputPath + NirvanaHelper.JsonSuffix + NirvanaHelper.JsonIndexSuffix);
+                Console.WriteLine("Temp Nirvana output deleted.");
+
+                output.status = AnnotationSuccessMessage;
+                output.exitCode = ExitCodes.Success;
+            }
+            catch (Exception e)
+            {
+                Console.WriteLine($"StackTrace: {e.StackTrace}");
+                output.status = e.Message;
+                output.exitCode = ExitCodeUtilities.GetExitCode(e.GetType());
+                throw;
+            }
+
+            return output;
+        }
+
+        public static Index GetTabixIndex(Stream tabixStream, IDictionary<string, IChromosome> refNameToChromosome)
+        {
+            using (var binaryReader = new BinaryReader(new BlockGZipStream(tabixStream, CompressionMode.Decompress)))
+            {
+                return Reader.Read(binaryReader, refNameToChromosome);
+            }
+        }
+
+        internal static long GetTabixVirtualPosition(AnnotationConfig annotationConfig, IS3Client s3Client, IDictionary<string, IChromosome> refNameToChromosome)
+        {
+            // process the entire file if no range specified
+            if (annotationConfig.annotationRange == null) return 0;
+
+            var tabixStream = new S3StreamSource(s3Client, annotationConfig.inputVcf).GetAssociatedStreamSource(NirvanaHelper.TabixSuffix).GetStream();
+            var tabixIndex = GetTabixIndex(tabixStream, refNameToChromosome);
+            var chromosome = ReferenceNameUtilities.GetChromosome(refNameToChromosome, annotationConfig.annotationRange.chromosome);
+
+            return tabixIndex.GetOffset(chromosome, annotationConfig.annotationRange.start);
+        }
+
+        private static AnnotationResources  GetAnnotationResources(IS3Client s3Client, AnnotationConfig annotationConfig, string genomeAssembly)
+        {
+            string cachePathPrefix = UrlCombine(NirvanaHelper.S3CacheFoler, genomeAssembly + "/" + NirvanaHelper.DefaultCacheSource);
+            string nirvanaS3Ref = NirvanaHelper.GetS3RefLocation(GenomeAssemblyHelper.Convert(genomeAssembly));
+            var saConfigFileLocation = GetSaConfigFileLocation(annotationConfig.supplementaryAnnotations, genomeAssembly);
+
+            var annotationResources = new AnnotationResources(nirvanaS3Ref, cachePathPrefix, saConfigFileLocation , null, false, false, true, false, false);
+
+            annotationResources.InputStartVirtualPosition = GetTabixVirtualPosition(annotationConfig, s3Client, annotationResources.SequenceProvider.RefNameToChromosome);
+
+            return annotationResources;
+        }
+
+        private static List<string> GetSaConfigFileLocation(string versionTag, string genomeAssembly) => versionTag == null ? null :
+            new List<string> { NirvanaHelper.S3Url + string.Join("_", versionTag, "SA", NirvanaHelper.ProjectName, genomeAssembly) + ".txt"};
+
+        private static string UrlCombine(string baseUrl, string relativeUrl) => baseUrl.TrimEnd('/') + '/' + relativeUrl.TrimStart('/');
+
+        private static IChromosomeInterval AnnotationRangeToChromosomeInterval(AnnotationRange annotationRange,
+            IDictionary<string, IChromosome> refnameToChromosome) => new ChromosomeInterval(ReferenceNameUtilities.GetChromosome(refnameToChromosome, annotationRange.chromosome),
+            annotationRange.start, annotationRange.end);
+    }
+}
diff --git a/AnnotationLambda/Readme.md b/AnnotationLambda/Readme.md
@@ -0,0 +1,45 @@
+# AWS Lambda Empty Function Project
+
+This starter project consists of:
+* Function.cs - class file containing a class with a single function handler method
+* aws-lambda-tools-defaults.json - default argument settings for use with Visual Studio and command line deployment tools for AWS
+
+You may also have a test project depending on the options selected.
+
+The generated function handler is a simple method accepting a string argument that returns the uppercase equivalent of the input string. Replace the body of this method, and parameters, to suit your needs. 
+
+## Here are some steps to follow from Visual Studio:
+
+To deploy your function to AWS Lambda, right click the project in Solution Explorer and select *Publish to AWS Lambda*.
+
+To view your deployed function open its Function View window by double-clicking the function name shown beneath the AWS Lambda node in the AWS Explorer tree.
+
+To perform testing against your deployed function use the Test Invoke tab in the opened Function View window.
+
+To configure event sources for your deployed function, for example to have your function invoked when an object is created in an Amazon S3 bucket, use the Event Sources tab in the opened Function View window.
+
+To update the runtime configuration of your deployed function use the Configuration tab in the opened Function View window.
+
+To view execution logs of invocations of your function use the Logs tab in the opened Function View window.
+
+## Here are some steps to follow to get started from the command line:
+
+Once you have edited your function you can use the following command lines to build, test and deploy your function to AWS Lambda from the command line (these examples assume the project name is *EmptyFunction*):
+
+Restore dependencies
+```
+    cd "LambdaWrapper"
+    dotnet restore
+```
+
+Execute unit tests
+```
+    cd "LambdaWrapper/test/LambdaWrapper.Tests"
+    dotnet test
+```
+
+Deploy function to AWS Lambda
+```
+    cd "LambdaWrapper/src/LambdaWrapper"
+    dotnet lambda deploy-function
+```
diff --git a/CacheUtils/CacheUtils.csproj b/CacheUtils/CacheUtils.csproj
@@ -1,7 +1,7 @@
 <Project Sdk="Microsoft.NET.Sdk">
   <PropertyGroup>
     <OutputType>Exe</OutputType>
-    <TargetFramework>netcoreapp2.0</TargetFramework>
+    <TargetFramework>netcoreapp2.1</TargetFramework>
     <OutputPath>..\bin\$(Configuration)</OutputPath>
     <DebugType>Full</DebugType>
   </PropertyGroup>

diff --git a/CacheUtils/Commands/CreateCache/CreateNirvanaDatabaseMain.cs b/CacheUtils/Commands/CreateCache/CreateNirvanaDatabaseMain.cs
@@ -16,6 +16,7 @@
 using Genome;
 using Intervals;
 using IO;
+using IO.StreamSource;
 using VariantAnnotation.Interface;
 using VariantAnnotation.IO.Caches;
 using VariantAnnotation.Logger;
@@ -40,11 +41,11 @@ private static ExitCodes ProgramExecution()
 
             (var refIndexToChromosome, var refNameToChromosome, int numRefSeqs) = SequenceHelper.GetDictionaries(_inputReferencePath);
 
-            using (var transcriptReader = new MutableTranscriptReader(GZipUtilities.GetAppropriateReadStream(transcriptPath), refIndexToChromosome))
-            using (var regulatoryReader = new RegulatoryRegionReader(GZipUtilities.GetAppropriateReadStream(regulatoryPath), refIndexToChromosome))
-            using (var siftReader       = new PredictionReader(GZipUtilities.GetAppropriateReadStream(siftPath), refIndexToChromosome, IntermediateIoCommon.FileType.Sift))
-            using (var polyphenReader   = new PredictionReader(GZipUtilities.GetAppropriateReadStream(polyphenPath), refIndexToChromosome, IntermediateIoCommon.FileType.Polyphen))
-            using (var geneReader       = new UgaGeneReader(GZipUtilities.GetAppropriateReadStream(ExternalFiles.UniversalGeneFilePath), refNameToChromosome))
+            using (var transcriptReader = new MutableTranscriptReader(GZipUtilities.GetAppropriateReadStream(new FileStreamSource(transcriptPath)), refIndexToChromosome))
+            using (var regulatoryReader = new RegulatoryRegionReader(GZipUtilities.GetAppropriateReadStream(new FileStreamSource(regulatoryPath)), refIndexToChromosome))
+            using (var siftReader       = new PredictionReader(GZipUtilities.GetAppropriateReadStream(new FileStreamSource(siftPath)), refIndexToChromosome, IntermediateIoCommon.FileType.Sift))
+            using (var polyphenReader   = new PredictionReader(GZipUtilities.GetAppropriateReadStream(new FileStreamSource(polyphenPath)), refIndexToChromosome, IntermediateIoCommon.FileType.Polyphen))
+            using (var geneReader       = new UgaGeneReader(GZipUtilities.GetAppropriateReadStream(new FileStreamSource(ExternalFiles.UniversalGeneFilePath)), refNameToChromosome))
             {
                 var genomeAssembly   = transcriptReader.Header.Assembly;
                 var source           = transcriptReader.Header.Source;

diff --git a/CacheUtils/Commands/ParseVepCacheDirectory/ParseVepCacheDirectoryMain.cs b/CacheUtils/Commands/ParseVepCacheDirectory/ParseVepCacheDirectoryMain.cs
@@ -11,6 +11,7 @@
 using CacheUtils.Logger;
 using Genome;
 using IO;
+using IO.StreamSource;
 using VariantAnnotation.Interface;
 using VariantAnnotation.Interface.AnnotatedPositions;
 using VariantAnnotation.Logger;
@@ -106,7 +107,7 @@ private static Dictionary<string, GenbankEntry> GetIdToGenbank(ILogger logger, G
             logger.Write("- loading the intermediate Genbank file... ");
 
             Dictionary<string, GenbankEntry> genbankDict;
-            using (var reader = new IntermediateIO.GenbankReader(GZipUtilities.GetAppropriateReadStream(ExternalFiles.GenbankFilePath)))
+            using (var reader = new IntermediateIO.GenbankReader(GZipUtilities.GetAppropriateReadStream(new FileStreamSource(ExternalFiles.GenbankFilePath))))
             {
                 genbankDict = reader.GetIdToGenbank();
             }

diff --git a/CacheUtils/Commands/ParseVepCacheDirectory/VepCacheParser.cs b/CacheUtils/Commands/ParseVepCacheDirectory/VepCacheParser.cs
@@ -1,12 +1,16 @@
 using System;
 using System.Collections.Generic;
 using System.IO;
+using System.Linq;
+using System.Runtime.CompilerServices;
 using CacheUtils.DataDumperImport.DataStructures.Import;
 using CacheUtils.DataDumperImport.DataStructures.Mutable;
 using CacheUtils.DataDumperImport.Import;
 using CacheUtils.DataDumperImport.IO;
 using Compression.Utilities;
 using Genome;
+using IO;
+using IO.StreamSource;
 using VariantAnnotation.Interface.AnnotatedPositions;
 
 namespace CacheUtils.Commands.ParseVepCacheDirectory
@@ -32,8 +36,9 @@ public VepCacheParser(Source source)
 
         private static List<IRegulatoryRegion> ParseRegulatoryFiles(IChromosome chromosome, string dirPath)
         {
-            var files             = Directory.GetFiles(dirPath, "*_reg_regulatory_regions_data_dumper.txt.gz");
             var regulatoryRegions = new List<IRegulatoryRegion>();
+            var files = FileUtilities.GetFileNamesInDir(dirPath, "*_reg_regulatory_regions_data_dumper.txt.gz")
+                    .ToArray();
 
             foreach (string dumpPath in VepRootDirectory.GetSortedFiles(files))
             {
@@ -45,8 +50,8 @@ private static List<IRegulatoryRegion> ParseRegulatoryFiles(IChromosome chromoso
 
         private List<MutableTranscript> ParseTranscriptFiles(IChromosome chromosome, string dirPath)
         {
-            var files       = Directory.GetFiles(dirPath, "*_transcripts_data_dumper.txt.gz");
             var transcripts = new List<MutableTranscript>();
+            var files = FileUtilities.GetFileNamesInDir(dirPath, "*_transcripts_data_dumper.txt.gz").ToArray();
 
             foreach (string dumpPath in VepRootDirectory.GetSortedFiles(files))
             {
@@ -61,7 +66,7 @@ private static void ParseRegulatoryDumpFile(IChromosome chromosome, string fileP
         {
             Console.WriteLine("- processing {0}", Path.GetFileName(filePath));
 
-            using (var reader = new DataDumperReader(GZipUtilities.GetAppropriateReadStream(filePath)))
+            using (var reader = new DataDumperReader(GZipUtilities.GetAppropriateReadStream(new FileStreamSource(filePath))))
             {
                 foreach (var ad in reader.GetRootNode().Value.Values)
                 {
@@ -90,15 +95,15 @@ private void ParseTranscriptDumpFile(IChromosome chromosome, string filePath,
         {
             Console.WriteLine("- processing {0}", Path.GetFileName(filePath));
 
-            using (var reader = new DataDumperReader(GZipUtilities.GetAppropriateReadStream(filePath)))
+            using (var reader = new DataDumperReader(GZipUtilities.GetAppropriateReadStream(new FileStreamSource(filePath))))
             {
                 foreach (var node in reader.GetRootNode().Value.Values)
                 {
                     if (!(node is ListObjectKeyValueNode transcriptNodes)) continue;
 
                     foreach (var tNode in transcriptNodes.Values)
                     {
-                        if (!(tNode is ObjectValueNode transcriptNode))        throw new InvalidOperationException("Expected a transcript object value node, but the current node is not an object value.");
+                        if (!(tNode is ObjectValueNode transcriptNode)) throw new InvalidOperationException("Expected a transcript object value node, but the current node is not an object value.");
                         if (transcriptNode.Type != "Bio::EnsEMBL::Transcript") throw new InvalidOperationException($"Expected a transcript node, but the current data type is: [{transcriptNode.Type}]");
 
                         var transcript = ImportTranscript.Parse(transcriptNode, chromosome, _source);
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		using System.Runtime.CompilerServices;

		[assembly: InternalsVisibleTo("UnitTests")]