C#: Read a Large CSV or Any Character Separated Values File in Sized Batches/Chunks
DiponRoy - 02/Aug/2020
DiponRoy - 02/Aug/2020
[SHOWTOGROUPS=4,20,22]
Read a large CSV or any character separated values file chunk by chunk as DataTable and Entity List
This article is about how to read a large CSV or any character separated values file chunk by chunk, and populate DataTable an Entity List representing each chunk.
Background
Imagine a scenario where we have to process or store contents of a large character separated values file in a database. I had a large .CSV file with 9-12 million rows, the file size was around 700-800 MB. Initially, I had tried GenericParser, CsvHelper and a few other things but ended up with an out of memory or a very slow solution. Then I made a parser of my own to chunk data as DataTable. A few days later, I had to validate the source data also, so I had to make some changes to the parser to chunk data as List<TSource>.
Today, I going to share both solutions to:
Yes, there are some libraries. Most of them are loading the entire file to a list/datatable, few are chunking the data but validation features aren't there. If we are intending to log data validation errors and error line details to a separate file/DB during the data processing time, we will not be able to do it or need to maintain two different processes.
Why Do We Need This?
Good Practices!!!
Core
Here are some of the Interfaces to be used in our helper project.
Hide Copy Code
using System.Collections.Generic;
namespace CsvBatch
{
public interface ICsvLine
{
long LineNo { get; }
string Line { get; }
char Splitter { get; }
List<string> LineValues();
void Set(long lineNo, string line, char splitter);
}
public interface ICsvDataMapper : ICsvLine
{
}
public interface ICsvDataMapper<T> : ICsvDataMapper
{
T Map();
bool Map(out T entity, out List<string> errors);
}
}
This class represents a data line of a file.
Hide Shrink
Copy Code
using System.Collections.Generic;
namespace CsvBatch
{
public class CsvLine : ICsvLine
{
public long LineNo { get; protected set; }
public string Line { get; protected set; }
public char Splitter { get; protected set; }
public List<string> LineValues()
{
var values = string.IsNullOrEmpty(Line) ?
new List<string>() : new List<string>(Line.Split(Splitter));
return values;
}
public void Set(long lineNo, string line, char splitter)
{
LineNo = lineNo;
Line = line;
Splitter = splitter;
}
}
}
This class represents the file.
Hide Shrink
Copy Code
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading.Tasks;
namespace CsvBatch
{
public class FileReader
{
public readonly string Path;
public readonly char Splitter;
public readonly bool HasHeaderLine;
private List<string> _header;
private FileReader(char splitter, bool hasHeaderLine)
{
Splitter = splitter;
HasHeaderLine = hasHeaderLine;
}
public FileReader(string path, bool hasHeaderLine, char splitter) :
this(splitter, hasHeaderLine)
{
Path = path;
}
public virtual List<string> Headers()
{
if (_header != null)
{
return _header;
}
if (!HasHeaderLine)
{
_header = new List<string>();
return _header;
}
_header = new List<string>();
using (var reader = new StreamReader(File.OpenRead(Path)))
{
if (HasHeaderLine && !reader.EndOfStream)
{
_header = reader.ReadLine().Split(Splitter).ToList();
}
}
return _header;
}
public virtual IEnumerable<string> Lines()
{
using (var reader = new StreamReader(File.OpenRead(Path)))
{
/*skip header rows*/
if (HasHeaderLine && !reader.EndOfStream)
{
reader.ReadLine(); /*check header: string[] headers =
reader.ReadLine().Split(Splitter);*/
}
/*data rows*/
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
yield return line;
}
}
}
public virtual IEnumerable<CsvLine> Rows()
{
using (var reader = new StreamReader(File.OpenRead(Path)))
{
long lineNo = 0;
/*skip header rows*/
if (HasHeaderLine && !reader.EndOfStream)
{
++lineNo;
reader.ReadLine(); /*check header: string[] headers =
reader.ReadLine().Split(Splitter);*/
}
/*data rows*/
while (!reader.EndOfStream)
{
var line = new CsvLine();
line.Set(++lineNo, reader.ReadLine(), Splitter);
yield return line;
}
}
}
public IEnumerable<List<CsvLine>> Rows(int batchSize)
{
List<CsvLine> list = new List<CsvLine>();
foreach (var row in Rows())
{
list.Add(row);
if (list.Count == batchSize)
{
yield return list;
list = new List<CsvLine>();
}
}
if (list.Count > 0)
{
yield return list;
}
}
}
}
[/SHOWTOGROUPS]
Read a large CSV or any character separated values file chunk by chunk as DataTable and Entity List
This article is about how to read a large CSV or any character separated values file chunk by chunk, and populate DataTable an Entity List representing each chunk.
Background
Imagine a scenario where we have to process or store contents of a large character separated values file in a database. I had a large .CSV file with 9-12 million rows, the file size was around 700-800 MB. Initially, I had tried GenericParser, CsvHelper and a few other things but ended up with an out of memory or a very slow solution. Then I made a parser of my own to chunk data as DataTable. A few days later, I had to validate the source data also, so I had to make some changes to the parser to chunk data as List<TSource>.
Today, I going to share both solutions to:
- read data as entity mapped
- read data as DataTable
Yes, there are some libraries. Most of them are loading the entire file to a list/datatable, few are chunking the data but validation features aren't there. If we are intending to log data validation errors and error line details to a separate file/DB during the data processing time, we will not be able to do it or need to maintain two different processes.
Why Do We Need This?
- Uploading large data file content as a DataTable to a database using SqlBulkCopy
- Uploading large data file content as an entity list List<TSource> to a database using Entity Framework
- Validating large file content
Good Practices!!!
- During these type of scenarios, use yield return and IEnumerable<TSource>.
- During insertion to a database, we should maintain a truncation for the entire file.
- For Entity Framework, it is good practice to commit small lists at a time and reinitialize DB context periodically using existing connection and transaction.
- Maintain a file process and error logs if any. Move a processed file to an archive location after processing.
Core
Here are some of the Interfaces to be used in our helper project.
Hide Copy Code
using System.Collections.Generic;
namespace CsvBatch
{
public interface ICsvLine
{
long LineNo { get; }
string Line { get; }
char Splitter { get; }
List<string> LineValues();
void Set(long lineNo, string line, char splitter);
}
public interface ICsvDataMapper : ICsvLine
{
}
public interface ICsvDataMapper<T> : ICsvDataMapper
{
T Map();
bool Map(out T entity, out List<string> errors);
}
}
This class represents a data line of a file.
Hide Shrink
using System.Collections.Generic;
namespace CsvBatch
{
public class CsvLine : ICsvLine
{
public long LineNo { get; protected set; }
public string Line { get; protected set; }
public char Splitter { get; protected set; }
public List<string> LineValues()
{
var values = string.IsNullOrEmpty(Line) ?
new List<string>() : new List<string>(Line.Split(Splitter));
return values;
}
public void Set(long lineNo, string line, char splitter)
{
LineNo = lineNo;
Line = line;
Splitter = splitter;
}
}
}
This class represents the file.
Hide Shrink
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading.Tasks;
namespace CsvBatch
{
public class FileReader
{
public readonly string Path;
public readonly char Splitter;
public readonly bool HasHeaderLine;
private List<string> _header;
private FileReader(char splitter, bool hasHeaderLine)
{
Splitter = splitter;
HasHeaderLine = hasHeaderLine;
}
public FileReader(string path, bool hasHeaderLine, char splitter) :
this(splitter, hasHeaderLine)
{
Path = path;
}
public virtual List<string> Headers()
{
if (_header != null)
{
return _header;
}
if (!HasHeaderLine)
{
_header = new List<string>();
return _header;
}
_header = new List<string>();
using (var reader = new StreamReader(File.OpenRead(Path)))
{
if (HasHeaderLine && !reader.EndOfStream)
{
_header = reader.ReadLine().Split(Splitter).ToList();
}
}
return _header;
}
public virtual IEnumerable<string> Lines()
{
using (var reader = new StreamReader(File.OpenRead(Path)))
{
/*skip header rows*/
if (HasHeaderLine && !reader.EndOfStream)
{
reader.ReadLine(); /*check header: string[] headers =
reader.ReadLine().Split(Splitter);*/
}
/*data rows*/
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
yield return line;
}
}
}
public virtual IEnumerable<CsvLine> Rows()
{
using (var reader = new StreamReader(File.OpenRead(Path)))
{
long lineNo = 0;
/*skip header rows*/
if (HasHeaderLine && !reader.EndOfStream)
{
++lineNo;
reader.ReadLine(); /*check header: string[] headers =
reader.ReadLine().Split(Splitter);*/
}
/*data rows*/
while (!reader.EndOfStream)
{
var line = new CsvLine();
line.Set(++lineNo, reader.ReadLine(), Splitter);
yield return line;
}
}
}
public IEnumerable<List<CsvLine>> Rows(int batchSize)
{
List<CsvLine> list = new List<CsvLine>();
foreach (var row in Rows())
{
list.Add(row);
if (list.Count == batchSize)
{
yield return list;
list = new List<CsvLine>();
}
}
if (list.Count > 0)
{
yield return list;
}
}
}
}
[/SHOWTOGROUPS]