If you have a large text file containing adjacent JSON documents all in a single line as follows. The first problem would be you cannot read the file line by line because a single line would load the entire file into memory. So you have to read it character by character.
{"key1":"value1","key2":"value2"}{"field":"value"}{"key":"value"}
Code language: JSON / JSON with Comments (json)
Then comes the real challenge of extracting individual JSON documents from the large file and writing it to another file as a jsonl file. In other words, you want to extract and write each JSON document to a separate line of an output file.
Python Script to Extract JSON documents in a single line of a large text file
Below is a python class I wrote to extract individual JSON documents that sit next to each other without any separator, in a single line of a large text file.
To parse the JSON from an input file to an output file, just instantiate the class and call its parse function as follows:
#Instantiate the class by padding input and output file paths
parser=Single_Line_JSON_Parser('single_line_file.txt','output.jsonl')
#call the parse function to extract JSON and write to separate lines in output file
parser.parse()
Code language: Python (python)
Below is the code for the Single_Line_JSON_Parser class. Please do comment below if you find any issue using this class or if it was useful and saved you a lot of time and effort :)
Single Line JSON Parser Python Class
import json
import sys
#When infile has a single line and a huge number of adjacent multiple JSON entries
#this class extracts each JSON entry and writes them to outfile as one JSON entry per line
#Sample content of infile is in below line
#{"field1","value1"}{"field2","value2","field3","value3"}
#to parse the content just create an instance of the class and call parse() as follows
#parser=Single_Line_JSON_Parser('infile_path','outfile_path')
#parser.parse()
class Single_Line_JSON_Parser:
def __init__(self,infile,outfile):
self.infile_path=infile
self.outfile_path=outfile
#buffer to store incomplete JSON
self.char_buffer=[]
#maximum buffer length to store incomplete JSON. Should be greater than the length of longest JSON entry you expect.
#Increase this if required. This limit ensures you do not end up with out of memory error in case we are not able to find valid JSON
self.max_buffer=1000000
#number of valid JSON entries parse so far
self.count=0
#stores previous and previous to prev characters while parsing
self.prev_char,self.prev_prev_char,self.prev_prev_prev_char='','',''
#indicator to tell if we are inside a JSON string during parsing
self.inside_string=False
#open and close brace counter for pending buffer
self.open_braces_count=0
self.close_braces_count=0
#returns True if json_str is a valid JSON string, else returns False
def is_valid_json(self,json_str):
try:
json.loads(json_str)
return True
except Exception as ex:
return False
def raise_exception(self,message):
data=''.join(self.char_buffer)
#write buffer as debug info
print(data,file=sys.stderr)
raise Exception(message)
#write buffer content in a separate line to outfile
def flush(self,outfile):
if self.inside_string: #if we are still inside a JSON string, then something is wrong
self.raise_exception("Unable to Parse as we are inside JSON string.")
#convert buffer to string
json_data=''.join(self.char_buffer)
if not self.is_valid_json(json_data):#if this is valid JSON
#then something is wrong.. raise exception and exit
self.raise_exception("Invalid JSON")
#write JSON to a new line in the output file
outfile.write(json_data+'\n')
#update number of jsons we extracted so far
self.count=self.count+1
#reset the buffer
self.char_buffer=[]
if self.count%1000==0:
print(self.count,"JSON entries extracted")
#reset open and close brace counter
self.open_braces_count=0
self.close_braces_count=0
#reset previous characters info
self.prev_prev_prev_char=''
self.prev_prev_char=''
self.prev_char=''
#called to check for valid JSON for each closing brace encountered in pending buffer
#this is slow, but will fix most issues not handled by this class while extracting valid JSON
#This is called when buffer length exceeds maximum characters allowed
def slow_parse(self,outfile):
print("Maximum buffer limit exceeded. Now parsing chunks in buffer.")
start_pos=0
entry_count=0
for i,cc in enumerate(self.char_buffer):
if cc=='}':#if closing brace
#get string from starting position till closing brace
json_data=''.join(self.char_buffer[start_pos:i+1])
if self.is_valid_json(json_data):#if this is valid JSON
#write JSON to a new line in the output file
outfile.write(json_data+'\n')
#update starting position of pending data
start_pos=i+1
#update number of jsons we extracted from buffer
entry_count=entry_count+1
if start_pos==0:#if we are unable to reduce pending buffer.
#should stop here as we are unable to parse JSON
#This could be because there is still some scenario not handled in this code
#or more probably the pending json data is longer than max_buffer.
#so can increase max buffer and try again.
self.raise_exception("Unable to Parse JSON. Try increasing max_buffer limit")
#update the buffer to retain only pending data
self.char_buffer=self.char_buffer[start_pos:]
#display number of JSON entries extracted from slow parsing
print("Slow Parsed",entry_count,"from buffer")
#update global count with number of JSONs extracted
self.count=self.count+entry_count
#reparse pending buffer to update global variables
self.reparse()
#reset and update parse info for pending character buffer
def reparse(self):
#reset all parse info and re calculate their values to match characters in pending buffer
self.inside_string,self.open_braces_count,self.close_braces_count,self.prev_char,self.prev_prev_char,self.prev_prev_prev_char=False,0,0,'','',''
for cc in self.char_buffer:
if cc=='{':#if opening brace
#increase open braces count if we are not inside a JSON string
if not self.inside_string:
self.open_braces_count=self.open_braces_count+1
elif cc=='}':#if closing brace
#increase close braces count if we are not inside a JSON string
if not self.inside_string:
self.close_braces_count=self.close_braces_count+1
elif cc=='"':#if this character is a quote
#if this quote is not escape i.e it is not a quote inside another JSON string
if self.prev_char!="\\" or (self.prev_prev_char=="\\" and self.prev_prev_prev_char!="\\"):
#then update the boolean that indicates whether we are inside a JSON string or not
self.inside_string= not self.inside_string
self.prev_prev_prev_char=self.prev_prev_char
self.prev_prev_char=self.prev_char
self.prev_char=cc
def do_parse(self,infile,outfile):
while True:
#read a huge chunk of characters. Increase this if you want to read more characters in every single read
chars=infile.read(1000000)
#no more pending characters in file?
if not chars:break
for char in chars:
if char=='{':
#increase open braces count if we are not inside a JSON string
if not self.inside_string:
self.open_braces_count=self.open_braces_count+1
elif char=='}':
#increase close braces count if we are not inside a JSON string
if not self.inside_string:
self.close_braces_count=self.close_braces_count+1
#if open and close brace count does not match and we are beyond buffer limit of maximum json string size
#then we have got something wrong in parsing this JSON.
#So try the more expensive way of parsing individual chunks at each close brace in the buffer
if self.open_braces_count!=self.close_braces_count and len(self.char_buffer)>self.max_buffer:
#add the latest closing brace to buffer
self.char_buffer.append(char)
#Try slow parse to check for valid json strings at every closing brace
self.slow_parse(outfile)
continue
elif char=='"':#if this character is a quote
#if this quote is not escape i.e it is not an escaped quote inside another JSON string
if self.prev_char!="\\" or (self.prev_prev_char=="\\" and self.prev_prev_prev_char!="\\"):
#then update the boolean that indicates whether we are inside a JSON string or not
self.inside_string = not self.inside_string
#add latest char to buffer
self.char_buffer.append(char)
#if open and close brace count matches, then it means we have a complete valid JSON
if self.open_braces_count>0 and self.open_braces_count==self.close_braces_count:
self.flush(outfile)
else:
#update previous character info
self.prev_prev_prev_char=self.prev_prev_char
self.prev_prev_char=self.prev_char
self.prev_char=char
print(self.count,"JSON entries extracted")
#strip pending data to remove any space or newline or other end of file characters in the end
json_data=''.join(self.char_buffer).strip()
#we should not have any more characters in buffer
#because the last character should be a closing brace that would force JSON output to be written to file in the loop above
if json_data!='':
raise_exception("Orphan data present at the end of file")
#main function to be called to parse contents of input file to output file
def parse(self):
with open(self.infile_path) as infile,open(self.outfile_path,'w') as outfile:
try:
self.do_parse(infile,outfile)
print("Completed JSON Extraction")
except Exception as ex:
infile.close()
outfile.close()
raise ex
Code language: Python (python)