If you have a large text file containing adjacent JSON documents all in a single line as follows. The first problem would be you cannot read the file line by line because a single line would load the entire file into memory. So you have to read it character by character.

{"key1":"value1","key2":"value2"}{"field":"value"}{"key":"value"}Code language: JSON / JSON with Comments (json)

Then comes the real challenge of extracting individual JSON documents from the large file and writing it to another file as a jsonl file. In other words, you want to extract and write each JSON document to a separate line of an output file.

Python Script to Extract JSON documents in a single line of a large text file

Below is a python class I wrote to extract individual JSON documents that sit next to each other without any separator, in a single line of a large text file.

To parse the JSON from an input file to an output file, just instantiate the class and call its parse function as follows:

#Instantiate the class by padding input and output file paths
parser=Single_Line_JSON_Parser('single_line_file.txt','output.jsonl')
#call the parse function to extract JSON and write to separate lines in output file
parser.parse()Code language: Python (python)

Below is the code for the Single_Line_JSON_Parser class. Please do comment below if you find any issue using this class or if it was useful and saved you a lot of time and effort :)

Single Line JSON Parser Python Class

import json
import sys
#When infile has a single line and a huge number of adjacent multiple JSON entries
#this class extracts each JSON entry and writes them to outfile as one JSON entry per line
#Sample content of infile is in below line
#{"field1","value1"}{"field2","value2","field3","value3"}
#to parse the content just create an instance of the class and call parse() as follows
#parser=Single_Line_JSON_Parser('infile_path','outfile_path')
#parser.parse()
class Single_Line_JSON_Parser:
	def __init__(self,infile,outfile):
		self.infile_path=infile
		self.outfile_path=outfile
		#buffer to store incomplete JSON
		self.char_buffer=[]
		#maximum buffer length to store incomplete JSON. Should be greater than the length of longest JSON entry you expect. 
		#Increase this if required. This limit ensures you do not end up with out of memory error in case we are not able to find valid JSON 
		self.max_buffer=1000000
		#number of valid JSON entries parse so far
		self.count=0
		#stores previous and previous to prev characters while parsing
		self.prev_char,self.prev_prev_char,self.prev_prev_prev_char='','',''
		#indicator to tell if we are inside a JSON string during parsing
		self.inside_string=False
		#open and close brace counter for pending buffer
		self.open_braces_count=0
		self.close_braces_count=0
		
	#returns True if json_str is a valid JSON string, else returns False
	def is_valid_json(self,json_str):
		try:
			json.loads(json_str)
			return True
		except Exception as ex:
			return False
	def raise_exception(self,message):
		data=''.join(self.char_buffer)
		#write buffer as debug info
		print(data,file=sys.stderr)
		raise Exception(message)
	#write buffer content in a separate line to outfile
	def flush(self,outfile):
		if self.inside_string: #if we are still inside a JSON string, then something is wrong
			self.raise_exception("Unable to Parse as we are inside JSON string.")				
		
		#convert buffer to string
		json_data=''.join(self.char_buffer)
		if not self.is_valid_json(json_data):#if this is valid JSON
			#then something is wrong.. raise exception and exit
			self.raise_exception("Invalid JSON")
		#write JSON to a new line in the output file
		outfile.write(json_data+'\n')
		#update number of jsons we extracted so far
		self.count=self.count+1
		#reset the buffer
		self.char_buffer=[]
		if self.count%1000==0:
			print(self.count,"JSON entries extracted")
		
		#reset open and close brace counter
		self.open_braces_count=0
		self.close_braces_count=0
		#reset previous characters info
		self.prev_prev_prev_char=''
		self.prev_prev_char=''
		self.prev_char=''
	#called to check for valid JSON for each closing brace encountered in pending buffer
	#this is slow, but will fix most issues not handled by this class while extracting valid JSON
	#This is called when buffer length exceeds maximum characters allowed
	def slow_parse(self,outfile):
		print("Maximum buffer limit exceeded. Now parsing chunks in buffer.")
		start_pos=0	
		entry_count=0				
		for i,cc in enumerate(self.char_buffer):							
			if cc=='}':#if closing brace
				#get string from starting position till closing brace
				json_data=''.join(self.char_buffer[start_pos:i+1])
				if self.is_valid_json(json_data):#if this is valid JSON
					#write JSON to a new line in the output file
					outfile.write(json_data+'\n')
					#update starting position of pending data
					start_pos=i+1				
					#update number of jsons we extracted from buffer
					entry_count=entry_count+1
	
		if start_pos==0:#if we are unable to reduce pending buffer. 
			#should stop here as we are unable to parse JSON
			#This could be because there is still some scenario not handled in this code
			#or more probably the pending json data is longer than max_buffer.
			#so can increase max buffer and try again.
			self.raise_exception("Unable to Parse JSON. Try increasing max_buffer limit")
		#update the buffer to retain only pending data
		self.char_buffer=self.char_buffer[start_pos:]
		#display number of JSON entries extracted from slow parsing
		print("Slow Parsed",entry_count,"from buffer")
		#update global count with number of JSONs extracted
		self.count=self.count+entry_count
		
		#reparse pending buffer to update global variables
		self.reparse()		
	#reset and update parse info for pending character buffer
	def reparse(self):
		#reset all parse info and re calculate their values to match characters in pending buffer 
		self.inside_string,self.open_braces_count,self.close_braces_count,self.prev_char,self.prev_prev_char,self.prev_prev_prev_char=False,0,0,'','',''
		for cc in self.char_buffer:
			if cc=='{':#if opening brace
				#increase open braces count if we are not inside a JSON string
				if not self.inside_string:
					self.open_braces_count=self.open_braces_count+1
			elif cc=='}':#if closing brace
				#increase close braces count if we are not inside a JSON string
				if not self.inside_string:
					self.close_braces_count=self.close_braces_count+1
			elif cc=='"':#if this character is a quote
				#if this quote is not escape i.e it is not a quote inside another JSON string
				if self.prev_char!="\\" or (self.prev_prev_char=="\\" and self.prev_prev_prev_char!="\\"):
					#then update the boolean that indicates whether we are inside a JSON string or not	
					self.inside_string= not self.inside_string
			self.prev_prev_prev_char=self.prev_prev_char
			self.prev_prev_char=self.prev_char
			self.prev_char=cc
	def do_parse(self,infile,outfile):
		while True:	
	
			#read a huge chunk of characters. Increase this if you want to read more characters in every single read
			chars=infile.read(1000000) 
			#no more pending characters in file?
			if not chars:break
			for char in chars:
				if char=='{':
					#increase open braces count if we are not inside a JSON string
					if not self.inside_string:
						self.open_braces_count=self.open_braces_count+1
				elif char=='}':
					#increase close braces count if we are not inside a JSON string
					if not self.inside_string:
						self.close_braces_count=self.close_braces_count+1
					#if open and close brace count does not match and we are beyond buffer limit of maximum json string size
					#then we have got something wrong in parsing this JSON.
					#So try the more expensive way of parsing individual chunks at each close brace in the buffer
					if self.open_braces_count!=self.close_braces_count and len(self.char_buffer)>self.max_buffer:
						#add the latest closing brace to buffer
						self.char_buffer.append(char) 
						#Try slow parse to check for valid json strings at every closing brace
						self.slow_parse(outfile)					
						
						continue
				elif char=='"':#if this character is a quote
					#if this quote is not escape i.e it is not an escaped quote inside another JSON string
					if self.prev_char!="\\" or (self.prev_prev_char=="\\" and self.prev_prev_prev_char!="\\"):
						#then update the boolean that indicates whether we are inside a JSON string or not
						self.inside_string = not self.inside_string		
				#add latest char to buffer
				self.char_buffer.append(char)
				#if open and close brace count matches, then it means we have a complete valid JSON
				if self.open_braces_count>0 and self.open_braces_count==self.close_braces_count:
					self.flush(outfile)
				else:
					#update previous character info
					self.prev_prev_prev_char=self.prev_prev_char
					self.prev_prev_char=self.prev_char
					self.prev_char=char
		print(self.count,"JSON entries extracted")
		#strip pending data to remove any space or newline or other end of file characters in the end
		json_data=''.join(self.char_buffer).strip()
		#we should not have any more characters in buffer
		#because the last character should be a closing brace that would force JSON output to be written to file in the loop above
		if json_data!='':
			raise_exception("Orphan data present at the end of file")

	#main function to be called to parse contents of input file to output file
	def parse(self):
		with open(self.infile_path) as infile,open(self.outfile_path,'w') as outfile:
			try:
				self.do_parse(infile,outfile)
				print("Completed JSON Extraction")				
			except Exception as ex:
				infile.close()
				outfile.close()
				raise exCode language: Python (python)

Download HitXP Mobile App

Get it on Google Play