Home Code Extract adjacent JSON documents in a single line of a large text...

Extract adjacent JSON documents in a single line of a large text file in Python

If you have a large text file containing adjacent JSON documents all in a single line as follows. The first problem would be you cannot read the file line by line because a single line would load the entire file into memory. So you have to read it character by character.

{"key1":"value1","key2":"value2"}{"field":"value"}{"key":"value"}

Then comes the real challenge of extracting individual JSON documents from the large file and writing it to another file as a jsonl file. In other words, you want to extract and write each JSON document to a separate line of an output file.

Python Script to Extract JSON documents in a single line of a large text file

Below is a python class I wrote to extract individual JSON documents that sit next to each other without any separator, in a single line of a large text file.

To parse the JSON from an input file to an output file, just instantiate the class and call its parse function as follows:

#Instantiate the class by padding input and output file paths
parser=Single_Line_JSON_Parser('single_line_file.txt','output.jsonl')

#call the parse function to extract JSON and write to separate lines in output file
parser.parse()

Below is the code for the Single_Line_JSON_Parser class. Please do comment below if you find any issue using this class or if it was useful and saved you a lot of time and effort :)

Single Line JSON Parser Python Class

import json
import sys

#When infile has a single line and a huge number of adjacent multiple JSON entries
#this class extracts each JSON entry and writes them to outfile as one JSON entry per line
#Sample content of infile is in below line
#{"field1","value1"}{"field2","value2","field3","value3"}

#to parse the content just create an instance of the class and call parse() as follows
#parser=Single_Line_JSON_Parser('infile_path','outfile_path')
#parser.parse()

class Single_Line_JSON_Parser:

	def __init__(self,infile,outfile):

		self.infile_path=infile

		self.outfile_path=outfile

		#buffer to store incomplete JSON
		self.char_buffer=[]

		#maximum buffer length to store incomplete JSON. Should be greater than the length of longest JSON entry you expect. 
		#Increase this if required. This limit ensures you do not end up with out of memory error in case we are not able to find valid JSON 
		self.max_buffer=1000000

		#number of valid JSON entries parse so far
		self.count=0

		#stores previous and previous to prev characters while parsing
		self.prev_char,self.prev_prev_char,self.prev_prev_prev_char='','',''

		#indicator to tell if we are inside a JSON string during parsing
		self.inside_string=False

		#open and close brace counter for pending buffer
		self.open_braces_count=0
		self.close_braces_count=0
		

	#returns True if json_str is a valid JSON string, else returns False
	def is_valid_json(self,json_str):
		try:
			json.loads(json_str)
			return True
		except Exception as ex:
			return False

	def raise_exception(self,message):
		data=''.join(self.char_buffer)

		#write buffer as debug info
		print(data,file=sys.stderr)

		raise Exception(message)

	#write buffer content in a separate line to outfile
	def flush(self,outfile):

		if self.inside_string: #if we are still inside a JSON string, then something is wrong
			self.raise_exception("Unable to Parse as we are inside JSON string.")				
		
		#convert buffer to string
		json_data=''.join(self.char_buffer)

		if not self.is_valid_json(json_data):#if this is valid JSON
			#then something is wrong.. raise exception and exit
			self.raise_exception("Invalid JSON")

		#write JSON to a new line in the output file
		outfile.write(json_data+'\n')

		#update number of jsons we extracted so far
		self.count=self.count+1

		#reset the buffer
		self.char_buffer=[]

		if self.count%1000==0:
			print(self.count,"JSON entries extracted")
		
		#reset open and close brace counter
		self.open_braces_count=0
		self.close_braces_count=0

		#reset previous characters info
		self.prev_prev_prev_char=''
		self.prev_prev_char=''
		self.prev_char=''

	#called to check for valid JSON for each closing brace encountered in pending buffer
	#this is slow, but will fix most issues not handled by this class while extracting valid JSON
	#This is called when buffer length exceeds maximum characters allowed
	def slow_parse(self,outfile):

		print("Maximum buffer limit exceeded. Now parsing chunks in buffer.")

		start_pos=0	
		entry_count=0				

		for i,cc in enumerate(self.char_buffer):							

			if cc=='}':#if closing brace
				#get string from starting position till closing brace
				json_data=''.join(self.char_buffer[start_pos:i+1])

				if self.is_valid_json(json_data):#if this is valid JSON
					#write JSON to a new line in the output file
					outfile.write(json_data+'\n')
					#update starting position of pending data
					start_pos=i+1				
					#update number of jsons we extracted from buffer
					entry_count=entry_count+1
	
		if start_pos==0:#if we are unable to reduce pending buffer. 
			#should stop here as we are unable to parse JSON
			#This could be because there is still some scenario not handled in this code
			#or more probably the pending json data is longer than max_buffer.
			#so can increase max buffer and try again.
			self.raise_exception("Unable to Parse JSON. Try increasing max_buffer limit")

		#update the buffer to retain only pending data
		self.char_buffer=self.char_buffer[start_pos:]

		#display number of JSON entries extracted from slow parsing
		print("Slow Parsed",entry_count,"from buffer")

		#update global count with number of JSONs extracted
		self.count=self.count+entry_count
		
		#reparse pending buffer to update global variables
		self.reparse()		

	#reset and update parse info for pending character buffer
	def reparse(self):
		#reset all parse info and re calculate their values to match characters in pending buffer 
		self.inside_string,self.open_braces_count,self.close_braces_count,self.prev_char,self.prev_prev_char,self.prev_prev_prev_char=False,0,0,'','',''
		for cc in self.char_buffer:
			if cc=='{':#if opening brace

				#increase open braces count if we are not inside a JSON string
				if not self.inside_string:
					self.open_braces_count=self.open_braces_count+1

			elif cc=='}':#if closing brace

				#increase close braces count if we are not inside a JSON string
				if not self.inside_string:
					self.close_braces_count=self.close_braces_count+1

			elif cc=='"':#if this character is a quote

				#if this quote is not escape i.e it is not a quote inside another JSON string
				if self.prev_char!="\\" or (self.prev_prev_char=="\\" and self.prev_prev_prev_char!="\\"):

					#then update the boolean that indicates whether we are inside a JSON string or not	
					self.inside_string= not self.inside_string

			self.prev_prev_prev_char=self.prev_prev_char
			self.prev_prev_char=self.prev_char
			self.prev_char=cc

	def do_parse(self,infile,outfile):

		while True:	
	
			#read a huge chunk of characters. Increase this if you want to read more characters in every single read
			chars=infile.read(1000000) 

			#no more pending characters in file?
			if not chars:break

			for char in chars:
				if char=='{':
					#increase open braces count if we are not inside a JSON string
					if not self.inside_string:
						self.open_braces_count=self.open_braces_count+1
				elif char=='}':
					#increase close braces count if we are not inside a JSON string
					if not self.inside_string:
						self.close_braces_count=self.close_braces_count+1

					#if open and close brace count does not match and we are beyond buffer limit of maximum json string size
					#then we have got something wrong in parsing this JSON.
					#So try the more expensive way of parsing individual chunks at each close brace in the buffer
					if self.open_braces_count!=self.close_braces_count and len(self.char_buffer)>self.max_buffer:

						#add the latest closing brace to buffer
						self.char_buffer.append(char) 

						#Try slow parse to check for valid json strings at every closing brace
						self.slow_parse(outfile)					
						
						continue

				elif char=='"':#if this character is a quote

					#if this quote is not escape i.e it is not an escaped quote inside another JSON string
					if self.prev_char!="\\" or (self.prev_prev_char=="\\" and self.prev_prev_prev_char!="\\"):

						#then update the boolean that indicates whether we are inside a JSON string or not
						self.inside_string = not self.inside_string		

				#add latest char to buffer
				self.char_buffer.append(char)

				#if open and close brace count matches, then it means we have a complete valid JSON
				if self.open_braces_count>0 and self.open_braces_count==self.close_braces_count:
					self.flush(outfile)
				else:
					#update previous character info
					self.prev_prev_prev_char=self.prev_prev_char
					self.prev_prev_char=self.prev_char
					self.prev_char=char

		print(self.count,"JSON entries extracted")

		#strip pending data to remove any space or newline or other end of file characters in the end
		json_data=''.join(self.char_buffer).strip()

		#we should not have any more characters in buffer
		#because the last character should be a closing brace that would force JSON output to be written to file in the loop above
		if json_data!='':
			raise_exception("Orphan data present at the end of file")


	#main function to be called to parse contents of input file to output file
	def parse(self):
		with open(self.infile_path) as infile,open(self.outfile_path,'w') as outfile:
			try:
				self.do_parse(infile,outfile)
				print("Completed JSON Extraction")				
			except Exception as ex:
				infile.close()
				outfile.close()
				raise ex
Content Protection by DMCA.com
Gurudevhttps://www.hitxp.com
Gurudev is the developer of Gurunudi AI Platform. This is his official website where he pens his thoughts on a wide range of topics, answers queries, shares resources and tools developed by him.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

4,258FansLike
7FollowersFollow
133FollowersFollow
380FollowersFollow

Subscribe to HitXP Articles

To be updated with all the latest articles, offers and special announcements.

Latest Articles

Corona Virus, SARS – Why are new viruses increasingly originating from China?

The reason why new strains of deadly viruses are emerging from China. Viruses like coronavirus and SARS should be stopped from evolving and spreading to prevent another global pandemic.

Evolution of life – primitive cells, complex organisms, intelligence, what next after humans?

Evolution of life from primitive life forms to complex multi cellular organisms to intelligent human species - the advent of AI raises a natural question - What next after human?

The individual human is smart, but the human species is dumb

The smart individual human has to clean up the mess created by his dumb species. Species polluted the planet, individuals are cleaning it up.

What if Dinosaurs did not go extinct 65 million years ago?

Imagine a world where dinosaurs did not go extinct, because that asteroid missed colliding with Earth. How would have life evolved then?
Content Protection by DMCA.com