Home Code Extract adjacent JSON documents in a single line of a large text...

Extract adjacent JSON documents in a single line of a large text file in Python

If you have a large text file containing adjacent JSON documents all in a single line as follows. The first problem would be you cannot read the file line by line because a single line would load the entire file into memory. So you have to read it character by character.

%MINIFYHTML93914dfb8d41a286876ddcc814911e5b42%
{"key1":"value1","key2":"value2"}{"field":"value"}{"key":"value"}

Then comes the real challenge of extracting individual JSON documents from the large file and writing it to another file as a jsonl file. In other words, you want to extract and write each JSON document to a separate line of an output file.

Python Script to Extract JSON documents in a single line of a large text file

Below is a python class I wrote to extract individual JSON documents that sit next to each other without any separator, in a single line of a large text file.

To parse the JSON from an input file to an output file, just instantiate the class and call its parse function as follows:

#Instantiate the class by padding input and output file paths parser=Single_Line_JSON_Parser('single_line_file.txt','output.jsonl') #call the parse function to extract JSON and write to separate lines in output file parser.parse()

Below is the code for the Single_Line_JSON_Parser class. Please do comment below if you find any issue using this class or if it was useful and saved you a lot of time and effort :)

Single Line JSON Parser Python Class

import json import sys #When infile has a single line and a huge number of adjacent multiple JSON entries #this class extracts each JSON entry and writes them to outfile as one JSON entry per line #Sample content of infile is in below line #{"field1","value1"}{"field2","value2","field3","value3"} #to parse the content just create an instance of the class and call parse() as follows #parser=Single_Line_JSON_Parser('infile_path','outfile_path') #parser.parse() class Single_Line_JSON_Parser: def __init__(self,infile,outfile): self.infile_path=infile self.outfile_path=outfile #buffer to store incomplete JSON self.char_buffer=[] #maximum buffer length to store incomplete JSON. Should be greater than the length of longest JSON entry you expect. #Increase this if required. This limit ensures you do not end up with out of memory error in case we are not able to find valid JSON self.max_buffer=1000000 #number of valid JSON entries parse so far self.count=0 #stores previous and previous to prev characters while parsing self.prev_char,self.prev_prev_char,self.prev_prev_prev_char='','','' #indicator to tell if we are inside a JSON string during parsing self.inside_string=False #open and close brace counter for pending buffer self.open_braces_count=0 self.close_braces_count=0 #returns True if json_str is a valid JSON string, else returns False def is_valid_json(self,json_str): try: json.loads(json_str) return True except Exception as ex: return False def raise_exception(self,message): data=''.join(self.char_buffer) #write buffer as debug info print(data,file=sys.stderr) raise Exception(message) #write buffer content in a separate line to outfile def flush(self,outfile): if self.inside_string: #if we are still inside a JSON string, then something is wrong self.raise_exception("Unable to Parse as we are inside JSON string.") #convert buffer to string json_data=''.join(self.char_buffer) if not self.is_valid_json(json_data):#if this is valid JSON #then something is wrong.. raise exception and exit self.raise_exception("Invalid JSON") #write JSON to a new line in the output file outfile.write(json_data+'\n') #update number of jsons we extracted so far self.count=self.count+1 #reset the buffer self.char_buffer=[] if self.count%1000==0: print(self.count,"JSON entries extracted") #reset open and close brace counter self.open_braces_count=0 self.close_braces_count=0 #reset previous characters info self.prev_prev_prev_char='' self.prev_prev_char='' self.prev_char='' #called to check for valid JSON for each closing brace encountered in pending buffer #this is slow, but will fix most issues not handled by this class while extracting valid JSON #This is called when buffer length exceeds maximum characters allowed def slow_parse(self,outfile): print("Maximum buffer limit exceeded. Now parsing chunks in buffer.") start_pos=0 entry_count=0 for i,cc in enumerate(self.char_buffer): if cc=='}':#if closing brace #get string from starting position till closing brace json_data=''.join(self.char_buffer[start_pos:i+1]) if self.is_valid_json(json_data):#if this is valid JSON #write JSON to a new line in the output file outfile.write(json_data+'\n') #update starting position of pending data start_pos=i+1 #update number of jsons we extracted from buffer entry_count=entry_count+1 if start_pos==0:#if we are unable to reduce pending buffer. #should stop here as we are unable to parse JSON #This could be because there is still some scenario not handled in this code #or more probably the pending json data is longer than max_buffer. #so can increase max buffer and try again. self.raise_exception("Unable to Parse JSON. Try increasing max_buffer limit") #update the buffer to retain only pending data self.char_buffer=self.char_buffer[start_pos:] #display number of JSON entries extracted from slow parsing print("Slow Parsed",entry_count,"from buffer") #update global count with number of JSONs extracted self.count=self.count+entry_count #reparse pending buffer to update global variables self.reparse() #reset and update parse info for pending character buffer def reparse(self): #reset all parse info and re calculate their values to match characters in pending buffer self.inside_string,self.open_braces_count,self.close_braces_count,self.prev_char,self.prev_prev_char,self.prev_prev_prev_char=False,0,0,'','','' for cc in self.char_buffer: if cc=='{':#if opening brace #increase open braces count if we are not inside a JSON string if not self.inside_string: self.open_braces_count=self.open_braces_count+1 elif cc=='}':#if closing brace #increase close braces count if we are not inside a JSON string if not self.inside_string: self.close_braces_count=self.close_braces_count+1 elif cc=='"':#if this character is a quote #if this quote is not escape i.e it is not a quote inside another JSON string if self.prev_char!="\\" or (self.prev_prev_char=="\\" and self.prev_prev_prev_char!="\\"): #then update the boolean that indicates whether we are inside a JSON string or not self.inside_string= not self.inside_string self.prev_prev_prev_char=self.prev_prev_char self.prev_prev_char=self.prev_char self.prev_char=cc def do_parse(self,infile,outfile): while True: #read a huge chunk of characters. Increase this if you want to read more characters in every single read chars=infile.read(1000000) #no more pending characters in file? if not chars:break for char in chars: if char=='{': #increase open braces count if we are not inside a JSON string if not self.inside_string: self.open_braces_count=self.open_braces_count+1 elif char=='}': #increase close braces count if we are not inside a JSON string if not self.inside_string: self.close_braces_count=self.close_braces_count+1 #if open and close brace count does not match and we are beyond buffer limit of maximum json string size #then we have got something wrong in parsing this JSON. #So try the more expensive way of parsing individual chunks at each close brace in the buffer if self.open_braces_count!=self.close_braces_count and len(self.char_buffer)>self.max_buffer: #add the latest closing brace to buffer self.char_buffer.append(char) #Try slow parse to check for valid json strings at every closing brace self.slow_parse(outfile) continue elif char=='"':#if this character is a quote #if this quote is not escape i.e it is not an escaped quote inside another JSON string if self.prev_char!="\\" or (self.prev_prev_char=="\\" and self.prev_prev_prev_char!="\\"): #then update the boolean that indicates whether we are inside a JSON string or not self.inside_string = not self.inside_string #add latest char to buffer self.char_buffer.append(char) #if open and close brace count matches, then it means we have a complete valid JSON if self.open_braces_count>0 and self.open_braces_count==self.close_braces_count: self.flush(outfile) else: #update previous character info self.prev_prev_prev_char=self.prev_prev_char self.prev_prev_char=self.prev_char self.prev_char=char print(self.count,"JSON entries extracted") #strip pending data to remove any space or newline or other end of file characters in the end json_data=''.join(self.char_buffer).strip() #we should not have any more characters in buffer #because the last character should be a closing brace that would force JSON output to be written to file in the loop above if json_data!='': raise_exception("Orphan data present at the end of file") #main function to be called to parse contents of input file to output file def parse(self): with open(self.infile_path) as infile,open(self.outfile_path,'w') as outfile: try: self.do_parse(infile,outfile) print("Completed JSON Extraction") except Exception as ex: infile.close() outfile.close() raise ex

Download HitXP Mobile App on Google Play

Get it on Google Play
Content Protection by DMCA.com
Gurudevhttps://www.hitxp.com
Gurudev is the developer of Gurunudi AI Platform. This is his official website where he pens his thoughts on a wide range of topics, answers queries, shares resources and tools developed by him.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

FacebookLike
InstagramFollow
PinterestFollow
RSS FeedSubscribe
Sound CloudFollow
TwitterFollow
YoutubeSubscribe

Latest Articles

The difference between Itihasa and Puranas

Documentation of ancient Indian history and the historical events of Indian civilization in the form of Itihasa and Puranas - Ramayana and Mahabharata.

Online Education should be interactive and innovative – Fun to Learn

For online education to succeed for smaller classses - it should be as interesting as cartoons. Online education cant succeed unless it is more interesting, interactive and innovative than offline schooling.

The Complete List of Dhatus – Sanskrit Root Words and their English meaning

A complete list of all dhatus (root words or verb roots) in Sanskrit dictionary and their meanings with corresponding IPA transliteration.

Sanskrit Lesson 4 – Word Creation Magic using Dhatus

Learn how easy it is to create new words in Sanskrit using root words called Dhatus. This simple process of Sanskrit grammar powers the entire language.

Latest Music Notations

So Gaya Ye Jahan – Tezaab – Piano Notations

Piano, Keyboard, Violin, Flute notes, Guitar Tabs and Sheet Music of the Song So Gaya Ye Jahan from the 1988 Hindi movie Tezaab in Western and Indian Notations.

Background Theme (BGM) – Mouna Ragam – Piano Notations

Piano, Keyboard, Violin, Flute notes, Guitar Tabs and Sheet Music of the Song Background Theme (BGM) from the 1986 Tamil movie Mouna Ragam in Western and Indian Notations.

Dil Kya Kare Jab Kisi Se – Julie – Piano Notations

Piano, Keyboard, Violin, Flute notes, Guitar Tabs and Sheet Music of the Song Dil Kya Kare Jab Kisi Se from the 1975 Hindi movie Julie in Western and Indian Notations.

Albela Sajan Aayo Re – Hum Dil De Chuke Sanam – Piano Notations

Piano, Keyboard, Violin, Flute notes, Guitar Tabs and Sheet Music of the Song Albela Sajan Aayo Re from the 1999 Hindi movie Hum Dil De Chuke Sanam in Western and Indian Notations.

Download HitXP Mobile App on Google Play

Get it on Google Play
Content Protection by DMCA.com